Sesame's Conversational Speech Model Now Open-Sourced

March 18, 2025

3 minutes

🟢easy Reading Level

Recently, Sesame AI released their Conversational Speech Model (CSM), marking an important development in voice AI technology. Early demos of the system have gained attention for its ability to reproduce natural pauses and intonation during conversations.

Sesame has now taken the additional step of open-sourcing the model, making it available for wider use through:

What Is CSM?

CSM (Conversational Speech Model) is a speech generation model that converts text and audio inputs into RVQ audio codes. In practical terms, it transforms written or spoken input into compact, machine-readable audio representations (called "Mimi" audio codes) that can be converted into natural-sounding speech. The model combines a Llama-based foundation with a specialized audio decoder, allowing it to process multiple types of inputs and generate conversational audio.

You can experience CSM through Sesame's interactive demo on their website, where they showcase a fine-tuned version of the model. For developers interested in experimenting with audio generation, Sesame also provides access through a hosted space on Hugging Face.

Technical Deep Dive

Architecture Foundation

At its core, CSM builds upon a Llama-based architecture that serves as the foundation for language processing. This sophisticated base enables the model to effectively interpret and generate text that guides the audio synthesis process. The model's audio decoder then creates Mimi audio codes—compact representations that capture the nuanced characteristics of speech, which are subsequently transformed into clear, expressive audio output.

Contextual Understanding

What sets CSM apart is its ability to process both text and audio inputs simultaneously. This multimodal approach allows the model to leverage additional information, such as speaker identity and previous statements, resulting in more coherent and contextually appropriate speech. Developers can enhance this capability by providing a Segment for each speaker's contribution, which helps the model maintain natural conversation flow.

System Requirements and Accessibility

The released 1B variant of CSM strikes a balance between performance and accessibility. It's designed to run efficiently with modest hardware requirements, needing only a CUDA-compatible GPU (tested on CUDA 12.4 and 12.6). This makes it accessible even on consumer-grade hardware. The model is distributed under the Apache 2.0 license, making it available for both research and commercial applications, with the model checkpoint conveniently hosted on Hugging Face.

Implementation Guide

Getting Started

Setting up CSM involves a straightforward process that begins with cloning the repository and setting up a Python virtual environment. Here's a comprehensive guide to get you started:

git clone git@github.com:SesameAILabs/csm.git
cd csm
python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
huggingface-cli login

Note that Windows users should install the triton-windows package instead of triton. Before proceeding, ensure you have access to both the sesame/csm-1b and meta-llama/Llama-3.2-1B models through Hugging Face.

Basic Implementation

Let's explore how to implement CSM in your projects. Here's a basic example that demonstrates generating a simple audio output:

from generator import load_csm_1b
import torchaudio
import torch

# Set the device based on available hardware
if torch.backends.mps.is_available():
    device = "mps"
elif torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

# Load the CSM 1B generator model
generator = load_csm_1b(device=device)

# Generate audio from text
audio = generator.generate(
    text="Hello from Sesame.",
    speaker=0,
    context=[],  # no context provided in this simple example
    max_audio_length_ms=10_000,
)

# Save the generated audio to a WAV file
torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)

Advanced Usage with Context

To fully leverage CSM's capabilities, you'll want to provide conversational context. Here's how you can create more sophisticated audio generations:

from generator import load_csm_1b
import torchaudio

# Define context for multiple speakers
speakers = [0, 1, 0, 0]
transcripts = [
    "Hey how are you doing.",
    "Pretty good, pretty good.",
    "I'm great.",
    "So happy to be speaking to you.",
]
audio_paths = [
    "utterance_0.wav",
    "utterance_1.wav",
    "utterance_2.wav",
    "utterance_3.wav",
]

def load_audio(audio_path):
    audio_tensor, sample_rate = torchaudio.load(audio_path)
    audio_tensor = torchaudio.functional.resample(
        audio_tensor.squeeze(0),
        orig_freq=sample_rate,
        new_freq=generator.sample_rate
    )
    return audio_tensor

segments = [
    Segment(text=transcript, speaker=speaker, audio=load_audio(audio_path))
    for transcript, speaker, audio_path in zip(transcripts, speakers, audio_paths)
]

# Generate contextualized audio output
audio = generator.generate(
    text="Me too, this is some cool stuff huh?",
    speaker=1,
    context=segments,
    max_audio_length_ms=10_000,
)

torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)

Common Questions and Ethical Considerations

Understanding CSM's Capabilities

While CSM is a powerful tool for audio generation, it's important to understand its scope and limitations. The model comes as a base generation system without pre-defined voices, allowing for flexible voice characteristics based on input and context. However, it's specifically designed for audio generation and requires integration with a separate language model for full conversational capabilities. While it can handle some non-English content, its primary optimization is for English language processing.

Responsible Usage Guidelines

Sesame emphasizes the importance of ethical implementation of CSM technology. The model is intended for research and educational purposes, with strict prohibitions against:

Creating speech that mimics specific individuals without consent
Producing misleading or deceptive content
Using the model for any illegal or harmful purposes

Users must adhere to applicable laws and ethical standards when implementing CSM in their projects.

Looking Forward

The open-sourcing of CSM marks a significant milestone in democratizing speech generation technology. By making the 1B variant publicly available, Sesame has opened new possibilities for researchers and developers to build innovative applications that leverage natural-sounding speech generation. Whether you're creating interactive demos, developing enterprise solutions, or conducting research, CSM provides a robust foundation for your audio generation needs.

Valeriia Kuka

Valeriia Kuka, Head of Content at Learn Prompting, is passionate about making AI and ML accessible. Valeriia previously grew a 60K+ follower AI-focused social media account, earning reposts from Stanford NLP, Amazon Research, Hugging Face, and AI researchers. She has also worked with AI/ML newsletters and global communities with 100K+ members and authored clear and concise explainers and historical articles.

DIFFICULTY LEVEL

RECOMMENDED COURSES

ChatGPT for Everyone

Introduction to Prompt Engineering

Live Courses