OpenAI Releases New Speech and Text Audio Models

March 25, 2025

2 minutes

🟢easy Reading Level

OpenAI announced the release of a new suite of audio models available through their developer API. The release includes updated speech-to-text models (gpt-4o-transcribe and gpt-4o-mini-transcribe) and a new text-to-speech model (gpt-4o-mini-tts), all designed to enhance the capabilities of voice-based applications.

Background

This release follows OpenAI's recent focus on agent-based systems, including previous launches like Operator, Deep Research, Computer-Using Agents, and the Responses API. The company has recognized that effective human-AI interaction requires more than text-based interfaces, prompting this investment in advanced audio processing capabilities.

Key Improvements in Speech-to-Text Models

The new speech-to-text models demonstrate significant improvements in word error rate (WER) compared to previous Whisper models. According to OpenAI's benchmarks, these models perform particularly well in challenging scenarios including:

  • Various accents and dialects
  • Noisy environments
  • Different speech speeds

These improvements make the models more suitable for applications such as:

  • Customer service call centers
  • Meeting transcription systems
  • Multilingual speech recognition

The technical advancements include:

  • Extensive pretraining on specialized audio datasets
  • A reinforcement learning approach that reduces transcription errors
  • Advanced distillation techniques from larger models to smaller ones

Text-to-Speech Innovations

The new gpt-4o-mini-tts model introduces a significant new capability: instructability. For the first time, developers can guide the model not just on what to say but how to say it, enabling more customized voice experiences. The model can adjust its speaking style based on specific instructions, such as "speak like a sympathetic customer service agent" or "narrate like a medieval knight."

OpenAI notes that these text-to-speech models are limited to artificial, preset voices, which they monitor to ensure consistency with synthetic presets.

Technical Foundation

These audio models build upon the GPT-4o and GPT-4o-mini architectures and feature:

  1. Specialized pretraining: The models were trained on audio-centric datasets to optimize performance for speech-related tasks.

  2. Advanced distillation: Knowledge transfer techniques from larger models to smaller ones enable efficient models with high-quality performance.

  3. Reinforcement learning: The speech-to-text models in particular benefit from a reinforcement learning approach that improves accuracy and reduces hallucinations.

API Availability and Integration

All new audio models are now available to developers through OpenAI's API. For those already building conversational experiences with text-based models, adding speech-to-text and text-to-speech capabilities is now more straightforward.

OpenAI has also released an integration with their Agents SDK to simplify the development process. For low-latency speech-to-speech applications, they recommend using their speech-to-speech models in the Realtime API.

Future Development Plans

OpenAI has indicated several directions for future development:

  1. Continued improvement of audio model intelligence and accuracy
  2. Exploration of ways to allow developers to use custom voices while maintaining safety standards
  3. Ongoing engagement with policymakers, researchers, and creatives regarding synthetic voice technology
  4. Investment in other modalities, including video, to enable multimodal agent experiences

Valeriia Kuka

Valeriia Kuka, Head of Content at Learn Prompting, is passionate about making AI and ML accessible. Valeriia previously grew a 60K+ follower AI-focused social media account, earning reposts from Stanford NLP, Amazon Research, Hugging Face, and AI researchers. She has also worked with AI/ML newsletters and global communities with 100K+ members and authored clear and concise explainers and historical articles.


© 2025 Learn Prompting. All rights reserved.