tech
April 2, 2026
Speaking of Voxtral
Today we’re releasing Voxtral TTS, our first text-to-speech model with state-of-the-art performance in multilingual voice generation. The model is lightweight at 4B parameters, making Voxtral-powered agents natural, reliable, and cost-effective at scale.

TL;DR
- Voxtral TTS is a 4B parameter, lightweight text-to-speech model for multilingual voice generation.
- It supports 9 languages with realistic, emotionally expressive speech and diverse dialect support.
- The model features very low latency for time-to-first-audio and is easily adaptable to new voices.
- Voxtral TTS excels in contextual understanding and speaker modeling, capturing personality, pauses, rhythm, and intonation.
- Human evaluations show superior naturalness compared to ElevenLabs Flash v2.5 while maintaining similar Time-to-First-Audio.
- It can adapt to custom voices with as little as 3s of reference audio, capturing nuances like accent and inflections.
- The model demonstrates zero-shot cross-lingual voice adaptation, enabling accent transfer.
- It is built on a transformer-based, autoregressive, flow-matching architecture.
- Voxtral TTS is available via API and as open weights on Hugging Face.
- Applications include customer support, financial services, sales & marketing, and real-time translation.
Continue reading the original article