tech

April 2, 2026

Speaking of Voxtral

Today we’re releasing Voxtral TTS, our first text-to-speech model with state-of-the-art performance in multilingual voice generation. The model is lightweight at 4B parameters, making Voxtral-powered agents natural, reliable, and cost-effective at scale.

Speaking of Voxtral

TL;DR

  • Voxtral TTS is a 4B parameter, lightweight text-to-speech model for multilingual voice generation.
  • It supports 9 languages with realistic, emotionally expressive speech and diverse dialect support.
  • The model features very low latency for time-to-first-audio and is easily adaptable to new voices.
  • Voxtral TTS excels in contextual understanding and speaker modeling, capturing personality, pauses, rhythm, and intonation.
  • Human evaluations show superior naturalness compared to ElevenLabs Flash v2.5 while maintaining similar Time-to-First-Audio.
  • It can adapt to custom voices with as little as 3s of reference audio, capturing nuances like accent and inflections.
  • The model demonstrates zero-shot cross-lingual voice adaptation, enabling accent transfer.
  • It is built on a transformer-based, autoregressive, flow-matching architecture.
  • Voxtral TTS is available via API and as open weights on Hugging Face.
  • Applications include customer support, financial services, sales & marketing, and real-time translation.

Continue reading the original article

Made withNostr