Microsoft’s new MAI-Transcribe-1 claims the top spot in speech recognition accuracy

Microsoft AI launched MAI-Transcribe-1 today, a multilingual speech-to-text model that claims the lowest word error rate on the FLEURS benchmark across 25 languages. It outperforms Scribe v2, Whisper-large-V3, GPT-Transcribe, and Gemini 3.1 Flash-Lite on that benchmark.

The model was built with messy real-world audio in mind. Background noise, low-quality recordings, overlapping speech, and heavy accents are all scenarios it’s designed to handle reliably.

By the numbers:

Batch transcription runs 2.5x faster than Microsoft’s current Azure Fast offering
Priced at $0.36 per audio hour
Supports 25 languages with consistent accuracy across accents and speaking styles

Where it’s going: MAI-Transcribe-1 is already in phased rollout for Copilot Voice mode and Microsoft Teams transcription. For developers building voice agents, Microsoft positions it as the foundational speech layer, meant to be combined with MAI-Voice-1 for text-to-speech and a separate LLM for reasoning.

Where to try it: The model is in public preview on Microsoft Foundry and the Microsoft AI Playground.

Source: Microsoft