Microsoft Unveils New AI Models for Enhanced Transcription, Voice, and Image Generation

Synopsis

Through these three models — MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 — Microsoft aims to expand its push into multimodal AI capabilities for developers. The models are also being integrated into Microsoft products, including Copilot, Bing, and PowerPoint, with enterprise adoption already underway

Microsoft on Thursday announced three new models from its Microsoft AI (MAI) model family for transcription, image, and speech generation.

This includes MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2, as Microsoft aims to expand its push into multimodal artificial intelligence (AI) capabilities for developers.

Starting today, the models are now available on Microsoft Foundry and the MAI Playground. Formerly Azure AI Studio, Foundry is a unified AI platform to build, customise, and scale generative AI (GenAI) applications and agents. Meanwhile, Playground is its public testing environment where users can experiment with features and provide feedback.

“Consistent with our commitment to safe and responsible AI, these MAI models were developed, tested, and rigorously red-teamed. Through Microsoft Foundry, developers get built-in guardrails, governance, and enterprise-grade controls designed to support safe, compliant deployment at scale,” wrote Mustafa Suleyman in a blog post. Suleyman leads the AI division at Microsoft.

MAI-Transcribe-1 is a speech-to-text model that can support transcription across the 25 most widely used languages, including Hindi. According to Microsoft, the model produces fewer mean word errors (WER) than even Google's Gemini 3.1 Flash and OpenAI’s GPT-Transcribe. WER evaluates the accuracy of Automatic Speech Recognition (ASR) systems by measuring the percentage of words a model gets wrong.

The model offers batch transcription speeds up to 2.5 times faster than Microsoft’s existing Azure Fast offering. The starting price of the model is $0.36 per hour.

Meanwhile, using MAI-Voice-1, developers will be able to create custom voices with a few seconds of input audio. The model can generate up to 60 seconds of audio in one second, with pricing starting at $22 per one million characters.

Finally, MAI-Image-2, Microsoft’s latest image generation model, introduced only in the MAI Playground last month, is now broadly accessible via Foundry. The model delivers at least twice the generation speed compared to earlier versions, based on production data, while maintaining output quality. Pricing starts at $5 per one million text tokens and $33 per one million image tokens.

The models are also being integrated into Microsoft products, including Copilot, Bing, and PowerPoint, with enterprise adoption already underway.

Microsoft Unveils New AI Models for Enhanced Transcription, Voice, and Image Generation

Synopsis

AI Companies Intensify Lobbying Efforts in the US and Europe

Elon Musk and Sam Altman's Diverging Paths in AI Development

Lachy Groom to back India startup Pronto at a $200M valuation, sources say

Wikimedia and Indonesia Reach Agreement to Avoid Wikipedia Block

Latest Briefs