Microsoft is stepping up its ambitions in the field of artificial intelligence by introducing a series of new models that go beyond traditional text processing.
This move shows that the US technology company is shifting to developing multi-mode AI, including voice, transcription and image.
Specifically, Microsoft announced three new models, including two completely new models to serve voice and text conversion into text.
This is the first time the company has launched specialized tools for this task. The transcription model has the ability to convert audio into text in 25 languages, targeting applications such as creating video subtitles, recording meetings or supporting voice assistants.
Along with that, the voice model allows creating audio segments up to 60 seconds long, expanding the ability to produce automatic audio content.
This helps businesses and content creators save significant time and production costs.
In the image segment, Microsoft introduced the second generation of models developed by the company itself, with faster creation speeds and significantly improved image quality.
This model is now available on development platforms such as Microsoft Foundry and MAI Playground, and is expected to be integrated into popular products such as Bing or PowerPoint soon.
These upgrades are a strategic step to expand Microsoft's AI ecosystem. Previously, the company mainly focused on language models and tools such as Microsoft Copilot, which is one of the popular AI assistants in the business environment, especially for Microsoft 365 users and the Azure cloud platform.
The addition of non-textual models helps Microsoft create a competitive advantage when providing more comprehensive solutions for businesses.
Products such as Copilot Cowork or Copilot Health also show the company's clear orientation to bring AI into real-life work situations, instead of just stopping at technology testing.
Notably, this strategy takes place in the context of increasingly fierce AI competition. OpenAI has recently narrowed down some projects to focus on core products, while Google is pursuing cost and energy optimization for generative models like Veo 3.1 Lite.
Meanwhile, Microsoft takes advantage of its financial and infrastructure advantages to invest in areas that require large resources, such as voice processing or image creation. These are important pieces that help complete the multi-mode AI ecosystem.
In 2026, the AI industry is shifting from the stage of demonstrating competence to demonstrating practical value.
With a series of new models, Microsoft shows that they are not only racing in technology but also focusing on application capabilities, especially in the business environment, where efficiency and stability are put first.