Alibaba releases Qwen3.5-Omni, an open-weight model that processes text, image, audio, and video natively

Alibaba’s Tongyi Lab released Qwen3.5-Omni, the latest in their open-weight model series. It handles text, images, audio, and video in a single unified model — no separate pipelines stitched together.

The demo that is getting the most attention is Audio-Visual Vibe Coding. You describe an idea out loud while pointing a camera at something, and the model generates working code from that.

That is a meaningful step beyond typing a prompt. The model understands what you are saying, what you are showing it, and produces functional output from the combination.

Input languages: Speech recognition in 74 languages.
Output speech: Expressive speech generation in 29 languages, with controllable emotion and style.
Context window: Up to 256k tokens, handling over 10 hours of audio.
Video understanding: Script-level captioning with timestamps, scene cuts, and speaker mapping.
Agentic tools: Native web search and function calling built in.

4/10 Audio-Visual Vibe Coding pic.twitter.com/6XCB53L6QA
— Tongyi Lab (@Ali_TongyiLab) March 30, 2026

The model uses a Thinker-Talker architecture. The Thinker processes all modalities together using a Mixture of Experts approach. The Talker converts output into streaming speech in real time. Both components use MoE, which keeps inference costs reasonable relative to the capability level.

On benchmarks across 215 audio-visual and multilingual tasks, the Plus variant claims to match Gemini 3.1 Pro on audio-visual understanding and surpasses it on pure audio tasks. Of course, these are self-reported benchmarks, so real-world results may, and probably will, vary.

The model is available on Hugging Face and through Alibaba’s DashScope API. Variants include Plus, Flash, and Light for different speed and size trade-offs.

Bottom line: We are way past just using the keyboard for vibe coding. Talking to your model is the latest trend, but Alibaba wants to go even further — now you can give video instructions too. More ways to vibe code your next project.

Source: Qwen Blog

Efficienist Newsletter

Get the core business tech news delivered straight to your inbox. We track AI, automation, SaaS, and cybersecurity so you don't have to.

Just read what you want, and be done with it.

Alibaba releases Qwen3.5-Omni, an open-weight model that processes text, image, audio, and video natively

Anthropic doubles Claude Code limits after signing compute deal with SpaceX

A 12 million token LLM appeared out of nowhere, and the AI community isn’t sure what to make of it

Karpathy joins Anthropic, Google I/O delivers Gemini 3.5 and a 24/7 personal agent, Standard Chartered cuts 7,000 for AI

Federal Jury Dismisses Musk’s $150B OpenAI Claim in Under Two Hours

Cerebras IPO Soars 68%, Codex goes mobile, and Grok Build enters the coding agent race

Anthropic doubles Claude Code limits after signing compute deal with SpaceX

A 12 million token LLM appeared out of nowhere, and the AI community isn’t sure what to make of it