Alibaba releases Qwen3.5-Omni, an open-weight model that processes text, image, audio, and video natively

Alibaba released Qwen3.5-Omni, an open-weight model that sees, hears, and speaks — and can turn a voice description into working code in real time.

qwen logo

Alibaba’s Tongyi Lab released Qwen3.5-Omni, the latest in their open-weight model series. It handles text, images, audio, and video in a single unified model — no separate pipelines stitched together.

The demo that is getting the most attention is Audio-Visual Vibe Coding. You describe an idea out loud while pointing a camera at something, and the model generates working code from that.

That is a meaningful step beyond typing a prompt. The model understands what you are saying, what you are showing it, and produces functional output from the combination.

  • Input languages: Speech recognition in 74 languages.
  • Output speech: Expressive speech generation in 29 languages, with controllable emotion and style.
  • Context window: Up to 256k tokens, handling over 10 hours of audio.
  • Video understanding: Script-level captioning with timestamps, scene cuts, and speaker mapping.
  • Agentic tools: Native web search and function calling built in.

The model uses a Thinker-Talker architecture. The Thinker processes all modalities together using a Mixture of Experts approach. The Talker converts output into streaming speech in real time. Both components use MoE, which keeps inference costs reasonable relative to the capability level.

On benchmarks across 215 audio-visual and multilingual tasks, the Plus variant claims to match Gemini 3.1 Pro on audio-visual understanding and surpasses it on pure audio tasks. Of course, these are self-reported benchmarks, so real-world results may, and probably will, vary.

The model is available on Hugging Face and through Alibaba’s DashScope API. Variants include Plus, Flash, and Light for different speed and size trade-offs.

Bottom line: We are way past just using the keyboard for vibe coding. Talking to your model is the latest trend, but Alibaba wants to go even further — now you can give video instructions too. More ways to vibe code your next project.

Source: Qwen Blog

RunPod
RunPod

If you need on-demand GPUs for training, fine-tuning, inference, or running open-source models, give RunPod a try.

  • Available hardware: H100, H200, A100, L40S, RTX 4090, RTX 5090, and 30+ more
  • Cost: significantly cheaper than AWS or GCP, billed per second, no contracts
  • Setup: spins up in under a minute, 30+ regions worldwide
Try RunPod →
Affiliate disclosure: We may earn a commission if you sign up via our link, at no extra cost to you.
Efficienist Newsletter

Get the core business tech news delivered straight to your inbox. We track AI, automation, SaaS, and cybersecurity so you don't have to.

Just read what you want, and be done with it.

Read Next