According to 1M AI News monitoring, Tongyi Laboratory has released its multimodal model Qwen3.5-Omni, which supports text, image, audio, and audio-video inputs, and can generate fine-grained audio-video Captions with timestamps. The official says that Qwen3.5-Omni-Plus has scored 215 SOTA results on tasks such as audio and audio-video analysis, reasoning, dialogue, and translation, and its capabilities exceed Gemini-3.1-Pro.
This time, the most special increment isn’t the leaderboard, but the “naturally emerging Audio-Visual Vibe Coding capability.” Tongyi says the model was not specifically trained, yet it can already generate runnable code directly from audio-video instructions. The official also claims that the model supports 256K context, recognizes 113 languages, can handle 10 hours of audio or 1 hour of video, and natively supports WebSearch and complex Function Calls.
Qwen3.5-Omni continues the Thinker-Talker split architecture, with both components upgraded to Hybrid-Attention MoE. Tongyi has provided three Plus, Flash, and Light sizes via Alibaba Cloud’s Bailian, and launched a real-time version, Qwen3.5-Omni-Plus-Realtime.