Tongyi integrates Vibe Coding into all modalities, and Qwen3.5-Omni claims to achieve 215 SOTA results.

BlockBeatNews

According to 1M AI News monitoring, Tongyi Laboratory has released its multimodal model Qwen3.5-Omni, which supports text, image, audio, and audio-video inputs, and can generate fine-grained audio-video Captions with timestamps. The official says that Qwen3.5-Omni-Plus has scored 215 SOTA results on tasks such as audio and audio-video analysis, reasoning, dialogue, and translation, and its capabilities exceed Gemini-3.1-Pro.

This time, the most special increment isn’t the leaderboard, but the “naturally emerging Audio-Visual Vibe Coding capability.” Tongyi says the model was not specifically trained, yet it can already generate runnable code directly from audio-video instructions. The official also claims that the model supports 256K context, recognizes 113 languages, can handle 10 hours of audio or 1 hour of video, and natively supports WebSearch and complex Function Calls.

Qwen3.5-Omni continues the Thinker-Talker split architecture, with both components upgraded to Hybrid-Attention MoE. Tongyi has provided three Plus, Flash, and Light sizes via Alibaba Cloud’s Bailian, and launched a real-time version, Qwen3.5-Omni-Plus-Realtime.

Disclaimer: The information on this page may come from third parties and does not represent the views or opinions of Gate. The content displayed on this page is for reference only and does not constitute any financial, investment, or legal advice. Gate does not guarantee the accuracy or completeness of the information and shall not be liable for any losses arising from the use of this information. Virtual asset investments carry high risks and are subject to significant price volatility. You may lose all of your invested principal. Please fully understand the relevant risks and make prudent decisions based on your own financial situation and risk tolerance. For details, please refer to Disclaimer.
Comment
0/400
No comments