Qwen3-VL Technical Report

Published in arXiv preprint, 2025

Qwen3-VL is the most capable vision-language model in the Qwen series, featuring dense variants (2B/4B/8B/32B) and mixture-of-experts variants (30B-A3B/235B-A22B). The model supports interleaved contexts of up to 256K tokens for text, images, and video, with enhanced interleaved-MRoPE for spatial-temporal modeling and DeepStack integration for multi-level ViT features.

PaperGitHub

Recommended citation: Shuai Bai, ..., Jie Huang, ..., et al. (2025). "Qwen3-VL Technical Report." arXiv preprint arXiv:2511.21631.
Download Paper