Qwen3-VL Technical Report
Published in arXiv preprint, 2025
Qwen3-VL is the most capable vision-language model in the Qwen series, supporting interleaved contexts of up to 256K tokens for text, images, and video.
Recommended citation: Shuai Bai, ..., Jie Huang, ..., et al. (2025). "Qwen3-VL Technical Report." arXiv preprint arXiv:2511.21631.
Download Paper
