Revisiting Multimodal Positional Encoding in Vision-Language Models

Published in arXiv preprint, 2025

We conduct a comprehensive analysis of multimodal Rotary Positional Embedding (RoPE) in vision-language models and distill three essential guidelines: positional coherence, full frequency utilization, and preservation of textual priors. Based on these insights, we propose Multi-Head RoPE (MHRoPE) and MRoPE-Interleave (MRoPE-I), which consistently outperform existing approaches across diverse benchmarks.

Paper

Recommended citation: Jie Huang, Xuejing Liu, Shijie Song, Ruibing Hou, Hong Chang, Jinlin Lin, Shuai Bai. (2025). "Revisiting Multimodal Positional Encoding in Vision-Language Models." arXiv preprint arXiv:2510.23095.
Download Paper