Revisiting Multimodal Positional Encoding in Vision-Language Models
Published in arXiv preprint, 2025
We conduct a comprehensive analysis of multimodal Rotary Positional Embedding (RoPE) in vision-language models and distill three essential guidelines: positional coherence, full frequency utilization, and preservation of textual priors. Based on these insights, we propose Multi-Head RoPE (MHRoPE) and MRoPE-Interleave (MRoPE-I), which consistently outperform existing approaches across diverse benchmarks.
Recommended citation: Jie Huang, Xuejing Liu, Shijie Song, Ruibing Hou, Hong Chang, Jinlin Lin, Shuai Bai. (2025). "Revisiting Multimodal Positional Encoding in Vision-Language Models." arXiv preprint arXiv:2510.23095.
Download Paper
