• Author(s): Ziyu Liu, Tao Chu, Yuhang Zang, Xilin Wei, Xiaoyi Dong, Pan Zhang, Zijian Liang, Yuanjun Xiong, Yu Qiao, Dahua Lin, Jiaqi Wang

“MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs” introduces a novel benchmark designed to evaluate the capabilities of Large Vision-Language Models (LVLMs) in handling complex dialog scenarios involving multiple images. The benchmark, named MMDU, focuses on multi-turn interactions where the model must understand and respond to queries based on a sequence of images and dialog turns.

MMDU is structured to test the model’s ability to comprehend and generate accurate responses in a dialog setting that requires integrating information from multiple images. This is particularly challenging as it involves maintaining context across multiple turns and understanding the relationships between different images presented during the dialog.

The benchmark includes a diverse set of tasks that simulate real-world scenarios, such as identifying objects, understanding spatial relationships, and generating descriptive captions based on the visual and textual context provided in the dialog. The dataset used for MMDU is extensive, ensuring a comprehensive evaluation of the model’s performance across various aspects of visual and language understanding.

Key contributions of the paper include the introduction of a standardized test-bed for evaluating multi-turn, multi-image dialog understanding in LVLMs, and the provision of a detailed analysis of the model’s performance on these tasks. The authors highlight the importance of this benchmark in advancing the development of more sophisticated and capable vision-language models that can handle complex dialog interactions. Overall, MMDU represents a significant step forward in the evaluation of LVLMs, providing researchers and developers with a robust tool to assess and improve the dialog understanding capabilities of their models. The benchmark’s focus on multi-turn, multi-image scenarios addresses a critical gap in current evaluation methodologies, paving the way for more advanced and contextually aware vision-language systems.