• Author(s): Mu Cai, Jianwei Yang, Jianfeng Gao, Yong Jae Lee

Large multimodal models (LMMs) have demonstrated remarkable performance in visual-linguistic reasoning tasks. These models embed images into a fixed, large number of visual tokens, which are then fed into a Large Language Model (LLM). However, this approach becomes inefficient when dealing with dense visual scenarios like high-resolution images and videos, as it leads to an excessive number of tokens.

While token pruning and merging methods exist, they produce a single output length for each image, lacking flexibility in balancing information density and efficiency. To address this limitation, the authors propose M3: Matryoshka Multimodal Models, inspired by the concept of Matryoshka dolls. M3 learns to represent visual content as nested sets of visual tokens, capturing information across multiple coarse-to-fine granularities.

This approach offers several unique benefits for LMMs. First, it allows explicit control over the visual granularity per test instance during inference, enabling adjustment of the number of tokens used to represent an image based on its anticipated complexity or simplicity. Second, M3 provides a framework for analyzing the granularity needed for existing datasets, revealing that COCO-style benchmarks only require around 9 visual tokens to achieve accuracy similar to using all 576 tokens. Third, the approach lays a foundation for exploring the optimal trade-off between performance and visual token length at the sample level, where the authors’ investigation reveals a significant gap between the oracle upper bound and current fixed-scale representations.