• Author(s): Jang Hyun Cho, Boris Ivanovic, Yulong Cao, Edward Schmerling, Yue Wang, Xinshuo Weng, Boyi Li, Yurong You, Philipp Krähenbühl, Yan Wang, Marco Pavone

This paper presents Cube-LLM, a novel Multi-modal Large Language Model (MLLM) that extends the perceptual capabilities of MLLMs to understand and reason about images in three-dimensional space. Unlike traditional models that primarily focus on 2D vision and language tasks, Cube-LLM leverages a large-scale pre-training dataset named LV3D, which amalgamates multiple existing 2D and 3D recognition datasets into a unified multi-turn question-answering framework.

The innovative approach of Cube-LLM involves pre-training on the LV3D dataset without the need for 3D-specific architectural designs or training objectives. This method demonstrates that scaling up data alone can significantly enhance 3D perception capabilities. Cube-LLM exhibits several remarkable properties: it can utilize chain-of-thought prompting to enhance 3D understanding from 2D contexts, adapt to complex and diverse instructions, and respond to visual prompts such as 2D boxes or candidate 3D boxes provided by specialists.

Empirical results show that Cube-LLM substantially outperforms existing baselines, achieving a 21.3-point increase in AP-BEV on the Talk2Car dataset for 3D grounded reasoning and a 17.7-point increase on the DriveLM dataset for complex reasoning in driving scenarios. Additionally, Cube-LLM delivers competitive performance in general MLLM benchmarks, such as achieving an average score of 87.0 in refCOCO for 2D grounding, and excels in visual question answering benchmarks like VQAv2, GQA, SQA, and POPE for complex reasoning tasks. This paper underscores the potential of data scaling in enhancing the capabilities of MLLMs in 3D perception and reasoning, setting a new standard for future developments in the field.