• Author(s): Yuanhui Huang, Wenzhao Zheng, Yunpeng Zhang, Jie Zhou, Jiwen Lu

3D semantic occupancy prediction is a critical task for enhancing the robustness of vision-centric autonomous driving systems. This task involves obtaining the fine-grained 3D geometry and semantics of the surrounding scene. Traditional methods typically use dense grids, such as voxels, to represent scenes. However, these methods often overlook the sparsity of occupancy and the varying scales of objects, leading to inefficient resource allocation.

To address these limitations, the authors propose an object-centric representation for describing 3D scenes using sparse 3D semantic Gaussians. Each Gaussian represents a flexible region of interest along with its semantic features. The approach aggregates information from images through an attention mechanism and iteratively refines the properties of the 3D Gaussians, including their position, covariance, and semantics.

Additionally, the authors introduce an efficient Gaussian-to-voxel splatting method to generate 3D occupancy predictions. This method aggregates only the neighboring Gaussians for a given position, thereby optimizing the process. Extensive experiments were conducted on the widely adopted nuScenes and KITTI-360 datasets. The experimental results demonstrate that the proposed model, GaussianFormer, achieves performance comparable to state-of-the-art methods while using only 17.8% to 24.8% of their memory consumption.

This innovative approach not only addresses the inefficiencies of traditional dense grid methods but also provides a more balanced allocation of resources by considering the sparsity and diversity of object scales in 3D scenes. The findings suggest that GaussianFormer is a promising solution for improving the efficiency and effectiveness of 3D semantic occupancy prediction in autonomous driving applications.