Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness

*Corresponding authors Project lead

1 Institute of Automation, Chinese Academy of Sciences
2 University of Chinese Academy of Sciences
3 Dexmal     4 MEGVII Technology     5 StepFun
Ross
Overview of Ross3D. Ross3D introduces 3D-aware visual guidance via cross-view (left) and global-view (right) reconstruction. In this way, LMMs are required to model 3D correspondence, thereby enhancing 3D scene comprehension capabilities.

Abstract

The rapid development of Large Multimodal Models (LMMs) for 2D images and videos has spurred efforts to adapt these models for interpreting 3D scenes. However, the absence of large-scale 3D vision-language datasets has posed a significant obstacle. To address this issue, typical approaches focus on injecting 3D awareness into 2D LMMs by designing 3D input-level scene representations. This work provides a new perspective. We introduce reconstructive visual instruction tuning with 3D-awareness (Ross3D), which integrates 3D aware visual supervision into the training procedure. Specifically, it incorporates cross-view and global-view reconstruction. The former requires reconstructing masked views by aggregating overlapping information from other views. The latter aims to aggregate information from all available views to recover Bird’s-Eye-View images, contributing to a comprehensive overview of the entire scene. Empirically, Ross3D achieves state-of- the-art performance across various 3D scene understanding benchmarks. More importantly, our semi-supervised experiments demonstrate significant potential in leveraging large amounts of unlabeled 3D vision-only data.

Quantitative Performance

Ross3D

Ross3D

Ross3D

Ross3D

Qualitative Performance

Ross3D

BibTeX

@article{wang2025ross3d,
  author={Haochen Wang and Yucheng Zhao and Tiancai Wang and Haoqiang Fan and Xiangyu Zhang and Zhaoxiang Zhang},
  title={Ross3d: Reconstructive Visual Instruction Tuning with 3D-Awareness},
  journal={arXiv preprint arXiv:2504.01901},
  year={2025},
}