Skip to main navigation Skip to search Skip to main content

Towards Foundation Models for 3D Vision: How Close are We?

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Building a foundation model for 3D vision is a complex challenge that remains unsolved. Towards that goal, it is important to understand the 3D reasoning capabilities of current models as well as identify the gaps between these models and humans. Therefore, we construct a new 3D visual understanding benchmark named UniQA-3D. UniQA-3D covers fundamental 3D vision tasks in the Visual Question Answering (VQA) format. We evaluate state-of-the-art Vision-Language Models (VLMs), specialized models, and human subjects on it. Our results show that VLMs generally perform poorly, while the specialized models are accurate but not robust, failing under geometric perturbations. In contrast, human vision continues to be the most reliable 3D visual system. We further demonstrate that neural networks align more closely with human 3D vision mechanisms compared to classical computer vision methods, and Transformer-based networks such as ViT [17] align more closely with human 3D vision mechanisms than CNNs. We hope our study will benefit the future development of foundation models for 3D vision. Code is available at https://github.com/princeton-vl/UniQA-3D.

Original languageEnglish (US)
Title of host publicationProceedings - 2025 International Conference on 3D Vision, 3DV 2025
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1285-1296
Number of pages12
ISBN (Electronic)9798331538514
DOIs
StatePublished - 2025
Event12th International Conference on 3D Vision, 3DV 2025 - Singapore, Singapore
Duration: Mar 25 2025Mar 28 2025

Publication series

NameProceedings - 2025 International Conference on 3D Vision, 3DV 2025

Conference

Conference12th International Conference on 3D Vision, 3DV 2025
Country/TerritorySingapore
CitySingapore
Period3/25/253/28/25

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence
  • Computer Science Applications
  • Computer Vision and Pattern Recognition
  • Signal Processing
  • Modeling and Simulation

Keywords

  • benchmark
  • dataset
  • foundation model
  • human subject research

Fingerprint

Dive into the research topics of 'Towards Foundation Models for 3D Vision: How Close are We?'. Together they form a unique fingerprint.

Cite this