TY - GEN
T1 - Towards Foundation Models for 3D Vision
T2 - 12th International Conference on 3D Vision, 3DV 2025
AU - Zuo, Yiming
AU - Kayan, Karhan
AU - Wang, Maggie
AU - Jeon, Kevin
AU - Deng, Jia
AU - Griffiths, Thomas L.
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Building a foundation model for 3D vision is a complex challenge that remains unsolved. Towards that goal, it is important to understand the 3D reasoning capabilities of current models as well as identify the gaps between these models and humans. Therefore, we construct a new 3D visual understanding benchmark named UniQA-3D. UniQA-3D covers fundamental 3D vision tasks in the Visual Question Answering (VQA) format. We evaluate state-of-the-art Vision-Language Models (VLMs), specialized models, and human subjects on it. Our results show that VLMs generally perform poorly, while the specialized models are accurate but not robust, failing under geometric perturbations. In contrast, human vision continues to be the most reliable 3D visual system. We further demonstrate that neural networks align more closely with human 3D vision mechanisms compared to classical computer vision methods, and Transformer-based networks such as ViT [17] align more closely with human 3D vision mechanisms than CNNs. We hope our study will benefit the future development of foundation models for 3D vision. Code is available at https://github.com/princeton-vl/UniQA-3D.
AB - Building a foundation model for 3D vision is a complex challenge that remains unsolved. Towards that goal, it is important to understand the 3D reasoning capabilities of current models as well as identify the gaps between these models and humans. Therefore, we construct a new 3D visual understanding benchmark named UniQA-3D. UniQA-3D covers fundamental 3D vision tasks in the Visual Question Answering (VQA) format. We evaluate state-of-the-art Vision-Language Models (VLMs), specialized models, and human subjects on it. Our results show that VLMs generally perform poorly, while the specialized models are accurate but not robust, failing under geometric perturbations. In contrast, human vision continues to be the most reliable 3D visual system. We further demonstrate that neural networks align more closely with human 3D vision mechanisms compared to classical computer vision methods, and Transformer-based networks such as ViT [17] align more closely with human 3D vision mechanisms than CNNs. We hope our study will benefit the future development of foundation models for 3D vision. Code is available at https://github.com/princeton-vl/UniQA-3D.
KW - benchmark
KW - dataset
KW - foundation model
KW - human subject research
UR - https://www.scopus.com/pages/publications/105016164469
UR - https://www.scopus.com/pages/publications/105016164469#tab=citedBy
U2 - 10.1109/3DV66043.2025.00122
DO - 10.1109/3DV66043.2025.00122
M3 - Conference contribution
AN - SCOPUS:105016164469
T3 - Proceedings - 2025 International Conference on 3D Vision, 3DV 2025
SP - 1285
EP - 1296
BT - Proceedings - 2025 International Conference on 3D Vision, 3DV 2025
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 25 March 2025 through 28 March 2025
ER -