TY - JOUR
T1 - Understanding the Limits of Vision Language Models Through the Lens of the Binding Problem
AU - Campbell, Declan
AU - Rane, Sunayana
AU - Giallanza, Tyler
AU - De Sabbata, Nicolò
AU - Ghods, Kia
AU - Joshi, Amogh
AU - Ku, Alexander
AU - Frankland, Steven M.
AU - Griffiths, Thomas L.
AU - Cohen, Jonathan D.
AU - Webb, Taylor
N1 - Publisher Copyright:
© 2024 Neural information processing systems foundation. All rights reserved.
PY - 2024
Y1 - 2024
N2 - Recent work has documented striking heterogeneity in the performance of state-of-the-art vision language models (VLMs), including both multimodal language models and text-to-image models. These models are able to describe and generate a diverse array of complex, naturalistic images, yet they exhibit surprising failures on basic multi-object reasoning tasks - such as counting, localization, and simple forms of visual analogy - that humans perform with near perfect accuracy. To better understand this puzzling pattern of successes and failures, we turn to theoretical accounts of the binding problem in cognitive science and neuroscience, a fundamental problem that arises when a shared set of representational resources must be used to represent distinct entities (e.g., to represent multiple objects in an image), necessitating the use of serial processing to avoid interference. We find that many of the puzzling failures of state-of-the-art VLMs can be explained as arising due to the binding problem, and that these failure modes are strikingly similar to the limitations exhibited by rapid, feedforward processing in the human brain.
AB - Recent work has documented striking heterogeneity in the performance of state-of-the-art vision language models (VLMs), including both multimodal language models and text-to-image models. These models are able to describe and generate a diverse array of complex, naturalistic images, yet they exhibit surprising failures on basic multi-object reasoning tasks - such as counting, localization, and simple forms of visual analogy - that humans perform with near perfect accuracy. To better understand this puzzling pattern of successes and failures, we turn to theoretical accounts of the binding problem in cognitive science and neuroscience, a fundamental problem that arises when a shared set of representational resources must be used to represent distinct entities (e.g., to represent multiple objects in an image), necessitating the use of serial processing to avoid interference. We find that many of the puzzling failures of state-of-the-art VLMs can be explained as arising due to the binding problem, and that these failure modes are strikingly similar to the limitations exhibited by rapid, feedforward processing in the human brain.
UR - http://www.scopus.com/inward/record.url?scp=105000492808&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=105000492808&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:105000492808
SN - 1049-5258
VL - 37
JO - Advances in Neural Information Processing Systems
JF - Advances in Neural Information Processing Systems
T2 - 38th Conference on Neural Information Processing Systems, NeurIPS 2024
Y2 - 9 December 2024 through 15 December 2024
ER -