Recent Large Multimodal Models (LMMs) have shown impressive results on visual question-answering (VQA) benchmarks demonstrating a high degree of understanding of visual and textual inputs. However, this work demonstrates that there is a significant gap between the mechanisms underlying these systems and those behind human visual perception. I demonstrate through a series of experiments involving prompting LMMs to answer questions about simple geometric properties of synthetic images, that these models do not possess the same geometric intuitions that are innate in humans. Their inability to identify geometric properties that would prove trivial for humans, when contrasted against their remarkable performance on VQA benchmarks consisting of more natural inputs reveals that their inductive biases, and the abstractions they form differ from those underlying human cognition.

presentation

report