Skip to content

The Data Scientist

AI Models Show Limited Success in Abstract Reasoning Tests Compared to Humans

Recent research examining artificial intelligence’s capability to solve visual puzzles similar to human IQ tests revealed significant limitations.

Can AI systems effectively tackle cognitive challenges designed to test human intelligence? The findings present a complex picture.

Scientists at the USC Viterbi School of Engineering Information Sciences Institute (ISI) conducted research exploring whether multi-modal large language models (MLLMs) could successfully navigate abstract visual assessments traditionally used for human cognitive testing.

The study, showcased at the Conference on Language Modeling (COLM 2024) in Philadelphia, evaluated “the nonverbal abstract reasoning capabilities of both open-source and closed-source MLLMs” by examining whether image-processing models could demonstrate advanced reasoning abilities when confronted with visual puzzles.

“Consider a scenario where a yellow circle transforms into a blue triangle – can the model identify and apply this pattern in different contexts?” elaborated Kian Ahrabian, a research assistant involved in the study, as reported by Neuroscience News. Such tasks demand both visual perception and logical reasoning abilities akin to human cognitive processes, presenting a particularly complex challenge.

The research team evaluated 24 distinct MLLMs using puzzles derived from Raven’s Progressive Matrices, a widely recognised abstract reasoning assessment—and the results were notably disappointing.

“The performance was remarkably poor. The models struggled to extract meaningful insights,” Ahrabian reported. The AI systems encountered difficulties in both visual comprehension and pattern recognition.

Nevertheless, performance varied across models. The study revealed that open-source models generally performed worse on visual reasoning puzzles compared to closed-source alternatives like GPT-4V, though even these fell short of human cognitive capabilities. Researchers achieved some improvement in model performance through Chain of Thought prompting, a technique that breaks down the reasoning process into sequential steps.

Closed-source models’ superior performance is attributed to their specialised development, extensive training datasets, and access to substantial corporate computing resources. “While GPT-4V demonstrated relatively strong reasoning capabilities, it remains significantly imperfect,” Ahrabian observed.

“Our comprehension of modern AI models’ capabilities remains remarkably limited, and until we fully grasp these constraints, we cannot enhance their safety, utility, and overall performance,” explained Jay Pujara, research associate professor and study author. “This research illuminates a crucial gap in our understanding of AI’s limitations.”

By identifying these shortcomings in AI systems’ reasoning capabilities, studies of this nature can help guide future developments to strengthen these cognitive abilities—with the ultimate aim of achieving human-comparable logical processing. However, there’s no immediate concern: At present, these systems remain distinctly inferior to human cognitive capabilities.