A study by researchers from University College London (UCL) revealed that leading AI models, including ChatGPT and Meta’s Llama, displayed irrational behaviour and made simple mistakes when solving classic logic puzzles designed to test human reasoning. The research raises concerns about the reasoning capabilities of current AI technologies.
Researchers from University College London (UCL) conducted a study on the reasoning capabilities of seven prominent AI models, including ChatGPT, Meta’s Llama, Claude 2, and Google Bard (now called Gemini). The study revealed that these large language models frequently exhibited irrational behavior and simple mistakes while solving logic puzzles designed to test human reasoning.
The AIs were tested using 12 classic logic puzzles such as the Monty Hall Problem, the Linda Problem, the Wason Task, and the AIDS Task. Though humans also struggle with these puzzles, the AI models displayed irrational responses distinct from those typically shown by humans. Notably, some AI models even refused to answer certain logic questions, citing ethical concerns.
Meta’s Llama 2 highlighted these issues by refusing to respond to questions like the Linda Problem due to perceived “harmful gender stereotypes,” affecting its performance. The best-performing AI was ChatGPT 4-0, which correctly answered 69.2% of the time, while the worst was Meta’s Llama 2 7b, with a 77.5% error rate.
These findings, published in Royal Society Open Science, indicate that current AI models do not yet possess human-like reasoning abilities and raise questions about their application in critical fields such as medicine and diplomacy.