Apple researchers recently conducted a large-scale study1 revealing that Large Language Models (LLMs) cannot perform true logical reasoning, despite appearing intelligent in their responses.
From my perspective, the injection of the word “intelligence” into the conference agenda in the 1950s when the first gathering of scientists happened to open the gates of innovation was a mistake.
The phrase “artificial intelligence” was an attempt to raise the level of interest in the work. From my understanding the original participants would be baffled to know that people actually take the word “intelligence” literally.
How can anyone think that humans can create simulated humans? Really? I wonder what Luciano Floridi might say.
Methodology
The team created GSM-Symbolic, a new benchmark using symbolic templates to generate diverse mathematical problems, testing over 20 state-of-the-art models including GPT and others across 5,000 samples from 100 templates.
Core Findings
No Genuine Understanding
LLMs “do not really understand what they are being asked. They instead recognize the structure of a sentence and then spit out an answer based on what they have learned through machine-learning algorithms.”
Extreme Fragility
Performance drops up to 65% when researchers add a single irrelevant clause to questions, even though “the clause doesn’t contribute to the reasoning chain needed for the final answer”. Performance also “significantly deteriorates as the number of clauses in a question increases.”
Pattern Matching, Not Reasoning
The study found that “LLMs rely on probabilistic pattern matching rather than genuine logical reasoning”, with models declining in performance “when only the numerical values in the question are altered”.
Failed Logic Test
LLMs cannot distinguish relevant from irrelevant information—a basic requirement for logical reasoning. When asked simple questions with non-pertinent details, “this was enough to confuse the LLMs into giving wrong or even nonsensical answers to questions they had previously answered correctly.”
Specific Recommendations
Improved Evaluation Methods
The study “calls for improved evaluation methods” beyond single-metric benchmarks like GSM8K
Enhanced Benchmarking
Use GSM-Symbolic and similar template-based approaches for more reliable assessment
Fundamental Research Priority
Researchers state: “We believe further research is essential to develop AI models capable of formal reasoning, moving beyond pattern recognition to achieve more robust and generalizable problem-solving skills”
Implications
The research suggests that seemingly intelligent AI responses are “little more than an illusion,” challenging assumptions about current AI capabilities and highlighting the need for developing genuinely reasoning-capable AI systems.
***
Let me put on my Luciano Floridi glasses and make up some responses he might give. Each of the following is my guess.
My Take on How Floridi Might Respond: “I Already Told You This”
From Floridi’s perspective, the Apple study essentially “discovered” what he had already theorized years earlier—though they present it as if revealing something surprising about AI capabilities. More problematically, their approach reveals fundamental philosophical confusion about what they’re studying.
Nothing New
Floridi’s thesis of “AI as Agency Without Intelligence” already established that we have “decoupled the ability to act successfully from the need to be intelligent, understand, reflect, consider or grasp anything” (AI as Agency Without Intelligence). The Apple researchers’ “spit out” finding simply confirms what Floridi theorized: LLMs are like tricksters who “gobble data in astronomical quantities and regurgitate (what looks to us as) information.”
Predictable Category Mistake
The Apple study’s surprise at LLM limitations reveals they fell into the anthropomorphic trap Floridi warned against. He had already noted that “these AI systems…do not think, reason or understand; they are not a step towards any sci-fi AI,” and that LLMs “can work on the formal structure, and not on the meaning of the texts they process—what we do semantically.”
Philosophical Poverty
From Floridi’s perspective, the Apple study exemplifies exactly the kind of atheoretical empiricism that plagues AI research. The researchers criticize LLMs for lacking “logic” and “reasoning” without ever defining what these terms mean. They invoke rational processes without articulating what rationality entails. What was rational in the 19th century is not rational today. This conceptual sloppiness reflects their failure to ground their research in an adequate philosophy of science.
Missing the Method of Levels of Abstraction
Floridi would argue that Apple’s confusion stems from their failure to specify their Level of Abstraction (LoA). They analyze LLMs sometimes as cognitive systems (inappropriate), sometimes as information processors (appropriate), and sometimes as would-be humans (deeply confused) without ever making explicit which LoA they’re operating within. This methodological incoherence leads to incoherent conclusions.
The Human Fallacy
Most fundamentally, Floridi would argue that Apple criticizes LLMs for not being human—which is obviously true but philosophically irrelevant. Their entire framework assumes human cognition as the gold standard, missing that “artificial intelligence manages the properties of electromagnetism to process texts with extraordinary success and often with outcomes that are indistinguishable from those that human beings could produce.”This represents “a form of agency that is alien to any culture in any past”—genuinely revolutionary not because it mimics human reasoning, but because it demonstrates an entirely new form of successful action without intelligence.
The User-Absent Analysis
Floridi would also critique Apple’s focus on the machine in isolation. Their analysis ignores the human-AI interaction system where meaning emerges through use. They test LLMs as standalone reasoning engines rather than examining how they function within human social practices, missing that these systems derive their power precisely from human-machine collaboration, not autonomous operation.
Apple’s Own Illusion
Ironically, Floridi would argue that Apple has created their own illusion—the illusion that there was ever a question about whether LLMs engage in human-like reasoning. Their “discovery” that LLMs aren’t human minds reveals more about their conceptual confusion than about AI limitations. The real illusion is expecting statistical systems to behave cognitively and then expressing surprise when they don’t.
Conclusion
The Apple study provides solid empirical confirmation of Floridi’s theoretical insights, but reveals deep philosophical confusion about what constitutes adequate analysis of artificial systems. Their recommendations for “formal reasoning” betray their failure to understand that they’re studying a fundamentally new kind of agency that requires new conceptual tools, not better approximations of human cognition.
https://machinelearning.apple.com/research/illusion-of-thinking
This would be a cool inquiry, Dustin. Actually, students collaborating on any legitimate philosophy with a historical anchor to find their relationship to it would be surprising and interesting. A philosophy of love, of science, of aesthetics, of emotion—high school kids would thrive as explorers, readers, and writers. LMS would be so useful
Cogent reasoning here.