Introduction
When was the last time you stopped to think about what "intelligence" actually means? This word—once primarily in the domain of psychologists, educators, and philosophers—now turns up at the gas station, the doctor’s office, who knows where. We find ourselves in an unusual moment where a concept that hummed quietly in the background of human affairs for a century or so has suddenly become a contested territory claimed by multiple stakeholders.
The AI industry confidently asserts that their benchmark performances represent the essence of intelligence. Educators point to GPA as the definitive measure of intellectual capacity. Psychologists stand by their carefully constructed IQ tests as the gold standard. Meanwhile, everyday conversations revolve around whether such foolish machines can really be considered intelligent. For heaven’s sakes keep them out of schools.
But what difference does it make? Why should we care whether a chatbot is "intelligent" or simply appears that way? Perhaps because how we define intelligence reflects our deepest values about what makes human being so special or even worth emulating? Our definitions reveal what we believe is worth assessing, worth fostering, worth preserving.
I invite you to explore your own conception of intelligence as we navigate this topic of bench marks and quotation marks.. Answer the question for yourself: Intelligent or not? What activities make you feel most intellectually alive? When do you recognize intelligence in others? Is it their ability to recall facts, to solve novel problems, to communicate clearly, to empathize? The answer might vary depending on whether you're in a classroom, a hospital, an art studio, or a wilderness survival situation.
With too many airplanes in the sky at once—too many competing definitions circling the same concept—perhaps we need not a single definition but a more nuanced understanding of how intelligence operates across different contexts and values. As we consider the evolution of artificial intelligence discourse from 2007 to 2025, we find ourselves moving inside quotation marks—inhabiting a world where fundamental concepts like "intelligence," "understanding," and "knowing"—even “learning”— have become simultaneously more operationalized and more contested.
Flash Back, Flash Forward: Whiplash
"A fundamental problem in artificial intelligence is that nobody really knows what intelligence is," wrote Legg and Hutter (2007) for an audience of academics. Little did they suspect they would be documenting an admission few corporate experts would cop to in 2025.
Benchmarks are all the craze.
MMMU (Massive Multi-discipline Multimodal Understanding): Introduced in 2023, this benchmark tests a model’s ability to handle complex, cross-disciplinary, and multimodal tasks. AI performance on MMMU has improved sharply, with scores rising by nearly 19 percentage points in just one year.
GPQA (Graduate-level Physics Questions and Answers): Designed to assess advanced reasoning and subject-matter expertise, GPQA scores jumped almost 49 percentage points between 2023 and 2024, reflecting rapid progress in model capability.
Humanity’s Last Exam: A new, extremely rigorous academic test where even the best AI systems score below 9%, highlighting the remaining gap between AI and human expertise in some domains.
The irony of this situation in the field of artificial intelligence has become rich—literally. In 2007, an admission of epistemological ignorance was normal. Theoretical foundations could remain unsettled while practical work continued. Researchers might have been creating a program called WatsonMD or some such to automate medical diagnosis using rule-based inference, developing a chess-playing engine using minimax algorithms, or constructing a natural language processing system to parse and answer questions in restricted domains—all without giving a fig about the nature of intelligence.
“Who knows?” functioned as a rhetorical space for innovation. Truth be told, nobody knew. By 2025, however, as AI systems have become embedded in everyday life with trillion-dollar industries built upon them, admitting such fundamental uncertainty has become professionally and commercially untenable.
The field has developed a vested interest in projecting certainty about intelligence and understanding, even as the yawning gap between statistical pattern matching and human consciousness has become more apparent. I’m aware of the sleight of hand, the swapping of consciousness for intelligence, but if anyone’s looking for a gold standard, why not consciousness? Artificial consciousness. Benchmarks of consciousness… a rock, a cucumber, a lemon tree, a flea, a bat, a golden retriever, a dolphin, an LLM, Thomas Pynchon?
This reluctance to acknowledge the paralyzing mystery at the heart of “intelligence” among teachers who want to do the best for their students illustrates how poorly commercial imperatives have reshaped scientific discourse to lightly touch on education among industry insiders. Where is the research educators need, the “benchmarks?” The quotation marks around concepts like "intelligence" and "understanding" haven't disappeared, but using them has become increasingly meaningless.
Settling the Question
The problem of premature closure on the concept of intelligence seems especially acute today. Fewer than twenty years ago, Legg and Hutter (2007) went to great lengths to keep the topic open to all comers. Would anyone today be ready to open the aperture so widely?
“In this paper we approach this problem in the following way: we take a number of well known informal definitions of human intelligence that have been given by experts, and extract their essential features. These are then mathematically formalised to produce a general measure of intelligence for arbitrary machines. We believe that this equation formally captures the concept of machine intelligence in the broadest reasonable sense. We then show how this formal definition is related to the theory of universal optimal learning agents.”
This kitchen-sink approach has fallen by the wayside. Melanie Mitchell. a widely published expert in AI of considerable stature who publishes a Substack, wrote recently in Science about the central role of consensus in the industry. Mitchell's article helps us understand why this shift from humility to arrogance occurred in the AI industry. She notes how "questions about the definition [of intelligence]… in machines have become central… as a result of the recent successes and broad real-world deployment of deep learning systems." The economic stakes have transformed philosophical questions into existential business concerns.
What's particularly revealing is Mitchell's discussion of benchmarks. She documents the twisted logic inherent in benchmarking in that "due to the incentives the field puts on successful performance on specific benchmarks, …research becomes…focused on a particular benchmark rather than the …underlying task (p. 6-7)." The industry has effectively substituted benchmark performance for human intelligence—what one person called exploiting "cheap tricks" rather than developing deeper understanding.
The industry has found it convenient to define intelligence operationally through performance metrics rather than confronting the messier philosophical questions about what understanding actually entails. These are the questions teachers want answered. As Mitchell notes, "Could machines be said to 'understand' differently from humans? What is the difference between merely representing some aspect of the world, as a thermostat represents temperature, and truly understanding what it is that you are representing?" These important questions aren’t benchmarked. If an autonomous thermostat is reliable in turning on and off a flame, what can be said about the reliability of an LLM in answering a question of fact in a history class? How might we benchmark this topic?
Conclusion: Living Inside Quotation Marks
As we consider the evolution of artificial “intelligence” discourse from 2007 to 2025, as we acknowledge that the dominant interpretive framework has settled the question of “intelligence” by “benchmarking,” we find ourselves “living” inside quotation marks—inhabiting a world where fundamental concepts like "intelligence," "understanding," and "knowing" have become operationalized, not theoretically understood, not anchored in human reality.
Takeaway 1: The Commercialization of Uncertainty
The transition from academic humility to corporate certainty about intelligence reflects not an advancement in understanding, but a strategic repositioning driven by commercial imperatives. When trillions of dollars rest on the presumed inevitability of artificial intelligence, admitting fundamental uncertainties becomes financially hazardous.
Discussion Question: How might educators use market forces to create spaces within the AI industry where fundamental uncertainties about intelligence can be productively explored without threatening commercial narratives or investment flows?
Takeaway 2: The Benchmark Substitution
The AI industry has effectively substituted benchmark performance for deeper conceptions of intelligence, allowing "cheap tricks" that exploit statistical patterns to stand in for genuine understanding. This substitution enables progress measured in metrics while sidestepping the thornier philosophical questions.
Discussion Question: What would an alternative benchmarking system look like that assessed not just pattern-matching capabilities on really hard tasks but also the kinds of successful interactions LLMs can have with students across developmental levels?
Takeaway 3: The Embodiment Challenge
Mitchell highlighted embodiment as a potential key to understanding—the notion that intelligence emerges not from disembodied computation but from the inseparable combination of brain and body interacting with the world. This presents a fundamental challenge to current AI approaches.
Discussion Question: If embodiment is essential to understanding, can disembodied language models ever cross what Rota called "the barrier of meaning," or are we developing an entirely different phenomenon that we've mistakenly labeled "intelligence"?
Takeaway 4: The Contextual Nature of Intelligence
The concept of "intelligence" may not refer to a single, unified phenomenon but rather to different capabilities valued in different contexts. What counts as intelligence in a classroom differs from intelligence in a hospital, a battlefield, or a creative writing workshop. By treating intelligence as a universal abstraction rather than a contextually-defined set of valued capacities, AI research may be chasing a philosophical mirage.
Discussion Question: How might AI development change if we abandoned the search for general intelligence and instead focused on developing systems optimized for specific contexts and communities, each with their own locally-defined conception of what intelligent contribution looks like?
These questions have no simple answers, but they remind us that now while we live inside quotation marks, we must remain vigilant about what those quotation marks signify—not merely semantic quibbling, but profound uncertainties about the nature of intelligence, understanding, and ultimately, what it means to be human in an age of increasingly sophisticated machine interlocutors.
"The AI industry has effectively substituted benchmark performance for deeper conceptions of intelligence, allowing "cheap tricks" that exploit statistical patterns to stand in for genuine understanding. This substitution enables progress measured in metrics while sidestepping the thornier philosophical questions."
It occurs to me that art realizes then combines seemingly unrelated unnamed flickers of essences into a new collection of essences. That is hard to say without resorting to using a term "non things"
for example the invention of new words.
I wonder if it's reasonable to consider intelligence as cognitive output. I see intelligence in the architecture of my feet, which allows for mobility to run away from sabretooth tigers, to dance with my wife, to paint with my toes, or to kick a ball. I see intelligence in the hydrological cycle of our planet, the unfolding of a flower, the spawning of a fish. Intelligence is all around us, most of which is not the result of human cognition.
I prefer to think of intelligence as a domain or substrate that finds expression through complexity. Humans once again fall into the trap of trying to assign value based upon a measurement, but can we compare the genius of the architecture of my knee to the brilliance of Terence Tao? I don't know that my knee 'understands' math; I don't know that Tao can calculate the genius of my knee.