Everybody remembers the Phonics Wars. They started long ago in a distant time when schools were separated by race, corporal punishment was considered appropriate in many schools, and students were there to toe the line. After decades of recriminations and phalanxes of experimentalists fought in the trenches, the war was over in 2017 with the advent of "Sold a Story," Emily Hanford's prize-winning podcast series from American Public Media that investigated how reading is taught in American schools.1
Legislation having been passed shortly thereafter in almost every state mandating phonics instruction in isolation, banning three-cueing pedagogy2, teachers were expected to set aside their professional understandings, throw up their arms, and say either "Hurrah!! The promised land!" or "Darn, I was wrong all along." Teachers tend to highly value their experiences in the classroom above research and are not likely to switch sides at the drop of a hat, or in this case, a law.
The expectation that instruction can be transformed by legislative mandate based primarily on selected quantitative test scores is problematic and oversimplifies the complex reality of classroom teaching. It’s especially troubling to imagine a legislator in an office scrutinizing the research. While data-driven approaches have their place, they often fail to capture the full picture of effective instruction.
Mainstream consumers of research expect it to be experimental (i.e., involve "hard evidence" or numbers from two or more different groups), credible (involve statistical tests of significance, not expert judgement), deep (replicated over and over again for years), and peer reviewed by independent scholars blinded to author (impossible since peer reviewers worth their salt have complete understanding of the dynamics of the war and can tell whose side you're on after the first paragraph). The question of phonics first is infinitely more complex than the question of, say, whether asbestos causes respiratory problems, including deadly diseases.
Even if there were no problems in creating and interpreting quantitative data, studies must be theoretically grounded in frameworks that are plausible. Phonemes, for example, are much more than sounds.3 Research on phonics has a brittle, definitionally impoverished framework that has a surface appeal exemplified in the phrase “sound it out.’ An example: Suppose I wanted to test whether teaching a child to dribble a basketball will improve that child's game. First off, I would need to have a theory of dribbling that accounts for the phenomenon and its relations with other phenomenon in the basketball game in the way I need a theory of the phoneme and its relation to reading.
So I teach 15 third graders to dribble using a systematic, replicable approach the same for everyone. I compare my 15 trained third graders against 15 others who receive only traditional physical education instruction without specific dribbling focus. This comparison, while common in educational research, raises important questions about what constitutes a fair control condition, a methodological challenge that appears in many, though not all, phonics and AI studies. The study would be disqualified if I designed as treatment and then used a simple “no treatment” group. The most I would be able to say is this: Teaching third graders to dribble results in better dribbling than no instruction at all.
I measure how much improvement I see in each group and compare the results. Voila! Those children whom I taught to dribble increase a) their endurance (length of dribbles without stopping or bobbling the ball), b) their speed (accurately dribbling the ball at increasing rates), and c) their double-handedness (equal ability with each hand). Whether it improved their overall game? Who knows?
A Merciless Critique of the Dribbling Study
By the way, any study purporting to tell teachers how to teach must be critiqued mercilessly. This basketball dribbling study exemplifies textbook methodological flaws that would make any serious researcher cringe. Let me dissect its profound inadequacies.
First and most glaringly, the study commits the cardinal sin of educational research by using a "business as usual" control group, particularly troubling if I had good reason to think that the intervention is truly more effective. As research methodologists point out, it is unethical and no IRB would ever approve a study whereby one group of people is denied an intervention that the researcher is convinced is superior to the status quo. Such research fails to qualify as research because the researcher had no legitimate question to study and is simply trying to win an argument, not advance knowledge for everyone.
For AI classroom research to truly advance educational practice, we need studies that:
Compare AI tools to other reasonable, active interventions requiring similar time, attention, and resources
Measure not just immediate academic performance but transfer of learning to new contexts
Examine both cognitive and social-emotional impacts of AI integration
Track effectiveness across diverse student populations and learning environments
Evaluate sustainability and long-term impact beyond initial implementation periods
Just as the interesting question about basketball dribbling isn't whether to teach it but how to teach it most effectively, the valuable questions about AI in education aren't about whether to use technology but how to integrate it within comprehensive, theoretically-grounded educational approaches that genuinely serve students' long-term learning needs.
The proper design would compare a systematic approach against another legitimate instructional approach. Perhaps comparing this dribbling technique against a more play-based approach, or a peer-coaching method, or any other reasonable alternative would provide meaningful data. Without this, we’re simply proving that teaching is better than not teaching, hardly a groundbreaking revelation.
Furthermore, this dribbling study suffers from confirmation bias in its very structure. Its selected measurement criteria (endurance, speed, ambidexterity) perfectly align with what your program explicitly trains. This is equivalent to teaching to the test, then claiming superior results because students did well on that specific test. It tells us nothing about overall basketball ability, game performance, or whether these isolated skills transfer to actual play situations. Could it be that dribbling in the context of a game isn't the same phenomenon as dribbling alone?
The methodology also fails to address numerous confounding variables. Were the children randomly assigned? Were baseline measures taken? Were the assessors blinded to group assignment? Without these controls, researcher expectancy effects likely contaminated results.
Most damning is the complete lack of theoretical framework. The design reduced the complex process of motor skill development to a mechanistic drill-and-practice model without acknowledging the rich literature on contextual learning, motivation, and skill transfer. This represents the same reductionist thinking that plagues phonics research—and increasingly plagues AI research—focusing on isolated subskills while ignoring the holistic nature of learning.
In summary, the study doesn't demonstrate effectiveness; it demonstrates the predictable outcome of a flawed research design that was built to confirm preconceptions rather than examine them rigorously.
The Illusion of "Evidence-Based" Approaches
The phonics wars appeared to end with a victory for explicit phonics instruction, backed by experimental studies and meta-analyses. However, this "evidence" rests on an indefensible definition of reading as pronouncing words accurately. The evidence is also a measures of the explicit skills directly taught in the treatment rather than long-term literacy outcomes.
Similarly, current AI education research often measures success through immediate task completion or engagement metrics rather than deeper learning. Or in reverse, it measures risk and danger by drawing theoretically on squishy concepts like cognitive offloading and critical thinking atrophy. Then, it either reduces these complex terms to a meaningless construct or ignores the complexity entirely.
In both debates, theoretical models of how learning in reading and writing occurs have been relegated to afterthoughts. A simplified theory of mind centered on cognitive offloading appears to underpin many influential AI studies in education, though certainly not all. This theoretical framework, while offering some insights, often fails to capture the full complexity of human learning processes. This theory seems to explain a lot by application of common sense, but if common sense worked, why would anyone go to the trouble of labor-intensive, highly expert research design?
The dominant research paradigm assumes learning is primarily about attention and task adherence, about memory and arguments, a fundamentally incomplete view of human cognition and development. This reductionist approach ignores the complex, social, and contextual nature of authentic learning.
Case in point. The Microsoft study on knowledge workers a few months4 ago sounded alarm bells heard around the world. The study found a strong correlation between a worker's confidence in AI and their likelihood to engage in critical thinking. When workers trusted AI outputs more, they were less likely to question, verify, or critically engage with those outputs.
Conversely, individuals with higher self-confidence in their own expertise were more likely to scrutinize AI-generated content, even though it required more effort. It’s appropriate to ask whether the word “confidence” even makes sense in this context. “Trusting AI outputs” means having confidence that the outputs are non-problematic. “Having confidence in expertise” is not the same as having confidence in AI output.
Which of these factors do you think might be a bit challenging to measure, especially using a survey method designed to study opinions? “I am generally confident in AI output.” How would you respond? “Am I confident in anything all the time? Is confidence a monolithic construct we can fit in containers labeled 1 = never, 2, 3, 4, and 5=all the time?
Many participants reported that as they relied more on AI for tasks such as writing, analysis, or evaluation, they practiced these skills less themselves. Over time, this led to a self-reported decline in their ability to perform such tasks independently—a phenomenon described as "cognitive atrophy" (the gradual weakening of mental skills through disuse). As a researcher, I have serious methodological concerns about this study's approach to measuring such complex cognitive processes, especially via self-reports. The findings warrant considerably more caution in their interpretation and application than they've received in public discourse.
The most egregious flaw is the non-existent theoretical model. To flesh out the construct we call "critical thinking" by drawing on Benjamin Bloom's taxonomy (a hierarchical framework from the 1950s classifying educational learning objectives) would be like drawing on definitions of intelligence from the early 20th century. Given the centrality of the construct, one might expect at least some awareness on the part of the researchers of the problems in interpreting the results this choice would present.
Nonetheless, the study claimed that workers were motivated to think critically when the stakes were high, such as ensuring work quality, avoiding errors, or developing professional skills. However, time pressure, high trust in AI, and lack of motivation or awareness often inhibited critical thinking. It is likely a truism that when the stakes are high, people tend to think more intently than when they are pressed for time.
The Quick Fix Fallacy
Education faces constant pressure to deliver immediate, measurable results. Phonics offered a seemingly straightforward solution to literacy problems; now AI promises similar quick fixes—and equally quick cognitive assaults. Both fail to address the fundamental complexity of learning and the diverse needs of different learners.
A truly evidence-based approach would value long-term outcomes over short-term gains, integrate theoretical understanding with empirical findings, recognize the limitations of experimental methods in complex learning environments, consider diverse forms of evidence including qualitative research, and focus on equity of outcomes across different learner populations.
Moving Forward: AI and Reading Instruction Beyond the Wars
Rather than repeating the mistakes of the phonics wars, we need to stop and think carefully about our biases and preconceptions to act responsibly as teachers. We may harbor aspirations of saving mankind from destruction whether we are pro or con on AI. But we should be mindful of how compelling narratives can sometimes overshadow nuanced research, as happened in some aspects of the phonics debate. Even well-intentioned advocacy can sometimes lead to oversimplification of complex educational issues. This moment calls for better from us. We don't yet have the research we need. We simply don't know the full picture.
What would more robust research look like? Strong studies would employ active comparison groups using alternative valid approaches rather than 'do-nothing' controls. They would measure both immediate skill acquisition and long-term transfer to authentic contexts. Mixed-methods designs combining quantitative outcomes with qualitative analysis of the learning process would provide deeper insights. Longitudinal studies tracking development over years rather than weeks or months would better capture true educational impact. And crucially, research would be conducted by interdisciplinary teams bringing expertise in both the content area and research methodology.
We need AI research that prioritizes deeper learning objectives over surface-level metrics, develops theoretical frameworks before implementation, designs technology applications to support diverse learning paths rather than standardizing instruction, and measures success through long-term educational outcomes as determined by expert teacher judgement. We need to remain vigilant in our reading of empirical research. We need to separate out the emotional noise, the preaching, the proselytizing, to reach reasoned conclusions rooted in good science. This moment challenges us to question what evidence matters in educational decision-making, whether about reading instruction or AI integration.
When evaluating quantitative research on AI in education, consider these key questions: Does the study measure meaningful learning outcomes beyond immediate task completion? Does it acknowledge the social and contextual nature of learning? Does it compare the intervention to other valid approaches rather than to no intervention? Does it track impacts over meaningful timeframes? And perhaps most importantly, does it recognize the diversity of learner needs and experiences? By demanding higher standards of evidence, we can avoid repeating the reductive debates that characterized the phonics wars and move toward more nuanced, effective educational approaches to AI.
For readers unfamiliar with the reference, "Sold a Story" documented how some popular teaching methods weren't aligned with the science of reading, sparking widespread policy changes across numerous states.
Three-cueing pedagogy differs from phonics instruction in a fundamental way. Three-cueing pedagogy teachers readers to learn to decode words in context, not in isolation. Phonics instruction holds that every student must learn and pass tests indicating the ability to decode words in lists before they are expected to read words in context.
Phonemes are defined as the smallest meaningful unit of sound in a language. For example, /b/ represents a phoneme because it makes a meaningful difference. /b/ is what makes /bat/ differ from /sat/ or /cat/ or /that/. This definition is encoded in the scientific literature and contextualized in varied theoretical frameworks. To competently discuss the nature and significance of the concept ‘phoneme’ requires a fairly sophisticated level of understanding.
https://www.microsoft.com/en-us/research/wp-content/uploads/2025/01/lee_2025_ai_critical_thinking_survey.pdf