The educational technology world may soon be buzzing about a new study claiming AI feedback systems dramatically outperform human teachers. But buried in the data is a revelation that should stop us in our tracks: AI feedback worked beautifully for STEM students while largely failing language learners.
As someone who's spent years watching tech evangelists promise educational silver bullets, this "interaction effect" isn't just a statistical footnote—it's the whole story. The researchers themselves admit their AI system produced "more complex feedback" for STEM subjects while language students received "standardized feedback" focused mainly on surface-level corrections.
This isn't just about algorithms. It's about whether we truly understand the profound differences between teaching calculus and teaching writing. Providing feedback in these disciplines is fundamentally different. In STEM, we often seek concrete, rule-based corrections where answers can be definitively right or wrong. A missing negative sign or incorrect formula has a clear fix. But language education thrives on nuance, voice, and the messy human adventures of meaning-making. Strong writing feedback requires understanding intention, audience, and cultural context—precisely the areas where our current AI systems struggle most.
Let me walk you through what this April 2025 study actually reveals when you look beyond the headline numbers, and why the implications matter for every classroom considering an AI teaching assistant...
Background
A recent study published in April 2025 examines how artificial intelligence might transform educational outcomes in mixed-ability undergraduate classrooms by providing learners with customized feedback, particularly important for struggling learners. As educators grapple with diverse learning needs in this post-Covid era, this research offers timely insights into how teachers might weigh whether AI-driven adaptive feedback can effectively close learning gaps. For me, the most interesting finding from the study is this:
“The STEM group benefited more from AI feedback than the language group. The interaction effect was statistically significant, suggesting discipline-specific differences in AI feedback efficiency.”
As you will see as you read this critique, the study essentially looked at whether AI feedback or human feedback from any teacher whether STEM or Language results in better learning. To find out, the researchers studied generic“undergraduate students” in bulk, 700 of them to be exact, representing a mixture of STEM and “language education” students in the population. The interaction effect I’ve highlighted suggests that it probably wasn’t a good idea to study AI feedback apart from particular disciplinary learning. In other words, the kind of feedback AI provides is critical to its effects, and quality feedback in language education differs from quality feedback in STEM.
The Study's Strengths
The research by Naseer and Khawaja (April 18, 2025) deserves recognition for several noteworthy strengths:
A robust sample size (700 undergraduate students). Unfortunately, this large size made it difficult to conduct a mixed-methods study; as a result, we have no qualitative understanding of why the interaction occurred, showing a significant difference between feedback uptake among language education in contrast to STEM.
A reasonably balanced demographic representation across gender and academic disciplines. Educational research in real classrooms always faces this issue; in this case, the roll of the dice went in favor of the researchers.
Use of multiple assessment methods, including pre/post-tests, cognitive load measures, and engagement tracking. Again, these methods might have been more informative if they had included a qualitative component to provide a sense of how students interpreted the terms “cognitive load” and “engagement” when they filled out their surveys, crucial concepts in the self-report measures.
Comprehensive system design incorporating multiple feedback mechanisms. Because of the expertise amassed within the institution where the research was conducted and the mission of the researchers, as near as I can tell, the technological feedback mechanisms were reliable and predictable.
Addressing a genuine educational challenge: learning gaps in mixed-ability classrooms
Methodology
The researchers divided participants into two equal groups:
Experimental mixed group (350 students): Received AI-driven adaptive feedback
Control mixed group (350 students): Received traditional instructor-led feedback
The AI system incorporated real-time performance tracking, personalized learning paths, adaptive feedback complexity, engagement analytics, and live AI tutoring capabilities.
Key Findings
The results showed advantages for the AI-feedback group across multiple dimensions:
Conceptual Mastery: Students receiving AI feedback demonstrated a 28% improvement compared to 14% in the control group.
Student Engagement: The AI group showed a 35% increase in engagement, while the control group experienced declining engagement after week 8 of the 20-week term.
Cognitive Load Reduction: The AI intervention reported reduced cognitive overload by 22%, compared to only 6% in the control group.
Knowledge Retention: Students using the AI system retained 85% of learned concepts four weeks later, versus 65% in the control group.
Study Limitations: From Minor to Serious Concerns
Common Limitations in Educational Research
Self-Report Measures: As noted before, self-reports using large-scale surveys are always suspect. Like many educational studies, this research relies on self-reported engagement and cognitive load, which may introduce response bias.
Novelty Effect: The initial enthusiasm for new technology may have temporarily boosted student engagement independent of the system's effectiveness.
Moderate Methodological Concerns
Limited Context for Effect Sizes: The reported improvements (.28 in conceptual mastery, .35 in engagement) exceed expected educational interventions, which generally show more modest gains of .05-.15. Without comparative context, these unusually large effects raise questions.
Generalizability Issues: The findings from this single institutional context may not transfer to diverse educational settings with different resource levels or cultural contexts.
Demographic Analysis Gaps: The study collected some individual data but inadequately analyzed how prior AI experience or other demographic factors might influence outcomes.
Significant Research Design Flaws
Non-Equivalent Control Conditions: The study fails to establish that the AI group and control group received comparable amounts and quality of feedback. If the AI system provided more frequent or detailed feedback, this alone could explain performance differences. It is difficult to know in detail the feedback quality from the AI; it’s impossible to know anything at all about the quality of feedback from the language instructors.
The researchers report that “language education” ran the gamut from English composition to EFL, presenting a diverse and complex constellation of feedback needs different in kind from feedback useful in STEM courses. Consider the phrases in bold print in the following quote from the methods section. This sort of feedback is unlikely to serve the needs of writing students:
“The AI Feedback System, which has openAI API integrated at backend to respond to student queries in Figure 3 enabled users to submit their work for instant feedback that categorized mistakes and offered structured recommendation solutions. Students used the text-based input panel to prompt open and multiple choice questions and to provide written and audio responses while the AI analysis panel presented systematic explanations that suggested improvement steps. The system implements adjustable feedback protocols, that use students’ past performance patterns for delivering immediate individualized productive feedback.”
Returning to the problem of the statistical interaction, we have even more reason to argue that the research actually supports the conclusion that AI feedback of the type described in this study is more effective for STEM courses than for language education courses. At the most, we can say that the sort of feedback provided in study appeared to support learning in one domain but not in another.
A main effect occurs when the the measurement of test subjects in the treatment condition (students lumped together and getting AI feedback) is significantly greater in magnitude than in the control condition (again, students lumped together but getting no AI feedback). In this case, the main effect would indicate that AI feedback is better for all students in the treatment condition irrespective of major, gender, age, or any other potentially confounding variable. One is warranted in asserting a superiority effect for the treatment unless there is an interaction between the groups dependent on a third factor.
“If there is a significant interaction, then ignore the following two sets of hypotheses for the main effects. A significant interaction tells you that the change in the true average response for a level of Factor A depends on the level of Factor B. The effect of simultaneous changes cannot be determined by examining the main effects separately. If there is NOT a significant interaction, then proceed to test the main effects” (Schuster, 2021).
The researchers did, in fact, investigate the interaction and concluded essentially the same thing I concluded, namely, that “feedback” isn’t the same in language education as in STEM:
“The AI-driven feedback engine produced more complex feedback content consistently for students taking STEM classes because it used the subject matter to adapt feedback depth accordingly. Language education students received [AI] standardized feedback which dedicated most attention to linguistic aspects of their work.”
AI feedback may have some use in undergraduate education to mitigate differences in levels of student expertise, but the effectiveness depends on canonical qualities of the subject area.
Insufficient Long-Term Assessment: True educational impact requires measurement beyond the one-month delayed recall test. Without longer follow-up, claims about "retention" are premature. To be fair, perhaps the majority of research in education fails to meet this criterion. In the case of AI, however, an intervention with potentially transformative impact on teaching and learning, long-term followup seems essential.
Ethical Framework Omissions: Although there are occasional concerns about ethics, the research lacks reference to established AI ethics frameworks (IEEE, UNESCO, etc.) regarding data privacy, algorithmic transparency, and potential technology dependency.
The Primary Concern: Didactic vs. Dialogical Feedback
Didactic and dialogic pedagogies represent two distinct approaches to classroom instruction, each with unique characteristics and implications for student learning. Didactic pedagogy is rooted in a traditional, teacher-centered model where the teacher transmits knowledge directly to students. In this approach, classroom talk is primarily one-way: the teacher explains, demonstrates, or instructs, and students are expected to listen, absorb, and reproduce information. The focus is on delivering curriculum content efficiently, often through structured lessons and authoritative explanations. This method is knowledge-oriented, emphasizing the accurate transfer of established facts and procedures from teacher to student. Feedback is often evaluative and directive. Successful learning is measured against objective, often countable, criteria.
In contrast, dialogic pedagogy centers on two-way communication and active participation. Here, both teachers and students engage in dialogue to explore ideas, ask questions, and build understanding collaboratively. Dialogic teaching values students’ contributions, encourages them to elaborate on their thinking, and uses classroom discussion as a tool for deeper learning. The teacher’s role shifts from sole authority to facilitator, guiding conversations that help students construct knowledge together. This process-oriented approach fosters critical thinking, adaptability, and a sense of shared inquiry with less priority on declarative content knowledge.
While didactic pedagogy prioritizes efficient content delivery and clear authority, dialogic pedagogy emphasizes interactive learning and the co-construction of meaning. Effective teaching often involves moving between these modes, choosing the most appropriate approach for the learning context and goals.
I don’t see these two types of teaching as a binary. In one activity a teacher may approach the lesson didactically, in the next dialogically—even within the same lesson. But I do think the feedback provided in this study was uniformly didactic, not dialogic, and I would argue against a monolithic model of feedback that would mix language education students with STEM students, and make a generic claim about AI feedback. Further, language education flourishes in a dialogic mode. This study fails to reflect these categorical differences. Feedback is not feedback is not feedback. I find this study laudable in intent but flawed in its theoretical model of feedback and its methodology.
Recommendations for Researchers and Practitioners
For future research to advance this promising topic more credibly:
Match feedback frequency and detail across experimental conditions. If the study integrates learning across disciplines by design, provide feedback appropriate to nuances in discipline-specific pedagogical and learning requirements.
Conduct longer-term follow-up studies (6-12 months) to assess true retention. Addressed above, this recommendation bears repeating. Part of this long-term strategy should include robust qualitative research strategies to answer questions not just of what and how but of why and when.
Implement independent evaluation by researchers with a mission to explore both didactic and dialogic feedback in contexts of both AI and human feedback together in different configurations. Obviously, in my mind, there will be a place for both types of feedback in future classrooms. Beginning to take these questions seriously and research them thoughtfully is long past due.
Adopt established AI ethics frameworks with clear protocols for data privacy.
Conclusion
This study highlights AI's potential to transform educational outcomes while also demonstrating the need for more rigorous research standards in educational technology. While the findings suggest promising directions, stakeholders should view them as preliminary until independently replicated with stronger controls and research design.
The educational technology domain requires high-quality assessment like never before—much more teacher time devoted to feedback, neither uncritical acceptance nor dismissal of innovation, but rather careful, ethical implementation based on transparent and rigorous evidence grounded in well-conceived theoretical models. Only then can we responsibly harness AI's potential to address genuine educational feedback challenges.