A Critical Analysis: Microsoft’s January, 2025, Study of Critical Thinking and Knowledge Workers

Feb 15, 2025

“Your Brain on AI: Atrophied and Unprepared” screamed the headline from Forbes. “AI is harming your critical thinking: Microsoft study suggests” hedged The Daily Star. “A new paper from Microsoft Research finds that humans who increasingly rely on generative AI (GAI) in their work use less critical thinking that can deteriorate their cognitive abilities,” reported Mediapost. Forbes, The Daily Star, and Mediapost at least got the warning from the study right, even though they overstated the MS paper’s speculation based on the study’s findings. I found the paper incredibly interesting because of its useful and clear background on research that has been done in the area of knowledge workers and AI. But I have serious reservations about whether the study says anything about critical thinking. Do read it, but don’t expect a coherent theoretical framework and a defensible methodology, two flaws in the research which make its claim underwhelming.

A Humble Critique of the Validity of Interpretations of the Findings

The Microsoft paper purports to explore the impact of generative AI on critical thinking among knowledge workers. The study included 319 participants, with the largest occupational categories (top five) being:

Computer and Mathematical (18.5%)
Arts, Design, Entertainment, Sports, and Media (13.79%)
Office and Administrative Support (11.7%)
Business and Financial Operations (10.8%)
Educational Instruction and Library (7.6%)

It examines retrospective (look-backs) self-reports of AI-integrated tasks and finds complex relationships between confidence, effort, and enaction of critical thinking. The following excerpt from the study is the best way to express how the data were collected in terms of task dynamics:

From Section 3.3 (Analysis Procedure): "In our survey, participants were asked to share three real examples of their GenAI tool use at work. To increase the variety of examples collected, participants were asked to think of three different examples, one for each task type: Creation, Information, and Advice (see Section 3.1.1). Then, participants were asked to share an example of each task type in detail. The order of task types was randomised to avoid order and fatigue effects. For each example, as mentioned, we measure participants' perceived enaction of critical thinking, perceived effort in critical cognitive activities, and perceived confidence. All participants shared three examples. However, they were allowed to skip any task type they did not have experience of and substitute another task type — e.g., a participant could share two examples about Creation and one example about Advice, if they had no experience of an Information task."

The study measures confidence in multiple dimensions: confidence in self, confidence in AI, and confidence in evaluating AI output. This multifaceted approach reveals that in theory the relationship between confidence, effort, and critical thinking varies depending on whether the confidence is placed in oneself or in the AI tool. However, the fundamental question remains whether self-reported enaction of CT, effort, and confidence can serve as valid proxies for critical thinking, regardless of how carefully these factors are measured.

As described in section 3.1.3 of the paper, the researchers measured effort by asking participants whether using GenAI changed the effort required for six different cognitive activities (based on Bloom's taxonomy). Here's the quote:

"For each task example, participants were asked if, and how much, the use of the GenAI tool changed the effort of critical thinking activities compared to when they did not use the AI tool. We used the five-point scale 'much less effort', 'less effort', 'about the same', 'more effort', to 'much more effort' (which we code as integers ranging between −2 and +2). Participants could choose 'N/A' if they thought that a cognitive activity was not relevant to the task."

The graphic below from the study helps explain the research design. Two task factors were populated with data: task type and task confidence A and B (in AI and in self). Participants were first asked to write a description of a self-selected task in detail through a free response question: "Please tell us: 1) what you were trying to achieve, 2) in what GenAI tool, and 3) how you used the GenAI tool, including any prompts." They then had to categorize their task into a task type from Brachman et al.'s (2024) taxonomy: creation tasks (generating artifacts or ideas); information tasks (searching, learning, summarizing, analyzing); and advice tasks (improving, guidance, validation). Then respondents indicated their level of confidence in performing the task from 1-5.

In this essay, I critique two problems with the study’s theoretical framework and methodology, illustrate the limitations of the study’s operational definition of critical thinking, and argue that the evidence presented does not adequately speak to the construct of critical thinking as it is understood by teachers and professors in the academic world and therefore must be interpreted as workplace research. While the study addresses an important issue in the workplace worthy of study, the researchers themselves seem to recognize its exploratory purpose. Headlines shouting out proof that AI is going to bring on an epidemic of cognitive atrophy are premature.

The Critique

Critical thinking is widely recognized in education as a complex reasoning process involving the systematic examination of claims, assumptions, premises, evidence, warrants, backing for warrants, limitations, counter arguments, structural integrity, and overall validity. Although the idea of critical thinking might be stretched to include small “c” acts of self-assessing one’s work, capital “C” thinking for knowledge-building and vital decision-making consists of acts of intense, strategic, deliberate probing and a comprehensive analysis of the information at hand in light of prior knowledge. Learning to think critically is among the most highly valued outcome of schooling and among the most difficult to assess.

The MS study, however, uses self-reported instances of on-the-job tasks as an logical starting point to study critical thinking activity. This method may capture moments of metacognitive awareness leading to basic error-correction, but metacognitive activity is heavily involved in knowledge retrieval, comprehension, and application, all lower-order thinking skills on Bloom’s Taxonomy which sometimes require small “c” thinking. Whether self-checking and self-correcting rise to the level of Critical Thinking depends in part on the expertise of the individual, in part on the task, in part on the intention. In the mainstream thinking of educators, germane cognition involves schema rehabilitation, expansion, construction, adaption for a specific purpose, modification, or deletion. Such work indeed calls on deliberate, facile, and adaptive critical thinking.

Here are the study’s research question as articulated in the document:

RQ1: When and how do knowledge workers perceive the enaction of critical thinking when using GenAI?

RQ2: When and why do knowledge workers perceive increased/decreased effort for critical thinking due to GenAI?

In plain English, here are the research questions as I interpret them.

When do workers stop and think more carefully about what they're doing while using AI tools like ChatGPT, and how do they do it?
When using AI tools, what makes workers feel like they have to think hard or less hard about their work, and why?

The Problem of Ordinary Cognition Versus Robust Critical Thinking

Knowledge workers may indeed rely on AI to handle routine tasks on automatic pilot if they have confidence in AI and re-engage cognitively only when the output appears far afield. In my view, this use of human brain power is metacognitive. Cognitive offloading is a strategy to reduce mental load; it does not necessarily imply that workers are or are not engaging in a deeper analysis regarding the task or any other matter.

On the other hand, if we consider the full breadth of what constitutes critical thinking in non-routine learning, the mere act of thinking harder does not guarantee that a knowledge worker is performing higher-order cognitive tasks. Critical thinking as educators usually define it involves, for instance, a systematic interrogation of the logic behind a claim, the identification of hidden biases, and the integration of multiple sources of evidence to form a coherent judgment. Simply reporting that one stopped to think may reflect a moment of uncertainty or a routine check rather than an in-depth evaluative process. Therefore, using such a self-report measure as a stand-in for critical thinking is, at best, an incomplete and, at worst, a misleading indicator.

The researchers chose Bloom’s Taxonomy to serve as the foundation of their definition of critical thinking. The taxonomy has been critiqued, revised, pitted against competitors and twisted like a pretzel, but still it remains a familiar anchor for a complex construct. Here is the rational for their choice:

Notice that there is no discussion of how the taxonomy is a conceptual match for this study. The knowledge worker may have little confidence or a great deal of confidence in the bot, but the output must be read to determine its fitness for service. The job demands rise to the level of comprehension and require expert human knowledge, but the decision to send or modify the output does not involve probing the assumptions behind the output, looking for a warrant for evidence and backing for warrants.

Instead, the researchers chose Bloom because the model has stood the test of time. Indeed, so have standardized tests. Bloom has been severely criticized and withstood the scrutiny. The fact is so has the five paragraph essay. Bloom is simple, unencumbered by complex nuances. So is the Simple View of Reading. Its simplicity becomes its reason for being a core flaw of the study.

Methodological Issues

The study's methodological architecture exhibits fatal flaws that render its conclusions about critical thinking not merely questionable, but fundamentally invalid. These deficiencies manifest at multiple levels of research design, data collection, and theoretical grounding.

Primary Data Collection: Binary Reduction of Complex Cognition

The study's primary approach to breaking the ice with the survey respondents and foregrounding critical thinking—instead of a binary focus like asking "Have you ever done any reflective/critical thinking?"—represents a profound methodological misstep. This reductionist approach commits a category error by treating critical thinking, a complex cognitive process requiring systematic analysis and evaluation, as a simple taxonomy that can be self-reported. The researchers provide no theoretical justification for how pre training in Blooms model could possibly capture the multidimensional nature of critical thinking as understood in education and perhaps by the respondents. There is no guarantee that the researchers and the responders were talking bout the same thing.

Measurement Validity: The Effort-Thinking Conflation

The researchers employ 5-point Likert scales to measure perceived effort, proceeding from an unstated and unjustified assumption that perceived effort correlates with critical thinking engagement. This assumption ignores fundamental principles of cognitive psychology wherein expertise often reduces perceived effort while potentially increasing analytical depth. The researchers themselves inadvertently acknowledge this problem, noting that "participants occasionally conflated reduced effort in using GenAI with reduced effort in critical thinking with GenAI." This admission effectively invalidates their core measurement strategy.

Sampling and Self-Selection Biases

The methodology suffers from compounded selection biases. First, participants self-select their examples of AI use, introducing narrative bias toward favorable scenarios. Second, as the researchers acknowledge, "Our sample was biased towards younger, more technologically skilled participants who regularly use GenAI tools." This dual-layer sampling bias fundamentally undermines any claims about the broader implications for knowledge work.

Retrospective Self-Reporting: The Memory Problem

The reliance on retrospective self-reporting introduces insurmountable validity problems. Participants must not only accurately recall past cognitive states—a notoriously unreliable process—but must also make post-hoc judgments about whether those states constituted critical thinking. The absence of any real-time observational data or objective measures renders these self-reports effectively unverifiable. A much better approach would have been phenomenological interviews in conjunction with a think-aloud protocol.

Theoretical Framework: The Missing Bridge

Perhaps most problematically, the study lacks a coherent theoretical framework linking its measures to its conclusions. The researchers fail to articulate how self-reported reductions in perceived effort could ineluctably lead to deterioration of critical thinking skills. This theoretical gap renders their conclusions about skill deterioration not merely unmeasurable using their own methodology but unsupportable.

The cumulative effect of these methodological deficiencies is not just to weaken the study's conclusions but to render them epistemologically invalid. A methodologically sound investigation of AI's impact on critical thinking in the workplace would require, at minimum:

Clear operational definitions of critical thinking in he workplace grounded in learning science
Real-time observational data of cognitive processes
Objective measures of analytical depth
Controlled comparisons of similar tasks with and without AI
Longitudinal measures of skill development of atrophy
Theoretical mechanisms linking observed phenomena to cognitive outcomes

In the absence of such methodological rigor, the study's conclusions about critical thinking and its potential deterioration must be regarded as speculative rather than empirical, representing hypotheses for future research rather than supported findings.

Analogies Illustrating the Conceptual Flaw

Two analogies help reveal this flaw:

The MS study is like measuring a chef’s creative genius by asking them to indicate when and why they enact tasting their sauce and how much effort they experienced. While tasting is essential, it does not capture the analytical depth enacted instantaneously involved in inventing an entirely new dish. The MS study is like evaluating a qualitative researcher’s discourse analytical coding skills by asking them to self-report a time when they assigned a unit of discourse to a classificatory code. Such work is routine involving comparisons rather than a deliberate, rigorous process of interrogation. Carrying the argument to its end, both chefs and discourse analysts risk diminishing their critical thinking ability by doing work that is routine for them but complex for non-experts. Said differently, each analogy underscores that tasks that would be highly complex for a non-expert but pleasantly interesting for an expert aren’t reliable indicators of critical thinking. An individual enacting critical thinking does so for a reason, not simply to complete a routine job.

Given these concerns, the study’s title “The Impact of Generative AI on Critical Thinking” is misleading. By operationalizing critical thinking as “moments when the worker perceives more effort” risks misleading a reader to conflate basic metacognitive behavior with the broader, more demanding construct of critical thinking. A more accurate title might have been “A Study of Metacognitive Behavior in Knowledge Workers Using Generative AI,” which would better reflect the nature of the self-reported measures. The title, as it stands, overstates the evidence provided by the data and implies a level of analytical depth that the measures do not support.

Fundamental Methodological Flaws

The study's methodological architecture exhibits fatal flaws that render its conclusions about critical thinking not merely questionable, but invalid. These deficiencies manifest at multiple levels of research design, data collection, and theoretical grounding. The next few sections address these flaws.

Primary Data Collection: Binary Reduction of Complex Cognition

The study's primary instrument for measuring critical thinking commits a category error by treating critical thinking, a complex cognitive process requiring systematic analysis and evaluation, as a binary state that can be self-reported—what the researchers discuss as “perceiving the enaction of critical thinking.” The researchers provide no theoretical justification for how simply asking questions like “Did you notice through your perception of effort that you were thinking critically” can yield reliable and valid data relevant to the construct.

Measurement Validity: The Effort-Thinking Conflation

The researchers employ 5-point Likert scales to measure perceived effort, proceeding from an unstated and unjustified assumption that perceived effort correlates with critical thinking engagement. This assumption ignores principles of cognitive psychology wherein expertise often reduces perceived effort while potentially increasing analytical depth. The researchers themselves inadvertently acknowledge this problem, noting that "participants occasionally conflated reduced effort in using GenAI with reduced effort in critical thinking with GenAI." This admission effectively invalidates their core measurement strategy.

Sampling and Self-Selection Biases

Retrospective Self-Reporting: The Memory Problem

The cumulative effect of these methodological deficiencies is not merely to weaken the study's conclusions but to render them epistemologically invalid. A methodologically sound investigation of AI's impact on critical thinking would require, at minimum:

A clear operational definitions of critical thinking as a generic or a culture-bound phenomenon.
Real-time observational data of cognitive processes
Objective measures of analytical depth
Controlled comparisons of similar tasks with and without AI
Longitudinal measures of skill development
Theoretical mechanisms linking observed phenomena to cognitive outcomes

The Study's Contributions Despite Methodological Limitations

While this critique has focused on the conceptual flaws in measuring critical thinking and a weak theoretical framework, it is important to acknowledge the study's valuable contributions to our understanding of human-AI interaction in knowledge work. The research provides important insights into how workers monitor their own cognitive processes when using AI tools, even if these insights fall short of capturing critical thinking in its fullest sense. The findings about how confidence and trust in AI affect workplace behaviors are particularly valuable, especially for educators, as they help us understand the psychological dynamics of human-AI collaboration.

Furthermore, we must acknowledge the considerable challenges in measuring critical thinking in real-world workplace contexts. The complex, multifaceted nature of critical thinking makes it resistant to straightforward empirical measurement, particularly in naturalistic settings. While self-reported memorable tasks is an inadequate proxy, designing a study that could capture the full depth of critical thinking while maintaining ecological validity would be extremely challenging. Such a study might require a combination of:

1.Think-aloud protocols to capture reasoning processes

Analysis of work products to evaluate the quality of critical analysis
Observational data to document decision-making processes
Structured interviews to understand the depth of analytical thinking
Longitudinal tracking to assess changes in analytical capabilities

The study also makes a meaningful contribution by highlighting the phenomenon of cognitive offloading in AI-assisted work. While this may not constitute critical thinking per se, understanding how workers delegate cognitive tasks to AI and when they choose to re-engage their own faculties is crucial for designing AI systems that support rather than suppress human cognitive capabilities. The study's findings about changes in cognitive effort and confidence provide valuable insights for workplace AI implementation, even if they don't fully capture critical thinking.

Conclusion

The study’s central flaw lies in its reliance on self-reported data as proxies for critical thinking. While such measures may capture aspects of metacognitive monitoring or cognitive offloading, they do not fully encapsulate the rigorous, evaluative, and synthetic processes that define true critical thinking. As a result, the study’s title and its conclusions regarding the impact of generative AI on critical thinking are overstated. To accurately assess the influence of AI on critical thinking, future research must employ more nuanced and comprehensive measures, ones that can differentiate between basic error-checking and the deep, reflective analysis that constitutes genuine critical thought.

Learning to Read, Reading to Learn

Discussion about this post