Reaching for the Gorgeous: Harvard's Race to Save the "A" in the Age of AI

May 26, 2026

A few days ago, I was ready to resist Harvard’s newly announced grading policy with everything I had. However, after reading more carefully into the primary documents and leaning on my own administrative experience in university assessment, my perspective has shifted. I still hold fast to any attempt to find value in letter grades looked at from the perspective of a student and would prefer Harvard to reject the machinery of letter grades outright in an ideal world.

What initially looks like a draconian cap on top grades is actually a complex, desperate experiment attempting to save the undergraduate experience from the dual threats of grade inflation and generative AI.

In February 2026, the Subcommittee on Grading, chaired by Stuart Shieber, released A Proposal for Updating Grading Policies. Its central recommendation: cap A grades in each Harvard College undergraduate course at 20% of enrollment plus four, restoring the A to its handbook-defined status as a grade reserved for “extraordinary distinction.” A secondary recommendation replaces grade point average with average percentile rank as the metric for internal honors.

Five months earlier, Dean of Undergraduate Education Amanda Claybaugh’s Re-Centering Academics at Harvard College had laid the diagnostic groundwork. Together, the two documents represent a consequential intervention in Harvard’s grading practices.

In the months since the Shieber report’s release, the Derek Bok Center for Teaching and Learning has become the operational arm of the response. Its public-facing guidance now sits under the banner of “Recentering Academics” with a portfolio of recommendations on assignment design, grading consistency, and the use of generative AI in teaching.

Read together, these documents describe a deep concern with what an undergraduate course at Harvard is. The cap on the letter grade A is the tip of the iceberg. Getting instructors to redesign assignments is now urgent in the trenches, driven simultaneously by the cap’s demand for differentiable work and by AI’s disruption of traditional assessment.

The Bok Center for Teaching and Learning and its growing portfolio of AI-augmented pedagogical recommendations seems to be the leading edge of a reorganization the policy documents don’t have time to describe. Suddenly, there seems to be a sense of irretrievable loss if institutional integrity doesn’t make a show of force.

Twenty Years of Inflation

The numbers are stark and well-documented. All of the numbers I will cite in this essay are drawn from the Claybaugh appendix and were generated by FAS Institutional Research representing the Faculty of Arts and Sciences, the body that houses Harvard College, the Graduate School of Arts and Sciences, and the John A. Paulson School of Engineering and Applied Sciences.

FAS Institutional Research data show A grades at 24% of all grades awarded in 2005, 40.3% in 2015, and 60.2% in 2025. Median GPA at graduation sat flat at 3.67 from 2005-06 through 2016-17, then jumped to 4.00 from 2017-18 onward and stayed there. Grade inflation at Harvard has been continuous, in some form, for over a century, but the late-2010s acceleration was different in kind. By 2020-21, A grades had reached 62.8%. They have not returned below 58%.

Several forces converged to produce what many have called the “qualitative break,” meaning that the quality of the information expressed in a grade has been devalued. The Claybaugh report names these forces with candor.

In 2008, the FAS Faculty voted to formalize student course evaluations, replacing the informal CUE Guide with the standardized Q. Once formalized, Q scores acquired weight in tenure decisions, in non-ladder faculty job prospects, in departmental enrollment competition.

Faculty came to believe that lower grades produced lower Q scores, even though FAS Institutional Research has shown the predictive power of grades on Q scores to be small and that of workload essentially zero. The belief shaped behavior anyway.

Around the same period, administrative messaging shifted. The College began emphasizing that students arrived with varied preparation, that many were struggling with imposter syndrome, that nearly all were stressed. Faculty took these messages to heart and many became more lenient. The pandemic accelerated the trend; the spike of 2020-21 never fully retreated.

Meanwhile, pedagogical fashion moved toward lower-stakes assignments, effort-based grading, creative projects without rigorous rubrics, and ungrading experiments — each defensible on its own terms, each contributing to the compression of grades at the top of the scale.

The result was not merely higher grades but flatter ones. Inflation (rising averages) isn’t the same as compression (narrowing of the range, especially at the top). The Shieber subcommittee’s central diagnosis is about compression.

GPAs pile against the wall of 4.0. Summa cum laude cutoffs trail out to five decimal places. The Sophia Freund Prize, awarded to the summa graduate with the highest GPA, went to 2 students in 2010-11 and 55 in 2024-25 — superlinear growth that reflects not an explosion of excellence but the collapse of the metric’s resolving power. Phi Beta Kappa and other honors committees resort to confidential phone calls to identify the “true stars” hidden inside compressed transcripts.

What Grades Are Supposed To Do

The Claybaugh report organizes its diagnosis around the question of function. Grades have stopped doing the three things they are supposed to do, according to the ethos at Harvard. Motivation: students will do what is needed for an A but no more, then redirect their energies to extracurriculars or additional credentials. Information: grades no longer give students reliable signals about their strengths and weaknesses. Distinction: honors committees and external evaluators can no longer differentiate among Harvard students by transcript alone.

Beyond the failure of grades themselves, the report names broader cultural damage. Students become “terrified of the A-” and choose courses by grading reputation rather than interest. Stress migrates from coursework — where A’s are too easy to be meaningful markers of achievement — into the extracurricular and pre-professional arenas where real distinction is still possible. Academics begin to feel “fake” to students.

One interviewee told the Office of Undergraduate Education, Dean Amanda Claybaugh’s office, that no instructor had ever told her she could do better work. Several mentioned sitting for finals they could have aced on the first day of class. In Claybaugh’s report, the following message is sent (pg. 13):

“We owe our students a functioning grading system. Specifically, we owe them grades that send clear signals, that give them a good sense of their strengths and weaknesses and that communicate their areas of distinction to employers and admissions committees.”

The report’s most candid moment nails the crux of the problem. Faculty grade as they do because they cannot afford to be outliers when Q scores affect career prospects, when enrollments affect departmental resources, when student advisors plead on behalf of struggling students. One faculty member quoted in the report: “Grading at Harvard is in a race to the bottom. This is a classic game theory problem.”

Anatomy of the A- Policy

The cap allows each course to award A grades to at most 20% of its enrollment, plus four additional A’s. The “+4” is the small-course provision. Without it, a five-person seminar could give only one A (20% of 5 = 1), which would be absurd in a setting that often draws advanced, self-selected students working at high levels.

The percentage component is what does the work in larger classes: a 20-student course is capped at 8 A’s (40%), a 50-student course at 14 A’s (28%), a 100-student course at 24 A’s (24%), a 500-student course at 104 A’s (about 21%). The formula bends generously at the small end and asymptotically approaches a true 20% cap at the large end.

The aggregate effect across the College, given the actual mix of course sizes at Harvard, is what the subcommittee estimates would produce roughly 34% A grades overall — close to the distribution that prevailed in 2011, before the late-2010s acceleration. About 60% of courses currently already comply with the cap; the policy primarily constrains the other 40%, which tend to be the larger lecture courses where A rates have climbed highest.

The shift from GPA to average percentile rank (APR) for internal honors addresses the compression problem rather than the inflation problem. APR remains meaningful even when grades are compressed. Crucially, the APR is intended for internal use only — calculating honors, prizes, fellowship eligibility — and will not appear on transcripts.

The subcommittee’s design is robust against failures they could identify from prior institutional attempts. It is less robust against approaches that emerge from mechanisms they did not make explicit. The most important example is this: the proposal assumes the cap operates on coursework that is itself stable.

But the coursework is being simultaneously redesigned under pressure from generative AI, by the Bok Center’s parallel initiative. The cap will be applied not to the stable coursework the subcommittee modeled but to a transformed coursework whose properties the subcommittee did not anticipate. A Sandoval conjecture map might have surfaced this dependency. The failure-mode analysis the committee used drawn from Engineering did not, because the AI-driven redesign was not yet visible as a failure mode at the time the subcommittee did its work.

There is also a subtler problem, the uninterrogated chain from policy to culture. The Bok Center extends the subcommittee’s argument by claiming that the cap and associated reforms will recenter academics in students’ lives, restore the ‘worth-it’ feel of intellectual work, repair student relationships to their own learning.

These are large claims about mediating processes, about how a quantitative cap on top grades translates into the qualitative experience of college. None of them is made explicit, tested, or specified with the kind of conjectural rigor that would make them falsifiable.

The cap will produce a different distribution of A’s; that much is mechanically certain. Whether the different distribution produces the cultural transformation the rhetoric promises is, in design-research terms, an unmapped conjecture.

The Shieber subcommittee produced an excellent piece of policy engineering. It is not, by training or methodology, a piece of learning design research. The difference shows up not in what the design does but in what it does not interrogate.

The proposal acknowledges this in passing: “Faculty review their course plans and past grade distributions to ensure that their letter-graded courses will include suitably challenging coursework to make fair distinctions between students. The Bok Center should support the process of course and assessment design when necessary.” The implementation is then handed off.

The Bok Center Takes the Handoff

The phrase “center academics in the lives of students” does a lot of work on the Bok Center’s Recentering Academics hub page. Read carefully, the construction is strange: faculty and administrators will reach into students’ lives and rearrange their priorities by teaching well and grading honestly.

The opening promises that faculty will center academics in students’ lives. A few sentences later, faculty will merely aid students in centering their academic experience among their rich set of activities. The verb softens because the stronger claim is indefensible. Faculty cannot center anything in a student’s life. They can teach well and grade honestly. Whether the student then centers their life around that has always been the student’s choice.

The Center’s operational thesis is clearer: “Central to the call to center academics in the FAS is to ensure that course assignments are challenging and meaningful and support grades that distinguish between work that is satisfactory, good, and excellent.” The mechanism is assignment design, precisely the element that wasn’t considered for inclusion in the grading policy.

A cluster of supporting pages organizes the response. Designing Rigorous Assignments & Exams That Lead to Fair Grades sits at the operational core with guidance on cumulative finals, rubrics, faculty-TF norming, and the principle that “grades are a reflection of mastery of course content and skills, not merely the amount of effort put into an assignment.” Grading provides a fairly traditional menu of ways to ensure transparency and predictability to grading: “A student’s grade on an exam or essay shouldn’t depend on which section of the course they happen to be in.”

The Bok Center has already developed resources for faculty who run several sections of a course with teaching assistants. Teaching Teams provides operational guidance for course heads. I was able to identify URLs for pages on interesting topics which require a login, I think because so much work at the Bok Center is in medias res.

What the Recentering Academics page does not explicitly acknowledge is that “recentering academics” frames the work as a restoration when the conditions for restoration no longer exist. The 2011 distribution the cap implicitly targets emerged from certain teaching conditions — pre-AI, pre-pandemic, with different student preparation levels and different extracurricular pressures — that cannot be re-created.

The Center’s recommendations are not restorative. They are innovative. The slogan is either naive or doing concealment work or the Office of Undergraduate Instruction isn’t communicating clearly with the Bok Center of Teaching and Development.

Generative AI as Pretext, Problem, and Tool

The convergence move appears in a single sentence on the Recentering Academics page: “For assignments to provide this distinction, faculty are likely to need to change grading practices. Moreover, particularly given the advent of generative AI, many traditionally effective assignments are likely to need to be redesigned.”

Read closely. The demonstrative “this distinction” pivots on a border between Claybaugh’s “meaningful as well as rigorous” and the grading proposal’s technical sense of differentiation at the toppermost of the toppermost of the scale.

The “Moreover” joins a contested policy rationale (grading) to an uncontested technological one (yes, AI is here), allowing the first to merge with the second. A faculty member who would resist the cap as a constraint on pedagogical autonomy is on much weaker ground resisting the broader claim that AI has changed what assignments can measure.

Once they are redesigning assignments anyway, in response to AI, the cap-supportive changes arrive in the same package. AI is the universal solvent. It dissolves resistance to changes that would, on their own grading-policy merits, generate academic guerrilla warfare.

The Center’s AI guidance compounds the effect. Using AI in Student Assessment describes Harvard-supported AI tools that can audit rubrics, compare student submissions against answer keys, and flag grading inconsistencies. The main Center page describes current projects including “AI-augmented oral exams, ‘vibe coding’ for humanities courses, AI-resilient assignment design, and frameworks for treating AI as an object of critical study—not just a tool.”

AI is the changed condition requiring assignment redesign. AI is also the tool assisting the redesign. AI is then the contaminate students must be prevented from using on the redesigned assessment — or taught to use ethically within it. Faculty adopt the tool in their workflow even as they constrain it in their students’. This seeming incoherence produces a closed loop that the public-facing pages do not acknowledge.

The shift from policing to designing, however, is the move that ties the package together: “Rather than trying to identify or police AI use, ensuring that student work is reliable evidence of their learning requires us to rethink assignment design and assessment.”

The opening clause is a concession dressed as a principle. The reason not to police AI use is, in part, principled — surveillance corrodes trust, detection tools are unreliable, false positives damage students. But the reason it is being stated here as the opening move is that policing has been tried across higher education and has largely failed. The grammar absorbs a defeat and reframes it as a strategic choice.

The locus of intervention then shifts from the student (who might or might not be cheating) to the assignment (which might or might not be designed to elicit reliable evidence). On the surface this is liberating. Faculty can stop being detectives and return to being teachers.

But the shift quietly accepts that students will use AI, that the question is no longer whether but how much and on what, and that the instructor’s job is to design around this fact. The student’s choice has been removed from the frame.

There is also an epistemological concession. “Reliable evidence of their learning” is more modest than it first appears. What faculty used to want from assignments was not evidence of learning but artifacts of learning — papers, projects, work that was itself the learning, that constituted the intellectual activity rather than just demonstrating it.

The Return of Oral Examination

The Center’s guidance is explicit about where the redesign points: “Making at least some of these steps in person without devices (oral topic proposals, in-class outlines, reflective oral or hand-written explanations after submission, follow-up oral exam) offers a better chance of reliable assessment. This might include touchpoints AFTER submission, such as a live interview about the project or an oral defense.”

This is a return to a tradition Anglo-American higher education largely abandoned for undergraduates a century ago. Oral examination persists at the dissertation level, in medical boards, in moot court, in language oral proficiency interviews. The undergraduate course assignment has, since roughly the rise of the modern research university, been overwhelmingly a written artifact submitted asynchronously.

The shift toward writing was deliberate and consequential. Writing democratized assessment — the shy student, the non-native speaker, the slow processor, the student who thinks better with revision than under live questioning, all gained ground when the artifact replaced the vocal chords. Writing also created a permanent record, gradable by multiple readers, contestable, archivable.

The proposed reversal is large. A Harvard lecture course of 200 students cannot administer a five-minute oral defense to each student without consuming many days of faculty and TF time per major assignment. The graduate-program ratios implied by the recommendation are not the ratios Harvard currently runs at scale. Operationally, the proposal will collide with infrastructure.

Pedagogically, the shift is also not neutral. Oral assessment privileges a different set of student capacities than written assessment does. The student who is articulate under pressure, who thinks well in real time, who has the social ease to perform before a grader, who comes from a background where verbal sparring with authority is familiar — this student is advantaged.

The student who is brilliant on the page but stumbles when called on, who is anxious in face-to-face evaluation, who is autistic, who stutters, who needs time to formulate — this student is disadvantaged. The cap will identify the top 20%, and under the redesigned regime that population will, disproportionately, be students who are good in the room. It is a different population than the old A measured.

There is also a suspicion built into the encounter. The oral defense is being recommended not because oral examination is intrinsically valuable, though it may be, but because it is AI-resistant. The interview’s epistemological function is forensic: can this student, sitting across from me, actually explain what they wrote? The student arrives knowing this. They are not being invited into a conversation about their ideas; they are being audited to confirm the ideas are theirs.

Consider the hybrid student. They write a paper. They use AI substantively — to clarify their thinking, to push back on a draft, to suggest sources they then read carefully, to polish prose. They understand the argument deeply by the end. They walk into the oral defense and explain the paper fluently, fielding questions, defending choices, extending claims into territory the paper did not cover.

By the standards of the oral defense, they have demonstrated mastery. By the standards of “reliable evidence of learning,” they have produced reliable evidence. But the artifact itself — the paper that will go in their portfolio, that contributed to the writing sample, that they will look back on as something they wrote in college — is a hybrid product whose precise authorship is irrecoverable.

The oral defense has, in effect, ratified the hybrid. AI-assisted writing becomes acceptable as long as the student can defend it live. The integrity of the artifact has been replaced by the integrity of the performance. This potential reality has far reaching consequences for what counts as gorgeous academic work.

How Far the Logic Runs

The Bok Center’s experimental edge is the place to see what the redesigned undergraduate course is becoming in the age of AI, and from my perspective, this sort of experimentation is sorely needed. Consider this proposal1:

“Voice-Cloned Discussion Facilitators: Instructors can create AI voice clones to serve as additional discussion participants or guides. These voice bots could represent different perspectives, historical figures relevant to course content, or even the instructor themselves, allowing students to engage in dialogue with the AI-generated subject positions to develop and articulate their ideas before collaborative activities.”

The student’s intuition that they are speaking with a presence will not match the reality of what they are speaking with, i.e., a contemporary language model, prompted by an instructor’s framing. Students have always read texts that purport to represent historical figures, but a book is plainly a representation; a voice that responds in real time feels like a presence. The pedagogical effect of this mismatch is unknown; it will be interesting to see what effect the new grading policy will have on instructor willingness to experiment in classes.

An instructor creates a voice clone of themselves and offers it to students. The instructor is, presumably, also still teaching the course. The student therefore has two versions of the professor available: the actual one, with office hours and limits and a real relationship, and the clone — available at 2 a.m., infinitely patient, never busy, never disappointed, etc.

Each version has comparative advantages the other cannot match. The actual professor cannot compete on availability. The clone cannot compete on being a person. What is likely to emerge is a split relationship, in which the real professor handles the high-stakes encounters — grading, recommendations, the seminar table — and the clone handles the developmental ones, where ideas get worked out, questions get asked aloud, positions get tried on.

The intellectual companionship that has long been one of the central goods of an undergraduate education is sustained engagement with an actual mind that is teaching you. That companionship would now be bifurcated. The mind one develops with is synthetic. The mind that judges is real. How might that work?

The Center’s broader AI guidance has been organized around the question whose thinking is this? with redesigned assessments preserving the boundary between student and machine even when policing has failed. The voice-cloned facilitator runs the other way. It explicitly introduces synthetic voices into the developmental phase of student thinking.

The student “develops and articulates their ideas” in dialogue with the clone. Whatever they arrive at by the time they write or face the oral defense, has been shaped by a synthetic interlocutor. AI in writing the paper is a problem; AI in forming the thoughts that become the paper is a feature. The distinction is fine and possibly defensible, but it is not stable.

Place this composite in front of the cap. A Harvard course, by accretion: students read with AI assistance to summarize and clarify; students develop ideas through dialogue with voice-cloned interlocutors; students draft work possibly with AI assistance; students submit work; students sit for an oral defense in which a human instructor verifies the ideas are recognizably theirs.

The top 20% across these stages receive A’s signifying extraordinary distinction. Each piece can be defended on its own terms, but the composite is genuinely new and is being assembled without overall design.

No one is asking what kind of intellectual life this composite produces in a nineteen-year-old. No one is asking what relationship to one’s own mind a student develops after four years of working out ideas with synthetic voices and then defending them to real ones.

Reaching for the Gorgeous

The Claybaugh report’s most affecting passages are the student interviews. The student who said no instructor had ever told her she could do better work. The student who could ace the final on day one. The students who said academics felt “fake.”

The reform’s deepest aspiration is to make the work real again — real in the sense of producing the recognition that one has done something that matters. Call it gorgeous, for shorthand. The reach for the gorgeous amidst the ordinary and the mundane is what gives the whole package its axiological value.

What gorgeous requires, though, is not just challenging assignments and honest grades. It requires instructors visibly absorbed in the material, sustained intellectual companionship between students and faculty, time for the slow formation of ideas, room for productive failure, the absence of constant evaluation, and a willingness to let a seminar lose its thread because someone has said something interesting.

The mismatch is structural. The reforms restore the distribution but cannot restore the conditions. The cap arrives in a teaching environment shaped by external pressures the policy does not directly address. The Bok Center is doing what it can with the lever it has been given. The lever may not reach what the rhetoric promises.

The 2026 grading proposal is, despite appearances, the most conservative document in the package. It does one thing: it restores meaning to a letter on a transcript. The Bok Center’s AI guidance, taken together, is doing something much more radical, in a much quieter register — reorganizing the substance of coursework, the relationship between student and machine, the relationship between student and teacher, the very object that grades evaluate.

The cap will be applied to students who emerge from a course experience that nobody could possibly have fully designed and that no policy document fully describes. To do so would be to follow a teaching script with intense controls on students.

Harvard students will know when the dust settles in 2030 or 2032 whether the clarity to see and honor genuinely gorgeous Grade A quality learning the reform reaches for comes to pass. It is the question the institution will have to revisit when the next subcommittee is appointed.

In the meantime, it will have reshaped academic culture in high schools across the country. In many ways, the shakeup is welcome. It’s delicious irony that a system that has for so long instructed students to sit quietly, take notes, and pass tests now must hear them speak. It would be amusing if it weren’t so tragic that the energy sparking this innovation is grounded in the most self-limiting machinery of scalability, the letter grading system.

Share Learning to Read, Reading to Learn

Learning to Read, Reading to Learn

Discussion about this post

Ready for more?