Beyond the One and Done: When Old Hearts Become Young Again
When a statistic is useless to clinicians but perfect for a headline, it’s not a finding from analysis, it’s a gift to a narcissistic President or a threat to a school district superintendent.
After looking into the White House claim that Trump has the cardiovascular age of a man 14 years younger, some physicians questioned or mocked the claim while others noted that AI-ECG heart-age research is a real and active field rather than obvious nonsense.
In his memorandum following Trump’s May 26 examination at Walter Reed, physician Sean Barbabella reported that an AI-enhanced electrocardiogram (ECG) estimated the president’s cardiac age — which Dr. Barbabella described as “an established measure of cardiovascular vitality” — to be roughly 14 years younger than his chronological age of 79.
An AI-ECG works by feeding the standard tracing of the heart’s electrical activity into a neural network trained on hundreds of thousands of ECGs paired with patients’ real ages. The model learns which features of the waveform tend to track with age, then estimates an age from a new tracing.
The gap between that estimate and a person’s actual age is offered as a rough index of cardiovascular condition. A heart’s tracing of electrical activity that “reads” younger than the calendar is good news.
Or is it?
One Cardiologist to Another
Enter Dr. Jonathan Reiner, the cardiologist who appeared on CNN’s Laura Coates Live and reported that he had taken the AI finding to his colleagues and that the reaction was unanimous. “When I discuss this with some of my colleagues in cardiology, everyone laughed,” he said.
It reminds me of how the twelve reading specialists on the district committee I worked with during Whole Language laughed when standardized test scores came back to the school. They laughed the way you laugh when someone reads a horoscope and seems to take it seriously.
On CNN, Reiner didn’t stop at the chuckle. He pointed out that this AI-enhanced diagnosis is “not a clinically used tool,” and that the science amounts to “one paper on this technology, so that’s not really a way to gauge cardiac health.”
He’s slightly understating the literature — there are several reputable papers asking good questions — but his clinical instinct squares with what even that one paper says if you read its limitations section, at least as I, admittedly a layperson, read the clear caveats. The number, while potentially interesting in some future, well-defined setting, is not established as a routine clinical decision tool about the patient in the room.
For the past decade, the American public has been led by the nose by political handlers when it comes to straightforward answers about how well old men can handle the pressure of the Oval Office. There is something miraculous about living into the golden years, but being qualified to serve as President may not be among those miracles.
Looking More Deeply into the Literature
The early studies Reiner waved off are real, but young. The story starts in 2019 with Attia and colleagues at the Mayo Clinic, who showed that a neural network could indeed read a person’s age and sex off a standard 12-lead ECG. This excerpt illustrates the respect for science these early researchers had:
“We hypothesized that a convolutional neural network (CNN) could be trained through a process called deep learning to predict a person’s age and self-reported sex using only 12-lead ECG signals. We further hypothesized that discrepancies between CNN-predicted age and chronological age may serve as a physiological measure of health.”
The hoped-for payoff was never the party trick of guessing one’s age. It was the gap between a read and predicted age. If the network pegged your heart older than your birth certificate, maybe that discrepancy was capturing something real about the wear on your heart, and we would have a cheap, noninvasive early readout of biological aging tucked inside a common medical test. But what did the gap mean?
Interest in creating a predicted heart age strategy had surfaced earlier, even before neural models capable of creative mathematics arrived. In 2014, Ball et al. developed a multilinear regression model that used ECG and other outputs such as body mass index to estimate the cardiac ages, or “heart ages,” of healthy individuals. Note that the outcome was aspirational, not implementational. Here is an excerpt from the 2014 paper:
“Such estimates might be useful to both physicians and patients for better encouraging lifestyle changes that may be beneficial for cardiovascular health.”
Two studies followed Attia et al. (2019), asking different questions. Lima and colleagues (2021), in Nature Communications, asked the question about AI’s operational capacity to predict death from any cause across roughly 1.5 million ECGs. Does a heart that reads older die sooner, of anything? Yes, they found; at the population scale, the AI-tagged older heart in a younger person dies earlier.
Why? Anything doctors can do about it? The answer remains blowing in the wind.
The most interesting finding from the perspective of an educator involved a spin-off study. To assess whether ECG-age was capturing ECG changes recognizable to humans, Lima et al. conducted an experiment using pairs of strictly normal ECGs. They asked three experienced medical doctors to examine these healthy-looking tracings and identify which one the AI had flagged as having an older heart or a younger heart.
The goal was to determine how well the human doctors could “see” in the ECGs what the AI “saw”—how good they were at predicting the output from the AI. Within each pair of equal chronological age and sex, one individual had an AI-ECG-age more than 8 years greater than their chronological age and the other had an AI-ECG-age more than 8 years smaller than their chronological age.
The experiment was divided into three stages where doctors annotated 44, 45, and 45 pairs of ECGs tracings respectively. In stages 1 and 3, doctors were not given the answer after accomplishing the task, but in stage 2 they were. They guessed in 1, guessed and got feedback in 2, and guessed the AI’s finding in 3 to see if they had learned.
Here is the rationale:
“Since the maintenance of a normal ECG status over time is associated with a low risk of cardiovascular diseases in a dose-response relationship, we hypothesize that the [AI] might be able to identify subtle abnormalities that are not being currently identified in traditional analysis. This could help justify the capacity of evaluating the risk even for apparently normal ECGs.”
Here is the finding:
“Analyzing doctor’s assessments of 134 pairs of traces, aggregated through majority voting, we found that they were not significantly better than random….. …[D]octors were given feedback about their predictions (in Stage 2), [but] this did not increase their accuracy in the subsequent stage. In fact, they performed worse in Stage 3 (accuracy = 45.5%), after the feedback, than in Stage 1 (accuracy = 64.4%), before the feedback, or in Stage 2 (accuracy = 62.2%), during the feedback.”
It’s apparent that something distinguishes the human reading of ECGs from the AI readings, but it isn’t clear from the research that the AI reading is necessarily superior to the doctor’s reading. The research hypothesizes that the AI notices aspects of the tracing more deeply or comprehensively than the doctors, but the experiment at least as I read it doesn’t offer that purchase. The researchers frame the finding as a “lack” on the part of the doctors:
“The lack of capability of trained doctors to distinguish between pairs of normal ECGs of the same age but different ECG-age also supports this hypothesis.”
The doctors’ failure to predict AI-ECG age is treated as confirmation, as evidence that the machine sees more than humans can see. But the same result is equally consistent with the opposite: that the signal corresponds to no cardiac abnormality a clinician would recognize because it isn’t one. The data underdetermine which story is true. The authors pick the flattering one and call the null result “support.”
Hirota and colleagues (2023) asked the narrower, clinical question. Where Lima et al. wondered about AI-ECG estimates of heart age as predictor of death from any cause, Hirota et al. asked whether the gap predicts real-time cardiovascular events in a specific patient.
Hirota and colleagues conducted a single medical center study based in one cardiovascular hospital in Tokyo with no external validation, atrial fibrillation cases excluded, and a mean follow-up of about fifteen months, and the authors are candid that the over-60 findings “…should be re-evaluated in different cohorts, such as multi-center cohorts or the general population.”
But the design is fundamentally different in a very important way. Rather than asking whether an AI-ECG age-gap correlates with eventual death, it generated AI-ECG predictions and then tracked whether actual cardiovascular events followed: heart failure, acute coronary syndrome, stroke, cardiac death. It tested the prediction against history. And in part, the numbers worked. The model showed real predictive value, evidence the machine is noticing something of empirical worth, not reading tea leaves.
That value, however, was confined to one portion of the population. In patients under 60 years old, AI-predicted heart age outperformed chronological age at forecasting events: an AUC1 of 0.700 against 0.642, a difference that held up statistically. In the under-60 group the gap between the heart’s estimated age and the calendar rose with risk. As the predicted heart age climbed above the real one, annual cardiovascular event rates rose at 0.98%, 1.52%, and 2.66% .
The authors read this as the gap “…presumably representing the progression of atherosclerotic change.” So far, the metric holds: a younger-reading heart in a 45-year-old plausibly is a healthier one. The trouble begins when you carry it across the age line.
Cross into the over-60 group and the finding doesn’t just fade; it actually flips.
“In patients aged > 60 years, AI-predicted age was not predictive for cardiovascular events.”
Worse for the over-60 crowd, the association for heart failure and valvular disease turned U-shaped, meaning an ECG reading younger than chronological age now tracked with more adverse incidents. The authors explain why, and the explanation needs to be considered when making up your own mind about the White House physician’s rosy take on Trump’s AI-ECG score.
According to the researchers, a strained, overloaded heart in an old person produces large-amplitude waveforms in the left-side leads, and those are the same tall waves that mark a healthy young heart. So “the ECG characteristics with disease burden may be mistakenly regarded as that with young age in the age-prediction model….”
They emphasize this complication: “These points would be the critical limitations of AI-predicted age especially when applied to patients with older age.”
Read that against the White House Presidential health memo. In a younger patient, “14 years younger” might plausibly suggest better cardiovascular status; in an older patient, the literature is more complicated, and in some cases a younger AI-ECG age can reflect confounding features of a troubled heart rather than straightforward health. The White House, therefore, highlighted a number whose meaning in Trump’s age group is uncertain, not a clean clinical victory lap.
What This All Means for Learning to Write, Writing to Learn, and AI Diagnosis of Learning Capacity
The thread running from the White House memo to the school district report is spun from the same act of misdirection. In both, a number is lifted out of the murky, qualified research that produced it, scrubbed of its caveats and limitations, and set in a headline where it gets the last word about something that calls for a long conversation.
“The heart of a 65-year-old in the body of a 79 year old.”
“District writing proficiency: 3.2.”
Each is a real measurement of something. Neither methodology measures the individual thing it’s presented as measuring, and both work only because the constructed number looks like the end of an argument rather than the start of one.
For the narcissist in the Oval Office, the prop flatters. For the superintendent in the district office, the average threatens — a number that can close a program, rank a school, or follow a teacher into a review. The deeper damage is distributed. For good or ill, the public learns to mistake the readout for the reality and stops wondering what the writer in seat three is actually capable of thinking and writing.
I have little doubt that AI will play a productive, if limited, role in both education and medicine as a data-analytic tool. A number, understood, can fortify a policy judgment or a procurement of materials and tools; it cannot warrant a decision about a person. A population distribution can tell you what is likely at different points along a scale. It cannot tell you what is true of one person who was never actually present in that distribution.
Imagine you reach into the data and pull out two patients at random: one who went on to have a cardiovascular event and one who didn’t. The AUC is the probability that the model assigned the higher risk score to the one who actually had the event. That’s it — it’s the model’s batting average at putting the right person on top across every possible pairing.
The scale runs from 0.5 to 1.0. An AUC of 0.5 is pure chance — a coin flip, no discriminating power at all. 1.0 is perfect — the model never gets a pair wrong. Roughly: 0.6 is weak, 0.7 is fair-to-modest, 0.8 is good, 0.9 is excellent.
So in Hirota, the under-60 numbers — 0.700 for the AI age versus 0.642 for plain chronological age — say the AI was right about 70% of the time when ranking a pair, the calendar figure about 64%. Both are mediocre in absolute terms; neither is anywhere near a precise individual test. But the AI’s edge over the calendar was real, not noise. That’s what “held up statistically” means: the gap between 0.700 and 0.642 was unlikely to be a fluke of this particular sample (their p-value was 0.003, comfortably below the 0.05 threshold researchers treat as significant).
