Multiple Choice Items: Misunderstood by Winners and Losers Alike
Which of the following best describes the origin of the multiple-choice test format?
a) It was developed in the early 1900s to efficiently assess large numbers of students
b) It emerged from Chinese civil service examinations during the Tang Dynasty
c) It was created in the 1800s as an educational parlor game
d) It originated in religious institutions to test scriptural knowledge
Before reading further, select your answer and justify your choice for yourself. If you can’t justify it, how can you be sure you’re right? If you are wrong, how will you find that out? Does any of that matter in the testing situation? What are you doing inside those options?
The correct answer is (a). Multiple-choice testing was developed in the early 1900s—specifically 1914-1915—for efficient large-scale assessment. If the test item is “good” by industry standards, a minority of you should have gotten it right, and the rest of you should be equally spread among the distractors. Of course, after receiving instruction, readministering the item should improve on the number of right answers.
The Uncomfortable Truth
Multiple-choice questions are designed to create winners and losers. That’s not a flaw. The purpose is to efficiently and reliably separate those who meet a threshold of knowledge from those who don’t. To sort. To rank. To decide who passes and who fails.
For decades, we’ve asked MC items to do more and more—to assess complex thinking, creativity, authentic application. Though there is no governmental authority regulating the construction of MC items for tests, there is disagreement about the kinds of thinking this format can evoke. But research makes clear: MC items excel at measuring recognition and recall. The moment we ask them to measure more, we introduce error variance unrelated to actual knowledge. Even without government regulation, we understand error variance to be a problem in measurement.
The RICA Case: When Sorting Creates Problems
California’s Reading Instruction Competence Assessment (RICA) was implemented in 1998 as a certification requirement for elementary teachers. We couldn’t have teachers who didn’t know how to teach phonics, who continued to believe in the three-cueing system model which had become prominent.
For over two decades, it functioned as designed—sorting teacher candidates into those who passed and those who didn’t. I remember it well because I was teaching elementary reading methods courses designed to prepare credential candidates to pass it and get into their own classrooms, another period where I was so grateful for the descriptive linguistic courses I took as an undergraduate and graduate student.
Between 2017 and 2021, more than 40% of teacher candidates failed the test the first time they took it, and Black and Latino teacher candidates overall had lower passing rates than their white and Asian peers. Along with good reading teachers, it wanted more teachers from underrepresented groups in the schools, including linguistically minority teachers, while the test systematically created barriers.
Critics of the test, who worked to change it, said RICA “does not align with current state English language arts standards, is racially biased and has added to the state’s teacher shortage.” Of course, because a measurement shows differential results based on racial differences, racial bias cannot be assumed. (See Soon-to-be retired California reading instruction test gets high marks in national analysis for the rest of the story.)
California responded by adopting new Literacy Teaching Performance Expectations in November 2019, explicitly addressing the knowledge, skills, and abilities necessary for literacy development, including reading instruction. These TPEs required all preliminary teacher preparation programs to restructure by 2021–22.
The shift moved from a single standardized test measuring phonics knowledge to performance-based assessment of actual teaching practice. This represented a real change in accountability. Rather than sorting candidates by MC test scores largely, the new system evaluated whether teachers could effectively implement literacy instruction with diverse learners in authentic classroom contexts, theoretically removing racial bias while maintaining instructional quality standards.
The lesson? MC tests sort reliably determined by the empirical quality of their distractors—do they fool enough of the people enough of the time to create an acceptable level of statistical variance within tolerances for bias? But when sorting outcomes conflict with institutional values, the threshold measurement gets adjusted—not because the MC test malfunctioned, but because it functioned exactly as designed.
The Irony: Using MC to Assess Reading and Writing
There’s profound irony in using MC questions to assess reading comprehension and writing ability, skills fundamentally about creating and interpreting complex, open-ended meaning.
The Writing Absurdity
Van Binsbergen’s pioneering 1972 study of MC tests in humanities and social sciences found they proved relevant to seven of eight educational aims in undergraduate sociology—”the evident exception being: students’ writing skill.” Though one could quibble with the reasoning that includes MC items in the appropriate toolkit for assessing higher-order thinking apart from comprehension and composition of text, the exclusion of writing is noteworthy because it came at the dawn of scaled interest in holistic writing assessment.
This issue isn’t about teacher certification. In California, a direct writing assessment using on-demand writing prompts and rubrics, the C-Best, showed up in the 1980s to test teacher writing ability as a prerequisite for entrance into a credential program. I had to take and pass it in 1985 to be credentialed to teach fourth grade.
This issue is about decades of using MC tests of editing skills to represent measures of student writing—in K-12 assessments, college placement exams, standardized MC tests were the norm. Students who could identify comma splices from a list may be unable to write clear sentences themselves, but that wasn’t the question. Item writers knew exactly what they were measuring.
We’ve measured knowledge about writing instead of writing ability itself just as we do with, say, history. Our system is interested in the capacity to recite the factual narratives as we’ve recorded them in textbooks. There’s been little interest in, say, measuring the capacity to apply the historical imagination to significant historical formations like, say, the American Civil War.
The Reading Paradox
MC reading comprehension reverses how reading functions. In these tests, the items focus not on knowledge about reading—we don’t ask sixth graders to pick a good definition of a phoneme or a syllable. We ask them to pick out of a lineup what we *think* students should *notice* about a predetermined meaning—the “facts” of the text. This point is tricky, subtle. Yes, there are brute facts of a text. Yes, MC items can determine whether a reader controls those facts. Yes, comprehension fails when such facts are missed.
But whether this sort of reading—surveilled reading—reflects authentic reading is a separate question. What does a reader do when the text doesn’t make sense? We gain no insight from MC items.
Without belaboring the point, real reading involves generating interpretations, not selecting from predetermined options; constructing meaning actively from prior knowledge and experience, not recognizing who said or did what when or how this or that causes this or that it; tolerating ambiguity and multiple valid interpretations, not searching for ways to stop ambiguity in its tracks.
There is a world of philosophy, empirical research, personal narratives, and the like leaning into these murky aspects of reading. Because we can’t see clearly yet in this arena is no excuse for taking refuge in MC items. AI is going to bring change in this regard.
When we ask “select the main idea from these four options,” we measure whether students can identify what test-makers consider important, not whether students can determine importance themselves. This is reading recognition of what the authority says the text means, not reading comprehension of the text the reader produced in any authentic sense. Even if the authority is obviously right, the approach distorts what it sets out to measure beyond recognition.
AI-Generated MC Items: Promise and Peril
Recent research reveals large language models can generate MC questions with varying but sometimes impressive quality. This development means that educational administrators and policy makers can put standardized test machinery in high gear everyday, everywhere. It also means that surveillance of learning and knowing has much more potential to shape what is learned and known more to the liking of the authority than ever before.
It is now possible for AI to “listen” to the talk in a classroom session—lectures, discussions, etc., access any texts students were assigned to read for that session, and develop an MC test on the spot to serve as an exit ticket for the day.
A 2023 multinational study comparing ChatGPT-generated MCQs with those written by experienced university professors for medical graduate exams found no significant overall quality difference. ChatGPT generated 50 questions in 20 minutes 25 seconds (average 24 seconds per question), while human professors required 211 minutes 33 seconds—making ChatGPT approximately 10 times faster.
Five independent international assessors rated questions generated by experience item writers and GPT on multiple domains. Criteria for “good” MC items are not formally standardized anywhere, but experts who write these items for a living practice according to industry standards. One standard, for example, is relevance. Each option in a “good” item must be relevant to the question being asked. Makes sense eh? Only one domain showed significant difference: AI scored lower than human item writers on relevance (7.56 vs 7.88 on a 10-point scale). Also makes sense.
Enlightened Use: Formative Assessment Without Stakes
This development also creates opportunity for enlightened MC use: AI-generated questions for formative assessment with immediate feedback, not high-stakes evaluation.
Consider this approach:
Generate quiz on recent material
Administer without grade consequences
Students score their own with immediate answer key
Pair discussion of missed questions
Whole-class exploration of common misconceptions
This transforms MC items from measurement tools into learning tools, serving diagnostic purposes without the distorting effects of stakes and surveillance.
Why Stakes Undermine Learning
When MC quizzes count toward grades, distortions occur: test anxiety replaces engagement; surveillance effects teach compliance rather than curiosity; gaming incentives promote test-taking strategies over genuine learning; reduced risk-taking discourages intellectual growth; external motivation displaces intrinsic interest.
MC pop quizzes that count for a grade function as attendance enforcement, compliance tools, behavior management, and grade inflation—none aligning with assessment’s legitimate purpose: providing information about learning to guide instruction.
MC items are measurement tools. Using them for surveillance and compliance degrades both their measurement function and the learning environment.
Critical Limitations Remain
AI-generated items still measure only recognition and recall. They can’t assess performance skills, creativity, or application. They may contain errors requiring human review. And they carry new risks: a) hallucination of plausible but incorrect content and b) bias amplification from training data. Over-reliance on these instruments without adequate review is risky. I suppose human reviewers may be susceptible to the same flaws.
The Verdict: Honest About What We Measure
Multiple-choice items have narrow, specific, defensible uses: measuring recognition and recall to reliably sort test-takers against thresholds. This purpose is limited but honest.
Problems arise when we use MC beyond their scope—for high-stakes decisions about complex competencies, as substitutes for authentic assessment, pretending scores represent deep understanding rather than recognition.
An instructor using AI to generate daily low-stakes quizzes with immediate answers, partner discussions, and misconception exploration—that’s an option for use of a limited tool.
An administrator using AI to generate high-stakes exit exams determining graduation, claiming this efficiently assesses complex competencies—that’s technological perpetuation of assessment failure.
The tool itself is neutral whether human or AI written. I worked for years with a professor of Criminal Justice who set out to convince me that he alone was skilled enough as an MC item writer to assess critical thinking. Back then, I was almost convinced the task was impossible. I’m not so sure any more with AI in the picture. We’d meet for lunch sometimes in the student union to scrutinize his items and have good, hearty laughs when the item format forced us down absurd alleys.
MC item have clear quality-control and application metrics, and there are good use case applications. When the goal is to verify a raw knowledge base at a fine-grained level, probing items every expert needs to know to practice, when the items are good, MC items can inform decisions about whether a person should have a license to do surgery on an anesthetized patient or represent an accused person facing the death penalty.
The use determines whether MC items are appropriate. Whether such use serves learning or undermines it is the question. We should be clear: high-quality MC items sort winners from losers efficiently and effectively. That’s what they do. We should use them only when that’s what we need.
We don’t need that kind of sorting to teach children.
References
Burton, S. J., Sudweeks, R. R., Merrill, P. F., & Wood, B. (1991). How to prepare better multiple-choice test items: Guidelines for university faculty. Brigham Young University Testing Services. https://testing.byu.edu/handbooks/betteritems.pdf
California Commission on Teacher Credentialing. (2023, June). Agenda item 2D: Reading Instruction Competence Assessment (RICA). https://www.ctc.ca.gov/docs/default-source/commission/agendas/2023-06/2023-06-2d.pdf
Cheung, K., Lam, J., Li, V. et al. (2023). ChatGPT versus human in generating medical graduate exam multiple choice questions—A multinational prospective study (Hong Kong S.A.R., Singapore, Ireland, and the United Kingdom). PLoS ONE, 18(8), e0290691. https://pmc.ncbi.nlm.nih.gov/articles/PMC10464959/
EdSource. (2023, November 15). Soon-to-be retired California reading instruction test gets high marks in national analysis. https://edsource.org/2023/soon-to-be-retired-california-reading-instruction-test-gets-high-marks-in-national-analysis/700144
Ngo, A., Gupta, S., Perrine, O., Reddy, R., Ershadi, S., & Remick, D. (2024). ChatGPT 3.5 fails to write appropriate multiple choice practice exam questions. Academic Pathology, 11, 100099. https://www.sciencedirect.com/science/article/pii/S2374289523000313
van Binsbergen, W. M. J. (1972). Multiple choice tests as an assessment technique for humanities and social sciences. University of Zambia. https://www.quest-journal.net/shikanda/publications/multiple_choice_BEST.pdf

And the right answer is
A, B, C, & D.
Honesty in measurement.
This is helpful Terry and I think teachers have known this for years - narrow but useful. They are also easier to grade and with new LMS’s, can be set so students can instantly see their feedback. I use them to accompany textbook reading which students can take at their leisure. But lots of good reminders in here - like with so much of educational practice, AI is forcing us to rethink things we’ve always taken for granted.