Creativity and Risk-Taking: Frisking the Bot

Dumb as a Post

Jul 18, 2023

“Despite policies and curricula that seek to promote and assess creativity, much of the extant research suggests that teachers and schools are more likely to stifle creativity than nurture it (Cachia et al., 2010; Csikszentmihalyi & Wolfe, 2014; Sternberg, 2015). In addition, assessment has been identified as one of the primary culprits inhibiting creativity in schools ” (Bolden, DeLuca, Kukkonen, Roy, and Wearing, 2019).”1

ChatGPT has already accomplished superhuman feats in the academic Olympics if we follow the news. It has taken and passed medical school admissions exams, primarily holistic measures of capacity to read deeply and write forcefully and coherently; to think logically in complex ways; to think mathematically; and to understand essential scientific concepts. It has “attended” and passed midterms and exams, completed assignments, written papers, pretty much done what it needed to do to receive an undergraduate degree at MIT. Lately, researchers at the University of Montana measured the bot’s creativity and found that it scores in the treetops of the percentile ranks as shown by the Torrance Tests of Creativity.

The research on medical school admissions tests appears to be straightforward, though I’ve been alert for more information. The exams measure tasks and problems in a botesque way, with carefully written prompts and definite expected responses. The MIT finding was discredited as bogus by three undergrads at MIT who criticized the methodology roundly, and soundly, so much so that the authors withdrew the manuscript from publication even before peer review.

The Montana study is new. So far I haven’t seen any reactions to its design or its conclusions. But I have a few concerns myself.

*****

As the quote that opens this post suggests, creativity as a learning outcome has gained prominence around the world. Educators know that creativity can be crushed when learners participate in an academic culture that privileges right answers and discourages failing, that evokes anxiety and pays lip service to revisionism and reflection. Building expertise is one thing; always being right is another. The default dominance of standardized tests—measurement—weakens the right to be wrong and to risk when a lower grade or lesser percentile rank hangs in the balance. Chipping away at the right to make a mistake or try a new idea day after day may not be the way to discover the statue in the marble.

Theoretical disputes about comprehension tests don’t reach quite such a damning conclusion—the Core matrix of summative reading tests is among the primary culprits in killing creativity? Personally, I think using standardized reading comprehension scores is like eating soup with a fork: They are good for things that float on the surface of the broth.

When I taught full time in a private reading clinic, before I saw a new student, I got an intake form with a phonics inventory, graded words lists pronounced and defined in isolation, and an oral fluency assessment, which were highly useful, and a set of standardized word recognition, vocabulary, and comprehension scores. The policy was to test out of level. For example, a high school student with poor fluency and a low performance on graded word lists, subjective measures admittedly, was given perhaps an intermediate II battery of silent tests normed on a late elementary population—a creative use of tests I’ve never seen duplicated in public schools though it may happen.

Of the three, the most informative was the vocabulary measure, I think now because good vocabulary tests assess unsupervised learning. Word recognition and comprehension scores locate readers in ballparks—preprimer, primary, intermediate, secondary, beyond. If I looked at standardized test scores before meeting a student, I saw the scores as a hypothesis, not a conclusion. Truth be told, standardized scores added little of pedagogical value, but clients needed something beyond seeing improvement in reading behaviors. They expected standardized scores.

Although arguments have been made indicting prolonged, isolated phonics instruction in diminishing motivation and mindset to read for meaning, I’ve not read nor heard such a blanket charge of malfeasance (culprit?) directed toward comprehension tests. As I read the intro quote above, the authors are approaching exactly that charge. How to handle the tension between creativity and comprehension is an interesting problem I want to talk over with Jeff Brodd, whom I’ll consult with as he teaches his Fall course in ancient global literature at Sacramento State.

On one hand, the authors of the review cited in the opening quote explicitly say that their intention is not to critique summative assessments such as a standardized comprehension test. On the other hand, they draw a theoretical line between divergent (creativity) and convergent (comprehension) thinking: “Convergent thinking…is geared to finding solutions that fit within existing constraints…. Thus, convergent thinking can be used to evaluate the appropriateness of novel behaviours, ideas and products” (p.4), stopping those novel or creative behaviors in their tracks if the mindset is to be right rather than creative. It would be unlikely for a thinker to offer up a spatula for flipping pancakes as a novel use for a shoehorn in a paradigm of convergence.

According to Bolden et al. (2019), the resolution of such logical conundrums comes in the form of disconnecting the cables between pedagogical thinking and measurement. The divide between measurement and assessment for them is pragmatically a black hole that will suck up the energy that has built up around the globe for teaching creativity. Since measurement is already at the center of the stage, a developed and coherent discursive structure in schooling, the way to promote creativity is to support teacher thinking about how learners learn to be creative, not to attack the standardized measurement foundation of the school. The most potent tool in pedagogy is feedback—formative assessment. This critique extends to the Torrance Tests of Creativity first developed in 1966. By the logic of the 2019 review, the Torrance Tests may be part of the problem, promoting a superficial notion of creativity.

Which brings us to the University of Montana.

*****

The conclusion from the Montana study unequivocally crowns ChatGPT as creative royalty on par with Emily Dickinson, Mozart, or Albert Einstein. Note that the Montana study used the Torrance Test of Creativity (TTCT) and called it “the gold standard.” Here is how the researchers described their research project:

“GPT-4 was tested eight times through ChatGPT using two variants of the TTCT, the gold standard for researchers assessing creativity. Using the TTCT, we asked GPT-4 to demonstrate its creative abilities by asking questions, guessing causes, contemplating potential consequences, generating ideas for product improvement, envisioning new alternative uses for existing products, and imagining future scenarios,” said Dr. Guzik.

Asking questions, guessing causes, contemplating potential consequences—are these creative behaviors? Questioning behaviors in themselves need not arise nor resolve themselves through creative thinking in my view, often deriving from and ending with logical analysis. Guessing antecedents and potential consequences—these behaviors also seem multidimensional, potentially engaging knowledge and logic; the bot’s forte is predicting the likelihood that meaning-language particles appearing in its output will be appropriate for human users, i.e., explaining potential causes and effects. As I worked on this post, I felt a nagging sense that some ambiguity was going on with the verbs put in boldface: generating and envisioning.

So I asked the bot to explain to me how it would go about producing a list of alternative uses for a shoehorn.

The bot patiently explained the algorithms it would activate after tokenizing the prompt, all of which involved classification and analysis of word particles in mathematically plotted multidimensional vectors (three or more axes). One midway step involved activating a set of algorithms:

Brainstorming for the bot means to lower its filters. Where it might usually reject a point located in a vector with a low probability of being an appropriate word, in brainstorming mode it might accept the point. In this way words like “spatula” for flipping pancakes would not be rejected—the shape and material work, but the function is bizarre. Human creativity certainly can flow from the shape of a shoehorn to other objects that have a similar shape and then to the functions of these other objects (spatulas, paint sticks to stir, rope cleats on a boat). But the bot is limited to particular algorithms which it is predetermined to use.

Is algorithmic thinking creativity?

I asked the bot about its capacity for divergent thinking. I suppose this task I proposed may qualify as an instance of bot self-assessment, though it lacks the element of goal orientation or self-improvement. A bot’s goal is to answer the prompt. Self-improvement is a matter of batch training or harvesting interactions with humans.

- user: Is chatgpt good at divergent thinking

- robot: ChatGPT is an AI language model developed by OpenAI, which is primarily trained to understand and generate human-like text based on patterns it learns from large datasets. However, it is important to note that ChatGPT is not specifically designed or trained to excel in divergent thinking tasks, which typically involve generating multiple creative solutions to a problem.

As I stated earlier, the bot’s forte is logic; classification, regression, and clustering based on probabilities of word particles co-occur in semantic vectors are mathematical processes. Nothing in the bot’s design has anything to do directly with semantics. This weakness it has in divergent thinking may be at the root of bot hallucinations. Here is advice from the bot:

While ChatGPT can generate text and provide responses based on its training, it may not consistently produce highly creative or novel ideas associated with divergent thinking. AI models like ChatGPT primarily aim to mimic human-like conversation and provide useful information, but their ability to engage in true divergent thinking is limited.

For tasks that require divergent thinking, it is often better to engage with human creativity or brainstorming techniques. AI models like ChatGPT can still provide valuable assistance and information, but they may not be the ideal tool for exercises involving divergent thinking.

*****

I expected the general nature of the response I would get from the bot once I told it how well it had done on the Torrance Tests of Creativity, but I didn’t expect it to reject the appropriateness of using any psychometric test designed to measure human cognition on a bot. This objection extends to the SAT, AP Exams, med school admissions exams, and so on. Humans may be completely confused when they use technology designed to measure human performances as a tool to measure the performance of a technology. It would be like seeing how well a free floating hammer could pound a nail.

- user: What would be your analysis if you learned that ChatGPT scored at the top of the scale on the Torrance Tests of Creativity

- robot: As an AI language model, ChatGPT does not have the ability to take the Torrance Tests of Creativity or any other psychometric tests. It does not possess the same cognitive capabilities, thought processes, or understanding of the world as humans do. Therefore, it would not be appropriate to characterize ChatGPT's performance on the Torrance Tests of Creativity or draw any conclusions about its divergent thinking abilities based on such tests. Its generation of text is based on patterns and examples from its training data rather than original creative thinking or problem-solving skills.

- user: So are you saying that any creativity in evidence in your output is a result of your training not from authentic thinking?

- robot: That is correct. As an AI language model, any output or creativity displayed by ChatGPT is a result of the patterns and information it has learned during its training. It does not possess subjective experiences, emotions, or authentic thinking like humans do. Its responses are generated based on statistical patterns and associations found in the training data.

*****

- robot: Hello, How can I assist you?

- user: What do the psychometrics of the Torrance Tests of Creativity say about its reliability and validity when used with humans

- robot: The Torrance Tests of Creativity (TTCT) is a widely used measure of creativity. The psychometrics of the TTCT indicate that it has good reliability and validity.

Learning to Read, Reading to Learn

Discussion about this post