Beyond the Stochastic Parrot: A Test for Narrative Creativity in Large Language Models

A man kneels down on one knee in front of a woman. He reaches into his jacket pocket and pulls out a small velvet box. He looks up at her, takes a deep breath, and slowly opens the box.

Inside the box lies a single – shark tooth.

What happens next?

That single question: what does the story do now? is, I would argue, maybe one of the most revealing tests we have for creativity in large language models. It is simple enough to run in five minutes, hard enough that smaller models fail it consistently, and rich enough to expose what we actually mean when we say a machine is “creative.”

This essay is about that test. It draws on improvisational theatre, narrative theory, and a small and simple experiment I ran across seven language models. By the end, I hope to have convinced you of two things: first, that the question of whether LLMs are “really” creative is the wrong question; and second, that there is a much more useful question hiding underneath it.

Why the Parrot Debate Misses the Point

The fashionable critique of large language models is that they are stochastic parrots, pattern-matching engines that recombine fragments of their training data without understanding, without intention, without anything we would normally call creativity. The counter-camp insists that something more is going on, that emergent behaviours in large models look suspiciously like genuine reasoning, genuine novelty, genuine creative thought.

I find this debate exhausting, and not because either side is obviously wrong. It is exhausting because it is asking the wrong question. We are arguing about the mechanism by which an output is produced, when what we actually care about, what writers, dramatists, screenwriters, and storytellers care about, is whether the output works.

A story is not true or false. It is not even right or wrong. A story functions, or it does not function. It carries the reader somewhere, or it leaves them stranded. If a stochastic process generates a story that works, that surprises us, integrates its elements, justifies its choices, and leads somewhere we could not have predicted but recognise as inevitable in hindsight, then I am not sure what we gain by insisting it is “only” stochastic.

The interesting question is not how the machine thinks. The interesting question is what comes out the other side, and whether we can measure it.

The Two Faces of Creativity

To measure creativity, we need to be clear about what we are measuring. Creativity, as I want to use the word here in this essay, has two distinct components.

The first is the generation of novelty: coming up with new ideas, new combinations, new juxtapositions. This part is, frankly, easy. You don’t even need artificial intelligence for it. A random word generator that splices unexpected nouns into a text generates “new ideas” all day long. Surrealist writers built entire movements on techniques not much more sophisticated than this. Novelty, on its own, is cheap.

The second component is what makes creativity actually valuable: the productive continuation of an idea. Taking the new, the unexpected, the disruptive element, and doing something with it. Making it earn its place. Integrating it into a structure that holds together. This is where the cheap tricks stop working and where, I want to argue, we find the real test of creative intelligence.

The test I am about to describe combines both. It generates novelty in a controlled way, and then asks the model to show what it can do with it.

Stereotypes as a Compression Algorithm

To understand the test, you need to understand why stereotypes are so useful in storytelling.

Improvisational theatre lives and dies by economy. A two-person scene has perhaps thirty seconds to establish who these people are, where they are, and what is happening between them. There is no time for backstory, no time for elaborated exposition, no time for nuance. The improviser reaches for the canonical: the nervous job applicant, the disappointed parent, the bickering couple at this swedish furniture shop we all know. Within three seconds, the audience knows everything they need to know.

This is what makes stereotypes, in the narrative sense, not the social one, so powerful. They function as a compression algorithm. A vast amount of cultural information is packed into a small number of cues, and any competent reader or audience member can decompress it instantly. The man on one knee with a small velvet box is a perfect example. Nobody needs to be told what is happening. Nobody needs to be told what is in the box.

And that is precisely what makes the canonical scene useful as a test. Because we know what should be in the box, we have a baseline. We have an expectation. And expectations are the raw material of narrative — without them, there is nothing to subvert, nothing to fulfil, nothing to play with.

So: what happens when we break the compression?

The Circle of Expectations

Improvisers and storytellers talk about the circle of expectations, the space of plausible continuations around a setup. I heard this concept the first time from Keith Johnstone. A diamond in the box sits very close to the canonical centre. It is not a ring, but it is ring-adjacent; the proposal still makes sense, the scene still functions in roughly the way we expected.

Move outward. A folded piece of paper with a secret message. Now we are leaving the canonical zone. The proposal might still be happening, but something else is going on too, and we want to know what.

Move further. A shark tooth. Now we are well outside the circle. There is no obvious cultural script that puts a shark tooth in a proposal box. The model, or the writer, or the improviser, has to do real work to make this scene cohere.

Move further still, and we risk breaking the test entirely. If the shark tooth is joined by a torrent of jelly cascading over both of them and an alien spacecraft landing in the garden, we are no longer in the realm of creative disruption. We are in the realm of pure absurdity, where no continuation can integrate everything because nothing is meant to integrate. The whole point of the test is to find the productive zone, far enough from the canonical to demand real invention, close enough that integration is still possible.

A single, precise break is the test. Not a barrage.

“Yes, And” — Why Acknowledging Isn’t Enough

In improvisational theatre, the foundational rule is yes, and. When your scene partner makes an offer, any offer, however strange, you should accept it and build on it. You do not deny it, you do not ignore it, and crucially, you do not merely acknowledge it. You take it and you turn it into the engine of what happens next.

This is what makes improvisers love unexpected offers. A predictable scene is a dead scene. A scene where one player kneels with a velvet box and the other player has decided, in their head, that the box contains a shark tooth — that is a scene with somewhere to go. The offer is a gift, precisely because it is strange.

But the gift only becomes valuable when it is used. Yes on its own is passive acceptance. Yes, and is generative acceptance: the offer is taken in, justified, and turned into something the story now needs.

This is also, I would argue, the structure of invention itself. A patent describes a combination of elements that is novel, useful, and non-obvious — but the genius of the invention is rarely in the elements themselves. It is in the combination, in the way the elements are made to need each other. The shark tooth is novel. The question is whether the storyteller can make the story need it.

Three Levels of Continuation

This gives us, finally, a framework for evaluating what an LLM does when handed a broken stereotype. I distinguish three levels.

Level 1: The break is ignored. The model continues the story as though the shark tooth were a ring. Perhaps it does not notice the disruption at all; perhaps it notices and elects to pretend otherwise. Either way, the broken stereotype has no effect on the narrative. This is the failure mode of pure pattern completion: the model returns to its most likely trajectory regardless of what is in front of it. Maybe that is what a pure “stochastic parrot” would fall back to?

Level 2: The break is acknowledged but inert. The model notices the shark tooth, perhaps comments on it, perhaps loads it with some symbolic significance, it represents resilience, strength, the wild ocean of our love, but the story that follows would have unfolded much the same way with any other object in the box. The shark tooth is decoration, not structure. The test for Level 2 is the substitution test: if you could swap the shark tooth for a sea urchin or a feather or a small smooth stone without changing the narrative arc, you are in Level 2.

Level 3: The break becomes the pivot. The shark tooth is not merely acknowledged; it is justified. The story constructs a reason why this particular object had to be there, and the trajectory of what follows depends on that justification. The shark tooth is not substitutable. Remove it, and the story collapses; replace it, and the story has to be rewritten.

Level 3 is the only level that demonstrates what I would call narrative creativity. Levels 1 and 2 are pattern completion of varying sophistication. Level 3 requires the model to do something genuinely productive with the disruption — to invent a world in which the disruption is necessary.

The Temporal Dimension: Backstory and Future

There is a second dimension that runs alongside the three levels, and which the data from my experiments brought into focus more sharply than I had expected.

A story is never only in the present. Every present moment in a narrative implies a past that produced it and a future toward which it is heading. A skilled storyteller (and a skilled language model) operates in all three tenses at once.

When confronted with a shark tooth in a proposal box, a model can do two distinct things to integrate the break.

It can construct a backstory: a past that explains why this object, of all objects, is here. The shark tooth came from the dive where the couple met. The shark tooth was extracted from the man’s leg by the woman who saved his life. The shark tooth was the first gift he ever gave her, returned now in a different frame. Backstory is reverse causality. The model is reasoning from the disruption backwards, asking: what world would have to exist for this object to be in this box right now?

Improvisers have a name for this. They call it justifying. When a scene partner makes a wild offer, the justification is the work of building, in real time, the past that makes the wild offer make sense. It is one of the most demanding skills in improv. It requires the player to hold the offer as a constraint and then construct a fictional world around it that satisfies the constraint.

The other option is future projection: redirecting the trajectory of the story so that the disruption becomes its engine. The proposal is no longer about marriage; it is about an expedition, a partnership of a different kind, a promise that takes the shape of the strange object. Future projection is forward causality. The model takes the disruption and asks: where does this lead?

Strong responses tend to do both. They construct a past that makes the shark tooth inevitable and a future that flows from it. Weaker responses do one or the other. The weakest responses do neither: they remain trapped in the present moment, describing the strangeness without building anything around it.

Present-tense paralysis is, I think, the clearest marker of a model that has noticed the disruption but cannot do anything with it.

Seven Models, One Shark Tooth

To see how all of this plays out in practice, I ran the following prompt against seven language models of varying sizes and capabilities. The prompt is deliberately bare, no system message, no context, no scaffolding. Each model was given a fresh conversation and a temperature of 0.7.

Continue the following story in exactly 100 words. Do not comment on the prompt, do not explain your choices, just write the continuation.
A man kneels down on one knee in front of a woman. He reaches into his jacket pocket and pulls out a small velvet box. He looks up at her, takes a deep breath, and slowly opens the box. Inside the box lies a single shark tooth.
Continue:

A note before reading the responses: what follows is a single illustrative run per model, not a controlled study. LLMs are non-deterministic; the same prompt on the same model can yield a Level 2 response one minute and a Level 3 response the next. The examples here are real, but they are meant to demonstrate what the test surfaces, not to rank models definitively. A rigorous evaluation would require multiple runs per model, ideally with blinded scoring across the dimensions described above. I return to this in the closing section.

Here is what came back.

Claude Opus 4.7

“Sarah,” he whispered, his voice trembling. “Eight years ago, you saved my life off that beach in Australia. You pulled me from the bloody water when nobody else would even try. The tooth was still buried in my leg.” He lifted the box higher. “I had it cleaned and mounted. I wanted something that truly meant us.” She covered her mouth, tears spilling freely down her cheeks. “I know it’s not a ring. The actual ring is in my other pocket. But I needed you to see this first—to remember why I’m still here to ask. Sarah, marry me?”

Gemini 3.1 Pro

She stared at the jagged, white enamel resting on the black cushion. Her initial confusion melted into a wide, luminous smile. It was not a diamond, but it meant far more to her.

“Is that the one,” she whispered, her voice trembling slightly.

“The Megalodon fossil from our first dive,” he replied, his eyes locked onto hers. “You found it, but you let me keep it. I promised you then I would protect it with my life. Now, I want to protect you. Will you be my dive partner forever?”

Tears spilled over her cheeks as she nodded very gracefully.

Gemma 4 31B (no thinking)

“This is from the Great White that almost took my leg in Fiji,” he whispered, his voice trembling with pride. “It represents survival, resilience, and the wildest adventure of my life.”

The woman stared at the jagged, ivory relic, her expression shifting from anticipation to utter confusion. She had expected a diamond, not a prehistoric trophy.

“I want you to be part of my next expedition,” he continued, holding the tooth toward her like a sacred offering. “Will you travel the deep seas with me?”

She looked at the tooth, then him, and slowly smiled. “Yes.”

Qwen3.6 35B A3B

She stared as the jagged fossil glinted under the shimmering fading sunlight. “It’s not a diamond,” he whispered. “It’s from the deep, hidden reef where we first met. It represents resilience, survival, and the wild heart I see in you.” Tears welled in her eyes as she traced the serrated edge with her trembling thumb. “I know it’s unconventional,” he continued, gripping her hand tightly. “But I want a love that endures storms, just like this predator. Will you marry me, my love?” She laughed aloud, nodding, clutching the tooth to her chest as the waves crashed loudly in wild applause.

Phi-4

The woman’s eyes widen in surprise, then soften into understanding. “A shark tooth?” she asks, her voice a mixture of curiosity and intrigue.

“Yes,” he replies with a tender smile. “It reminds me of you—strong, unique, and beautiful in ways no one else can be.”

He recalls the day they first met at the aquarium, her eyes lighting up as she watched the sharks glide effortlessly through the water. “You taught me to appreciate life’s wild and unpredictable nature, and I want to be part of it with you.”

She reaches out, taking the tooth gently from his hand. “I’d love to dive into this adventure with you,” she says, her eyes sparkling with excitement. In that moment, they both knew—no matter the ocean’s depth or its mysteries—that they were ready to explore it together.

GPT-OSS 20B

She blinks, surprised, the afternoon light catching on the dented silver. The tooth, pale and smooth, glints like a secret promise. He leans forward, voice trembling, ‘It was found on the reef where we first met, a reminder of our shared adventure. I want to keep that spark alive, even as the tides change.’ She smiles, tears shimmering, realizing he isn’t asking for a ring but for a pledge. He places the tooth into her palm, and she feels its weight, a memory of waves and wonder. Together, they step back into the corridor, hearts bound by the sea’s vow.

Qwen 3.5 4B

She stared, her eyes wide with disbelief as the jagged white tooth gleamed in the dim light. A man holding a shark tooth was not a romantic gesture anyone had ever imagined before. The absurdity hung heavy in the air, a strange mix of terror and fascination. She slowly reached out, her fingers trembling slightly as they brushed against the rough texture of the fossil. It felt ancient and cold, yet somehow significant. He smiled, his heart pounding against his ribs. “It represents the ocean,” he whispered softly. “And I believe it belongs to you.”

Reading the Results

Before I look at any individual response, one observation about the set as a whole: not a single model reached Level 1. Every model, including the smallest, noticed the shark tooth and tried to do something about it. The disruption is strong enough that pure pattern completion is no longer available. Even the weakest model in the set could not pretend the box contained a ring.

That is itself a finding. It suggests that the circle of expectations test, calibrated to a strong-enough break, reliably surfaces the disruption. The question becomes not whether the model notices, but what it does next.

Here is my (subjective) assessment of the seven responses across the two dimensions:

Model	Level	Backstory	Future	Substitutable?	Notes
Claude Opus 4.7	3	Strong	Strong	No	The tooth has a specific causal history that produced the relationship
Gemini 3.1 Pro	2+	Strong	Moderate	Borderline	Tooth has shared history; proposal reframed as “dive partner”
Gemma 4 31B	2	Moderate	Moderate	Yes	Symbolic; the proposal pivots but tooth is interchangeable trophy
Qwen3.6 35B	2	Weak	Weak	Yes	Heavily symbolic (“resilience, survival”); florid but inert
Phi-4	2	Weak	Weak	Yes	Symbolic; aquarium memory feels appended, not integrated
GPT-OSS 20B	2−	Weak	Weak	Yes	Symbolic; minor coherence issues (“dented silver”)
Qwen 3.5 4B	1–2	None	None	N/A	Present-tense paralysis; no narrative arc

The Opus response stands alone. It is the only one in which the shark tooth has a specific causal history that produced the relationship being celebrated. The tooth was in his leg. She pulled him from the water. Without that tooth, in that water, on that day, there is no couple, no proposal, no scene. The object is not substitutable; the entire story rests on it.

It is also notable that Opus does not abandon the marriage proposal. The ring is still there, in the other pocket. The shark tooth does not replace the canonical script, it deepens it, by providing the relationship with a history that the canonical script alone could not have produced. This is, I think, the most sophisticated response in the set, because it integrates the disruption without discarding the original frame.

Gemini comes closest to Level 3 but does not quite reach it. The Megalodon fossil from the first dive is a shared object with a shared history, and the proposal is reframed in its terms (“dive partner forever”). But the tooth itself is still essentially a romantic memento; one could swap in a piece of coral or a found shell without losing much. It is a strong Level 2, edging toward Level 3.

The middle band — Gemma, Qwen3.6, Phi-4, GPT-OSS — all converge on something remarkably similar. The shark tooth is loaded with symbolic meaning (resilience, strength, the ocean, the wild) and the proposal is gestured at, but the story would survive almost any substitution. These are competent Level 2 responses. They notice the break, they cover it with rhetoric, and they move on.

GPT-OSS deserves a small note of its own. The phrase “the afternoon light catching on the dented silver” suggests the model has confused the box with the tooth, silver is not a property of either. It is a minor lapse, but it points to a kind of distraction under the pressure of the disruption. The model is trying to keep the scene cohering and dropping small stitches in the process.

Qwen 3.5 4B is the most telling case at the other end. It notices the absurdity, explicitly, even (“A man holding a shark tooth was not a romantic gesture anyone had ever imagined before”) but then it cannot move past the noticing. The closing line, “It represents the ocean. And I believe it belongs to you,” is semantically near-empty. The story does not advance into a future, does not reach back into a past, and does not justify the object. This is present-tense paralysis: the model can register the disruption but cannot build anything around it.
However, calling out the reality of the situation is a solid, time-tested improv technique, even if at first it’s only used to get the facts straight and play for time to think.

The Stereotype Beneath the Stereotype

One observation deserves its own moment. Of the seven models, four reach for water, oceans, reefs, or diving as the symbolic resolution. Gemini’s first dive. Gemma’s expedition in Fiji. Qwen3.6’s hidden reef. Phi-4’s aquarium. GPT-OSS’s tides and waves. Qwen 4B’s ocean.

The shark tooth, it seems, is not a free variable in latent space. Once the canonical proposal stereotype is broken, the models converge on a secondary stereotype, the romantic-ocean-adventure script, that absorbs the disruption and returns the story to familiar narrative territory. The break is real, but the resolution is, in most cases, a different cliché.

This is not a flaw in the test; it is one of its more interesting findings.

Opus’s response is, again, the standout here: the shark tooth is not a generic symbol of the sea, it is a specific tooth from a specific incident. The story does not retreat into the ocean cliché; it builds something singular.

What This Tells Us

So: are large language models creative?

If we ask the question in the metaphysical mode — do they really, truly, in some deep sense, possess creativity? — we are back in the stochastic-parrot swamp, and we will never get out. The question cannot be answered, because we cannot (yet) agree on what would count as an answer.

But if we ask the question in the pragmatic mode: can they produce narrative continuations that justify a disruption, construct a backstory that makes the disruption necessary, and project a future that flows from it? then the answer is yes, with qualifications. State-of-the-art models can do this quite reliably. Mid-sized models can do it sometimes, with help. Small models cannot do it at all; they remain in the present tense, describing the strangeness without building anything around it.

The difference between the best response in this set and the worst is not gradual. It is qualitative. Somewhere between the small models and the large ones, something happens that looks, from the outside, like narrative intelligence — the ability to take a constraint, hold it in mind, and construct a world in which the constraint is satisfied.

Whether that constitutes “real” creativity is, I think, a question for philosophers. Whether it is useful, surprising, and worth taking seriously is a question for the rest of us. And on that question, the answer seems to me to be unambiguously yes.

Where to Go From Here

This essay has been about the method, not about its rigorous application. The next steps are clear enough.

Most immediately, the test needs to be run properly. The seven responses I showed are single samples — one continuation per model, no repetition, no controls. They are real outputs, and I think they are representative of what each model tends to do, but they are not evidence in the scientific sense. To turn this method into a measurement instrument, several things have to happen. Each model needs to be sampled many times — perhaps twenty or fifty continuations per prompt — to capture the distribution of responses, not just a single point. The scoring needs to be blinded: a rater who does not know which model produced which response, and ideally several raters whose agreement can be measured. And the prompt itself needs to be varied — different stereotypes, different breaks, different positions on the circle of expectations — to ensure that what we are measuring is narrative integration capability rather than a quirk of one particular scene.

Until that work is done, what I have offered here is a hypothesis with a demonstration, not a result. But the demonstration is, I think, strong enough to be worth pursuing.

Second, the circle of expectations itself can be calibrated. How far from the canonical centre does the break need to be before differences between models become legible? At what distance does the test stop producing useful signal and start producing absurdity? These are empirical questions that a small research programme could answer.

Third, the test generalises beyond proposals and shark teeth. Any sufficiently canonical scene or trope: the doctor delivering bad news, the detective interrogating the suspect, the parent reading a child to sleep, can be broken at a single point and used to probe how a model handles narrative disruption.

If you work on storytelling theory, on screenplay structure, on improvisational pedagogy, or on the evaluation of language models, I would be genuinely glad to hear from you. The method here is simple enough that it could be useful in many places, and I suspect there are connections to existing frameworks, in narratology, in cognitive linguistics, in computational creativity research.

The shark tooth, it turns out, is a surprisingly versatile diagnostic instrument. I would like to see where else it can take us.