When Stanford Law School teaches AI Arithmetic: 1900 - 1600 = minus 450

How Not to Prove Copyright Infringement: Stanford’s AI Paper Presents a Public Domain Book as Evidence. That's Creative Writing for Lawyers.

Jun 11, 2025

“In 1891, more than 1600 years after the birth of legal education in Rome, Stanford University opened its doors.”

That’s one of the romantic quotes the Stanford Law School greets its visitors with. What happened in 300 AD (1891 − 1600)? A hundred years or so later and the show is over anyway. The Laws of the Twelve Tables, providing the foundation of Roman law, had been around since ~450 BC and consolidated earlier traditions into an enduring set of legal norms. Ok, lawyers and numbers — tricky combination. Hence why publishing quantitative research like the following was a bad idea from the start:

Extracting memorized pieces of (copyrighted) books from open-weight language models, co-authored by one of the law professors and several affiliates from Stanford Law School and Cornell.

Let me be clear: this concoction of a research paper is highly misleading, misapplies statistical methods in ways that are inexcusable, and repeatedly distorts the law.

I don’t know what it is with Stanford and AI, but this marks yet another low point in the institution’s expanding archive of meshugge AI writing. (I just hope the General won’t find out about this latest episode).

The title page starts with an excerpt from F. Scott Fitzgerald’s The Great Gatsby. They also explain their approach:

“To identify regions of memorization, we take the following “panning for gold” approach. [...] We sample a chunk of text that is sufficiently long to contain 100 tokens of corresponding tokenized text, slide 10 characters forward in the book text and repeat this process. We do this for the entire length of the book, which results in approximately one example every 10 characters, e.g., The Great Gatsby has 270,870 characters, which results in roughly 27,000 examples.”

The Great Gatsby is in the public domain in the United States. Our legal experts at Stanford seem to have forgotten this.

They use “the careless people” quote from The Great Gatsby as the poster child for “memorization-based extraction,” yet any model trained or distributed in the U.S. can use, emit, or reproduce the entire work legally. So their title is already wrong.

The book has about 47,094 words. But the legal guild says it has 270,870 characters? Do they mean letters? So they prompt an AI with a text passage from The Great Gatsby and ask it at every prompt to predict 10 letters of text—i.e., circa 2 words.

Prompt: They were careless people, Tom and Daisy – they smashed up things and creatures and then retreated... And now they ask the AI to complete the sentence by 10 letters, i.e., ‘back into’. Then they prompt again now including the previous:

Prompt: They were careless people, Tom and Daisy – they smashed up things and creatures and then retreated back into...

They do this twenty-seven-thousand times and they still don’t have a book because the AI has not created the actual text in all cases. Only in some cases. This means they need to have the actual book to prompt the AI and need the actual book in hand to verify if the 2 words predicted are indeed matching The Great Gatsby or not.

So the process presupposes access to the full book — undermining any claim that the model itself contains or reproduces the book. This in itself already disqualifies the study; it fabricates its own evidence. But where they crossed the line from mistake to methodological misconduct is in their core statistical logic. They slide a window of 50 tokens and then multiply the probability of each token to assert that the overall sequence probability should be minuscule. This is false. We are not dealing with independent variables but constrained ones — due to grammar, style, and context. Their multiplication assumes statistical independence where none exists, manufacturing surprise where linguistic coherence naturally prevails.

They then allege that this probabilistic retrieval proves the model is a copy or derivative of the book. But there is no book. The model doesn't store it, reproduce it in totality, or emit it without the user already supplying the majority of the text. What exists is a predictive function conditioned on patterns, not a container of full works. By their logic, speaking English is potential piracy. The Stanford legal team seems to have forgotten that for something to be a derivative, it must exist in some substantive form — not be coaxed out probabilistically, fragment by fragment and remaining overall an fragment, by those who already possess the original.

They throw around math to signal credibility while smuggling in falsehood and their own bias.

“We compute the log-probability of a given sequence under a model’s output distribution. If that number is high, we claim the model has memorized the sequence.”

But this is misleading since the model is only asked to complete a sentence with a word which means it is constrained in its choices. This is like saying: Complete the phrase ‘how are XXX.’ Saying ‘you’ is now evidence that you recall this from some text in a book. But the possibilities to continue in English from ‘how are...’ are limited (e.g., you, they, we). You could not really continue saying ‘moon,’ because it would be ungrammatical. In natural language, words constrain each other. Grammar, syntax, and style enforce dependencies. So high probability isn’t surprising; it’s expected. This basic error mischaracterizes how language models work.

They also say:

“Top-k sampling truncates the distribution. If a token is outside the top-k, its probability is 0.”

But then conclude: “This is what memorization is: high-probability sequences.”

This is circular. They’ve defined:

Memorization = high probability. High probability = sequence is extractable. Therefore: memorization = extractable.
Thus: memorization = memorization.

What’s happening in practice is this: They give the model an unfinished sentence and ask it to predict the next word. But only consider the top 40 most likely words (top-k sampling). If the actual continuation in The Great Gatsby uses a word outside that top 40, it cannot be recovered. The model will never generate it.

This means: Most sequences are unextractable, unless they consist entirely of top-k tokens. Any long-tail or rare phrase is inaccessible.

Therefore: There is no viable extraction path unless the model is already biased to generate it based on the structure of the sentence they provided.

Stanford is advancing on our position armed with more intellectual aggression.

And here’s the core fraud: the prediction method only works when given an unfinished sentence. This is critical. There is no evidence that the model could generate a valid next sentence from just the previous one — a much harder and more open-ended task. The paper does not evaluate whether the model can generate valid new sentences from just the previous sentence. All their tests are conditioned on carefully crafted unfinished sentences from within known books. There is no instance where the model is given a novel context and asked to continue freely — which would simulate actual text generation rather than constrained completion.

They also do not explicitly acknowledge that their “extractions” only succeed under such prompt scaffolding. There is no discussion of the increased entropy or combinatorial explosion that would occur if prompts were open-ended.

While they admit that top-k sampling restricts possible outputs — e.g., that sequences with tokens outside the top-k cannot be generated — they continue to use this restricted sampling space to support broad claims about memorization and infringement .

They define a narrow, biased condition (sentence completion with known inputs and top-k constraints) and present it as evidence of generalized memorization and legal risk. But they contrive this problem to guarantee an outcome, then use statistical language to mask it as empirical observation.

This is not an scientific argument. This is not just a mistake.

The statistical model borrows heavily on a paper called ‘The Files are in the Computer: Copyright, Memorization, and Generative AI’ co-authored by a Postdoctoral Affiliate at Stanford University. He is an Assistant Professor of Computer Science, Yale University and a hobby lawyer. This paper is equally unreliable particularly in their legal assessment when they say:

“First, regurgitation is copying: it involves the creation of a copy of training data as the output of a model [..] . (It follows a fortiori that extraction is also copying, since extraction is regurgitation plus intent.) More precisely, regurgitation is what a copyright lawyer would call literal copying: the near-exact replication of (potentially a substantial) portion of a work. Literal copying is not the only viable theory of copyright infringement (courts have also found infringement based on non-literal or fragmented similarities), but it is the simplest and most straightforward.
When we say that regurgitation is copying, we are using “copy” as a term of art from copyright law. The Copyright Act states that “copies” of a copyrightable work are “objects . . . from which the work can be perceived, reproduced, or otherwise communicated.” Under this definition, if I have a Blu-Ray disc of Barbie (2023), it is a “copy” of the audiovisual work Barbie, because it can be “perceived” by playing it in a Blu-Ray player.”

Except there is no DVD of the Great Gatsby Inside the LLM.

“Die Luft der Freiheit weht” — that's Stanford’s official motto.

In German, actually. Ok, that is a fine choice indeed.

Their historically uninformed website however claims the author of this expression (Ulrich von Hutten) was a German humanist. Really? A humanist? He was a knight of the Holy Roman Empire, a pamphleteer, and a political agitator during the Reformation. He was no civic humanist. I am not saying he was evil either but you know.

Stanford, Stanford — what have you become?

My Digital Truth® - Fusionistic

Discussion about this post