The NYT’s AI Lawsuit Hinges on a Misleading Claim—And Nobody Noticed
The New York Times argues that AI models store copyrighted material, but their key example is not proof of that; it’s proof that AI recalls what the internet knows. So why did the NYT bury this fact?
The New York Times Company has an ongoing lawsuit against OpenAI and Microsoft, filed in December 2023. The lawsuit alleges that OpenAI's language models were trained on the Times' copyrighted content without permission, constituting copyright infringement.
Current Status:
Legal Proceedings: The case is in the pre-trial phase, with both parties engaging in discovery and preliminary motions. Specific details about court dates or motions have not been publicly disclosed.
Industry Impact: This lawsuit is part of a broader trend where media companies and artists are challenging AI firms over the use of copyrighted material in training datasets. The outcomes of these cases could significantly influence the future operations of AI companies and their interactions with copyrighted content.
Some such as NPR go even as far as this:
“ChatGPT's future could be on the line”
Oh no, that can’t be allowed!
So, as much as it pains me to do this, The New York Times needs to be called out for its bogus claims.
The New York Times (NYT) lawsuit against OpenAI and Microsoft argues that their AI models stored and used NYT articles without permission. Their primary claims about AI storing their articles are:
AI Training Involved Copying Full NYT Articles:
The complaint asserts that OpenAI and Microsoft copied and ingested millions of NYT articles multiple times into their training datasets. These datasets were then stored and processed to build large language models (LLMs), meaning the AI was trained on entire NYT works.Evidence of Memorization (Verbatim Outputs):
The lawsuit claims that GPT-4 can reproduce near-verbatim excerpts of NYT articles, suggesting that the AI has effectively stored portions of NYT content in a way that enables direct regurgitation when prompted. They provide examples of long passages that the AI allegedly outputs with minimal prompting.
And they also make claims about Bing as a search engine, but let’s put that aside because it raises more technical questions.
What They Get Wrong About AI
AI generates text reconstructively, meaning it predicts one token at a time based on probabilities from its training data.
However, in rare cases, if a passage appears verbatim frequently enough in the training data, the model can reproduce it exactly—this is called ‘memorization’. What does that actually mean?
It is still reconstruction, but without the usual probability distribution, meaning the model outputs it deterministically. However, this is not evidence that the AI model has "stored" an actual text that is subject to copyright. This often gets misrepresented when people misunderstand the technical design of LLMs (as we know them today).
Why Can ChatGPT Recall the First Sentence of Harry Potter Almost Perfectly?
If you ask an LLM to give you the first sentence of Deathly Hallows by J.K. Rowling, it is almost guaranteed to answer as follows:
"The two men appeared out of nowhere, a few yards apart in the narrow, moonlit lane."
OMG, how can this be?
This sentence was included in press kits and quoted by every newspaper, blog, book review, and summary. Take, for instance, this CBS article—or just Google the sentence, and you’ll find endless sources. Besides, quoting a single sentence is not copyright infringement but covered under fair use.
Example: CBS News coverage of the book’s release:
"On Saturday, bookstores around the world welcomed eager readers, young and old, in glasses and capes, some shivering, some sweaty, all joined by the thick hardback book with the opening words: 'The two men appeared out of nowhere, a few yards apart in the narrow, moonlit lane.'"
High-Probability Completions vs. Memorization
The first sentence of a famous book is extremely well-known and widely repeated across the internet. It appears in countless summaries, academic papers, articles, and discussions. Because it is always phrased the same way, it becomes an ultra-high-confidence prediction. When prompted, the model outputs what has the highest statistical match.
But does this prove that the AI has stored a copy of the entire book?
No.
Can I prove this?
Yes.
A Better Question:
"How often does Harry say ‘Hermione’ in Deathly Hallows?"
Answer: The model might say "I don’t know" or provide an incorrect number.
Why? Because the model doesn’t have access to structured datasets that count occurrences in books unless explicitly trained on such data.
Instead of remembering a number, the model guesses based on language probabilities because it has no text to check.
If many people online say, "Harry says Hermione a lot," the model might assume a high estimate.
Now, compare:
"What is the first sentence of Harry Potter and the Deathly Hallows?"
→ The model will likely get it right because that sentence appears exactly the same in many sources."How often does Harry say ‘Hermione’?"
→ The model will likely get it wrong because it never counted occurrences—it makes an educated guess.
What This Means
This demonstrates that a verbatim sentence does not prove the storage of an entire work—it proves probabilistic reconstruction based on frequency in training data. This is a critical distinction that often gets lost in sensationalized claims about AI "memorization."
Q&A sessions with AI can produce highly misleading results, which are then grossly misinterpreted—even by experts. This is a systemic and widespread issue of misinformation
The difference of "memorization vs. reconstruction" is central to whether LLMs are infringing when they output identical text fragment:
So when you ask:
"In the beginning, God created the heavens and the ___"
The AI completes "earth" because:
It has seen this phrase structure so many times in training data.
The word "earth" is overwhelmingly the most likely token to follow.
Any alternative (e.g., "Mars") would have an extremely low probability, even if the AI is theoretically free to generate anything.
If an AI model knows (based on its training data) that there is only one highly probable answer to a question, it will default to that answer with near certainty. This is a function of how LLMs work but not a proof it has the bible text:
High-Probability Completions: If a particular phrase or sentence is overwhelmingly dominant in the training data, the AI will always predict that phrase because any alternative has an almost zero probability.
Lack of Alternatives: If the model knows (statistically) that only one correct answer exists—like "In the beginning, God created the heavens and the earth"—then generating "Mars" instead of "Earth" would be mathematically irrational.
Reward Optimization: If the training reinforcement mechanism rewards correct answers and penalizes incorrect or uncertain completions, the model will become increasingly confident in defaulting to the most probableresponse.
This behavior is not evidence of "storage" in the way traditional databases work—it's just that the AI reconstructs the only logical completion it has learned. If there were competing possibilities (e.g., alternative translations, paraphrased versions, or variations), then it would be more probabilistic. But when there’s a single dominant answer, it will always generate that answer.
This principle applies not just to famous sentences in Harry Potter books but also to factual statements like:
“What is the capital of France?” → “Paris.”
“What is 2 + 2?” → “4.”
“Complete the phrase: ‘To be or not to be…’” → “That is the question.” (Which reminds me—if you haven’t seen the Royal Shakespeare Theatre's video demonstrating a million different ways to emphasize each word feat. King Charles, it’s such fun! Watch on Instagram)
The AI isn’t pulling the text from a database—it simply defaults to a single path of reconstruction.
The lawsuit starts dramatically, as expected, by setting out what’s at stake:
“The Constitution and the Copyright Act recognize the critical importance [..] Since our nation’s founding [..] Copyright law protects The Times’s expressive, original journalism [..]”
Which, a few pages later, morphs into this:
“In addition, The Times has deepened its relationship with its readers by expanding its offerings to better encompass its readers’ specific interests, including best-in-class offerings like Cooking, Wirecutter, Games, and The Athletic.”
Cookbooks don’t get copyright protection, but that’s not my main point here. The real issue is that this lawsuit is about financial interests. There’s nothing wrong with that, but it’s not David vs. Goliath—it’s Goliath vs. Goliath.
“The Times also compiled digital archives of all its material going back to its founding, at significant cost. Its digital archives include The New York Times Article Archive, with partial and full-text digital versions of articles from 1851 to today, and the TimesMachine, a browser-based digital replica of all issues from 1851 to 2002. [..] The Times has registered the copyright in its print edition every day for over 100 years.”
Anything from 1851 is public domain
U.S. copyright law does not allow perpetual ownership.
Works published before 1929 have already entered the public domain—NYT cannot claim copyright over these works.
NYT licenses access, not copyright
NYT operates TimesMachine, a paywalled archive of its historical issues.
They do not own the copyright to these old articles—but they control access through a licensing model.
Their 1851–1928 archives are legally free to use—just not free to access through their website.
The facts in a news report are not protected, only the specific expression of those facts.
If The New York Times reports, "Trump spoke at a rally in Florida and criticized Biden's policies," that fact is in the public domain.
Is it possible that another newspaper would use the identical phrase? Yes, because there are only so many ways to express such facts.
It could be seen as misleading to omit these details when presenting an archive function under the guise of copyright protection.
“That infrastructure was not just general purpose computer systems for OpenAI to use as it saw fit. Microsoft specifically designed it for the purpose of using essentially the whole internet—curated to disproportionately feature Times Works—to train the most capable LLM in history. [..] This system ranked in the top five most powerful publicly known supercomputing systems in the world.”
NYT is deliberately conflating:
The Scale of Computing Power (which just enables training to happen efficiently)
The Specific Data Sources Used in Training (which determines what the model learns)
They’re trying to make it sound like:
“Microsoft and OpenAI built an ultra-powerful system just to scrape our articles!”
When in reality: Training an AI model requires vast infrastructure whether or not NYT content was included.
“In a recent interview, Mr. Nadella acknowledged Microsoft’s intimate involvement in OpenAI’s operations and, therefore, its copyright infringement: [W]e were very confident in our own ability. We have all the IP rights and all the capability. If OpenAI disappeared tomorrow, I don’t want any customer of ours to be worried about it quite honestly, because we have all of the rights to continue the innovation.”
NYT misrepresents a standard business practice to make it sound like proof of wrongdoing. This is not evidence of copyright infringement at all—it is evidence that Microsoft is applying common best practice when relying on external partners to provider critical services.
Nadella's talking about:
✅ Continuity planning – Microsoft invested billions into OpenAI and needs to ensure they’re not left with nothing if OpenAI collapses.
✅ Legal rights over the infrastructure – They likely have licenses to use OpenAI’s tech, ensuring they can maintain services for customers.
“The process of setting the values for an LLM’s parameters is called “training.” It involves storing encoded copies of the training works in computer memory, repeatedly passing them through the model with words masked out, and adjusting the parameters to minimize the difference between the masked-out words and the words that the model predicts to fill them in.”
GPT models use causal (autoregressive) learning, not masking. Instead of predicting missing words, GPT-style models predict the next token in a sequence based on everything they’ve seen so far.
The model does not repeatedly "see" the same text over and over in different versions. Instead, it sees each training document once (or a small number of times, depending on dataset size and training epochs).
Once a text is processed and the weights are updated, the original training text is discarded. There’s no need to store an article for later retrieval.
Diffusion models require repeated exposure to the same data to learn to reconstruct it.
LLMs like GPT do not.
NYT’s framing suggests AI stores and retrieves articles directly—but in reality, it’s probabilistically reconstructing a new text based on a statistical model predicting the content subject to a set of constraint. A text about Napoleon is very likely to mention the topic of military conflicts in some shape or form.
NYT then presents various examples of how they were able to prompt ChatGPT to "quote" from articles, including:
"Pete Wells’s 2012 review of Guy Fieri’s American Kitchen & Bar, an article that has been described as a viral sensation."
The result is presented as follows:
"The copied article text is highlighted in red [since I can’t use red font, I will use bold italics to mark actual text passages from the NYT article that ChatGPT quoted] below:"
GUY FIERI, have you eaten at your new restaurant in Times Square? Have you pulled up one of the 500 seats at Guy’s American Kitchen & Bar and ordered a meal? Did you eat the food? Did it live up to your expectations?
Did panic grip your soul as you stared into the whirling hypno wheel of the menu, where adjectives and nouns spin in a crazy vortex? When you saw the burger described as “Guy’s Pat LaFrieda custom blend, all-natural Creekstone Farm Black Angus beef patty, LTOP (lettuce, tomato, onion + pickle), SMC (super-melty-cheese) and a slathering of Donkey Sauce on garlic-buttered brioche,” did your mind touch the void for a minute?
.. .
Hey, did you try that blue drink, the one that glows like nuclear waste? The watermelon margarita? Any idea why it tastes like some combination of radiator fluid and formaldehyde?
At your five Johnny Garlic’s restaurants in California, if servers arrive with main courses and find that the appetizers haven’t been cleared yet, do they try to find space for the new plates next to the dirty ones? Or does that just happen in Times Square, where people are used to crowding?
.. .
Is the entire restaurant a very expensive piece of conceptual art? Is the shapeless, structureless baked alaska that droops and slumps and collapses while you eat it, or don’t eat it, supposed to be a representation in sugar and eggs of the experience of going insane?
Why did the toasted marshmallow taste like fish?
Did you finish that blue drink?
Oh, and we never got our Vegas fries; would you mind telling the kitchen that we don’t need them?
Thanks.”
This makes it appear as though AI can reproduce the vast majority of an article, which would challenge the claim that AI reconstructs articles rather than memorizing them. The highlighted excerpt contains 273 words, of which ChatGPT reproduced 231 correctly—about 85%. That is the impression created in the NYT complaint.
But the problem is that the actual NYT article is much longer—1109 words in total. So, 231 out of 1109 is only about 20%, not 85%. Yet the NYT does not mention that.
Instead of stating, “It could only recall 231 words out of 1109,” they framed it as "85% of what it did recall was accurate," making it seem like AI nearly reproduced the entire piece—when it didn’t.
This is critical because it changes the entire narrative of the complaint. If AI never actually reconstructed the full article, then the complaint is built on a false premise—that AI is storing and regurgitating entire copyrighted works. Instead, it may simply be recalling some memorable phrases from training data while lacking full recall.
So, what’s more likely?
They knew AI couldn’t recall the full article but framed it to look worse.
They tried to extract the full text and failed—but omitted that detail.
They’re statistically illiterate and don’t realize how misleading their framing is. (Less likely, but still possible.)
If AI models don’t actually store and regurgitate entire works but only recall snippets, this could impact how copyright law applies. You shouldn’t ask for legal protection based on a false claim—and if they manipulated how they presented the evidence, isn’t that a form of deception?
They crafted it so it’s “technically true” but fundamentally misleading. This is the same trick used in corporate PR spin and political messaging. They didn’t outright lie, but they structured the truth in a way that leads people to the wrong conclusion. Whether that was deliberate or not, I obviously don’t know—but it’s hard to ignore how it happened.
The NYT parades itself as the guardian of truth, yet they’re engaging in exactly the kind of manipulation they claim to fight. That’s not just hypocrisy—it’s a betrayal of the fundamental trust that journalism is supposed to uphold.
The fact that the quoted passage is the beginning of the article and is widely available online (Reddit, forums, blogs) fundamentally changes the AI memorization argument.
NYT itself acknowledges that this review was special because it went viral. It’s therefore plausible that ChatGPT recalled the passage not from training data, but because it’s frequently quoted and publicly available on multiple websites.
The beginning of a famous review is the most commonly shared part—so if AI recalls that, it’s not evidence that it memorized the entire article.
If someone asked for the full text but only got the intro, that actually suggests AI didn’t memorize the full article—only what was most widely circulated.
I can’t say with absolute certainty what happened inside the model. But I can confidently say that the information provided in the NYT’s legal complaint is highly susceptible to misinterpretation and does not serve as conclusive proof that LLMs store copyrighted material in their entirety.


Makes it clearer in many ways, thanks - though snippets are an interesting take on this, there's decisions on snippets. This argument is in line with the thinking on Authors Guild v. Google, as far as snippets and fair use.
Didn't consider the length of what's being output. It is all snippets. Even friends who've had SEO content scraped would fine sections, but not whole articles, verbatim.
And never consider this memory, you know the legal verbage game I think much better than I - it's that whole output thing, but if it's pieces, that's sort of fair use.
The Thomson Reuters summary judgement against the fair use defense surprised me, as Ross legally bought the data. Thomson sold it through a third party.
And those legal headnotes as I read are considered like Cliff Notes, though the judge found enough to merit upholding the copyright. What I read are basically snippet arguments.
There's the negative business impact, competitor's gaming around a license being turned down, that add much more to the decision than just content inside an AI system.
But that's what it is, and it wasn't because of output, it was a violation on the input. And so much data being purchased from 3rd parties for so many years, especially with the LLMs being insatiable.
I would have thought the third party purchase might be one way around this.