Truth Be Told: Why AI Research is Stuck in a Loop And The Crisis of Performative Competence (Part 6)
How legal AI, governance, and ethics all suffer from the same fundamental flaw: We mistake structure for substance, fluency for truth, and pretend that hard choices don’t exist.
AI will revolutionise law, governance, and ethics—making decisions faster, more efficient, and even fairer. Or so the story goes. However, the reality is far messier.
We expect AI to improve legal reasoning, yet studies show that what it really does is make people sound more competent—rewarding fluency over accuracy. We demand AI to align with human values, yet we refuse to explicitly define those values, pretending that ethical dilemmas are merely technical problems.
This article examines how AI governance, legal AI research, and ethical debates are riddled with contradictions. By looking at a recent legal AI study, I’ll show how flawed assumptions lead to misleading conclusions—and why the real crisis isn’t AI itself, but our unwillingness to think deeply about what we’re actually asking it to do.
The Cranes of Ibycus
Friedrich Schiller tells the story of the murder of Ibycus on his way to the Isthmian Games in his ballad Die Kraniche des Ibykus (The Cranes of Ibycus).
"Von euch, ihr Kraniche dort oben,
Wenn keine andre Stimme spricht,
Sei meines Mordes Klag erhoben!"
Er ruft es, und sein Auge bricht."From you, ye cranes that are up yonder,
If not another voice doth rise,
Be rais'd indictments for my murder!"
He calls it out, and then he dies.
The Isthmian Games were one of the four major Panhellenic Games in Ancient Greece, alongside the Olympic Games, Pythian Games, and Nemean Games. Together, they formed what was known as the "periodos" (περίοδος)—a sacred cycle of competition, a term that originally described traveling in a circuit and returning to the starting point. It is, of course, the root of our modern word "period" (as in a repeating cycle).
Each of these four Games was characterized by three defining elements:
Periodos (περίοδος) – Circuit Games
The Panhellenic circuit (περίοδος) consisted of these four festivals, which athletes traveled between, competing in each.
Stephanos (στέφανος) – Crown Games
These were "stephanitic" (crown-giving) festivals, meaning the prize was a symbolic crown rather than monetary rewards.
Agon (ἀγών) – Competitive Games
These were agonistic festivals, structured across three contest categories:
Musical contests (μουσικός ἀγών) – singing, poetry, and instrumental performances.
Athletic contests (γυμνικὸς ἀγών) – wrestling, running, boxing, pankration, and the pentathlon.
Equestrian contests (ἱππικὸς ἀγών) – Chariot races and horseback riding.
The hippikós agṓn (ἱππικὸς ἀγών) was not merely a display of status—it was a public demonstration of virtue (ἀρετή) and honor (τιμή). While the ideal Greek elite needed wealth to afford horses—demonstrating their readiness for battle—they were equally expected to prove themselves as models of excellence.
The Meaning of Going in Circles
In ancient Greece, going in circles (περίοδος) was not a failure—it was a sacred structure, a path to mastery. A leader was not just a strategist but an artist, a warrior, and a model of virtue—skill and wisdom together.
In modern times? Going in circles is a failure.
It means stagnation, bureaucracy, and aimlessness. Leadership is no longer about art, skill, and virtue together—we have split them apart, and in doing so, we have lost all three.
What Do We Have Instead?
Athletic contests—and branding deals with Coca-Cola.
Did we shed the superstition and mystery that once underpinned these games?
No. We dropped the meaningful part and kept the hollow spectacle.
But the rituals never disappeared.
We still practice magic—we just no longer recognize it.
And now, this unacknowledged magic has seeped into science, business, and politics, shaping the very structures we claim are rational.
The ancient cycle of mastery (περίοδος) was about returning to the beginning wiser than before.
Today, going in circles is seen as a failure—but we still go in circles without realizing it.
We believe in rationality, science, and optimization—yet much of what we do is ritualistic, performative, and superstitious.
The Performative Debate on AI Literacy
I came across two interesting articles on Substack that exemplified these issues. One (by
) was discussing to what extent anti-establishment backlash is, in fact, a result of establishment failures. It has been a very long time since I saw someone describe the issue in a nuanced way without losing clarity and framing it in ways that resonate with me.Another article was about AI literacy by
. Besides its title, it does not explain how AI works beyond legal and compliance jargon. There is no discussion of tokenization, embeddings, inference models, or probability-driven outputs. AI literacy, it seems, is framed as a compliance requirement rather than an actual understanding of AI. This kind of "AI literacy" talk leads to bureaucratic control, not real innovation. Governance is important—but if you don’t understand how AI works, your policies are meaningless.It's safety training on an airplane, not flight school for real pilots,
which is what we need.
However, this led me to an article by several authors from the Michigan Law School and the Minnesota Law School called "AI-Powered Lawyering: AI Reasoning Models, Retrieval Augmented Generation, and the Future of Legal Practice." It is a “must-read for lawyers using AI,” according to the AI literacy article. And this law school paper perfectly embodies my concern that our scientific research and understanding of digital technologies are often shallow, misleading, and performative.
The study tested six legal tasks with three different conditions:
No AI (two tasks)
Retrieval-Augmented Generation (RAG) powered legal AI (Vincent AI) (two tasks)
AI reasoning model (OpenAI o1-preview)
The study claims that AI reasoning models improve legal analysis while RAG reduces hallucinations, yet its own data raises serious doubts. For example, they define hallucinations strictly as citations to fabricated sources, yet even the No AI group supposedly produced hallucinations—something that should be impossible. This inconsistency suggests either flawed grading or an arbitrary distinction between human and AI errors. If misattributed legal references from humans weren’t counted as hallucinations, but similar AI errors were, then the study’s findings are skewed. Rather than proving AI's unique flaws, the evidence actually highlights biases in how AI-generated vs. human-generated mistakes are assessed.
The study states that 127 students successfully completed the experiment. However, the tables show varying sample sizes across tasks, ranging from 126 to 135 participants. This discrepancy raises questions about the data consistency. If all students were required to complete six assignments, why do the task-specific sample sizes fluctuate? This suggests potential methodological inconsistencies, such as missing data or incomplete submissions, which the study does not fully address. If there were only 127 students in the study, but some tasks show up to 135 completed assignments, then there is a significant methodological gap left unexplained.
The study presents its findings as evidence that AI improves legal work, but its own methodology raises significant questions about what is actually being measured. The regression model includes control variables like GPA, law school year, and prior AI use, yet we are not given transparency on how much these factors influenced the results. If these controls significantly impact outcomes, then the study may not be measuring the effectiveness of AI at all—but rather how well their control variable definitions capture student ability. Worse, the error term accounts for all unknown influences, meaning that unmeasured factors—like writing skill, motivation, or even random variation—could be skewing the results. Without clearer reporting on these effects, we cannot determine whether AI had a meaningful impact or whether the study’s conclusions are simply an artefact of flawed assumptions.
The study claims to assess AI’s impact on efficiency, but its methodology relies on self-reported time tracking, which is notoriously unreliable. Participants may have counted thinking time, research, and writing differently—some may have included breaks, while others only logged active typing. Without an objective measure of time-on-task, the study cannot reliably determine whether AI actually improves efficiency. This is particularly problematic given that the task was subject to a time limit, yet participants were merely asked to recall and report their time. Such self-reported metrics are inherently inconsistent and risk inflating or deflating AI’s perceived impact based on differences in how individuals perceive and record their work.
The study presents a serious contradiction
It instructs participants to ensure their work does not appear AI-generated, yet hallucinated sources—one of the most obvious indicators of AI misuse—were included in submitted assignments. In real legal practice, submitting fabricated citations could result in malpractice claims, sanctions, or even criminal liability. However, the study does not account for these consequences in its grading. Instead, AI-assisted assignments were rated as higher quality despite containing errors that, in real-world practice, could be catastrophic. This raises concerns about whether the study truly measures AI’s effectiveness in law or simply evaluates stylistic improvements while overlooking fundamental legal accuracy.
Even more concerning, while the authors acknowledge that depth and nuance of legal analysis are the most critical aspects of legal work, they did not weight these factors differently in their scoring. This raises serious questions about whether AI truly improved legal reasoning—or simply made responses appear more polished while failing to ensure sound legal argumentation. If a model produces a well-organized but legally flawed argument, is that really a step forward for AI-powered lawyering?
The study’s claims about AI “raising the floor” obscure the real issue: AI improved presentation and structure, but its impact on legal reasoning and accuracy was inconsistent at best. The authors admit that AI was less effective on tasks requiring independent legal judgment, meaning it only excelled when the problem was already well-defined. This reinforces a crucial limitation for now—AI is not replacing deep legal expertise; it’s just making responses look more polished. The paper frames AI as a safety net, yet also acknowledges that it did not consistently improve accuracy, which is the only safeguard that truly matters in law.
The study presents a cleaner-looking legal document as progress, but in legal practice, a well-structured argument with incorrect citations is worse than a messy one that is legally sound.
What This Means
o1-preview (Reasoning Model) did not make students better at legal analysis—it made them sound more competent.
The grading process rewarded rhetorical fluency and structure over factual accuracy.
How do they not see the problem?
The paper lacks any grounding in AI’s technical foundations, preventing the kind of rigorous analysis needed to make real conclusions.
AI’s Core Limitations in Legal Reasoning
AI checks logical consistency at a local level (within a few sentences or paragraphs).
AI does not track long-term dependencies in contracts (e.g., if a definition on Page 1 is invalidated by a clause on Page 10).
AI does not re-evaluate earlier statements when new information is introduced later—it generates text linearly, not iteratively.
Why AI Struggles with Contracts
Reasoning Models (GPT-like AI)
AI does not persistently track contract definitions throughout a document.
It does not scan the full document to check if earlier terms have been invalidated.
Instead, it predicts the next most probable token—without verifying prior logic.
RAG (Retrieval-Augmented Generation)
RAG can fetch legal references but does not enforce contract-wide consistency.
It retrieves definitions if explicitly prompted but does not automatically ensure coherence across clauses.
This is why AI performs well in litigation-focused tasks (e.g., writing a persuasive letter) but struggles with transactional tasks (e.g., drafting a contract).
AI writes contracts as if they were novels—but contracts must be structured like code.
There is a fundamental limitation to solving this through better AI models alone.
Contracts Are Fundamentally Different from Legal Precedents
Legal citations are like blockchain transactions—each stands alone and does not change over time.
Contracts are constraint-solving problems—every clause must be optimized in relation to the others.
Solving one constraint (e.g., adding a liability clause) can break previous optimizations (e.g., a definition that seemed fine now needs revision).
This creates an "infinite regress" problem—there is no single correct answer.
Possible Solutions
AI could be designed to dynamically adjust and revalidate clauses rather than treating contracts as static documents.
AI could use a structured "dependency graph," where changing one clause triggers a reevaluation of related clauses.
But even with better AI, multi-conditional legal trade-offs cannot be reduced to deterministic logic.
AI Cannot Solve Subjective Risk Trade-Offs
Human risk assessments are nonlinear and context-dependent—there is no absolute preference function.
AI must approximate human values, but it has no way to verify if the approximation is “correct.”
This means AI is always playing "risk bingo"—it guesses an optimal trade-off but never knows if it truly reflects human intent.
Example: Modeling Risk
A legal contract might need to assess:
Risk A: Losing one leg
Risk B: Losing two legs
Mathematically, AI might model Risk B as twice as bad as Risk A.
But in reality, losing one leg might already represent 99% of the total loss in human experience.
There is no single, objective way to resolve this preference.
This is why AI struggles with law—it has no intrinsic understanding of trade-offs beyond probability models.
The Core Issue in AI Evaluation
We observe a pattern (e.g., o1-preview outperforms RAG).
We lack a precise model to explain why, so we apply vague reasoning ("structured reasoning models are better").
We fail to interrogate the details, making the conclusion functionally useless.
This is why AI research often lacks scientific rigor—people mistake correlation for causation.
Rome adopted a practice from Greece called augury—interpreting the will of the gods by watching the flight patterns of birds. No major political or military decision was made without it. The assumption was simple: if the birds signaled favorably, the course of action was correct.
In Schiller’s ballad Die Kraniche des Ibykus, a murderer confesses at the exact moment a flock of cranes flies over a theatre. The crowd gasps. The cranes have delivered justice!
Da hört man auf den höchsten Stufen
Auf einmal eine Stimme rufen:
"Sieh da! Sieh da, Timotheus,
Die Kraniche des Ibykus!"Then hears one from the highest footing
A voice which suddenly is crying:
"See there! See there, Timotheus,
Behold the cranes of Ibycus!"
But did the cranes cause the confession? No.
Did augury actually reveal divine truth? No.
Yet, the belief in the pattern was enough to make it real.
This is exactly how we evaluate AI today.
Are structured reasoning models better than RAG? The study claims so, but lacks a real explanation. Just like the augurs of Rome, we watch the birds (AI benchmarks), see a pattern, and declare it meaningful—without questioning whether the pattern actually means anything at all.
Why This Matters Beyond Law
This same problem exists in AI ethics.
Let’s say a driverless car controlled by AI goes around a corner and sees two people:
A grandmother
A 10-year-old child
The car cannot stop in time.
It must either:
Hit the grandmother to save the child.
Hit the child to save the grandmother.
There is no algorithmic "truth" to resolve this decision.
This mirrors the Titanic dilemma:
In Titanic (1997), "Women and children first" was an explicit moral choice.
This was a social norm, not an objective truth—it was based on historical values at that time.But we hesitate to encode moral trade-offs into AI.
So we either "let AI decide" (introducing hidden bias), or we pretend AI isn’t making a moral choice at all.
This is the same trap as the contract issue:
AI must approximate human trade-offs.
But there is no objectively "right" function—it depends on values.
Humans have historically made these trade-offs explicitly (e.g., Titanic, war triage, medical ethics).
But we hesitate to program them into AI because it forces us to admit that our values are subjective and always fair.
AI Ethics Is a Paradox
We demand AI to make choices that reflect human values, yet we either cannot or refuse to explicitly define those values—pretending the decision is purely technical instead of confronting the underlying moral dilemma.
AI ethics today is mostly about disguising bias, not actually solving it.
The same issue of hiding the problem applies to debates of AI used for contract drafting, litigation, and governance.
Instead of acknowledging trade-offs which are inherently subjective, we engage in shallow, misleading, and performative debates. We don’t challenge our own assumptions and we accept superficial speculation as explanation.
This erodes confidence in institutions, science, and politics—not out of malice, but because we are not thinking deeply enough.
This is the real crisis of today which AI highlights:
People mistake polished narratives for real understanding.