DeepSeek vs. OpenAI: The AI Hype Machine & Why We Can’t Compare Models Reliably
AI Benchmarking Is Broken—DeepSeek Just Made It Obvious
Hangzhou DeepSeek Artificial Intelligence Foundational Technology, with its “DeepSeek” brand (深度求索), has been a front-page topic in recent weeks given its alleged technological breakthrough: high performance and low cost. Headlines such as
“DeepSeek hasn’t just disrupted OpenAI. Chinese tech giants are being upended too” (CNN)
to
“DeepSeek: how China’s embrace of open-source AI caused a geopolitical earthquake” (The Conversation)
are—despite seemingly being written by some hyperventilating journalist (or even management professors from Essex) with a flair for drama—rather the norm than the exception. That should give us reason to be a bit cautious here.
First of all, why is anyone surprised that China invests in high-end technology, has certain capabilities, and runs at a lower cost base than a company with a lot of employees based in San Francisco? If China can’t produce things cheaper than the U.S., then it would not have much of an economy while still being structurally an emerging economy.
I have no vested interest in whether DeepSeek is or is not the better AI model over another, and I have no means to establish the required facts to make such an assessment. Hence, this article is not concerned with finding the next “AI Idol.” Now that I’ve said that, it’s not my worst idea to have a singing contest between AI and some random human bystanders for emotional weight. 深度求索,你来吗? (DeepSeek, are you in?)
My concern is this: the whole DeepSeek story makes one thing very clear. There is a widespread lack of sufficient grounding in how AI works—that’s true among journalists, financial professionals, and academics in terms of being able to weigh in on the technical aspects—combined with the absence of even the most minimal amount of critical thinking when presented with new developments such as DeepSeek.
Why?
Because all these articles and comments simply restate certain commentary from DeepSeek—often without mentioning the assumptions or constraints—and never ask:
Is everything I read plausible?
Nope.
DeepSeek’s V3 technical report mentioned “$5.57 million” for the final training phase—strictly referring to GPU-hour costs. This figure got misrepresented as their entire R&D budget. It made DeepSeek look incredibly efficient and cost-effective, suggesting they achieved what typically costs hundreds of millions (OpenAI spent far more for GPT-3.5 and GPT-4).
Is it possible to independently verify any of these claims?
Nein!
Is it likely that the benefits DeepSeek identified for its own deployment can be realized by somebody else using a different deployment environment?
Good question!
What other things do I know that should be considered before making claims such as foreseeing ‘geopolitical shifts brought upon us by DeepSeek’?
So? Well, take your time.
Hype machine: deep technical complexity + marketing theatrics
The issue is that in the AI market, an organization (or its fan base) releases a slew of mini-facts: snippets of GPU-hours, research breakthroughs, partial training costs, or benchmark wins. Some details may be fully accurate; others are ambiguous or require heavy context to interpret meaningfully (like the $5.5M figure for final training). Observers, especially online, see “They have 10,000 GPUs!” and “They spent just $5.5M!”—not realizing these items might reflect only a small part of the bigger picture or be misquoted altogether. In many Western markets, mass media and social platforms amplify new, “disruptive” tech stories very quickly—often with minimal critical vetting.
By calling themselves “the largest open-source AI,” DeepSeek positioned themselves as a bold, forward-thinking competitor to OpenAI—instantly earning a tribe of supporters who prefer open-source solutions. But DeepSeek is (partially) open-source. They released enough code and architecture details for the community to see “it’s not just GPT re-labeled.” However, “open-source” doesn’t always mean easily replicable or fully transparent about training data, internal scaling, etc.
Many commentators seem to be informed by a quick glance at the DeepSeek repository on GitHub, seeing the large star count (which is not an indicator of peer-reviewed endorsement) and professional documentation, and assume the code is thoroughly tested. They rarely dive deep into issues, pull requests, or the code’s actual functionality.
A healthy open-source AI repo often has genuine external pull requests, discussions, or forks that result in real improvements. If 99% of commits come only from a small internal team, the “open-source community” may be more marketing than reality.
DeepSeek has 18 total contributors, which is not necessarily small for some projects—but given the grand claims around DeepSeek (massive models, enterprise-level impact), one might expect a thriving open-source ecosystem with dozens or hundreds of regular contributors.
Several contributors have just 1 or 2 commits each, which may hint they are:
Primarily internal staff members using multiple accounts.
Casual or “one-off” external contributors (e.g., minor fixes or typo corrections).
Usernames like “DeepSeekDDM” or repeated references to the same handful of handles suggest many commits come from in-house staff, not truly external, unaffiliated contributors.
Commits with Big “++” lines might just reflect refactoring or code reformatting that doesn’t necessarily represent new functionality.
GitHub is more of a documentation hub—a place to host final code versions or partial code for marketing, rather than truly open development.
The fact that these questions are often not asked is very problematic
Why? Because there are no reliable tests that allow for meaningful comparisons in AI performance.
Yes, there are tests, and yes, they produce results. But the benchmarks we have so far cannot be used reliably to say that one AI model outperforms another—because these tests don’t measure real-world utility and, in many cases, have fundamental design flaws and don’t control for bias.
DeepSeek is capitalizing on the AI hype cycle.
Its rapid success in terms of brand recognition seems to be a mix of strategic positioning, open-source intrigue, high-profile endorsements (and controversies), and capitalizing on the AI hype cycle.
DeepSeek seems to have filled a “narrative vacuum.” There’s a desire in both tech and media circles to showcase that OpenAI is not the only game in town. Competition is always a good thing. This dynamic could allow DeepSeek to gain traction faster internationally because stakeholders want a compelling story: a new, non-U.S. challenger. And being a fervent consumer of all sorts of AI solutions, I, of course, welcome healthy competition.
We should not forget recent history. Despite arguably superior 5G technology, Huawei was blocked in many Western countries due to security concerns, signifying that technological prowess alone doesn’t guarantee market access. So far, telecommunications infrastructure is considered a direct national security concern. AI, while strategically important, is perceived as software services, which often face fewer immediate national security barriers. Now that will need a serious rethink sooner rather than later—my two cents of strategic thought in the area of national security. The simple question is: could this background influence how attractive a solution involving DeepSeek looks vs. one without it?
The Real Question
What’s needed to form a reasonably objective opinion on such matters and what can we say so far based on the available information? In short, what are the gaps in current AI benchmarking?
Deterministic vs. Probabilistic / Fact-Checking Obsession
User Competence & Interaction
Lack of Domain-Specific Strength Assessments
Deployment Variability & Environment Mismatch
No Standardized “Risk & Consequence” Factor
Continuous vs. One-Off Testing
Absence of Transparent “Explainability” Tests
Gaps in AI Benchmarking
1. Deterministic Framework to Test a Probabilistic Machine Leading to Misplaced “Fact-Checking” Paradigm
Current Testing Gap
Many existing tests (e.g., fact-based QA benchmarks) assume a “right/wrong” binary outcome—treating the AI’s output as if it were a fixed, deterministic fact. In reality, large models produce probabilistic guesses, not single “correct” answers. They are also not optimized to serve as factual look-up bots.
Why It’s Problematic
Penalizing Valid Variations: Because test scoring is rigidly pass/fail, the AI is unfairly penalized for generating multiple equally valid or approximate truths. This approach overlooks how AI might handle ambiguity or generate confidence intervals (e.g., “70% confident” in a given response).
Ambiguous Input, Interpretive Output: In real usage, prompts are rarely standardized—unlike rigid SWIFT financial instructions, where every field is unambiguous. AI must interpret user questions or instructions, which can be ambiguous. Its final response is shaped by that ambiguity, user context, and internal probabilistic reasoning.
Missing Confidence & Risk Assessment: Deterministic test results ignore the fact that in high-stakes or business environments, we often need risk profiles, uncertainty estimates, and the ability to handle partial or uncertain data. Pass/fail grading says nothing about how the model deals with uncertain or incomplete prompts.
Irrelevance to Real-World Business Tasks: A yes/no or 100% fact-correctness test doesn’t measure if an AI can effectively support tasks like decision-making, scenario analysis, or complex negotiations. In domains like medicine or law, the interpretation of facts and context matters as much as (or more than) a direct “fact recall.” AI might look “knowledgeable” but fail in synthesizing or prioritizing information, which is often more critical in practice. In business contexts, the ability to produce helpful partial answers (and communicate confidence) often matters more than perfect “factual” recall.
2. User Competence & Interaction Effects
Current Testing Gap
We typically measure AI performance in a vacuum—assuming user skill has no impact. Yet, AI usage is deeply interactive: an expert user can coax better responses, while a novice might get confused or accept incorrect answers.
Why It’s Problematic
Poor outcomes could reflect the user’s misunderstanding or prompts rather than the AI itself. And underperformance could—depending on the circumstances—actually represent a relative outperformance by the AI, given higher task difficulty.
Testing that ignores user behavior and interface design overlooks a huge factor in real-world performance.
3. Lack of Domain-Specific Strength Assessments
Current Testing Gap
We rely on broad general benchmarks (MMLU, GSM8K, etc.) that sample a wide swath of tasks—but each domain has unique nuances. For example, a “5% improvement in Chinese language tasks” may not equate to better medical diagnostics.
Why It’s Problematic
A model might excel in general reading comprehension but fail in specialized medical or legal contexts.
Domain-specific skill sets—like image interpretation in radiology—are rarely captured by generic LLM benchmarks.
A health company must decide whether Chinese reading skills should be prioritized over good performance in Chinese history. In other words, these tests in themselves can’t guide which model should be preferred.
Furthermore, an AI that has strength in Chinese language may not be ideal for deployment with Western consumers because the communication style (including phrasing, structure, implied context, cultural references, etc.) may not align with Western expectations. This can lead to confusion or frustration on the part of Western users who aren’t accustomed to the indirectness, context-rich phrasing, or other norms embedded in Chinese language and discourse.
Here is an imagined scenario to explain what I mean. However, this is reflective of Chinese convention. (I hope nobody will be offended—I am not making fun of Chinese; I am simply illustrating different linguistic structures.):
User: Hi, is ChatGPT or DeepSeek better?
DeepSeek AI (English, but with Chinese linguistic conventions):
Ah, a very thoughtful question! To compare is natural, but also, comparison is the beginning of misunderstanding.
ChatGPT has traveled far, learning many things, like a seasoned merchant who has seen many markets. DeepSeek, however, is like a scholar who studies quietly in the mountains, knowing much but speaking only when necessary.
What is “better”? If you seek quick and clever talk, perhaps ChatGPT is an energetic sparring partner. If you wish for depth and reflection, DeepSeek might be a lantern in a misty night (深度求索或许是雾夜中的一盏灯。). 🌙✨
But to truly know, one must experience. Would you like me to tell you what the ancients said about making wise choices?
4. Deployment Variability & Environment Mismatch
Current Testing Gap
Testing often happens with a specific training configuration, dataset, or hardware environment. Real-world deployments differ in data distribution, hardware constraints, and user interactions—leading to inconsistent outcomes.
Why It’s Problematic
Two organizations might train (or fine-tune) the “same” model differently, yielding divergent behaviors.
Benchmarks don’t reflect how small changes in environment, prompts, or fine-tuning can dramatically alter results.
5. No Standardized “Risk & Consequence” Factor
Current Testing Gap
Benchmarks typically track accuracy or F1 scores, ignoring how mistakes vary in severity across use cases (e.g., a chatbot flub vs. a medical misdiagnosis). There’s no built-in measure of cost of error or risk tolerance in current scoring.
Why It’s Problematic
An AI with 95% accuracy in trivial tasks might be worthless for a critical domain where any error is catastrophic.
Conversely, an AI with 85% accuracy might be perfectly acceptable if each error is low-impact.
Without risk weighting, test results can be misleading for real-world adoption.
6. Continuous vs. One-Off Testing
Current Testing Gap
Traditional software testing is done once (or periodically) and the system is considered “stable.” AI models drift over time, especially if continually retrained or updated with new data.
Why It’s Problematic
Performance can degrade unexpectedly if new data or methods introduce biases or break older assumptions.
A single “pass” on a benchmark doesn’t guarantee consistent performance months later or in evolving environments.
7. Absence of Transparent “Explainability” Tests
Current Testing Gap
Many AI evaluations ignore whether a model can provide reasoning or rationale for its answers. While interpretability isn’t always needed, certain domains (medicine, finance, law) demand “why” as much as “what.”
Why It’s Problematic
Users may not trust or adopt an AI system that can’t explain itself, even if it has high accuracy.
Benchmarks that focus solely on final answers overlook the crucial dimension of explainability.
DeepSeek’s Benchmarks and Their Limitations
The benchmark comparison provided by DeepSeek is, of course, subject to the same limitations and gaps. This is not a critique of DeepSeek per se. However, the recipients of the information could be criticized for being mostly ignorant of these issues.
Fact-Centric, Single-Answer Tests Dominate
Nearly all these benchmarks revolve around correctness of a final answer (factual or code output).Little Consideration of User Interaction
No metric captures how a human might guide or refine the AI’s output across multiple query-answer cycles.No Built-In Risk Assessment
A math error or code bug can be catastrophic in certain contexts—but these benchmarks treat it as a simple “wrong answer.” No weighting for real-world consequences.Domain-Agnostic, Not Real-World
Specialized tasks (like specific legal or medical analysis) aren’t represented; so “winning” on these metrics may not translate to actual business success.Impressive Scores, But Partial Picture
Achieving high numbers (Pass@1, EM, etc.) doesn’t prove an AI’s ability to handle ambiguous prompts, maintain robust performance under changing conditions, or provide confidence levels.
Linking Back to the 7 Gaps
Gap #1 (Probabilistic vs. Deterministic): Every benchmark here yields a pass/fail or accuracy statistic, ignoring partial correctness or confidence.
(Overreliance on Fact-Checking): MMLU, Code Q&A, etc. basically test factual or “correct code” output.
Gap #2 (User Competence): None of these tasks evaluate how a user might refine the prompts or handle ambiguous instructions.
Gap #3 (Domain Specific): They’re mostly general academic or code tasks, not deep domain (healthcare, finance).
Gap #4 (Deployment Variability): Scores from one environment say little about how another data pipeline or hardware setup might affect performance.
Gap #5 (Risk Factor): No sense of cost for errors—just correctness.
Gap #6 (Continuous vs. One-Off): These are static tests, ignoring model drift or iterative improvement.
Gap #7 (Explainability): None measure whether the model can articulate its rationale.
What It Means
DeepSeek’s benchmark table underscores the common limitations in AI evaluations—fact-based, single-answer tasks that don’t reflect real-world usage or the multifaceted nature of probabilistic models. While these tests can show some relative strengths (e.g., code generation prowess, multilingual QA), they fall short of revealing how an AI would perform with complex, user-driven, domain-heavy, risk-sensitive scenarios in actual business environments.
We typically assess AI via simplistic “accuracy” or “fact recall” tests, ignoring the deeper complexities of probabilistic outputs, user interaction, domain-specific needs, deployment variability, and risk impacts.
Justifying claims about one AI model being cheaper or better than the next would require, at a minimum, the following:
Multi-dimensional benchmarks that go beyond fact-checking.
A method to factor in user competence, domain specificity, real-world environment constraints, and risk tolerance.
Accounting for uncertainty from the need for ongoing evaluation (not one-time) to track drift or environment changes.
Incorporating explainability or at least confidence calibration into performance criteria.
In short, we lack robust, standardized methods to meaningfully compare AI models when so many variables—from user skill to domain context—profoundly shape real-world outcomes. Improving AI testing requires a holistic, dynamic approach that recognizes these diverse, probabilistic, and user-dependent factors.