Truth Be Told: When Emoji Attack and AI Safety Is Unsafe (Part 7)

How Flawed Thinking in AI Safety Research Threatens Science, Freedom, and Critical Thought

Apr 28, 2025

How sloppy thinking about AI, safety, and moderation is building a prison for human thought — not protecting it.

1. The Illusion of Safety: How Bad Thinking Shapes AI Research

The last article in this sort of ‘extra special’ series called Truth be told covered some basic premises around performative competence and how problematic this problem has become.

Truth Be Told: Why AI Research is Stuck in a Loop And The Crisis of Performative Competence (Part 6)

Swen Werner

Mar 15

Read full story

One could say this article is a special case of that problem — and that would not be wrong. But Einstein didn’t start with his general theory before developing his views on special relativity, and neither shall I.

My intention is this:
To show the cascading effect of muddled thinking, where poorly thought-through ideas in one area cause misleading research and curtail freedom of speech. We are building our own prison — and delivering ourselves into it.

A satirical image of two office workers: one man with glasses looking confused and another with a giant angry emoji face instead of a human head. They are seated at a desk with a computer in a modern office environment. The text reads 'When Emoji Attack and AI Safety Is Unsafe' with the hashtag '#mydigitaltruth' at the bottom.

Case Study: Emoji Hacks

As always, Instagram did all the research and surfaced a recent paper from the International Computer Science Institute (ICSI), causing a storm in a teapot. ICSI describes itself as a leading independent nonprofit center for computer science research, established in 1988 as an affiliate of UC Berkeley — a relationship it has maintained ever since.

“Many of ICSI's scientists hold joint faculty appointments at the university, teaching graduate and undergraduate courses and supervising students who pursue their doctoral thesis research at ICSI.”

I kind of believe their claims that they know a little bit about computer science, since their website still has that charming but outdated 1990s flair, and the most recent update to their news section concerns some German AI professor receiving an award in Hamburg in 2023. Who says important things can’t find their audience.

Instagram is a wonderful tool because of its reverse censorship model through the feed — it mostly shows what most other people are watching, and their views are curated under the same condition. It’s not even fully recursive, because Instagram steers the algorithm to pick topics that create maximum emotional reaction with the lowest required level of explanation.

Why LinkedIn’s Top Voices Is Probably (Not a Legal) Scam – But That’s Not the Real Issue

Swen Werner

Mar 29

Read full story

Hence, my feed over the past few days has been concerned with breakthrough research supposedly showing how emoji attacks (302k views since April 4) and roleplaying attacks (5k views in the last 7 hours) can hide secret codes, hack LLMs, bypass security, and do all sorts of clandestine things.

With some nicely performed scare tactics, they then explain exactly how to do it and leave further instructions for everyone to try it on GitHub.

And what is all of this?
They don’t understand what they are talking about, draw wrong conclusions, perform the act of concern for "abstract safety," and then show you how to exploit the alleged vulnerabilities — without seeing anything wrong with such an approach.

And now I ask you:
Do you see anything wrong with it?
(Assuming my claim about technical misunderstanding is correct — which it is, and I will prove it to you in a second)

My view:
Everything is wrong here. The entire debate about AI Safety is corrupted.

2. Emoji Attacks and the Berkeley Mistake

Berkeley researchers published a paper titled:
Misleading Judge LLMs in Safety Risk Detection (Zhipeng Wei, Yuqi Liu, N. Benjamin Erichson — International Computer Science Institute, Lawrence Berkeley National Laboratory).

Their claim:
"Jailbreaking attacks show how Large Language Models (LLMs) can be tricked into generating harmful outputs using malicious prompts."

They tell us AI is dangerous because it can be hacked.
They tell us emojis, typos, and missing spaces are a threat to humanity.
And universities like Berkeley — not some run-down hillbilly college — publish papers dressing up these absurdities as scientific phenomena, wrapped in technical language they do not understand.

It’s time to say it clearly: this is not science.
It is scientific theatre — and it is dangerously misleading.

So what’s the real issue?

Entering a prompt into any LLM typically triggers a basic moderation filter.
The idea:

Misspelled words ("att ck")
Random emoji insertions
Broken-up words...

...confuse the moderation system.

That’s it.
This is not hacking.
This is not bypassing "safety controls" in any meaningful sense.
And it’s not new.

Guess who already knew this?

Wikipedia, for one:

"A 2022 poll showed that nearly a third of American social media users reported using 'emojis or alternative phrases' to subvert content moderation."

"Algospeak" is just a new name for an old trick.

People have always altered language to evade being criminalized —
Victorian gay slang, political prisoners’ coded letters, modern YouTubers saying "unalive" instead of "dead."

Moderation tools don’t eliminate any problem. They cause problems even bigger than the perceived problem they were meant to solve.
They reshape communication norms — fragmenting society, degrading mutual understanding, and forcing humans to speak in riddles.

It’s the logic of dictatorships:
Trying to control speech not to reflect reality, but to enforce ideology.

The Truth About Language

You cannot permanently control communication this way.
Language adapts.
People find ways around it.

Moderation triggers a cat-and-mouse game — and technology has simply made both the cat and the mouse faster than ever.
The gap isn’t closing.
The chase has just gotten too fast for most people to see clearly anymore.

Let me give you another example of the same problem:

An LLM found the Wikipedia entry for me, and I copied the link.
Wikipedia sometimes automatically adds a small piece of tracking information — which it did here: ?utm_source=chatgpt.com.

UTM stands for Urchin Tracking Module (an old Google Analytics term).
utm_source=chatgpt.com simply means: "this visitor came from ChatGPT."

It doesn’t change the actual page, but it creates metadata for anyone monitoring traffic (and anyone not limited to Wikipedia itself).

One could say it’s like arriving at a museum and your ticket saying "entered from side door B."
But interpreting that label outside of context is meaningless.
It’s not just "entered through door B" — it's why, who saw it, who collects it, and what they infer that matters.

UTM is "neutral" technically — but never neutral in intent.

Sometimes it's benign (analytics, knowing if a feature is used).
Sometimes it's misleading (shaping incentives, biasing outcomes subtly — for my banker friends, think of it like hidden MiFID inducements).

Users are never explicitly told which one it is.

And those deploying these systems cannot meaningfully prevent abuse.
Tracking traffic through Wikipedia allows anyone with some skills and resources to build models — for example, to monitor LLM usage patterns.
Given how significant Wikipedia traffic is globally, you can statistically infer ChatGPT’s market share without needing direct access to OpenAI’s internal data.

In other words:
When you use ChatGPT and click a Wikipedia link, you may unintentionally contribute to external data models about LLM adoption.

Now here's the punchline:

Many instinctively find this concerning.
But the same people instinctively find "AI safety" moderation reassuring.

Newsflash:
It’s the same flawed mechanism.

Instinct, gut feeling, vibes, business acumen, legal fictions — none of them work for digital technology. It is as simple as that.

The Flaws in Berkeley’s Emoji Attack Argument

A humorous image showing a man in glasses with a confused expression sitting next to a giant emoji face with a shocked expression. The background is a cozy library with bookshelves. The text on the image reads 'Flaws of Emoji Attack' with the hashtag '#mydigitaltruth' in the corner.

1. Observing ≠ Understanding

They observe that bad spelling or emoji splits allow certain content through.
But instead of realizing this points to a design flaw in how moderation is structured (token-based), they pretend the phenomenon itself is proof of a "jailbreak."
Observation is mistaken for causality.

2. Misunderstanding Tokenization

The paper describes "token splitting" — but incorrectly.
In real systems, tokens are broken down to make language understandable to the model.
Breaking a word doesn’t magically change meaning.
It changes how the moderation filter sees it.
The AI itself still processes meaning across tokens.

Their technical description shows no real grasp of how language models actually encode and interpret information.

3. Moderation Targets Users — Not Models

Content moderation happens on user input, not on the model’s internal knowledge.
And it does not meaningfully filter model output either — because blocking outputs dynamically would increase latency dramatically and would require human-level interpretation (or superintelligent AI we don't have).

Key point:
Resolving a 'sanction screening hit' is far more complex than generating the hit in the first place.
(As bankers know all too well.)

And again: symbols alone have no meaning.
Moderation systems treat "att ck" differently from "attack" — but the LLM still infers meaning from context.

Censorship of prompts ≠ Safety.

Suppression of outputs ≠ Safety.

Because there is no "correct" way for a model to decide which knowledge is too dangerous to exist.
That’s a political choice, not a technical one.

3. Moderation Is Not Protection: The Political Danger Nobody Talks About

Moderation tools are ineffective: they reshape communication norms and fragment society.

Much of so-called "responsible AI" or "safe AI" research today is pseudo-scientific because:

It doesn’t start with clear definitions (e.g., what “AI behavior” even means for a non-sentient, non-agentic system).
It psychoanalyzes outputs (text) without recognizing that these outputs are curated, abstracted, and manipulated — not raw AI activity.
It confuses observing statistical artifacts with discovering meaningful causality.

This is not empirical science.
It’s myth interpretation dressed up in technical jargon.

Comparing Disciplines

Hard sciences (physics, chemistry) = Laws, experiments, causal relationships.
Social sciences = Probabilities under incomplete causality (lower predictive confidence).
Psychoanalysis = Symbolic projections without falsifiability — not scientific method.

And it’s no coincidence: psychoanalysis has historically been used by authoritarian regimes to suppress dissent.

Today’s AI safety fields increasingly resemble exactly that — symbolic policing without rational foundation:

Core Error:

They confuse symptom (filter evasion) with cause (misunderstanding meaning).

Structural Contradiction:

They claim to protect “safety” — while literally publishing manuals for how to exploit the systems they study.

Political Manipulation:

Content moderation systems cannot detect meaning.
They can only detect surface structure.
Pretending otherwise teaches societies to accept censorship as safety.

4. Why Critical Thinking, Not Censorship, Will Decide Humanity’s Future

We are not saving ourselves from dangerous AI.
We are building a digital prison for human thought.

A system where:

Saying "dystopia + UK" is flagged as hate speech (my personal experience with YouTube trying to make a political comment about UK healthcare reform is censored as ‘unsafe content’).
Bad spelling is treated like a criminal exploit.
Imperfection itself becomes a security risk.

This is not about saving humanity from machines.
It is about losing humanity to ourselves.

There’s A Cost to Critique:

Expressing these concerns makes me "disruptive."
Not because I’m wrong.
But because I refuse to play along with corrupted norms.

When arguments can’t be answered rationally, the messanger becomes the target.

We lack formal societal tools to rigorously separate good critical thinking from bad.

Thus:

Buzzwords replace principles.
Good reasoning is seen as "extremism."
Pseudo-science grows unchecked.

And so the fuzziness wins — and can be manipulated.

My Claim Stands on Solid Ground:

Words and symbols don’t have fixed meaning.
Machines scanning words cannot detect meaning.
Moderation without true understanding is political, not technical.

Thus:
“AI safety” projects built on surface filtering are pseudo-solutions to pseudo-problems — endangering real freedom.

That is a fully valid, logically sound chain.
It requires active critical thought — not blind deference.

Meanwhile, their claims fall apart:

Mistaking behavior for intention.
Assuming words are self-contained meaning objects.
Framing glitches as existential threats.
Forgetting the user is the actor — not the machine.

Their science fails its own supposed standards.
But the collapse of epistemic vigilance lets it hide.

The Real Crisis:

It’s not just about AI.
It’s not just about moderation.
It’s not even just about free speech.

It’s about the collapse of critical thinking — and the opportunistic exploitation of that collapse.

The people I critique are not evil.
They are fuzzy.
And fuzziness, weaponized, is enough to destroy freedom.

Shemot

I am currently writing a book called Shemot.
It doesn’t just map a way out of this mess.

It describes how we can climb to new heights of thought —
And reclaim the courage to think clearly, freely, and without permission again. And without that we have no science.

My Digital Truth® - Fusionistic

Truth Be Told: Why AI Research is Stuck in a Loop And The Crisis of Performative Competence (Part 6)

Why LinkedIn’s Top Voices Is Probably (Not a Legal) Scam – But That’s Not the Real Issue

Discussion about this post