Hidden CharactersUnicodeAI Watermarks

Zero-Width Characters in AI Text: Why They Appear and How to Remove Them

May 9, 2026·5 min read·Free tool

TL;DR

Zero-width characters are invisible Unicode code points that AI tools embed in their output — from training data patterns or intentional watermarking. They break word counts, find-and-replace, CMS imports, and database queries without leaving any visible trace. TextPurify removes all of them in one pass.

Something strange happens when you copy text from an AI tool. Your word count doesn't match what the AI reported. Your find-and-replace isn't finding the text you can clearly see on screen. Your CMS threw a validation error on import. Your database query returned nothing for a string that is plainly there.

The most likely cause is a zero-width character — an invisible Unicode code point that exists in your text, occupies a position in the string, is counted as a character, and breaks string matching, but takes up no visible space whatsoever.

What Is a Zero-Width Character?

Unicode is the global standard for encoding text characters in software — it covers every letter, number, symbol, and emoji used in human languages. It also defines a category of characters called format characters: code points that affect how text is laid out or processed but do not produce a visible glyph.

Zero-width characters are the most problematic of these. They appear in your string, they are counted, they break matching — but they are completely invisible. The most common ones in AI-generated text are:

U+200B Zero-Width Space — a space that takes up no visual width. Commonly inserted between words in AI output. Breaks find-and-replace and makes word-count tools disagree with each other.
U+200C Zero-Width Non-Joiner — prevents adjacent characters from forming a joined glyph. Can interfere with ligatures and cause unexpected line-break behavior.
U+FEFF Byte Order Mark (BOM) — a signature character that historically appeared at the start of text files to indicate encoding. Common at the beginning of copied AI text. Corrupts file imports and API inputs that don't expect it.
U+00A0 Non-Breaking Space — looks visually identical to a regular space but is an entirely different character. Breaks string comparisons, SQL queries, and most equality checks.
U+00AD Soft Hyphen — an optional line-break point that is invisible in most renderers but counted as a character by word-processing tools and search engines.

Why Do AI Models Output These Characters?

There are three reasons, and they are not mutually exclusive:

Training data: AI models are trained on internet text, which contains zero-width characters used legitimately — Wikipedia articles with complex scripts, professionally typeset web pages, internationalized content. Models replicate patterns from their training data, including the invisible ones. This is not malicious; it is a side-effect of learning from the full breadth of the web.

Intentional watermarking: Security researchers have documented zero-width characters being systematically embedded in AI output as implicit watermarks — patterns that can theoretically identify which model generated the text. DeepSeek in particular has been studied for this. The characters are not random; they appear in statistically consistent positions that could be decoded by someone who knows the pattern.

Typographic conventions: Non-breaking spaces between numbers and units (“100 km/h”), soft hyphens in long technical words, and zero-width joiners in emoji sequences all appear in properly typeset text. Models trained on well-formatted content produce them naturally.

What Problems Do They Cause?

Word count discrepancy: A zero-width space inserted between “data” and “science” turns them into separate strings in some counting tools and keeps them as adjacent text in others. Your CMS might report 847 words; your editor reports 851. Both are technically correct about different things.

Failed find-and-replace: You search for “artificial intelligence” but the replace operation finds nothing — because “artificial” has a zero-width space inside it that you cannot see. The literal string you typed does not match the string in the document.

Database and API errors: Hidden characters in AI-generated content that flows into your application can corrupt text fields, break JSON parsing, and cause string comparison queries to return empty results for values that are visually present.

CMS import failures: Many content management systems, email marketing tools, and spreadsheet importers have strict UTF-8 encoding requirements. Documents containing BOMs or non-standard Unicode format characters fail validation silently — sometimes the import succeeds but the text is subtly corrupted.

The Security Angle

In 2025, security researchers demonstrated that zero-width characters embedded in code completions from AI coding assistants could be used for prompt injection attacks. An attacker who gets invisible characters into a shared document or codebase can embed hidden instructions that AI systems read but humans cannot see.

This is not theoretical. The Cursor code editor had a documented vulnerability where zero-width characters in shared context documents were invisibly influencing AI code suggestions. If you work in security-sensitive environments and use AI text in documents that other AI systems will also read, stripping zero-width characters is a hygiene requirement, not just a convenience.

How to Detect Zero-Width Characters

Most text editors do not show zero-width characters by default. In VS Code, you can enable the "Unicode Highlight" setting to flag them. In a browser developer console, you can run text.length and compare it to the visible character count. Programmatically, the regex /[-‍ ]/g will match the most common variants.

The practical approach for content work is to run all AI output through a cleaning tool before publishing — not because you know invisible characters are present, but because you cannot know they are not.

How to Remove Them

Paste your text into TextPurify and ensure the "Hidden chars" option is enabled (it is on by default). TextPurify removes all zero-width spaces, zero-width non-joiners, BOMs, soft hyphens, and other invisible Unicode from the U+200x range in a single pass. Non-breaking spaces (U+00A0) are converted to regular spaces rather than deleted outright, which preserves text flow while making the characters standard.

For the other formatting issues that typically accompany zero-width characters in AI output — em-dashes, curly quotes, Markdown syntax — see our complete AI text cleaning checklist.

Fix it in one click — free

Paste your text into TextPurify and strip all zero-width spaces, BOM characters, soft hyphens, and non-breaking spaces in one click.

Remove Hidden Characters Free →

Frequently Asked Questions

Can I see zero-width characters with my eyes?

No — that is the defining property. They have no visible glyph. The only ways to know they are there are to use a tool that reveals them, compare character counts programmatically, or notice the downstream problems they cause.

Does every AI tool embed zero-width characters?

Not systematically, but all major AI tools do it occasionally — as a byproduct of training on formatted web text. DeepSeek has been specifically documented embedding them as watermarks. ChatGPT and Claude produce them from training data patterns.

Will removing zero-width characters break legitimate formatting?

For standard prose, no. Zero-width characters are only legitimately used in very specific typography contexts — right-to-left script control, certain ligature handling in complex scripts. If you are cleaning AI output for publishing in English, removal is always safe.

Do zero-width characters affect SEO?

Yes, potentially. Search engines index the text they receive, including invisible characters. A page with zero-width characters creates strings that may not match typical search queries. Removing them ensures your content matches searches exactly.

DeepSeek <think> Blocks Explained →Clean AI Text Before Publishing →