UnicodeFix Levels Up: Metrics, Forensics, and Bracket Discipline
UnicodeFix Grew a Brain
Last time I wrote about UnicodeFix, it was in full CodExorcism mode: burn the smart quotes, banish zero-width daemons, normalize ellipses, and make Codex’s Unicode gremlins someone else’s problem.
Since then, the tool did what all good exorcists eventually do: it learned to take notes on the hauntings.
The current release line is:
UnicodeFix - CodExorcism Edition+ v1.1.3
Same mission (wipe AI breadcrumbs and Unicode trash), but now it can analyze what it just cleaned - and tell you how weird your text looked before the ritual.
From “Just Clean It” to “Show Me the Crime Scene”
The big change is that UnicodeFix is no longer just a firehose; it’s also a forensics report.
You still have the classic “make this text not suck” workflows:
# Filter mode (STDIN → STDOUT)
cat draft.md | cleanup-text > draft.clean.md
# Batch clean
cleanup-text *.txt
# In-place, with temp file safety
cleanup-text -t important.txt
But now there’s a whole analysis layer bolted on top:
# Human-readable audit
cleanup-text --report notes.md
# JSON report (for scripts/CI)
cleanup-text --report --json notes.md
# Add experimental semantic metrics
cleanup-text --report --json --metrics essay.md
What the Metrics Do (and Don’t Do)
Turn on --metrics and UnicodeFix will attach a metrics block to the report:
- Entropy / diversity
- ASCII vs Unicode ratios
- Repetition patterns
- A heuristic “AI-likeness” score
This is not a magical AI-detector. It’s a dashboard:
- “Why does this look machine-generated?”
- “Why is this file full of weird Unicode when the rest of the repo isn’t?”
- “Why did my CI suddenly start yelling at me?”
Pair it with:
--metrics-helpfor a friendly legend (and ↑/↓ hints)--exit-zeroto keep CI/pre-commit runs informative without breaking your flow--thresholdwhen you do want to fail the pipeline if things get too spooky
Example:
# Warn, don't break commits
cleanup-text --report --metrics --exit-zero src/**/*.py
# Fail CI if anomalies cross your line in the sand
cleanup-text --report --threshold 1 docs/**/*.md
Unicode Discipline: Fullwidth Brackets & Friends
The 2025-11-15 update (v1.1.3) did a little Unicode housekeeping too:
- Fullwidth square brackets
[]now fold to ASCII[]by default- keeps tables, code blocks, and terminal output aligned
- If you’re doing something visual/typographic and you want the fullwidth flavor:
cleanup-text --keep-fullwidth-brackets manuscript.md
The dagger † (e.g., for footnote markers) is left alone.
We’re exorcising daemons, not your typography.
There’s also a small helper for display-only folding if you want to keep your source text intact but render it in ASCII for terminals:
unicodefix.transforms.fold_for_terminal_display(text)
Use it when you want clean-looking logs without mutating the original content.
Test Harness: Less Ceremony, More Coverage
The test suite grew up too.
Instead of a thicket of bespoke scripts, there’s now a single entry point:
tests/test_all.sh
What it does:
- Builds its run list automatically from
data/ - Exercises all the main scenarios:
- default clean
- invisible-preserving (
-i) - no-newline (
-n) - custom output (
-o) - in-place (
-t/-t -p) - STDIN/STDOUT filter mode
- Drops diffs and word-count comparisons into
test_output/<scenario>/
Binary fixtures are sensibly skipped in STDIN/STDOUT mode so the run doesn’t explode on UTF-8 decoding.
Cleanup is one command away:
tests/test_all.sh clean
If you’re wiring UnicodeFix into CI, this harness is basically your paranoia button: push it before you publish, push it before you grade, push it before you merge PRs that smell like AI copy-paste.
Install / Upgrade in 60 Seconds
The installer path also got cleaned up and clarified.
Fresh install:
git clone https://github.com/unixwzrd/UnicodeFix.git
cd UnicodeFix
./setup.sh # creates a venv and installs dependencies
# Standard usage
pip install .
# Dev/editable mode
pip install -e .
# With optional NLP metrics support
pip install .[nlp]
The README and setup.sh now actually explain what just happened, how to activate the env, and how to get from “clone repo” to “cleanup-text on PATH” without spelunking.
Real-World Use Cases (Now With Forensics)
Students
- Clean AI-assisted code before handing it in
- Run
--report --metricson your own work to understand why it looks suspicious - Use
--exit-zeroso your local checks warn you without blocking the compile/push cycle
Devs
- Put
cleanup-text --report --threshold Ninto CI to catch sketchy Unicode in PRs - Run
cleanup-text -ton config files, docs, and scripts cloned from random Gists - Keep logs / emails / release notes looking hand-typed instead of LLM-sprayed
Writers / Bloggers
- Use
-Qand-Dto keep smart quotes and em dashes where you actually want them - Let UnicodeFix handle the invisible junk, weird spaces, and stray AI fingerprints
- Run
--reporton drafts to see where editors/LLMs have been “helpful” behind your back
TL;DR
Since the last CodExorcism post, UnicodeFix:
- Added analytics:
--report,--json,--metrics,--metrics-help,--threshold,--exit-zero - Slimmed and hardened the test suite:
tests/test_all.shdrives everything fromdata/ - Tightened Unicode behavior: default folding for fullwidth square brackets with an escape hatch flag
- Polished the installer and docs so the path from “clone” to “clean” is straightforward
Same attitude, sharper tools.
If you’re already using UnicodeFix, pull the latest and run your usual workflows with --report once - it’s weirdly satisfying to watch your own files get roasted in the terminal.
If you’re new:
- Repo: https://github.com/unixwzrd/UnicodeFix
- Shortcut, tests, docs, and screenshots are all in there.
The Unicode arms race isn’t slowing down. At least your side has better metrics now.