> it appears both homoglyph substitution and zero-width fingerprinting have been...

> it appears both homoglyph substitution and zero-width fingerprinting have been discovered by others

They were discovered a long time ago, and there are many other ways to hide data in documents.

Remember that you only need to encode enough bits for a relatively unique ID (and not unique for all files in the universe but only for files with the same content - for a low-distribution file, even 2-5 bits might be enough). On the application level, the most common applications and formats have a very large number of features you can utilize to encode data or simply insert it (e.g., Word, Excel, the PDF standard). On the bit level, unless the application vendor has invested in writing exceptionally tight, secure code, you probably can find someplace to hide/encode a few bits in a file.

But I think the author is on the right track with the solutions ...

> Use a tool that strips non-whitelisted characters from text before sharing it with others.

A more general solution is needed: Something that normalizes data in many formats, from text to Word to PDF to JPG to WAV to markup languages.

Personally, for non-security reasons, I'd love a utility that normalizes text to 7-bit ASCII (e.g., from UTF-8 characters higher than 7 bits) and that fits very efficiently into workflow (e.g., something that normalizes text in the clipboard if I press a hotkey combination). Anybody know of one?