A possible way to handle unicode stealth

published May 27, 2015 09:05 by admin ( last modified May 27, 2015 09:06 )

I haven't tried this so I do not know if it is in any way practical, but wouldn't it be possible to classify texts as being in one language? If so, you could filter out any characters or glyph combinations that aren't native to that language.

It would be a bit like how markup languages such as Markdown, reStructuredText, Creole and others have largely supplanted HTML for creating rich text documents on Wikipedia, GitHub, in python documentation and many other places.

Its pretty easy to have two different Unicode strings display identical output - and that can cause a whole host of problems. For instance, many family friendly sites may ban foul language from user comments, but its trivial to come up with Unicode equivalent strings that bypass any blacklist of obscene language.