Remix.run Logo
wrp 4 days ago

I need to call out a myth about UTF-8. Tools built to assume UTF-8 are not backwards compatible with ASCII. An encoding INCLUDES but also EXCLUDES. When a tool is set to use UTF-8, it will process an ASCII stream, but it will not filter out non-ASCII.

I still use some tools that assume ASCII input. For many years now, Linux tools have been removing the ability to specify default ASCII, leaving UTF-8 as the only relevant choice. This has caused me extra work, because if the data processing chain goes through these tools, I have to manually inspect the data for non-ASCII noise that has been introduced. I mostly use those older tools on Windows now, because most Windows tools still allow you to set default ASCII.

account42 a day ago | parent | next [-]

Do you have an actual example where this causes an issue? "ASCII" tools mostly just passed along non-ASCII bytes unchanged even before UTF-8.

int_19h 3 days ago | parent | prev | next [-]

The usual statement isn't that UTF-8 is backwards compatible with ASCII (it's obvious that any 8-bit encoding wouldn't be; that's why we have UTF-7!). It's that UTF-8 is backwards compatible with tools that are 8-bit clean.

wrp 3 days ago | parent [-]

Yes, the myth I was pointing out is based on loose terminology. It needs to be made clear that "backwards compatible" means that UTF-8 based tools can receive but are not constrained to emit valid ASCII. I see a lot of comments implying that UTF-8 can interact with an ASCII ecosystem without causing problems. Even worse, it seems most Linux developers believe there is no longer a need to provide a default ASCII setting if they have UTF-8.

kccqzy 4 days ago | parent | prev [-]

That's not a myth about UTF-8. That's a decision by tools not to support pure ASCII.