Why “a caret, euro, trademark” â€™ in a file?

John

from John D. Cook on 2024-01-12 02:18 (#6HSPX)

Why might you see a in the middle of an otherwise intelligible file? The reason is very similar to the reason you might see , which I explained in the previous post. You might want to read that post first if you're not familiar with Unicode and character encodings.

It all has to do with an encoding error, probably. Not necessarily, since, for example, I deliberately put a in the opening sentence. But assuming it is an error, it's likely an encoding error.

But it's the opposite of the error. The occurs when non- UTF-8 text has been declared (or implicitly interpreted as) Unicode. In particular, you can run into this error if text encoded in ISO 8859-1 is interpreted as as UTF-8.

The a sequence is usually the opposite: UTF-8 encoded text is being interpreted as Windows-1252 (a.k.a. CP-1252) encoded text. In particular, a single quote (U+2019) encoded in UTF-8 has been interpreted as the Windows-1252 text a.

Windows-1252 is a superset of IDO 8859-1, the error resulting in could also be described as a Windows-1252 error. So a means Windows-1252 text has been interpreted as UTF-8, and a means UTF-8 has been interpreted as Windows-1252. In the former case there is an invalid character. In the latter case all the characters are valid, though they're not the characters you were supposed to see.

You can fix the error by making your content and your encoding match. Or remove the offending character, replacing the single quote with ’.

You can find more details in this Stack Overflow post.

The post Why a caret, euro, trademark" aTM in a file? first appeared on John D. Cook.

Source	RSS or Atom Feed
Feed Location	http://feeds.feedburner.com/TheEndeavour?format=xml
Feed Title	John D. Cook
Feed Link	https://www.johndcook.com/blog