Article 4PVQY Excel, R, and Unicode

Excel, R, and Unicode

by
John
from John D. Cook on (#4PVQY)

I received some data as an Excel file recently. I cleaned things up a bit, exported the data to a CSV file, and read it into R. Then something strange happened.

Say the CSV file looked like this:

 foo,bar 1,2 3,4

I read the file into R with

 df <- read.csv("foobar.csv", header=TRUE)

and could access the second column as df$bar but could not access the first column as df$foo. What's going on?

When I ran names(df) it showed me that the first column was named not foo but i..foo. I opened the CSV file in a hex editor and saw this:

 efbb bf66 6f6f 2c62 6172 0d0a 312c 320d

The ASCII code for f is 0x66, o is 0x6f, etc. and so the file makes sense, starting with the fourth byte.

If you saw my post about Unicode the other day, you may have seen Daniel Lemire's comment:

There are various byte-order masks like EF BB BF for UTF-8 (unused).

Aha! The first three bytes of my data file are exactly the byte-order mask that Daniel mentioned. These bytes are intended to announce that the file should be read as UTF-8, a way of encoding Unicode that is equivalent to ASCII if the characters in the file are in the range of ASCII.

Now we can see where the funny characters in front of "foo" came from. Instead of interpreting EF BB BF as a byte-order mask, R interpreted the first byte 0xEF as U+00EF, "Latin Small Letter I with Diaeresis." I don't know how BB and BF became periods (U+002E). But if I dump the file to a Windows command prompt, I see the first line as

 ifoo,bar

with the first three characters being the Unicode characters U+00EF, U+00BB, and U+00BF.

How to fix the encoding problem with R? The read.csv function has an optional encoding parameter. I tried setting this parameter to "utf-8" and "utf8". Neither made any difference. I looked at the R documentation, and it seems I need to set it to "UTF-8". When I did that, the name of the first column became X.U.FEFF.foo [1]. I don't know what's up with that, except FEFF is the byte order mark (BOM) I mentioned in my Unicode post.

Apparently my troubles started when I exported my Excel file as CSV UTF-8. I converted the UTF-8 file to ASCII using Notepad and everything worked. I also could have saved the file directly to ASCII. If you the list of Excel export options, you'll first see CSV UTF-8 (that's why I picked it) but if you go further down you'll see an option that's simply CSV, implicitly in ASCII.

Unicode is great when it works. This blog is Unicode encoded as UTF-8, as are most pages on the web. But then you run into weird things like the problem described in this post. Does the fault lie with Excel? With R? With me? I don't know, but I do know that the problem goes away when I stick to ASCII.

***

[1] A couple people pointed out in the comments that you could use fileEncoding="UTF-8-BOM" to fix the problem. This works, though I didn't see it in the documentation the first time. The read.csv function takes an encoding parameter that appears to be for this purpose, but is a decoy. You need the fileEncoding parameter. With enough persistence you'll eventually find that "UTF-8-BOM" is a possible value for fileEncoding.

RrngOUWdZoU
External Content
Source RSS or Atom Feed
Feed Location http://feeds.feedburner.com/TheEndeavour?format=xml
Feed Title John D. Cook
Feed Link https://www.johndcook.com/blog
Reply 0 comments