ETAOIN SHRDLU and all that

John

from John D. Cook on 2016-09-02 14:32 (#1SFMJ)

Statistics can be useful, even if it's idealizations fall apart on close inspection.

For example, take English letter frequencies. These frequencies are fairly well known. E is the most common letter, followed by T, then A, etc. The string of letters "ETAOIN SHRDLU" comes from the days of Linotype when letters were arranged in that order, in decreasing order of frequency. Sometimes you'd see ETAOIN SHRDLU in print, just as you might see "QWERTY" today.

Morse code is also based on English letter frequencies. The length of a letter in Morse code varies approximately inversely with its frequency, a sort of precursor to Huffman encoding. The most common letter, E, is a single dot, while the rarer letters like J and Q have a dot and three dashes. (So does Y, even though it occurs more often than some letters with shorter codes.)

One letter has worn off my keyboard

So how frequently does the letter E, for example, appear in English? That depends on what you mean by English. You can count how many times it appears, for example, in a particular edition of A Tale of Two Cities, but that isn't the same as it's frequency in English. And if you'd picked the novel Gadsby instead of A Tale of Two Cities you'd get very different results since that book was written without using a single letter E.

Peter Norvig reports that E accounted for 12.49% of English letters in his analysis of the Google corpus. That's a better answer than just looking at Gadsby, or even A Tale of Two Cities, but it's still not English.

What might we mean by "English" when discussing letter frequency? Written or spoken English? Since when? American, British, or worldwide? If you mean blog articles, I've altered the statistics from what they were a moment ago by publishing this. Introductory statistics books avoid this kind of subtlety by distinguishing between samples and populations, but in this case the population isn't a fixed thing. When we say "English" as a whole we have in mind some idealization that strictly speaking doesn't exist.

If we want to say, for example, what the frequency of the letter E is in English as a whole, not some particular English corpus, we can't answer that to too many decimal places. Nor can we say, for example, which letter is the 18th most frequent. Context could easily change the second decimal place in a letter's frequency or, among the less common letters, its frequency rank.

And yet, for practical purposes we can say E is the most common letter, then T, etc. We can design better Linotype machines and telegraphy codes using our understanding of letter frequency. At the same time, we can't expect too much of this information. Anyone who has worked a cryptogram puzzle knows that you can't say with certainty that the most common letter in a particular sample must correspond to E, the next to T, etc.

By the way, Peter Norvig's analysis suggests that ETAOIN SHRDLU should be updated to ETAOIN SRHLDCU.

Source	RSS or Atom Feed
Feed Location	http://feeds.feedburner.com/TheEndeavour?format=xml
Feed Title	John D. Cook
Feed Link	https://www.johndcook.com/blog