Initial letter frequency
I needed to know the frequencies of letters at the beginning of words for a project. The overall frequency of letters, wherever they appear in a word, is well known. Initial frequencies are not so common, so I did a little experiment.
I downloaded the Canterbury Corpus and looked at the frequency of initial letters in a couple of the files in the corpus. I first tried a different approach, then realized a shell one-liner [1] would be simpler and less-error prone.
cat alice29.txt | lc | grep -o '\b[a-z]' | sort | uniq -c | sort -rn
This shows that the letters in descending order of frequency at the beginning of a word are t, a, s, ..., j, x, z.
The file alice29.txt is the text of Alice's Adventures in Wonderland. Then for comparison I ran the same script on another file, lcet10.txt. a lengthy report from a workshop on electronic texts.
This technical report's initial letter frequencies order the alphabet t, a, o, ..., y, z, x. So starting with the third letter, the two files have different initial letter frequencies.
I made the following plot to visualize how the frequencies differ. The horizontal axis is sorted by overall letter frequency (based on the Google corpus summarized here).
I expected the initial letter frequencies to differ from overall letter frequencies, but I did not expect the two corpora to differ.
Apparently initial letter frequencies vary more across corpora than overall letter frequencies. The following plot shows the overall letter frequencies for both corpora, with the horizontal axis again sorted by the frequency in the Google corpus.
Here the two corpora essentially agree with each other and with the Google corpus. The tech report ranks letters in essentially the same order as the Google corpus because the orange dashed line is mostly decreasing, though there is a curious kink in the graph at c.
Related posts[1] The lc function converts its input to lower case. See this post for how to install and use the function.
The post Initial letter frequency first appeared on John D. Cook.