Article 6AWD0 Tradeoff between alphabet size and word size

Tradeoff between alphabet size and word size

by
John
from John D. Cook on (#6AWD0)
Literal alphabets

Natural language alphabets are all within an order of magnitude of the size of the Roman alphabet. The Hebrew alphabet has a few less letters and Russian has a few more. The smallest alphabet I'm aware of is Hawaiian with 13 letters.

Syllabaries are larger than alphabets, but not an order of magnitude larger. For example, Japanese uses 46 symbols and the Cherokee uses 85.

There is a tradeoff between alphabet size and word size that constrains alphabets to be roughly the same size in every culture.

Metaphorical alphabets

Computer science generalizes the term alphabet to mean any fixed collection of symbols and generalizes word to mean any finite sequence of symbols from an alphabet. So, for example, the digits 0 through 9 form an alphabet in this sense, and the words are numbers.

Numbers are sometimes written using smaller or larger alphabets. Binary notation uses an alphabet of only 2 symbols, but numbers require significantly more symbols to express than they do in decimal. Hexadecimal uses an alphabet of 16 symbols and is more compact than decimal notation.

Tradeoffs

Many design problems can be formulated as a tradeoff between alphabet length and word length. I was working on such a problem recently, which motivated this blog post.

Larger alphabets mean shorter words, but a linear reduction in word size requires an exponential increase in alphabet size. For example, we could write numbers using half as many digits if we thought in base 100 and had symbols for numbers 0 through 99. Or we could think in base 1000, increasing the number of digits by a factor of 100 and reducing the length of number representations by a factor of 3.

Word length is inversely proportional to the logarithm of alphabet size, and so if we have some cost proportional to alphabet size a and another cost proportional to word length, the optimal alphabet size for the sum of these costs would minimize

C1a + C2 / log a.

We could divide this objective function by C1 and set w = C1/C2 to simplify the objective to minimizing

a + w / log a

Setting the derivative with respect to a to zero shows that the optimal alphabet size a is given by solving

a (log a)^2 = w.

As the weight w given to the cost of word size increases, the optimal alphabet size increases as well. A wide range of weights corresponds to a narrower range of alphabet sizes.

optimal_alphabet.png

This plot could be misleading. The graph is not showing the cost associated with various alphabet sizes but rather the optimal alphabet size for a given word size cost. The weight w is fixed in any particular application.

An alphabet of size 10 is optimal for a word length cost weight w of about 50. Using w = 50, the following plot shows how alphabet size cost and word size costs compare. A large increase in alphabet size cost only buys a small decrease in word size cost.

optimal_alphabet2.png

Related post: The base with the largest decibel

The post Tradeoff between alphabet size and word size first appeared on John D. Cook.
External Content
Source RSS or Atom Feed
Feed Location http://feeds.feedburner.com/TheEndeavour?format=xml
Feed Title John D. Cook
Feed Link https://www.johndcook.com/blog
Reply 0 comments