Alphabets and Unicode

John

from John D. Cook on 2020-09-27 19:36 (#58M3B)

ASCII codes may seem arbitrary when you're looking at decimal values, but they make more sense in hex [1]. For example, the ASCII value for 0 is 48. Why isn't it zero, or at least a number that ends in zero? Well it is, in hex: 0x30. And the codes are in consecutive order, so the ASCII value of a digit d is d + 0x30.

There are also patterns in ASCII codes for letters, and this post focuses on these patterns and their analogies in the Unicode values assigned to other alphabets.

Latin

Letters have a similar pattern to digits in ASCII. A is 0x41 and a is 0x61. The upper case and lower case codes are 32 (0x20) apart. Consecutive letters have consecutive ASCII codes, so the nth letter of the alphabet is 0x40 + n, in capital form, and 0x60 + n in lower case form,

Unicode absorbed the first 128 ASCII values for backward compatibility. And some of the patterns in the Latin alphabet carry over to other alphabets. Older codings for other languages were imported into Unicode similar to the way ASCII was, but with an offset. For example, the Unicode values for Cyrrilic letters are essentially those from ISO 8859-5 with a offset of 0x360.

Greek

For example, Greek upper case and lower case letters are also 0x20 apart. Capital alpha is U+0391, and lower case alpha is U+03B1 [2]. As with Latin, capital letters come first. Unicode values are consecutive, so the nth letter of the Greek alphabet is 0x391 + n, in capital form, and 0x3B0 + n in lower case form.

There's a wrinkle, however. The rule above only holds for n from 1 to 17, because there are two version of the 18th letter, sigma. Greek has two versions of lower case sigma- (U+03C2) at the end of words and (U+03C3) everywhere else-but only one upper case sigma . The Unicode value U+03A2 is unassigned, so that the pattern of capitals and lower case letters being separated by 0x20 will continue after sigma.

Letters as numerals

The Greeks associated numerical values with letters: (alpha) = 1, (beta) = 2, (gamma) = 3, etc. That means the numerical value associated with a letter is its Unicode value minus 0x390. That works for the the numbers 1 through 10.

But then starting with the 10th letter, (kappa), the letters start counting by 10s: = 20, etc. So for the letters (kappa) through (rho), the numerical value is 10(U - 0x399) where U is the Unicode value. The letters count by 100s starting with (rho), and then the gap at complicates things.

Russian

Russian uses the Cyrillic alphabet, so I should say the Cyrillic alphabet, just as I started with the Latin" alphabet, not the English alphabet. But several languages used the Cyrillic alphabet, and some may use it differently than Russian, so I'll say Russian" to avoid possibly saying something that's not true.

As with Latin and Greek, the Unicode values for Russian letters are consecutive, and code points for capital letters and lower case letters differ by 32 (0x20). But the Russian alphabet has 33 letters, so something's got to give.

The quirk is the 7th letter, (yo). The capital letters in the Russian alphabet start with U+0410 and are consecutive up to U+042F. But there's an interruption in the sequence with . As the 7th letter, you would expect it to have Unicode value U+0416, but that's the code point for the 8th letter, . Yo has Unicode value U+0401. And while you can find the lower case value of the rest of the letters in the Russian alphabet by adding 32 (0x20), the lower case yo has value U+0451.

Hebrew

Hebrew doesn't have upper and lower case letters, so that pattern can't carry over. Unicode does assign consecutive values to consecutive letters, but only if you count final forms as separate letters, and list them before their ordinary forms. The first letter of the Hebrew alphabet has Unicode value U+05D0, so the nth letter has Unicode value 0x5CF + n. That holds for n up to 10.

The first 10 letters of the Hebrew alphabet have only one form. But the 11th letter, kaf, has a final form and a non-final form. Final forms are listed first, so 0x5CF + 11 = 0x5DA goes to final kaf, , and 0x5DB goes to (non-final) kaf, .

Hebrew has a way of associating numerical values to letters, very similar to the one described above for Greek. For the first 10 letters, the associated numerical value is the Unicode value minus 0x5CF, but then final forms complicate things.

[1] Hex is short for hexadecimal, i.e. base 16. The 0x in front of a number indicates that it's a hexadecimal number.

[2] It's standard to refer to Unicode values in the format U+xxxx where xxxx is a hexadecimal number. So U+03B1 has numerical value 0x3B1, or 945 in decimal.

The post Alphabets and Unicode first appeared on John D. Cook.

Source	RSS or Atom Feed
Feed Location	http://feeds.feedburner.com/TheEndeavour?format=xml
Feed Title	John D. Cook
Feed Link	https://www.johndcook.com/blog