Ideograph numerals
This post is a follow on to my previous post on Unicode numbers. I always welcome feedback from readers, but I especially welcome it here because I'm walking into an area I know next to nothing about.
Consecutive code pointsUnicode generally assigns code points to number-like things in consecutive order. For example, the Python code
for n in range(1,10): print(chr(0x30+n), chr(0x24f4+n), chr(0x215f+n))
prints
1 2 3 4 5 6 7 8 9
showing that ASCII digits, circled numerals, and Roman numerals are are encoded consecutively.
Parenthesized and circled ideographsSo the same is probably true for ideographs representing digits, right?
No, but before we get into that, the following code shows that parenthesized ideographs and circled ideographs for digits are numbered consecutively. The code
from unicodedata import numeric, name for i in range(1, 10): cp = 0x321f + i ch = chr(cp) print(ch, hex(cp), numeric(ch), name(ch)) for i in range(1, 10): cp = 0x327f + i ch = chr(cp) print(ch, hex(cp), numeric(ch), name(ch))
prints
0x3220 1.0 PARENTHESIZED IDEOGRAPH ONE 0x3221 2.0 PARENTHESIZED IDEOGRAPH TWO 0x3222 3.0 PARENTHESIZED IDEOGRAPH THREE 0x3223 4.0 PARENTHESIZED IDEOGRAPH FOUR 0x3224 5.0 PARENTHESIZED IDEOGRAPH FIVE 0x3225 6.0 PARENTHESIZED IDEOGRAPH SIX 0x3226 7.0 PARENTHESIZED IDEOGRAPH SEVEN 0x3227 8.0 PARENTHESIZED IDEOGRAPH EIGHT 0x3228 9.0 PARENTHESIZED IDEOGRAPH NINE 0x3280 1.0 CIRCLED IDEOGRAPH ONE 0x3281 2.0 CIRCLED IDEOGRAPH TWO 0x3282 3.0 CIRCLED IDEOGRAPH THREE 0x3283 4.0 CIRCLED IDEOGRAPH FOUR 0x3284 5.0 CIRCLED IDEOGRAPH FIVE 0x3285 6.0 CIRCLED IDEOGRAPH SIX 0x3286 7.0 CIRCLED IDEOGRAPH SEVEN 0x3287 8.0 CIRCLED IDEOGRAPH EIGHT 0x3288 9.0 CIRCLED IDEOGRAPH NINECJK Unified Ideographs
Now let's take the parentheses and circles off.
The following code shows that the CJK unified ideographs for digits are not digits (!) according to Unicode, but they are numeric. It also shows that their code points are not assigned in any apparent order.
numerals = "" for n in numerals: print(n, hex(ord(n)), n.isdigit(), numeric(n))
This outputs the following.
0x4e00 False 1.0 0x4e8c False 2.0 0x4e09 False 3.0 0x56db False 4.0 0x4e94 False 5.0 0x516d False 6.0 0x4e03 False 7.0 0x516b False 8.0 0x4e5d False 9.0 0x5341 False 10.0
I assume the ordering of ideographs in Unicode has its own internal logic (with exceptions and historical quirks) that I know nothing about. If anyone knows of any patterns of how code points are assigned to ideographs, please let me know.
The names of the characters above say nothing about what the characters mean. For example, the official Unicode name for (U+4E5D) is CJK UNIFIED IDEOGRAPH-4E5D. The name says nothing about the ideograph representing the digit 9, though the numeric property of the digit is indeed 9. My guess is that when that character represents a digit, it represents 9, but maybe it can mean other things in other contexts.
The post Ideograph numerals first appeared on John D. Cook.