Ideograph numerals

John

from John D. Cook on 2022-10-07 18:42 (#64GB1)

This post is a follow on to my previous post on Unicode numbers. I always welcome feedback from readers, but I especially welcome it here because I'm walking into an area I know next to nothing about.

Consecutive code points

Unicode generally assigns code points to number-like things in consecutive order. For example, the Python code

 for n in range(1,10): print(chr(0x30+n), chr(0x24f4+n), chr(0x215f+n))

prints

 1   2   3   4   5   6   7   8   9

showing that ASCII digits, circled numerals, and Roman numerals are are encoded consecutively.

Parenthesized and circled ideographs

So the same is probably true for ideographs representing digits, right?

No, but before we get into that, the following code shows that parenthesized ideographs and circled ideographs for digits are numbered consecutively. The code

 from unicodedata import numeric, name for i in range(1, 10): cp = 0x321f + i ch = chr(cp) print(ch, hex(cp), numeric(ch), name(ch)) for i in range(1, 10): cp = 0x327f + i ch = chr(cp) print(ch, hex(cp), numeric(ch), name(ch))

prints

  0x3220 1.0 PARENTHESIZED IDEOGRAPH ONE  0x3221 2.0 PARENTHESIZED IDEOGRAPH TWO  0x3222 3.0 PARENTHESIZED IDEOGRAPH THREE  0x3223 4.0 PARENTHESIZED IDEOGRAPH FOUR  0x3224 5.0 PARENTHESIZED IDEOGRAPH FIVE  0x3225 6.0 PARENTHESIZED IDEOGRAPH SIX  0x3226 7.0 PARENTHESIZED IDEOGRAPH SEVEN  0x3227 8.0 PARENTHESIZED IDEOGRAPH EIGHT  0x3228 9.0 PARENTHESIZED IDEOGRAPH NINE  0x3280 1.0 CIRCLED IDEOGRAPH ONE  0x3281 2.0 CIRCLED IDEOGRAPH TWO  0x3282 3.0 CIRCLED IDEOGRAPH THREE  0x3283 4.0 CIRCLED IDEOGRAPH FOUR  0x3284 5.0 CIRCLED IDEOGRAPH FIVE  0x3285 6.0 CIRCLED IDEOGRAPH SIX  0x3286 7.0 CIRCLED IDEOGRAPH SEVEN  0x3287 8.0 CIRCLED IDEOGRAPH EIGHT  0x3288 9.0 CIRCLED IDEOGRAPH NINE

CJK Unified Ideographs

Now let's take the parentheses and circles off.

The following code shows that the CJK unified ideographs for digits are not digits (!) according to Unicode, but they are numeric. It also shows that their code points are not assigned in any apparent order.

 numerals = "" for n in numerals: print(n, hex(ord(n)), n.isdigit(), numeric(n))

This outputs the following.

  0x4e00 False 1.0  0x4e8c False 2.0  0x4e09 False 3.0  0x56db False 4.0  0x4e94 False 5.0  0x516d False 6.0  0x4e03 False 7.0  0x516b False 8.0  0x4e5d False 9.0  0x5341 False 10.0

I assume the ordering of ideographs in Unicode has its own internal logic (with exceptions and historical quirks) that I know nothing about. If anyone knows of any patterns of how code points are assigned to ideographs, please let me know.

The names of the characters above say nothing about what the characters mean. For example, the official Unicode name for (U+4E5D) is CJK UNIFIED IDEOGRAPH-4E5D. The name says nothing about the ideograph representing the digit 9, though the numeric property of the digit is indeed 9. My guess is that when that character represents a digit, it represents 9, but maybe it can mean other things in other contexts.

The post Ideograph numerals first appeared on John D. Cook.

Source	RSS or Atom Feed
Feed Location	http://feeds.feedburner.com/TheEndeavour?format=xml
Feed Title	John D. Cook
Feed Link	https://www.johndcook.com/blog