Graphemes
Here's something amusing I ran across in the glossary of Programming Perl:
grapheme A graphene is an allotrope of carbon arranged in a hexagonal crystal lattice one atom thick. Grapheme, or more fully, a grapheme cluster string is a single user-visible character, which in turn may be several characters (codepoints) long. For example " a "E" is a single grapheme but one, two, or even three characters, depending on normalization.
In case the character E doesn't display correctly for you, here it is:
First, graphene has little to do with grapheme, but it's geeky fun to include it anyway. (Both are related to writing. A grapheme has to do with how characters are written, and the word graphene comes from graphite, the "lead" in pencils. The origin of grapheme has nothing to do with graphene but was an analogy to phoneme.)
Second, the example shows how complicated the details of Unicode can get. The Perl code below expands on the details of the comment about ways to represent E.
This demonstrates that the character . in regular expressions matches any single character, but \X matches any single grapheme. (Well, almost. The character . usually matches any character except a newline, though this can be modified via optional switches. But \X matches any grapheme including newline characters.)
# U+0226, o with diaeresis and macron my $a = "\x{22B}"; # U+00F6 U+0304, (o with diaeresis) + macron my $b = "\x{F6}\x{304}"; # o U+0308 U+0304, o + diaeresis + macron my $c = "o\x{308}\x{304}"; my @versions = ($a, $b, $c);# All versions display the same.say @versions;# The versions have length 1, 2, and 3.# Only $a contains one character and so matches .say map {length $_ if /^.$/} @versions;# All versions consist of one grapheme.say map {length $_ if /^\X$/} @versions;