Transliterating Hebrew

John

from John D. Cook on 2020-08-23 17:54 (#5784Z)

Yesterday I wrote about cjhebrew, a LaTeX package that lets you insert Hebrew text by using a sort of transliteration scheme. That reminded me of unidecode, a Python package for transliterating Unicode to ASCII, that I wrote about before. I wondered how the two compare, and so this post will answer that question.

Transliteration is a crude approximation. I started to say it's no substitute for a proper translation, but in fact sometimes it is a substitute for a proper translation. It takes in the smallest context possible-one character-and is utterly devoid of nuance, but it still might be good enough for some purposes. It might, for example, help in searching some text for relevant content worth the effort of a proper translation.

Here's a short bit of code to display unidecodes transliterations of the Hebrew alphabet.

 for i in range(22+5): ch = chr(i + ord('')) print(ch, unidecode.unidecode(ch))

I wrote 22 + 5 rather than 27 above to give a hint that the extra values are the final forms of five letters [1]. Also if ord('') doesn't work for you, you can replace it with 0x05d0.

Here's a comparison of the transliterations used in cjhebrew and unidecode. I've abbreviated the column headings to make a narrower table.

|---------+---+----+----|| Unicode | | cj | ud ||---------+---+----+----|| U+05d0 |  | ' | A || U+05d1 |  | b | b || U+05d2 |  | g | g || U+05d3 |  | d | d || U+05d4 |  | h | h || U+05d5 |  | w | v || U+05d6 |  | z | z || U+05d7 |  | .h | KH || U+05d8 |  | .t | t || U+05d9 |  | y | y || U+05da |  | K | k || U+05db |  | k | k || U+05dc |  | l | l || U+05dd |  | M | m || U+05de |  | m | m || U+05df |  | N | n || U+05e0 |  | n | n || U+05e1 |  | s | s || U+05e2 |  | ` | ` || U+05e3 |  | P | p || U+05e4 |  | p | p || U+05e5 |  | .S | TS || U+05e6 |  | s | TS || U+05e7 |  | q | q || U+05e8 |  | r | r || U+05e9 |  | /s | SH || U+05ea |  | t | t ||---------+---+----+----|

The transliterations are pretty similar, despite different design goals. The unidecode module is trying to pick the best mapping to ASCII characters. The cjhebrew package is trying to use mnemonic ASCII sequences to map into Hebrew. The former doesn't need to be unique, but the latter does. The post on cjhebrew explains, for example, that it uses capital letters for final forms of Hebrew letters.

Here's the corresponding table for vowel points (niqqud).

|---------+---+----+----|| Unicode | | cj | ud ||---------+---+----+----|| U+05b0 |  | : | @ || U+05b1 |  | E: | e || U+05b2 |  | a: | a || U+05b3 |  | A: | o || U+05b4 |  | i | i || U+05b5 |  | e | e || U+05b6 |  | E | e || U+05b7 |  | a | a || U+05b8 |  | A | a || U+05b9 |  | o | o || U+05ba |  | o | o || U+05bb |  | u | u ||---------+---+----+----|

[1] Unicode lists the final forms of letters come before the ordinary form. For example, final kaf has Unicode value U+05da and kaf has value U+05db.

The post Transliterating Hebrew first appeared on John D. Cook.

Source	RSS or Atom Feed
Feed Location	http://feeds.feedburner.com/TheEndeavour?format=xml
Feed Title	John D. Cook
Feed Link	https://www.johndcook.com/blog