Cellar
You are in a dark, damp cellar with narrow passageways to the north and east. On the west is the bottom of a steep metal ramp which is unclimbable.
> go north
We are very close to being able to decode zstrings. Let me lay out for you what the real compression system is.
First off, this is the compression scheme used by story versions 3 and up; I'm not going to bother implementing the slightly different compression scheme used in versions 1 and 2.
As we know, every 16 bits of compressed characters represents three five-bit characters and a "we're done" bit. There are in fact three five-bit character alphabets from which characters are drawn; I'll put them into an array-of-arrays:
let alphabet_table = [|[| " "; "?"; "?"; "?"; "?"; "?"; "a"; "b"; "c"; "d"; "e"; "f"; "g"; "h"; "i"; "j"; "k"; "l"; "m"; "n"; "o"; "p"; "q"; "r"; "s"; "t"; "u"; "v"; "w"; "x"; "y"; "z" |]; [| " "; "?"; "?"; "?"; "?"; "?"; "A"; "B"; "C"; "D"; "E"; "F"; "G"; "H"; "I"; "J"; "K"; "L"; "M"; "N"; "O"; "P"; "Q"; "R"; "S"; "T"; "U"; "V"; "W"; "X"; "Y"; "Z" |];[| " "; "?"; "?"; "?"; "?"; "?"; "?"; "\n"; "0"; "1"; "2"; "3"; "4"; "5"; "6"; "7"; "8"; "9"; "."; ","; "!"; "?"; "_"; "#"; "'"; "\""; "/"; "\\"; "-"; ":"; "("; ")" |] |]
(Code for this episode can be found at https://github.com/ericlippert/flathead/tree/blog5)
(A commenter on a previous episode asked why use an array of strings here when a simple string of characters would suffice? Because I wanted an excuse to figure out how arrays worked in OCaml! A string would do fine.)
As you can see the first alphabet is lowercase characters, and obviously these are going to be by far the most common characters in the story. The second alphabet is uppercase, and the third is numbers and symbols. Character 0 of each alphabet is space, and characters 1 through 6 have special meanings that I call out below. (See the comments for the reason why the first seven characters have the same meanings in all three alphabets; it has to do with how this standard evolved from the Z-machine version 1 and 2 standards.)
The meaning of a given five-bit character depends on what the previous character was. That's maybe an odd way to phrase it, so let's describe the effect that a given code has on the following character. A five-bit character code has the following meanings:
- If the character is 1, 2, or 3, then the next character indicates an offset into the abbreviation table; the string wishes to embed an abbreviation at this point. So now we see why there are 96 abbreviations: three possible lead characters, 32 possible trailing characters, makes 96 combinations.
- If the character is 4 or 5 then the next character is drawn from the uppercase or symbol alphabets, respectively, unless"
- If the character is 5, and the next character is 6 then the following two five-bit characters should be combined into one ten-bit character, which is then interpreted as an eight-bit ASCII code. (This is actually a slight lie; the coding used by the Z-machine does not exactly match ASCII. But I am going to ignore that fact for the purposes of this article.)
- Otherwise, it's just an ordinary lowercase code.
Wow. How on earth are we going to implement this mess?
A "state machine" is a simple abstract machine that takes in a sequence of inputs. Each input causes the machine to produce a result and transition to a new state. The result produced and the new state depend solely on the input and the current state. What are our possible states?
- Nothing special; we're processing characters drawn from alphabet 0 as usual. Call this state "alphabet 0"
- We've been asked to use alphabet 1 or 2. Call these states "alphabet 1" and "alphabet 2"
- We've been asked to fetch an abbreviation string from the first, second or third bank of 32 abbreviations. Call these states "abbrev 0", "abbrev 32" and "abbrev 64".
- We've been asked to concatenate two five-bit characters, and we've seen zero of them. Call this state "leading", as in, we are expecting the leading five bits.
- We've been asked to concatenate two five-bit characters, and we've seen one of them. Call this state "trailing x", as in, we have seen leading character x and are expecting a trailing character.
I want to represent all of these possible states in a single type. OCaml lets me do this!
type string_state = | Alphabet of int | Abbrev of abbreviation_number | Leading | Trailing of int
What I have done here is defined a new type - string_state - and said it has four distinct constructors. Two constructors take an int, one takes an abbreviation number - which recall is itself a wrapped int - and one takes no arguments at all. This is one of the nicest features of languages in the ML family like OCaml; that you can very cheaply say "things that have any of these structures are members of this type". As we'll see later on, this plays well into pattern matching.
Our wrapper types that we've already seen are of course just this kind of type, with only one constructor.
Here, let me construct a few states:
let abbrev0 = Abbrev (Abbreviation 0)let abbrev32 = Abbrev (Abbreviation 32)let abbrev64 = Abbrev (Abbreviation 64)let alphabet0 = Alphabet 0let alphabet1 = Alphabet 1let alphabet2 = Alphabet 2
We've got most of our states; the rest of them we'll construct on the fly as we need them.
Next time on FAIC: we'll implement the state transition function. That is, given a current state and a new five-bit zchar, what is the decoded string output and the new state?