Article 148EE Attic

Attic

by
ericlippert
from Fabulous adventures in coding on (#148EE)

You have moved into a dark place. It is pitch black. You are likely to be eaten by a grue.

> what is a grue

The grue is a sinister, lurking presence in the dark places of the earth. Its favorite diet is adventurers, but its insatiable appetite is tempered by its fear of light.

> light the lantern

The brass lantern is now on.

This is the attic. The only exit is the stairway down. A large coil of rope is lying in the corner. There is a nasty knife here.

> take the rope

Taken.

> go down
> go west
> go down

The trap door crashes shut, and you hear someone barring it.

Last time we got as far as getting the uncompressed 17 bit address of an abbreviation string fragment out of the abbreviation table. Today we'll make a start at decoding the thing. Code for this episode can be found at:

https://github.com/ericlippert/flathead/tree/blog4.

The zstring format is, roughly speaking, every 16 bits of string contains three 5-bit characters and one bit that indicates whether the string is done or not. Let's make a rudimentary string decoder; this will not be correct, but it will be enough to see whether we're on the right track here.

The first thing I'm going to do is make a lookup table. This is how OCaml represents straight up old fashioned mutable arrays:

let alphabet_table = [|"_"; "?"; "?"; "?"; "?"; "?"; "a"; "b"; "c"; "d"; "e"; "f"; "g"; "h"; "i"; "j";"k"; "l"; "m"; "n"; "o"; "p"; "q"; "r"; "s"; "t"; "u"; "v"; "w"; "x"; "y"; "z" |]

Array elements are separated by semis and arrays are delimited by this cute [| |] syntax. The type of an array is, of course, deduced by the type inferencer; obviously this is an array of 32 strings. Notice that the first string is an underbar; this should actually be a space but that does not show up well on the blog for obvious reasons. The next five characters are special; we'll come back to those later. And of course the remaining 26 are the lower-case Roman alphabet.

In order to decode a string that starts at some address and ends when the first bit of the current word is on, we're going to need to loop. But in a world where every variable is in fact a value that can only be assigned once, how does a loop work? An ordinary "for" loop where we mutate a variable every time we go through a loop until a condition fails to be met, that's not going to fly.

Now, there is actually a for-loop-like construct in OCaml, but I'm not going to use it. Rather, I'm going to use recursion. The pattern for constructing a loop in a purely functional language goes like this:

let rec looper loop_variable = if loop_condition_is_not_met then produce_result else looper incremented_loop_variable

Study the structure of this thing and make sure that it makes sense. The idea here is that when we enter the function, we check to see if the loop condition is no longer met. If it is not, then we produce a result, and the function ends. If it is still met then we produce an incremented loop variable and call ourselves recursively.

Now, I can hear your objection. We might have to loop thousands or millions of times in some programs. Surely this will blow the stack!

The trick here is that looping functions like this can always be written as tail recursive functions. That is, the very last thing the method does is recurse, and therefore the runtime does not need to maintain the stack frame of the current activation! We're never coming back here except to return the result to our caller, so we can just delete the stack frame entirely and have the recursive activation return the result for us!

That might be a bit hard to comprehend, being so abstract. Let's look at a real example. We'll do it wrong the first time and then fix it up:

let rec looper current = let word = Story.read_word story current in let is_end = fetch_bits bit15 size1 word in let zchar1 = fetch_bits bit14 size5 word in let zchar2 = fetch_bits bit9 size5 word in let zchar3 = fetch_bits bit4 size5 word in let decoded = Printf.sprintf "%02x %s %02x %s %02x %s " zchar1 alphabet_table.(zchar1) zchar2 alphabet_table.(zchar2) zchar3 alphabet_table.(zchar3) in if is_end = 1 then decoded else decoded ^ (looper (inc_word_addr current))

To make sense of this code you need to know first that ^ is the OCaml string concatenation operator, and second, that .() is the OCaml array indexing operator, and third, that functions which recursively call themselves must be declared with "let rec", not "let".

Let's follow along. We call looper with a word address, in current. We fetch the word at that address, decompose it into four numbers, and then turn those numbers into a little string fragment. If we're at the end then the string fragment is the entirity of the text we're going to display. If not, then we recursively get the string associated with the next address and concatenate that with the current decoded string. The result is the decoded version of the entire zstring.

I said this was the wrong way to do it; why? This looks right, and in fact if you run it, it works.

The trouble is that this is not tail recursive because the recursion is not the last thing this function does. The last thing this function does is a string concatenation. So this will have to maintain one activation frame for every two bytes of string being decoded.

How can we fix this up? By passing along an accumulation of results!

let rec looper current acc = let word = Story.read_word story current in let is_end = fetch_bits bit15 size1 word in let zchar1 = fetch_bits bit14 size5 word in let zchar2 = fetch_bits bit9 size5 word in let zchar3 = fetch_bits bit4 size5 word in let decoded = Printf.sprintf "%02x %s %02x %s %02x %s " zchar1 alphabet_table.(zchar1) zchar2 alphabet_table.(zchar2) zchar3 alphabet_table.(zchar3) in if is_end = 1 then acc ^ decoded else looper (inc_word_addr current) (acc ^ decoded)

Now we do the concatenation before the recursive call. The recursive call is the last thing we do, and so this is a nice efficient tail recursion.

I note that this is NOT a nice efficient string concatenation! This is the standard n-squared Schlemiel the painter string concatenation algorithm. There are ways to fix this; OCaml has string builders similar to C#'s, or we could accumulate a list of strings and concatenate them all in one go later, or whatever. There are lots of possible strategies here and I'm going to use none of them. The strings we're working with here are going to be sufficiently small that I don't really care about the inefficiency of concatenating lots of them.

By that same logic I might also say that I don't care about the inefficiency of non-tail-recursive algorithms, and that would be a good point. But it is usually so easy to write an elegant tail-recursive algorithm that you might as well just do so. I'll try to write tail-recursive algorithms throughout this project, and call out when I've failed to do so.

So this is good, but the interface to this thing is now a mess. It takes a word address; we want it to take a zstring address. It takes an accumulator which must be initialized to an empty string. Let's make a more pleasant interface by making this thing a nested function of a function that has the interface we want. We'll rename the nested function "aux", as is traditional, and make a few other small cleanups:

let display_bytes story (Zstring addr) = let rec aux current acc = let word = Story.read_word story current in let is_end = fetch_bits bit15 size1 word in let zchar1 = fetch_bits bit14 size5 word in let zchar2 = fetch_bits bit9 size5 word in let zchar3 = fetch_bits bit4 size5 word in let s = Printf.sprintf "%02x %s %02x %s %02x %s " zchar1 alphabet_table.(zchar1) zchar2 alphabet_table.(zchar2) zchar3 alphabet_table.(zchar3) in let acc = acc ^ s in if is_end = 1 then acc else aux (inc_word_addr current) acc in aux (Word_address addr) ""

Now that we have a function that does the right thing we can take a look at some of those abbreviations:

let () = let story = Story.load "minizork.z3" in let zstring = Zstring.abbreviation_zstring story (Abbreviation 0) in let text = Zstring.display_bytes story zstring in Printf.printf "%s\n" text; let zstring = Zstring.abbreviation_zstring story (Abbreviation 4) in let text = Zstring.display_bytes story zstring in Printf.printf "%s\n" text

And we discover that abbreviation 0 and 4 are the string fragments:

19 t 0d h 0a e 00 _ 05 ? 05 ?1e y 14 o 1a u 17 r 00 _ 05 ?

Fragments "the " and "your " seem to be likely candidates for commonly-found substrings in large sections of text, so I think we're on to something here.

However, there should still be some nagging doubts in your mind. First, what is up with character codes 1 through 6? Why are there fives at the end of "the " and "your "? And second, hey, we have no way to represent capital letters, numbers or symbols here! There must be more to this string format.

Next time on FAIC: We'll see how to actually understand the zstring compression format.


3792 b.gif?host=ericlippert.com&blog=67759120
External Content
Source RSS or Atom Feed
Feed Location http://ericlippert.com/feed
Feed Title Fabulous adventures in coding
Feed Link https://ericlippert.com/
Reply 0 comments