An interesting list structure, part 2
Last time on FAIC I gave a standard implementation of a persistent immutable linked list, with cheap - O(1) - Push, Pop, Peek and IsEmpty operations, and expensive - O(n) - Append and Concatenate operations. In this episode we're going to implement a Hughes list, which "flips" those costs: Pop, Peek and IsEmpty become O(n), Append and Concatenate become O(1). Push stays O(1) also. This data structure is useful for scenarios where we are accumulating a large list "from both ends" or out of smaller lists, and then want to read the whole thing out as a normal list once we're done the accumulation.
Before we get into it, I'll repeat my caveat from last time: the implementations I'm showing today are not intended to be production-ready, heavily tested, performant or in any way bulletproof; the point of this exercise is to learn about the data structure, not to give a tutorial on good industrial design.
Last time I gave the data structure of a Hughes list, which is surprisingly simple:
public struct HughesList<T>
{
private readonly Func<SimpleList<T>, SimpleList<T>> f;
Even if it is not at all clear to you how this is going to work - and it certainly was not clear to me when I learned about these! - this probably reminds you of your experience with LINQ. Constructing a query object with LINQ is very fast, even if the data set is extremely large, because we use the Func and IEnumerable types to defer computations to the future. A Hughes list gets its good performance in the same way; it represents a sequence of operations on simple lists to be executed in the future, when they can be done in the highest-performance order, once.
But I am getting ahead of myself. What is that function? What are its semantics? They are both simple and subtle:
- A Hughes list represents a list. Say, 1, 2, 3, 4.
- The function is a function which can concatenate a simple list to the represented list; the returned value is the resulting simple list.
So if we have the Hughes list h representing 1, 2, 3, 4, and we call h.f(g) where g is the simple list representing 5, 6, 7, 8, then the result will be a simple list representing 1, 2, 3, 4, 5, 6, 7, 8.
Let's see where this takes us. The first thing I'm going to do is make a few little one-liner helpers to decrease the size of the code later. (In fact, every method I write is going to be a one-expression method!)
private HughesList(Func<SimpleList<T>, SimpleList<T>> f) =>
this.f = f;
private static HughesList<T> Make(
Func<SimpleList<T>, SimpleList<T>> f) =>
new HughesList<T>(f);
Nothing interesting there. The first interesting question is: what does an empty Hughes list look like? Well, remember our basic principle: if we have the Hughes list that represents an empty list, then the function must be the function that concatenates any list to the empty list. But" that's the identity function! Appending any list to the empty list is just the list. So this one is easy:
public static readonly HughesList<T> Empty = Make(ts => ts);
We should expect that the empty list case is pretty straightforward; let's get more complicated. How do we represent a Hughes list that contains a single element? That is, I want a function that takes a single T and gets back a Hughes list which represents the list containing only that value. Let's suppose the value is 10. Our representation is a function which takes a list, say, 1, 2, 3, and returns that list concatenated to the list that contains only 10, which is 10, 1, 2, 3.
But" that's just Push!
public static HughesList<T> Single(T t) => Make(ts => ts.Push(t));
Let's make one more constructor. Suppose we have a simple list and we would like the equivalent Hughes list. Again, the Hughes list that represents a list is the function that concatenates something to that list. But" that's Concatenate! We don't even need a lambda for this one; we can turn the method directly into a delegate bound to the receiver.
public static HughesList<T> FromSimple(SimpleList<T> ts) =>
Make(ts.Concatenate);
So far every operation has been O(1) and has been a factory method. Now let's look at our first O(n) operation, an instance method: how do we turn an instance of a Hughes list back into a simple list?
Well, what have we got? A Hughes list represents a list. It has a function which takes any simple list and returns the simple list that consists of the Hughes list concatenated with the simple list. The empty simple list is a legal argument, and it is the identity of concatenation, so we can simply write:
public SimpleList<T> ToSimple() => this.f(SimpleList<T>.Empty);
This is like doing a foreach over an OrderBy query in C#; the whole thing gets built at this point, and all the deferred work executes. As we'll see, that work is O(n) in the size of the represented list.
Let's blow right through the other O(n) operations. Again, the point of a Hughes list is that we want to decrease and defer the cost of Push, Append and Concatenate, making those operations O(1), but the down side is that we make Pop, IsEmpty and Peek all O(n). There is no way to tell from a function that can concatenate a list what the contents of the represented list are, so we'll have to realize them as simple lists:
public T Peek => this.ToSimple().Peek;
public bool IsEmpty => this.ToSimple().IsEmpty;
public HughesList<T> Pop => FromSimple(this.ToSimple().Pop);
public override string ToString() => this.ToSimple().ToString();
(Exercise: We could make IsEmpty O(1); do you see how?)
(Exercise: I could have made Pop O(1). Can you do so? Why might this be a bad idea?)
All right, that leaves Push, Append and Concatenate. We will take each in turn.
Suppose we have the Hughes list for 1, 2, 3, and we want to push 10.
What have we got? We've got a function that takes a list, say, 4, 5, 6, and concatenates it to 1, 2, 3, to produce 1, 2, 3, 4, 5, 6.
What do we want? We want a function that takes a list, say, 4, 5, 6, and concatenates it with 10, 1, 2, 3, to produce 10, 1, 2, 3, 4, 5, 6.
That is the function "call the original concatenation function and then push 10 onto the result":
private static HughesList<T> Push(HughesList<T> ts1, T t) =>
Make(ts2 => ts1.f(ts2).Push(t));
public HughesList<T> Push(T t) => Push(this, t);
Why do I have this little indirection into a static method? Because this in a value type is logically a ref parameter and cannot be captured in a lambda, but a formal parameter is just a normal local variable. I wish C# had a way to say "treat this as an ordinary value, not a ref, in this lambda".
But that's not the interesting part; make sure the meaning of this is clear. When we turn this thing back into a simple list, we're going to pass the empty list to the new function. That's going to pass the empty list to the previous function, which will realize the old list. We'll then push a value onto the old list.
Calling Push on the Hughes list is O(1), because all we're doing here is allocating a delegate. What cost are we deferring here? Suppose the old list is of length n-1 and the new list is of length n; calling ToSimple on this thing will call the underlying concatenation to construct the tail, which takes n-1 steps, and then adds a single O(1) simple-list Push to that, for a total cost of O(n).
One of the nice things about Hughes lists is that the Append operation is almost exactly the same as the Push operation, in sharp contrast to the SimpleList implementation where they could not be more different.
private static HughesList<T> Append(HughesList<T> ts1, T t) =>
Make(ts2 => ts1.f(ts2.Push(t)));
public HughesList<T> Append(T t) => Append(this, t);
Again, let's talk through this. Suppose we have a Hughes list that represents 1, 2, 3 and we append 10. We want the Hughes list that represents 1, 2, 3, 10.
What do we have? A function that can take 4, 5, 6 and produce 1, 2, 3, 4, 5, 6.
What do we want? A function that can take 4, 5, 6 and produce 1, 2, 3, 10, 4, 5, 6.
So what do we do? We make a function that pushes 10 onto 4, 5, 6, and we know how to concatenate that to 1, 2, 3, which gives us the desired result!
Again, the Hughes list Append is obviously O(1) because it is just a delegate allocation. Suppose the original list is of length n-1. The call to ToSimple will do the O(1) Push onto the empty simple list, and then the n-1 steps to concatenate 10, 4, 5, 6 to 1, 2, 3, for a total cost of O(n).
We've only got one more, and I'm sure you can guess how it goes:
private static HughesList<T> Concatenate(
HughesList<T> ts1, HughesList<T> ts2) =>
Make(ts3 => ts1.f(ts2.f(ts3)));
public HughesList<T> Concatenate(HughesList<T> ts2) =>
Concatenate(this, ts2);
Going through that to verify that it does what it is supposed to is left as an exercise, as is the exercise to determine that doing a ToSimple on the result is O(n) in the size of the this list.
I said last time that it seems bizarre that there is no "list structure" here - no head and tail - but of course there is, it's just encoded. This becomes more clear when you realize that our three operations, Push, Append, and Concatenate, are all the composition of f with another function. (Recall that the "composition" of two functions f1 and f2 is the function "call f2, and then call f1 with whatever f2 returns".) Let's rewrite our final three methods in that form, and we'll notice something.
First we'll make a helper method to do the composition for us:
private static HughesList<T> Make2(
Func<SimpleList<T>, SimpleList<T>> f1,
Func<SimpleList<T>, SimpleList<T>> f2) =>
new HughesList<T>(a => f1(f2(a)));
And now we can rewrite Push, Append and Concatenate to use this helper. When we do that we no longer have any lambdas that close over "this", so we can get rid of our static helper methods too!
public HughesList<T> Push(T t) =>
Make2(ts2 => ts2.Push(t), this.f);
public HughesList<T> Append(T t) =>
Make2(this.f, ts2 => ts2.Push(t));
public HughesList<T> Concatenate(HughesList<T> ts2) =>
Make2(this.f, ts2.f);
And now it becomes considerably more clear what is going on here: if you squint a little, it looks like Make2 has the same structure as though it was a constructor of a binary tree class! Everything in the left argument is going on the left side of the list, and everything in the right argument is going in the right half! That's why we can get away with such a "thin" data structure in a Hughes list; the real data structures are in the closures generated by the compiler!
Or, another way to look at it is: when we call ToSimple, all the operations in the right argument will be done first, and its result will be the argument to the function in the left side; in this manner we will build up the resulting list from right to left, which is the efficient way to do it.
Next time on FAIC: A relevant story about my Microsoft full-time interviews.