Generating De Bruijn cycles with primitive polynomials
Last week I wrote about a way to use De Bruijn cycles. Today I'll write about a way to generate De Bruijn cycles.
De Bruijn cyclesStart with an alphabet of k symbols. B(k, n) is the set of cycles of that contain every sequence of n symbols, and is as short as possible. Since there are kn possible sequences of n symbols, and every one corresponds to some starting position in the De Bruijn cycle, an element of B(k, n) has to have at least kn symbols. In fact, the elements of B(k, n) have exactly kn symbols.
It's not obvious that B(k, n) should be non-empty for all k and n, but it is. And there are algorithms that will take a k and an n and return an element of B(k, n).
The post from last week gave the example of a sequence in B(4, 3) that contains all triples of DNA base pairs:
AAACAAGAATACCACGACTAGCAGGAGTATCATGATTCCCGCCTCGGCGTCTGCTTGGGTGTTT
Generating De Bruijn cyclesWhen k is a prime number, i.e. we're working over an alphabet with a prime number of symbols, it is particularly simple generate De Bruijn sequences [1]. For example, let's work over the alphabet {0, 1, 2}. Then the following code will produce the next symbol in a sequence in B(3, 4).
def B34(a,b,c,d): if (a,b,c,d) == (0,0,0,0): return 1 if (a,b,c,d) == (1,0,0,0): return 0 return (a+b) % 3
We can initialize the sequence wherever we like since it produces a cycle, but if we start with 0000 we get the following:
000010011012110021020122101011112220112120002002202122001201021120202222111022121
You can verify that every sequence of four elements from {0, 1, 2} is in there somewhere, provided you wrap the end around. For example, 1000 can be found by starting in the last position.
Where did the algorithm above come from? How would you create an analogous algorithm for other values of k and n?
The algorithm goes back to Willem Mantel in 1894, and can be found, for example, in Knuth's TAOCP Volume 4. Here is Mantel's algorithm to generate an element of B(k, n) where k is prime. The function takes the latest n symbols in the De Bruijn cycle and returns the next symbol.
- If (x1, x2, ", xn ) = (0,0, ", 0), return c1.
- If (x1, x2, ", xn ) = (1,0, ", 0), return 0.
- Otherwise return c1x1 + c2x2 + " cnxn mod k.
In our example above, c1 = c2 = 1 and c3 = c4 = 0, but how do you find the c's in general?
Primitive polynomialsTo find the c's, first find a primitive polynomial of degree n over the prime field with k elements. Then the c's are the coefficients of the polynomial, with a sign change, if you write the polynomial in the following form.
In the example above, I started with the polynomial
We can read the coefficients off this polynomial: c1 = c2 = -2 and c3 = c4 = 0. Since -1 and 2 are the same working mod 3, I used c1 = c2 = 1 above.
Backing up a bit, what is a primitive polynomial and how do you find them?
A primitive polynomial of degree n with coefficients in GF(k), the finite field with k elements, has leading coefficient 1 and has a root I that generates the multiplicative group of GF(kn). That is, every nonzero element of GF(kn) can be written as a power of I.
In my example, I found a primitive polynomial in GF(34) by typing polynomials into Mathematica until I found one that worked.
In[1]:= PrimitivePolynomialQ[x^4 + 2 x + 2, 3] Out[1]= True
Since coefficients can only be 0, 1, or 2 when you're working mod 3, it only took a few guesses to find a primitive polynomial [2].
Brute force guessing works fine k and n are small, but clearly isn't practical in general. There are algorithms for searching for primitive polynomials, but I'm not familiar with them.
The case where k = 2 and n may be large is particularly important in applications, and you can find where people have tabulated primitive binary polynomials, primitive polynomials in GF(2n). It's especially useful to find primitive polynomials with a lot of zero coefficients because, for example, this leads to less computation when producing De Bruijn cycles.
Finding sparse primitive binary polynomials is its own field of research. See, for example, Richard Brent's project to find primitive binary trinomials, i.e. primitive binary polynomials with only three non-zero coefficients.
More on binary sequences in the next post on linear feedback shift registers.
***
[1] The same algorithm can be used when k is not prime but a prime power because you can equate sequence of length n from an alphabet with k = pm elements with a sequence of length mn from an alphabet with p elements. For example, 4 is not prime, but we could have generated a De Bruijn sequence for DNA basepair triples by looking for binary sequences of length 6 and using 00 = A, 01 = C, 10 = G, and 11 = T.
[2] The number of primitive polynomials in GF(pn) is I(pn - 1)/m where I is Euler's totient function. So in our case there were 8 polynomials we could have found.