Article 43M1K Sequence alignment

Sequence alignment

by
John
from John D. Cook on (#43M1K)

In my previous post I illustrated the Levenshtein edit distance by comparing the opening paragraphs of Finnegans Wake by James Joyce and a parody by Adam Roberts.

In this post I'll show how to align two sequences using the sequence alignment algorithms of Needleman-Wunsch and Hirschberg. These algorithms can be used to compare any sequences, though they are more often used to compare DNA sequences than impenetrable novels and parodies.

I'll be using Gang Li's implementation of these algorithms, available on github. I believe the two algorithms are supposed to produce the same results, that Hirschberg's algorithm is a more space-efficient implementation of the Needleman-Wunsch algorithm, though the two algorithms below produce slightly different results. I'll give the output of Hirschberg's algorithm.

Li's alignment code uses lists of characters for input and output. I wrote a simple wrapper to take in strings and output strings.

 from alignment import Needleman, Hirschberg def compare(str1, str2): seq1 = list(str1) seq2 = list(str2) for algorithm in [Needleman(), Hirschberg()]: a, b = algorithm.align(seq1, seq2) print("".join(a)) print("".join(b)) print()

The code inserts vertical bars to indicate spaces added for alignment. Here's the result of using the Needleman-Wunsch algorithm on the opening paragraphs of Finnegans Wake and the parody Finnegans Ewok.

 |||riverrun, past Ev|e| and Adam'||||s, mov|i|er|un, past ||new and |||||hopes, from swe|rv||e of shore|||| to bend of from s||tr|ike of |||||back to bend of b|||ay, brings us by a commodius |jeday, brings us by a commodius vic|u||s of recirculation back to |||lucas of recirculation back to H|owth Ca||stle|||| and E|nvi||r|ons. |fo||||||rest||moon and |en||dor.||||

I mentioned in my previous post that I could compare the first four paragraphs easily, but I had some trouble aligning the fifth paragraphs. The fifth paragraphs of each version start out quite simiar:

 Bygme||ster Fi|nnega||n, of the Bygm|onster ||Ann||akin, of the Stutte||r|||||||ing Hand, f|re|emen'|s ||||||Throatchokin| Hand, for|cemen|'s mau-rer, lived in the broadest way mau-rer, lived in the broadest way immarginable in his rushlit immarginable in his rushlit toofar-|||back for messuages before toofar| - back for messuages before 

but then Roberts takes the liberty of skipping over a large section of the original. This is what I suspected by looking at the two texts, but Hirschberg's algorithm makes the edits obvious by showing two long sequences of vertical bars, one about 600 characters long and another about 90 characters long.

ykEdaaaKnV0
External Content
Source RSS or Atom Feed
Feed Location http://feeds.feedburner.com/TheEndeavour?format=xml
Feed Title John D. Cook
Feed Link https://www.johndcook.com/blog
Reply 0 comments