Article 5GY0V Comparing files with very long lines

Comparing files with very long lines

by
John
from John D. Cook on (#5GY0V)

Suppose you need to compare two files with very long lines. Maybe each file is one line.

The diff utility will show you which lines differ. But if the files are each one line, this tells you almost nothing. It confirms that there is a difference, but it doesn't show you where the difference is. It just dumps both files to the command line.

For example, I created two files each containing the opening paragraph of Moby Dick. The file file1.txt contains the paragraph verbatim, and in file2.txt contains a typo. I ran

 diff temp1.txt temp2.txt

and this is what I saw:

moby_diff1.png

The 1c1 output tells us that the difference occurs on line 1, but that's no help because the whole file is line 1.

There are no doubt many ways to fix this problem, but the solution I thought of was to use fold to break the files into more lines before running diff. Without any optional arguments, fold will split a file into segments of 80 characters.

Here's my first attempt.

 $ diff <(fold temp1.txt) <(fold temp2.txt) 7c7 < and bringing up the rear of every funeral I meet; and especially whenever my hy --- > and bringing up the rear of every funeral I meet; and especially whenever my ty

That's better, but we lost context when we chopped the file into 80-character segments. We can't see that the last words should be hypos" and typos". This is a minor inconvenience, but there's a bigger problem.

In the example above I simply changed one character, turning an h' into a t'. But suppose I had changed hypos" into typoes", not only changing a letter, but also inserting a letter. Then the rest of the files would split differently into 80-character segments. Even though the rest of the text is identical, we would be seeing differences created by fold.

This problem can be mitigated by giving diff the option -s, telling it to split lines at word boundaries.

 $ diff <(fold -s temp1.txt) <(fold -s temp2.txt) 8,9c8,9 < whenever my hypos get such an upper hand of me, that it requires a strong moral < principle to prevent me from deliberately stepping into the street, and --- > whenever my typoes get such an upper hand of me, that it requires a strong > moral principle to prevent me from deliberately stepping into the street, and

The extra letter that I added changed the way the first line was broken; putting moral" on the first line of the second output would have made the line 81 characters long. But other than pushing moral" to the next line, the rest of chopped lines are the same.

You can use the -w option to have diff split lines of lengths other than 80 characters. If I split the files into lines 20 characters long, the differences between the two files are more obvious.

 $ diff <(fold -s -w 20 temp1.txt) <(fold -s -w 20 temp2.txt) 34c34 < my hypos get such --- > my typoes get such

Ah, much better.

Not only do shorter lines make it easier to see where the differences are, in this case changing the segment length fixed the problem of a word being pushed to the next line. That won't always be the case. It would not have been the case, for example, if I had used 15-character lines. But you can always change the number of characters slightly to eliminate the problem if you'd like.

Related postsThe post Comparing files with very long lines first appeared on John D. Cook.p0yU3trNUiY
External Content
Source RSS or Atom Feed
Feed Location http://feeds.feedburner.com/TheEndeavour?format=xml
Feed Title John D. Cook
Feed Link https://www.johndcook.com/blog
Reply 0 comments