Munging CSV files with standard Unix tools

John

from John D. Cook on 2019-08-30 13:50 (#4P9BJ)

This post briefly discusses working with CSV (comma separated value) files using command line tools that are usually available on any Unix-like system. This will raise two objections: why CSV and why dusty old tools?

Why CSV?

In theory, and occasionally in practice, CSV can be a mess. But CSV is the de facto standard format for exchanging data. Some people like this, some lament this, but that's the way it is.

A minor variation on comma-separated values is tab-separated values [1].

Why standard utilities?

Why use standard Unix utilities? I'll point out some of their quirks, which are arguments for using something else. But the assumption here is that you don't want to use something else.

Maybe you already know the standard utilities and don't think that learning more specialized tools is worth the effort.

Maybe you're already at the command line and in a command line state of mind, and don't want to interrupt your work flow by doing something else.

Maybe you're on a computer where you don't have the ability to install any software and so you need to work with what's there.

Whatever your reasons, we'll go with the assumption that we're committed to using commands that have been around for decades.

cut, sort, and awk

The tools I want to look at are cut, sort, and awk. I wrote about cut the other day and apparently the post struck a chord with some readers. This post is a follow-up to that one.

These three utilities are standard on Unix-like systems. You can also download them for Windows from GOW. The port of sort will be named gsort in order to not conflict with the native Windows sort function. There's no need to rename the other two utilities since they don't have counterparts that ship with Windows.

The sort command is simple and useful. There are just a few options you'll need to know about. The utility sorts fields as text by default, but the -n tells it to sort numerically.

Since we're talking about CSV files, you'll need to know that -t, is the option to tell sort that fields are separated by commas rather than white space. And to specify which field to sort on, you give it the -k option.

The last utility, awk, is more than a utility. It's a small programming language. But it works so well from the command line that you can almost think of it as a command line utility. It's very common to pipe output to an awk program that's only a few characters long.

You can get started quickly with awk by reading Greg Grothous' article Why you should learn just a little awk.

Inconsistencies

Now for the bad news: these programs are inconsistent in their options. The two most common things you'll need to do when working with CSV files is to set your field delimiter to a comma and specify what field you want to grab. Unfortunately this is done differently in every utility.

cut uses -d or --delimiter to specify the field delimiter and -f or --fields to specify fields. Makes sense.

sort uses -t or --field-separator to specify the field delimiter and -k or --key to specify the field. When you're talking about sorting things, it's common to call the fields keys, and so the way sort specifies fields makes sense in context. I see no reason for -t other than -f was already taken. (In sorting, you talk about folding upper case to lower case, so -f stands for fold.)

awk uses -F or --field-separator to specify the field delimiter. At least the verbose option is consistent with sort. Why -F for the short option instead of -f? The latter was already taken for file. To tell awk to read a program from a file rather than the command line you use the -f option.

awk handles fields differently than cut and sort. Because it is a programming language designed to parse delimited text files, each field has a built-in variable: $1 holds the content of the first field, $2 the second, etc.

The following compact table summarizes how you tell each utility that you're working with comma-separated files and that you're interested in the second field.

 |------+-----+-----| | cut | -d, | -f2 | | sort | -t, | -k2 | | awk | -F, | $2 | |------+-----+-----|

Trade-offs

Some will object that the inconsistencies documented above are a good example of why you shouldn't work with CSV files using cut, sort, and awk. You could use other command line utilities designed for working with CSV files. Or pull your CSV file into R or Pandas. Or import it somewhere to work with it in SQL. Etc.

The alternatives are all appropriate for different uses. The premise here is that in some circumstances, the inconsistencies cataloged above are a regrettable but acceptable price to pay to stay at the command line.

[1] Things get complicated if you have a CSV file and fields contain commas inside strings. Tab-separated files are more convenient in this case, unless, of course, your strings contain tabs. The utilities mentioned here all support tab as a delimiter by default.

Source	RSS or Atom Feed
Feed Location	http://feeds.feedburner.com/TheEndeavour?format=xml
Feed Title	John D. Cook
Feed Link	https://www.johndcook.com/blog