Article 52C8P A shell one-liner to search directories

A shell one-liner to search directories

by
John
from John D. Cook on (#52C8P)

I started this post by wanting to look at the frequency of LaTeX commands, but then thought that some people mind find the code to find the frequencies more interesting than the frequencies themselves.

So I'm splitting this into two posts. This post will look at the shell one-liner to find command frequencies, and the next post will look at the actual frequencies.

I want to explore LaTeX files, so I'll start by using find to find such files.

 find . -name "*.tex"

This searches for files ending in .tex, starting with the current directory (hence .) and searching recursively into subdirectories. The find command explores subdirectories by default; you have to tell it not to if that's not what you want.

Next, I want to use grep to search the LaTeX files. If I pipe the output of find to grep it will search the file names, but I want it to search the file contents. The xargs command takes care of this, receiving the file names and passing them along as file names, i.e. not as text input.

 find . -name "*.tex" | xargs grep ...

LaTeX commands have the form of a backslash followed by letters, so the regular expression I'll pass is \\[a-z]+. This says to look for a literal backslash followed by one or more letters.

I'll give grep four option flags. I'll use -i to ask it to use case-insensitive matching, because LaTeX commands can begin contain capital letters. I'll use -E to tell it I want to use extended regular expressions [1].

I'm after just the commands, not the lines containing commands, and so I use the -o option to tell grep to return just the commands, one per line. But that's not enough. I would be enough if we were only search one file, but since we're searching multiple files, the default behavior is for grep to return the file name as well. The -h option tells it to only return the matches, no file names.

So now we're up to this:

 find . -name "*.tex" | xargs grep -oihE '\\[a-z]+'

Next I want to count how many times each command occurs, and I need to sort the output first so that uniq will count correctly.

 find . -name "*.tex" | xargs grep -oihE '\\[a-z]+' | sort | uniq -c

And finally I want to sort the output by frequency, in descending order. The -n option tells sort to sort numerically, and -r says to sort in descending order than the default ascending order. This produces a lot of output, so I pipe everything to less to view it one screen at a time.

 find . -name "*.tex" | xargs grep -oihE '\\[a-z]+' | sort | uniq -c | sort -rn | less

That's my one-liner. In the next post I'll look at the results.

More command line posts

[1] I learned regular expressions from writing Perl long ago. What I think of a simply a regular expression is what grep calls "extended" regular expressions, so adding the -E option keeps me out of trouble in case I use a feature that grep considers an extension. You could use egrep instead, which is essentially the same as grep -E.

ZN-GYLUgvRc
External Content
Source RSS or Atom Feed
Feed Location http://feeds.feedburner.com/TheEndeavour?format=xml
Feed Title John D. Cook
Feed Link https://www.johndcook.com/blog
Reply 0 comments