Extract text from a PDF
Arshad Khan left a comment on my post on the less and more utilities saying on ubuntu if I do less on a pdf file, it shows me the text contents of the pdf."
Apparently this is an undocumented feature of GNU less. It works, but I don't see anything about it in the man page documentation [1].
Not all versions of less do this. On my Mac, less applied to a PDF gives a warning saying ... may be a binary file. See it anyway?" If you insist, it will dump gibberish to the command line.
A more portable way to extract text from a PDF would be to use something like the pypdf Python module:
from pypdf import PdfReader reader = PdfReader("myfile.pdf") for page in reader.pages: print(page.extract_text())
The pypdf documentation gives several options for how to extract text. The documentation also gives a helpful discussion of why it's not always clear what extracting text from a PDF should mean. Should captions and page numbers be extracted? What about tables? In what order should text elements be extracted?
PDF files are notoriously bad as a data exchange format. When you extract text from a PDF, you're likely not using the file in a way its author intended, maybe even in a way the author tried to discourage.
Related post: Your PDF may reveal more than you intend
[1] Update: Several people have responded saying that that less isn't extracting the text from a PDF, but lesspipe is. That would explain why it's not a documented feature of less. But it's not clear how lesspipe is implicitly inserting itself.
Further update: Thanks to Konrad Hinsen for pointing me to this explanation. less reads an environment variable LESSOPEN for a preprocessor to run on its arguments, and that variable is, on some systems, set to lesspipe.
The post Extract text from a PDF first appeared on John D. Cook.