Article 6SC79 Confidential OCR

Confidential OCR

by
John
from John D. Cook on (#6SC79)

A client emailed me a screenshot of a table rather than pasting the table as text into an email.

I thought about using an LLM to convert it to text, but the table is confidential client information and so I shouldn't upload it anywhere.

I searched for a command line utility to do OCR and found tesseract. I installed it with

 sudo apt install tesseract-ocr libtesseract-dev tesseract-ocr-eng

and ran it with the default settings

 tesseract screenshot.png textfile

It worked remarkably well. I had to change a C to a U, but otherwise I didn't have to add or change any text, but I did have to delete a few extraneous parentheses generated by the software.

I work locally in part out of habit; it was the only way to work when I started using a computer. It has numerous advantages, such as being able to keep working when a hurricane knocks out my internet connection, but above all it is private.

I pay more attention to privacy than is convenient because I work in data privacy. And aside from my privacy, I have to protect our clients' privacy.

Related postsThe post Confidential OCR first appeared on John D. Cook.
External Content
Source RSS or Atom Feed
Feed Location http://feeds.feedburner.com/TheEndeavour?format=xml
Feed Title John D. Cook
Feed Link https://www.johndcook.com/blog
Reply 0 comments