OCR vs. text extraction: when to use which

PDFs come in two flavors: digital (born in software) and scanned (photographed paper). Both look identical, but they're as different as a Word doc and a JPEG of a Word doc.

Quick test: can you copy-paste the text?

Open the PDF. Try to select a sentence. If you can copy it into a text editor and read it, you have a digital PDF — no OCR needed. If selection produces nothing or a string of garbage, the "text" is actually pixels in an image, and you need OCR.

What OCR actually does

Optical Character Recognition runs each page through an image classifier that identifies letter shapes, then outputs editable text. Modern OCR (we use Tesseract via OCRmyPDF) handles 100+ languages and reaches 95%+ accuracy on clean scans. Crooked or low-resolution scans drop sharply.

A handy trick: deskew before OCR. Our pipeline does this automatically, so you typically don't need to.