OCR a scanned PDF and get clean Markdown — built for AI ingestion

Got a scanned PDF you want to feed to ChatGPT or your RAG pipeline? This chain runs OCR (with deskew + auto-rotate), extracts the recognized text, and emits clean Markdown with headings and lists detected. One job, two steps you'd otherwise have to chain manually.

Drag & drop a file

or browse from your computer · max 100 MB

OCR language

Pick the language(s) of text in your scan. Output is a .md file ready for ChatGPT/Claude/RAG.

If you're doing anything with AI and PDFs — RAG indexing, ChatGPT analysis, Claude summarization — Markdown is the lingua franca. But most legacy documents are scans. Scans need OCR before any text extraction works. PDFOnly's OCR to Markdown tool runs the full chain in one job: ocrmypdf for OCR (with deskew + page-rotation correction), poppler's pdftotext for extraction, and our heuristic structure pass for Markdown formatting. Output is a single .md file ready to paste into your favorite AI tool. Multi-language support via Tesseract — English, Spanish, French, German, Arabic, Chinese, Japanese, Russian, and 90+ others.

How to ocr to markdown step by step

1
Upload a scanned PDF
Up to 100 MB free, 200 MB on Pro. Works with any PDF — scans run through OCR; born-digital PDFs pass through without re-OCR (the --skip-text flag avoids redoing work).
2
Pick language(s)
Default is 'eng' (English). Use Tesseract codes — 'spa' for Spanish, 'fra' for French, 'deu' for German, 'ara' for Arabic, 'chi_sim' for Simplified Chinese, 'jpn' for Japanese, 'rus' for Russian. Combine with '+' for multilingual docs (e.g. 'eng+ara').
3
Process and download
OCR can take 30-60 seconds per page on dense documents. We deskew and auto-rotate as part of the OCR pass. Then we extract text, apply Markdown structure heuristics, and return the .md file.

Why ocr to markdown on PDFOnly

End-to-end chain

Most tools do OCR OR text extraction OR Markdown — never all three. We run the full chain in one job, optimizing for the AI/RAG use case.

Skips OCR if already searchable

If the input is already a born-digital PDF with text, we skip the OCR step. Saves time and money on already-indexed documents.

Same Markdown heuristic as PDF to Markdown

Heading detection (font/casing cues), list detection (bullets, numbers), page-break HRs. Output is consistent whether the source was scanned or not.

Privacy-first

Files auto-deleted within an hour. Self-host for air-gapped processing if your scans are sensitive.

What people use ocr to markdown for

A few common scenarios. If your workflow looks like one of these, this tool is a good fit.

Feed legacy scanned reports to ChatGPT

You have a folder of scanned reports from 2010. Convert them all to Markdown so you can pipe them into ChatGPT for summarization or analysis without copy-paste hell.

Build a RAG knowledge base from scanned manuals

Old vendor manuals, scanned compliance docs, scanned research papers — convert to Markdown and chunk by heading, embed, index. Pipeline-ready output.

Translate a scanned foreign-language document

OCR a Spanish scan to Markdown, paste into ChatGPT, ask for English. Two steps instead of five.

Migrate a paper-based document archive to digital + Markdown

Scan all the docs, run them through this tool, deposit the .md files in your knowledge base or wiki. Searchable, AI-ready, version-controllable.

What you get

End-to-end: scan in → Markdown out, no manual chaining
Auto-deskew and auto-rotate as part of the OCR pass
Multi-language OCR (90+ languages via Tesseract)
Markdown output is LLM-ready: headings + lists detected
Free, no signup, files auto-deleted in 1 hour
Pass through if your PDF already has a text layer (no double-OCR)

Frequently asked questions

How accurate is the OCR?

Tesseract 5 is accurate on clean scans (95-99% on print-quality 300 DPI black-and-white). Quality drops on low-resolution scans, handwriting (often unusable), unusual fonts, or noisy backgrounds. For mission-critical accuracy, run a manual proofreading pass on the Markdown output.

What languages are supported?

Tesseract supports 90+ languages. The most common: English (eng), Spanish (spa), French (fra), German (deu), Italian (ita), Portuguese (por), Russian (rus), Chinese Simplified (chi_sim), Chinese Traditional (chi_tra), Japanese (jpn), Korean (kor), Arabic (ara), Hindi (hin), Hebrew (heb). Pass multiple separated by '+': 'eng+spa+fra'.

Can it handle handwritten text?

Generally no — Tesseract is trained on printed text. Some recent ML-based OCR engines handle handwriting better; we may add an optional handwriting mode in the future.

How long does it take?

Roughly 5-30 seconds per page depending on density and language complexity. Multi-language passes are slower. A 50-page scanned report typically completes in 1-3 minutes.

Is this different from just running OCR PDF then PDF to Markdown manually?

Same end state, fewer clicks. The chained tool also bundles deskew + auto-rotate as part of the OCR pass, which the standalone OCR tool also does — but here you don't have to manage two separate jobs.

What about tables and images?

Tables get extracted as plain text rows (no Markdown table grids — those need column detection that's not reliable). Images in the source are dropped from the Markdown output. For true table extraction, use Extract Tables.

Is this safe for confidential documents?

Files are auto-deleted within an hour and never used to train AI. For maximum security on sensitive scans, self-host the stack (it's straightforward open-source software).

Ready to ocr to markdown?

Free to use for the basics. Files are auto-deleted within an hour and never used to train AI.

Open OCR to Markdown