PDF to Markdown for LLM and RAG pipelines: what works

If you've tried building a RAG (retrieval-augmented generation) pipeline over PDF source material, you've hit the conversion problem. PDFs encode visual layout, not semantic structure. LLMs want semantic structure. Most direct PDF → text dumps lose the structure entirely.

The solution most teams converge on: PDF → Markdown → chunks → embeddings → vector store.

Why Markdown specifically

LLMs are trained on enormous amounts of Markdown. ChatGPT and Claude both default to outputting Markdown when given the choice. Their internal representations of "what a document looks like" lean heavily on heading hierarchies and bullet lists.

When you give them a clean .md file, they reason about the document structure naturally. When you give them raw text dumps, they have to re-infer the structure, often inconsistently.

What "good Markdown from a PDF" looks like

You want:

- Heading levels detected — H1 for document title, H2 for sections, H3 for subsections. Lets you chunk by heading boundary later. - Lists preserved — bullet and numbered lists become `-` and `1.` in Markdown. - Page breaks marked — ideally as Markdown horizontal rules so chunkers can preserve page context. - Tables either as Markdown tables (when reliable) or as plain rows — better to be honest about what couldn't be reconstructed than to emit broken Markdown.

What you don't need:

- Pixel-perfect visual fidelity. This is the wrong axis for AI ingestion. Don't optimize for it. - Embedded images. LLMs can't see them anyway (without separate vision API calls). Drop them and link out if needed.

The chunking strategy that works

Once you have clean Markdown:

1. Split by H1/H2 boundaries — natural semantic boundaries. 2. For long sections, secondary-split by paragraph — keep chunks under your embedding model's context window. 3. Preserve heading context in each chunk — prepend the heading hierarchy as metadata. Helps retrieval relevance.

Common gotcha: chunking by character count alone (e.g. "1000 chars per chunk") destroys the structure benefits. Chunk by structure first, then trim if too long.

OCR before Markdown if scanned

If your source PDFs are scans (image-only), you need OCR before any of this works. The chain is:

1. OCR (we use Tesseract via OCRmyPDF) → searchable PDF 2. Extract text (poppler's pdftotext) → plain text 3. Heuristic structure pass → Markdown 4. Chunk → embeddings → vector store

We've packaged steps 1–3 into a single tool: [OCR to Markdown](/tools/ocr-to-markdown). Drop the scanned PDF, get a .md file, feed it to your pipeline.

What we got wrong (and what we fixed)

Early version of our PDF-to-Markdown was too aggressive at promoting lines to headings — random short capitalized lines (figure captions, page footers) became H1s, fragmenting the document. We tightened the heuristic: a line is a heading only if it's short, title-cased or all-caps, and not terminated by sentence punctuation.

Still imperfect on some layouts (multi-column magazines, heavy ad designs). For those, you may need a manual cleaning pass on the output — but for standard prose documents, it's reliable.

A pragmatic test

Before you invest in a complex PDF parsing pipeline, try the simple chain on 10 representative documents from your corpus:

1. Run them through PDFOnly's [PDF to Markdown](/tools/pdf-to-markdown). 2. Open the output in your text editor. 3. Ask: "If I were a chatbot reading this, would I know what's a section vs. a paragraph?"

If yes, you're done — the chain works. If no, the documents have layout patterns the heuristic misses, and you'll need a custom solution (PyMuPDF or LlamaParse). For most prose-heavy corpora, the simple chain is enough.