Convert PDF to clean Markdown — built for LLM workflows
Convert a PDF to clean Markdown that's ready to feed into ChatGPT, Claude, or any RAG pipeline. We extract the text, detect headings and lists, and emit valid Markdown — much cleaner than copy-paste.
Drag & drop a file
or browse from your computer · max 100 MB
Markdown has become the canonical input format for LLMs and RAG pipelines — but PDFs are still where most documents live. PDFOnly's PDF to Markdown converter extracts text from your PDF and runs structure detection: short title-cased lines become headings, bullet/numbered lines become Markdown lists, and indented blocks of text become paragraphs. The result is valid GitHub-flavored Markdown that loads cleanly into ChatGPT, Claude, Notion, Obsidian, or any RAG indexer. For scanned/image-only PDFs, run OCR first — the converter needs real selectable text to work with.
How to pdf to markdown step by step
- 1
Upload your PDF
Up to 100 MB free, 200 MB on Pro. The PDF needs real selectable text — if it's a scan, run OCR PDF first to make it text-searchable, then come back here.
- 2
We extract and structure
Text is extracted with poppler's pdftotext (preserves reading order even for multi-column layouts). Heuristic structure detection scans the result: short title-cased lines without trailing punctuation become headings, lines starting with '•', '-', '*', or '1.' become Markdown lists, runs of plain prose become paragraphs.
- 3
Download the .md file
Drop it into ChatGPT, Claude, Notion, Obsidian, or any tool that reads Markdown. The file is plain UTF-8 text — easy to diff, easy to version-control, easy to feed into any AI pipeline.
Why pdf to markdown on PDFOnly
Built for LLM and RAG workflows
We designed the output for AI ingestion, not for visual fidelity. Headings and lists are explicit, code-like blocks are preserved, and chunks are stable across runs.
Heuristic structure detection
Other tools just dump raw text. We try to detect H1/H2 from text characteristics so your headings end up as Markdown headings, not paragraphs.
Honest about limits
Tables, images, equations, and complex layouts don't translate well to Markdown — we tell you what was approximated and what was dropped, instead of pretending it's perfect.
Privacy-first
Your PDFs are deleted within an hour and never used to train AI. Self-host if you need air-gapped processing.
What people use pdf to markdown for
A few common scenarios. If your workflow looks like one of these, this tool is a good fit.
Feed PDFs to ChatGPT or Claude for analysis
Pasting raw PDF text into an LLM gives messy results — the model has to re-infer structure. Convert to Markdown first so headings and lists are explicit, and the model can reason about the document's structure.
Build a RAG knowledge base from PDFs
Most RAG pipelines (LangChain, LlamaIndex, Haystack) work best with Markdown. Convert your reference PDFs to .md, chunk by heading, embed, index. Cleaner than passing raw PDF chunks.
Import documentation into Notion, Obsidian, or Markdown notes
Migrate vendor docs, manuals, or research papers from PDF into a Markdown-native knowledge base. Headings and lists are preserved, so the imported pages look right immediately.
Pipe PDFs into static-site generators or wikis
MkDocs, Hugo, Docusaurus, GitBook all consume Markdown. Convert legacy PDF documentation to .md, drop into the site's content directory, publish.
Translate or summarize PDFs by piping the .md to AI
Convert → feed to LLM → back as translated/summarized Markdown → optionally render to PDF again with HTML to PDF. Markdown is the lingua franca in the middle.
What you get
- Output is valid GitHub-flavored Markdown — no broken formatting
- Heading detection catches H1/H2/H3 from font and layout cues
- List detection preserves numbered and bullet lists
- Page breaks become Markdown horizontal rules so context stays intact
- Free, no signup, files auto-deleted in 1 hour
- Designed for LLM/RAG ingestion — predictable, parseable output
Frequently asked questions
Does it work on scanned PDFs?
No — scanned PDFs are images. There's no real text to extract. Run OCR PDF first to make the scan text-searchable, then convert. The output quality depends on the OCR quality.
Are tables converted to Markdown tables?
Tables in PDFs don't have machine-readable structure unless they were tagged at export. We extract the text but don't try to reconstruct table grids — for true table extraction, use Extract Tables, which outputs CSV/Excel.
What about images and equations?
Images are dropped from the Markdown output. Equations rendered as images are also dropped. If equations were typeset as text (LaTeX-like inline math), they pass through as plain text.
Will it preserve heading levels (H1, H2, H3)?
We attempt to. Truly accurate heading detection requires the PDF to have been tagged at export (which most aren't). Our heuristics catch obvious cases (large/bold/short lines) but won't be perfect on every document.
Multi-column layouts?
Poppler's text extraction handles standard two-column academic layouts well. Highly creative magazine layouts may interleave columns — for those, you may need to manually reorder paragraphs.
Can I feed the output directly to ChatGPT?
Yes. The output is plain UTF-8 Markdown. Paste it into ChatGPT/Claude as the user message, or upload as an attachment. For very long documents, chunk by heading first.
How is this different from PDF to Text?
PDF to Text gives you raw plaintext — no structure cues, just lines of text. PDF to Markdown applies heuristic structure detection so headings and lists are formatted, making the output much more useful for AI/LLM workflows or for direct import into knowledge tools.
Ready to pdf to markdown?
Free to use for the basics. Files are auto-deleted within an hour and never used to train AI.
Open PDF to Markdown