Extract text from PDF as plain .txt
Extract the text content from any PDF as a plain .txt file. Layout-aware extraction preserves columns and reading order. Works on scanned PDFs via OCR.
Drag & drop a file
or browse from your computer · max 100 MB
Extracting plain text from a PDF is one of the most common building blocks in document workflows — especially now that AI assistants, search engines, and analytics pipelines all want clean text inputs. Copy-pasting from a PDF reader produces broken paragraphs, scrambled multi-column reads, and ligatures merged into junk characters. PDFOnly's PDF-to-Text uses pdftotext (poppler) with layout-aware extraction, so columns stay separated, paragraphs stay intact, and special characters round-trip correctly. For scanned PDFs (image-only), OCR runs first to recover the text.
How to pdf to text step by step
- 1
Upload your PDF
Drop the PDF you want as text. Both digital and scanned PDFs work — we auto-detect and route accordingly.
- 2
We extract the text
Digital PDFs: pdftotext with -layout flag for column-aware extraction. Scanned PDFs: OCRmyPDF runs first to add a text layer, then we extract from that layer.
- 3
Download as .txt
Output is a UTF-8 encoded plain text file. Opens in Notepad, TextEdit, VS Code, any text editor — no special software needed.
Why pdf to text on PDFOnly
Layout-aware extraction
Most basic PDF-to-text tools dump the raw character stream, which scrambles multi-column documents. We use layout-aware extraction so columns are read top-to-bottom, left-to-right correctly.
OCR for scans
Some tools fail silently on scanned PDFs and produce empty output. We detect scans and run OCR automatically so you always get usable text.
Clean ligature handling
PDFs often store 'fi', 'fl', 'ff' as single ligature characters that copy-paste mangles. Our extractor maps them back to separate letters in the output text.
What people use pdf to text for
A few common scenarios. If your workflow looks like one of these, this tool is a good fit.
Feed text into an AI assistant
Most LLMs work much better with clean plain text than PDF uploads. Convert first, paste the .txt, get better answers.
Index PDFs in a search system
Elasticsearch, Algolia, Typesense, and most search engines want plain text inputs. Convert your PDF archive to .txt for full-text indexing.
Word-count or text-analyze a document
Word counters, readability tools, and NLP pipelines all expect plain text. Extract first, then analyze.
Quote large passages without copy-paste artifacts
Researchers quoting from PDF papers avoid the broken line breaks and ligature merges of copy-paste by extracting clean text first.
What you get
- Layout-aware extraction — multi-column documents read in correct order
- Plain UTF-8 .txt output — opens in any editor or feeds any pipeline
- Auto-OCR for scanned PDFs in 100+ languages
- Preserves paragraph breaks and basic structure
- Free for files up to 100 MB
- Files auto-deleted in 1 hour, never used to train AI
Frequently asked questions
Will the output preserve formatting like bold or headings?
No — plain text doesn't have formatting. The output is .txt, not Word/Markdown. Paragraph breaks and the order of content are preserved, but bold/italic/heading distinction is lost. For preservation of formatting, use PDF to Word instead.
Does it work on scanned PDFs?
Yes. We auto-detect scanned (image-only) PDFs and run OCR first. Quality of the extracted text depends on scan quality — clean 300 DPI scans give 95%+ character accuracy. Older or lower-resolution scans drop to 80-90%.
How are multi-column documents handled?
We use layout-aware extraction so columns are read top-to-bottom, left-to-right naturally. Two-column journal articles, three-column newsletters, and similar layouts come out in correct reading order.
Can I get the text in another encoding?
Output is UTF-8 by default — works correctly with any modern editor and supports all 100+ languages we OCR. If you need a specific encoding (e.g. Windows-1252 for legacy systems), use any text editor to convert after extraction.
What about tables?
Tables in PDF have no semantic structure — they're just positioned text. The extractor preserves their visual position with whitespace, but the result isn't a structured table. For real table extraction, use PDF to Excel or Extract Tables.
Will images be included?
No — only text content. Images are skipped (they have no text). For extracted images, render the PDF as JPGs page-by-page using PDF to JPG.
Maximum PDF size?
100 MB free, 200 MB on Pro. There's no hard limit on the number of pages — we've successfully extracted from 10,000-page PDFs.
Ready to pdf to text?
Free to use for the basics. Files are auto-deleted within an hour and never used to train AI.
Open PDF to Text