OCR Japanese PDF — Kanji, Hiragana, Katakana
Make scanned Japanese PDFs searchable. Recognizes Kanji, Hiragana, and Katakana in mixed text. Free, 100+ languages.
Japanese is unusual in that a single document mixes three scripts: Kanji (Chinese-origin characters), Hiragana (phonetic syllabary for native words), and Katakana (phonetic syllabary for borrowed words). OCR must recognize all three correctly and segment them properly. PDFOnly uses Tesseract's Japanese language pack, which is trained on mixed-script content and handles vertical-text traditional layouts as well as modern horizontal text.
Use cases: digitizing scanned Japanese contracts and business documents; processing scanned manga, magazines, or technical manuals for archival and search; making Japanese academic papers searchable for citation; or feeding Japanese text into translation, NLP, or search pipelines.
Related tools
Frequently asked questions
Does it handle vertical Japanese text?
Yes for modern documents printed vertically. For scanned old books or magazines with right-to-left page order, you may need to also use our Rotate PDF tool to handle page orientation before OCR.
How accurate is Japanese OCR?
On clean printed scans at 300 DPI: 90-95% character accuracy. Handwritten Japanese is much harder and beyond the scope of standard Tesseract — for handwritten documents, dedicated tools like Google Document AI work better.
Will furigana (small phonetic characters above Kanji) be recognized?
Furigana renders as inline characters in the OCR output, which can interfere with reading flow. For text intended for Japanese learners, this is usually fine. For clean text extraction, you may want to filter or post-process the OCR output to merge furigana with their parent Kanji.