Get started

How to extract text from a PDF (free online and programmatic methods)

Extract text from any PDF in seconds, free and online, or automate it with JavaScript, Python, and the PDF.js API. No signup required for the free tool.

benoitdedMarch 25, 202611 min read

PDF text extraction is the process of reading the text layer embedded in a PDF file and outputting it as plain, editable text. For digitally created PDFs, this is fast and accurate. For scanned PDFs (which are images), you need OCR first.

This guide covers the free online method, programmatic extraction in JavaScript and Python, and the difference between text extraction and OCR.

How to extract text from a PDF online (free, no signup)

The fastest way to extract text from a PDF is PDF4.dev's free PDF to Text tool at /tools/pdf-to-text. Upload your PDF, click extract, and copy or download the result. The entire process runs in your browser using PDF.js — your file never reaches a server.

Steps:

  1. Open pdf4.dev/tools/pdf-to-text
  2. Drop your PDF onto the upload area or click to browse
  3. Click Extract text
  4. Copy the text directly from the output panel or download as a .txt file

No account is required. The free tier allows 3 extractions per week. For unlimited use, create a free PDF4.dev account.

Text extraction vs OCR: which do you need?

Text extraction reads the text layer already embedded in a PDF. This layer exists in PDFs created by word processors (Word, Google Docs), design tools (InDesign, Figma), or programmatic generators (Playwright, PDF4.dev). Extraction is instantaneous and accurate.

OCR (Optical Character Recognition) is needed when the PDF contains scanned images of text with no underlying text layer. OCR uses machine learning to interpret pixel patterns as characters. It is slower (1-5 seconds per page) and less accurate than extraction.

PropertyText extractionOCR
Input typeDigitally created PDFsScanned PDFs, images
SpeedUnder 100ms per page1-5 seconds per page
AccuracyNear 100% for standard PDFs85-98% depending on scan quality
Requires ML modelNoYes
Preserves layoutApproximateApproximate
Free toolsPDF.js, pdfminer, pypdfTesseract, Google Drive, OCR.space

To check if your PDF has a text layer: try selecting and copying text in any PDF viewer. If text highlights and copies correctly, use text extraction. If nothing selects, use OCR.

How to extract text from a PDF in JavaScript (browser and Node.js)

PDF.js is the standard library for PDF text extraction in JavaScript. It runs in browsers and in Node.js, and it is the engine behind PDF4.dev's free tool.

Method 1: pdf-parse (Node.js, one-call API)

pdf-parse is a thin wrapper around PDF.js that returns the full text in one async call. It has over 1 million weekly downloads on npm.

import pdfParse from "pdf-parse";
import fs from "fs";
 
async function extractText(filePath: string): Promise<string> {
  const buffer = fs.readFileSync(filePath);
  const result = await pdfParse(buffer);
  return result.text;
}
 
const text = await extractText("./document.pdf");
console.log(text);

The result object includes result.text (full extracted text), result.numpages (page count), and result.info (PDF metadata).

Method 2: pdfjs-dist (browser or Node.js, page-by-page control)

For finer control — extracting specific pages, preserving text positions, or running in a browser — use pdfjs-dist directly.

import * as pdfjsLib from "pdfjs-dist";
 
// Node.js: disable worker
pdfjsLib.GlobalWorkerOptions.workerSrc = "";
 
async function extractTextFromPages(
  pdfBuffer: ArrayBuffer,
  pageNumbers?: number[]
): Promise<string> {
  const pdf = await pdfjsLib.getDocument({ data: pdfBuffer }).promise;
  const pages = pageNumbers ?? Array.from({ length: pdf.numPages }, (_, i) => i + 1);
 
  const texts: string[] = [];
  for (const pageNum of pages) {
    const page = await pdf.getPage(pageNum);
    const content = await page.getTextContent();
    const pageText = content.items
      .map((item) => ("str" in item ? item.str : ""))
      .join(" ");
    texts.push(pageText);
  }
 
  return texts.join("\n\n");
}

getTextContent() returns items with str (the text string), transform (the 6-element affine transformation matrix giving position and rotation), and width and height in PDF units. Use transform[4] and transform[5] as x/y coordinates for layout reconstruction.

Browser example: extract text from a user-uploaded file

import * as pdfjsLib from "pdfjs-dist";
pdfjsLib.GlobalWorkerOptions.workerSrc = `//cdnjs.cloudflare.com/ajax/libs/pdf.js/3.11.174/pdf.worker.min.js`;
 
async function extractFromFile(file: File): Promise<string> {
  const arrayBuffer = await file.arrayBuffer();
  const pdf = await pdfjsLib.getDocument({ data: arrayBuffer }).promise;
 
  let fullText = "";
  for (let i = 1; i <= pdf.numPages; i++) {
    const page = await pdf.getPage(i);
    const content = await page.getTextContent();
    const pageText = content.items
      .map((item) => ("str" in item ? item.str : ""))
      .join(" ");
    fullText += pageText + "\n\n";
  }
  return fullText;
}
 
// Usage with a file input
document.getElementById("fileInput")?.addEventListener("change", async (e) => {
  const file = (e.target as HTMLInputElement).files?.[0];
  if (file) {
    const text = await extractFromFile(file);
    console.log(text);
  }
});

How to extract text from a PDF in Python

Three libraries dominate Python PDF text extraction in 2026. Each has different accuracy tradeoffs for complex PDFs.

pdfminer.six (best accuracy for complex fonts)

pdfminer.six is the most accurate Python library for PDFs with custom fonts, non-standard encodings, or unusual glyph mappings. It is slower than pypdf but handles edge cases better.

from pdfminer.high_level import extract_text, extract_pages
from pdfminer.layout import LTTextContainer
 
def extract_full_text(pdf_path: str) -> str:
    return extract_text(pdf_path)
 
def extract_text_by_page(pdf_path: str) -> list[str]:
    pages = []
    for page_layout in extract_pages(pdf_path):
        page_text = ""
        for element in page_layout:
            if isinstance(element, LTTextContainer):
                page_text += element.get_text()
        pages.append(page_text)
    return pages
 
# Usage
text = extract_full_text("document.pdf")
print(text)
 
pages = extract_text_by_page("document.pdf")
for i, page in enumerate(pages, 1):
    print(f"--- Page {i} ---")
    print(page)

Install with: pip install pdfminer.six

pypdf (simple, fast, standard PDFs)

pypdf (formerly PyPDF2) is a pure-Python library with a simpler API. It is faster than pdfminer for standard PDFs but less accurate with complex font encodings.

import pypdf
 
def extract_text_pypdf(pdf_path: str) -> str:
    reader = pypdf.PdfReader(pdf_path)
    pages_text = []
    for page in reader.pages:
        pages_text.append(page.extract_text())
    return "\n\n".join(pages_text)
 
def extract_specific_pages(pdf_path: str, page_numbers: list[int]) -> str:
    """page_numbers is 0-indexed."""
    reader = pypdf.PdfReader(pdf_path)
    texts = []
    for page_num in page_numbers:
        texts.append(reader.pages[page_num].extract_text())
    return "\n\n".join(texts)
 
# Extract pages 1-3 (0-indexed: 0, 1, 2)
text = extract_specific_pages("document.pdf", [0, 1, 2])
print(text)

Install with: pip install pypdf

PyMuPDF (fastest, good accuracy)

PyMuPDF (imported as fitz) is the fastest option and handles a wide range of PDFs, including those with complex layouts.

import fitz  # PyMuPDF
 
def extract_text_pymupdf(pdf_path: str) -> str:
    doc = fitz.open(pdf_path)
    full_text = ""
    for page in doc:
        full_text += page.get_text() + "\n\n"
    doc.close()
    return full_text
 
def extract_with_blocks(pdf_path: str) -> list[dict]:
    """Returns structured blocks with position and text."""
    doc = fitz.open(pdf_path)
    blocks = []
    for page_num, page in enumerate(doc):
        for block in page.get_text("blocks"):
            x0, y0, x1, y1, text, block_no, block_type = block
            blocks.append({
                "page": page_num + 1,
                "text": text.strip(),
                "bbox": (x0, y0, x1, y1),
            })
    doc.close()
    return [b for b in blocks if b["text"]]
 
text = extract_text_pymupdf("document.pdf")
print(text)

Install with: pip install pymupdf

Comparison: Python PDF text extraction libraries

LibrarySpeedAccuracy (complex)API simplicityActive development
pdfminer.sixSlowExcellentMediumYes
pypdfFastGoodSimpleYes
PyMuPDF (fitz)Very fastVery goodSimpleYes
pdfplumberSlowExcellent (tables)MediumYes

For tables specifically, pdfplumber (which wraps pdfminer) offers the best structured table extraction. Install with pip install pdfplumber.

Handling common text extraction problems

Problem: text is concatenated without spaces

PDF text items do not always include space characters between words. The PDF format positions each glyph by coordinate. Reconstruct spaces by checking whether the gap between consecutive transform[4] (x) values exceeds the typical glyph width.

With pdfjs-dist, use includeMarkedContent: true and check item.hasEOL to detect line breaks. For most PDFs, joining with a space character rather than empty string gives better results.

Problem: text is out of reading order

PDFs from design tools may store text objects in creation order, not reading order. To reconstruct reading order:

  1. Sort text items by vertical position (transform[5], descending for top-to-bottom)
  2. Group items within the same horizontal band (within ±2 units of y)
  3. Sort each band by horizontal position (transform[4], ascending)

pdfminer.six applies this algorithm automatically via LAParams.

Problem: no text extracted (scanned PDF)

If getTextContent() returns zero items, the PDF contains raster images. You need OCR. Options:

  • Tesseract.js (browser/Node.js): free, open source, 100+ languages
  • Google Drive: open a PDF, right-click → Open with Google Docs. Google's OCR runs automatically
  • OCR.space API: free tier up to 25,000 requests/month

For programmatic OCR in Python, pytesseract wraps the Tesseract binary. Render each PDF page to an image first (PyMuPDF's page.get_pixmap()), then run OCR on the image.

Problem: garbled characters or question marks

This happens with PDFs that use custom (Type 3) fonts or proprietary encodings where the glyph-to-Unicode mapping is not embedded. pdfminer.six handles more encoding edge cases than pypdf. For PDFs produced by older software (pre-2000), PyMuPDF often has better decoding logic.

Using PDF4.dev to generate text-searchable PDFs

When you generate PDFs with PDF4.dev, the output is always a native, text-searchable PDF. Playwright renders your HTML to a PDF with a full text layer embedded — every word is extractable without OCR.

This matters for document workflows where PDFs are stored and later searched, processed, or indexed. Scanned document workflows and image-to-PDF pipelines often skip the text layer; programmatic generation never does.

The PDF to Text tool at pdf4.dev/tools/pdf-to-text uses the same PDF.js extraction engine described in this guide, so you can verify the text layer of any generated PDF immediately after creation.

For bulk text extraction from many PDFs — for example, indexing a document archive — combine PDF4.dev's merge PDF tool (to consolidate files) with a Node.js or Python script using the extraction code above.

When to use a PDF to Word converter instead

Plain text extraction loses all formatting: headings become flat text, tables become a stream of values, bold and italic are gone. If you need to edit the PDF content in a word processor, use PDF to Word conversion instead.

GoalUse
Search text content of a PDFText extraction (PDF.js, pdfminer)
Feed PDF content to an LLM or search indexText extraction
Edit PDF content in Word or Google DocsPDF to Word converter
Extract data from PDF tablespdfplumber (Python)
Extract text from a scanned documentOCR (Tesseract, Google Drive)

Summary

PDF text extraction reads the text layer embedded in digitally created PDFs. For browser and Node.js, use PDF.js via the pdfjs-dist package or the pdf-parse wrapper. For Python, pdfminer.six is the most accurate, pypdf is the simplest, and PyMuPDF is the fastest.

Scanned PDFs require OCR — text extraction returns nothing from image-only documents.

Use PDF4.dev's free PDF to Text tool for quick one-off extractions, and the code examples in this guide for automated, programmatic workflows.

Free tools mentioned:

Pdf To TextTry it freePdf To PngTry it freeCompress PdfTry it free

Start generating PDFs

Build PDF templates with a visual editor. Render them via API from any language in ~300ms.