How do I extract text from a PDF for free?

Use a browser-based tool like PDF4.dev's free PDF to Text converter at pdf4.dev/tools/pdf-to-text. Upload your PDF, click extract, and download or copy the text. No account required. The extraction runs entirely in your browser — your file never leaves your device.

Can I extract text from a scanned PDF?

Scanned PDFs are images, not text documents. Standard text extraction returns nothing from them. You need OCR (Optical Character Recognition) to convert the scanned image to machine-readable text. Free OCR tools include Google Drive (open the PDF, then copy text) and online tools like OCR.space.

Why does extracted PDF text look garbled or out of order?

PDF is a layout format, not a flow format. Text is positioned as absolute coordinates on a page, not as a document flow. Extractors must reconstruct reading order from these coordinates, which fails on multi-column layouts, rotated text, or PDFs where glyphs use non-standard encodings. PDF.js and pdfminer handle most cases well but complex layouts still require post-processing.

Does PDF text extraction preserve formatting?

Basic extraction preserves the text content and approximate line breaks. It does not preserve bold, italic, font sizes, colors, or table cell alignment. For tables, the extracted text is a flat stream of cells. To preserve formatting, use PDF to Word conversion instead of plain text extraction.

How do I extract text from a specific page range in a PDF?

With PDF.js (JavaScript), iterate over specific page numbers using pdf.getPage(pageNumber). With pdfminer (Python), pass a page_numbers list to extract_pages(). With a PDF API, split the PDF to the desired pages first, then extract text from the resulting file.

What is the difference between text extraction and OCR?

Text extraction reads the encoded text layer already embedded in a PDF. It is fast (milliseconds per page), accurate for native PDFs, and free. OCR converts a raster image to text using machine learning. It is slower (seconds per page), less accurate, and requires an ML model. Scanned PDFs need OCR; digitally-created PDFs need text extraction.

Can I extract text from a password-protected PDF?

Only if the PDF allows copying text. PDFs can have separate passwords for opening and for editing/copying. If the PDF opens without a password but text selection is disabled, some tools can still extract text from the underlying stream. If the PDF is encrypted (AES-128 or AES-256), you need the password to decrypt it first.

Which Python library is best for PDF text extraction?

pdfminer.six is the most accurate for complex PDFs with non-standard fonts or unusual encodings. pypdf is simpler and faster for standard PDFs. PyMuPDF (fitz) is the fastest and handles both native and complex PDFs well. All three are free and open source.

How do I extract text from a PDF in Node.js?

Use PDF.js (pdfjs-dist) on either the client or server. Call getDocument() to load the PDF, then iterate pages with getPage() and call getTextContent() on each. For a simpler API, the pdf-parse npm package wraps PDF.js with a one-call interface.

How accurate is browser-based PDF text extraction?

For PDFs created by word processors, design tools, or programmatic APIs (like PDF4.dev), browser-based extraction is near 100% accurate. Accuracy drops for PDFs with custom font encodings, ligatures, or right-to-left text. Scanned PDFs return zero text without an OCR step.

PDF Manipulation

How to extract text from a PDF (free online and programmatic methods)

Extract text from any PDF in seconds, free and online, or automate it with JavaScript, Python, and the PDF.js API. No signup required for the free tool.

benoitdedMarch 25, 202611 min read

On this page

How to extract text from a PDF online (free, no signup)
Text extraction vs OCR: which do you need?
How to extract text from a PDF in JavaScript (browser and Node.js)
Method 1: pdf-parse (Node.js, one-call API)
Method 2: pdfjs-dist (browser or Node.js, page-by-page control)
Browser example: extract text from a user-uploaded file
How to extract text from a PDF in Python
pdfminer.six (best accuracy for complex fonts)
pypdf (simple, fast, standard PDFs)
PyMuPDF (fastest, good accuracy)
Comparison: Python PDF text extraction libraries
Handling common text extraction problems
Problem: text is concatenated without spaces
Problem: text is out of reading order
Problem: no text extracted (scanned PDF)
Problem: garbled characters or question marks
Using PDF4.dev to generate text-searchable PDFs
When to use a PDF to Word converter instead
Summary

PDF text extraction is the process of reading the text layer embedded in a PDF file and outputting it as plain, editable text. For digitally created PDFs, this is fast and accurate. For scanned PDFs (which are images), you need OCR first.

This guide covers the free online method, programmatic extraction in JavaScript and Python, and the difference between text extraction and OCR.

The fastest way to extract text from a PDF is PDF4.dev's free PDF to Text tool at /tools/pdf-to-text. Upload your PDF, click extract, and copy or download the result. The entire process runs in your browser using PDF.js — your file never reaches a server.

Steps:

Open pdf4.dev/tools/pdf-to-text
Drop your PDF onto the upload area or click to browse
Click Extract text
Copy the text directly from the output panel or download as a .txt file

No account is required. The free tier allows 3 extractions per week. For unlimited use, create a free PDF4.dev account.

Text extraction vs OCR: which do you need?

Text extraction reads the text layer already embedded in a PDF. This layer exists in PDFs created by word processors (Word, Google Docs), design tools (InDesign, Figma), or programmatic generators (Playwright, PDF4.dev). Extraction is instantaneous and accurate.

OCR (Optical Character Recognition) is needed when the PDF contains scanned images of text with no underlying text layer. OCR uses machine learning to interpret pixel patterns as characters. It is slower (1-5 seconds per page) and less accurate than extraction.

Property	Text extraction	OCR
Input type	Digitally created PDFs	Scanned PDFs, images
Speed	Under 100ms per page	1-5 seconds per page
Accuracy	Near 100% for standard PDFs	85-98% depending on scan quality
Requires ML model	No	Yes
Preserves layout	Approximate	Approximate
Free tools	PDF.js, pdfminer, pypdf	Tesseract, Google Drive, OCR.space

To check if your PDF has a text layer: try selecting and copying text in any PDF viewer. If text highlights and copies correctly, use text extraction. If nothing selects, use OCR.

How to extract text from a PDF in JavaScript (browser and Node.js)

PDF.js is the standard library for PDF text extraction in JavaScript. It runs in browsers and in Node.js, and it is the engine behind PDF4.dev's free tool.

Method 1: pdf-parse (Node.js, one-call API)

pdf-parse is a thin wrapper around PDF.js that returns the full text in one async call. It has over 1 million weekly downloads on npm.

import pdfParse from "pdf-parse";
import fs from "fs";
 
async function extractText(filePath: string): Promise<string> {
  const buffer = fs.readFileSync(filePath);
  const result = await pdfParse(buffer);
  return result.text;
}
 
const text = await extractText("./document.pdf");
console.log(text);

The result object includes result.text (full extracted text), result.numpages (page count), and result.info (PDF metadata).

Method 2: pdfjs-dist (browser or Node.js, page-by-page control)

For finer control — extracting specific pages, preserving text positions, or running in a browser — use pdfjs-dist directly.

import * as pdfjsLib from "pdfjs-dist";
 
// Node.js: disable worker
pdfjsLib.GlobalWorkerOptions.workerSrc = "";
 
async function extractTextFromPages(
  pdfBuffer: ArrayBuffer,
  pageNumbers?: number[]
): Promise<string> {
  const pdf = await pdfjsLib.getDocument({ data: pdfBuffer }).promise;
  const pages = pageNumbers ?? Array.from({ length: pdf.numPages }, (_, i) => i + 1);
 
  const texts: string[] = [];
  for (const pageNum of pages) {
    const page = await pdf.getPage(pageNum);
    const content = await page.getTextContent();
    const pageText = content.items
      .map((item) => ("str" in item ? item.str : ""))
      .join(" ");
    texts.push(pageText);
  }
 
  return texts.join("\n\n");
}

getTextContent() returns items with str (the text string), transform (the 6-element affine transformation matrix giving position and rotation), and width and height in PDF units. Use transform[4] and transform[5] as x/y coordinates for layout reconstruction.

Browser example: extract text from a user-uploaded file

import * as pdfjsLib from "pdfjs-dist";
pdfjsLib.GlobalWorkerOptions.workerSrc = `//cdnjs.cloudflare.com/ajax/libs/pdf.js/3.11.174/pdf.worker.min.js`;
 
async function extractFromFile(file: File): Promise<string> {
  const arrayBuffer = await file.arrayBuffer();
  const pdf = await pdfjsLib.getDocument({ data: arrayBuffer }).promise;
 
  let fullText = "";
  for (let i = 1; i <= pdf.numPages; i++) {
    const page = await pdf.getPage(i);
    const content = await page.getTextContent();
    const pageText = content.items
      .map((item) => ("str" in item ? item.str : ""))
      .join(" ");
    fullText += pageText + "\n\n";
  }
  return fullText;
}
 
// Usage with a file input
document.getElementById("fileInput")?.addEventListener("change", async (e) => {
  const file = (e.target as HTMLInputElement).files?.[0];
  if (file) {
    const text = await extractFromFile(file);
    console.log(text);
  }
});

How to extract text from a PDF in Python

Three libraries dominate Python PDF text extraction in 2026. Each has different accuracy tradeoffs for complex PDFs.

pdfminer.six (best accuracy for complex fonts)

pdfminer.six is the most accurate Python library for PDFs with custom fonts, non-standard encodings, or unusual glyph mappings. It is slower than pypdf but handles edge cases better.

from pdfminer.high_level import extract_text, extract_pages
from pdfminer.layout import LTTextContainer
 
def extract_full_text(pdf_path: str) -> str:
    return extract_text(pdf_path)
 
def extract_text_by_page(pdf_path: str) -> list[str]:
    pages = []
    for page_layout in extract_pages(pdf_path):
        page_text = ""
        for element in page_layout:
            if isinstance(element, LTTextContainer):
                page_text += element.get_text()
        pages.append(page_text)
    return pages
 
# Usage
text = extract_full_text("document.pdf")
print(text)
 
pages = extract_text_by_page("document.pdf")
for i, page in enumerate(pages, 1):
    print(f"--- Page {i} ---")
    print(page)

Install with: pip install pdfminer.six

pypdf (simple, fast, standard PDFs)

pypdf (formerly PyPDF2) is a pure-Python library with a simpler API. It is faster than pdfminer for standard PDFs but less accurate with complex font encodings.

import pypdf
 
def extract_text_pypdf(pdf_path: str) -> str:
    reader = pypdf.PdfReader(pdf_path)
    pages_text = []
    for page in reader.pages:
        pages_text.append(page.extract_text())
    return "\n\n".join(pages_text)
 
def extract_specific_pages(pdf_path: str, page_numbers: list[int]) -> str:
    """page_numbers is 0-indexed."""
    reader = pypdf.PdfReader(pdf_path)
    texts = []
    for page_num in page_numbers:
        texts.append(reader.pages[page_num].extract_text())
    return "\n\n".join(texts)
 
# Extract pages 1-3 (0-indexed: 0, 1, 2)
text = extract_specific_pages("document.pdf", [0, 1, 2])
print(text)

Install with: pip install pypdf

PyMuPDF (fastest, good accuracy)

PyMuPDF (imported as fitz) is the fastest option and handles a wide range of PDFs, including those with complex layouts.

import fitz  # PyMuPDF
 
def extract_text_pymupdf(pdf_path: str) -> str:
    doc = fitz.open(pdf_path)
    full_text = ""
    for page in doc:
        full_text += page.get_text() + "\n\n"
    doc.close()
    return full_text
 
def extract_with_blocks(pdf_path: str) -> list[dict]:
    """Returns structured blocks with position and text."""
    doc = fitz.open(pdf_path)
    blocks = []
    for page_num, page in enumerate(doc):
        for block in page.get_text("blocks"):
            x0, y0, x1, y1, text, block_no, block_type = block
            blocks.append({
                "page": page_num + 1,
                "text": text.strip(),
                "bbox": (x0, y0, x1, y1),
            })
    doc.close()
    return [b for b in blocks if b["text"]]
 
text = extract_text_pymupdf("document.pdf")
print(text)

Install with: pip install pymupdf

Comparison: Python PDF text extraction libraries

Library	Speed	Accuracy (complex)	API simplicity	Active development
pdfminer.six	Slow	Excellent	Medium	Yes
pypdf	Fast	Good	Simple	Yes
PyMuPDF (fitz)	Very fast	Very good	Simple	Yes
pdfplumber	Slow	Excellent (tables)	Medium	Yes

For tables specifically, pdfplumber (which wraps pdfminer) offers the best structured table extraction. Install with pip install pdfplumber.

Handling common text extraction problems

Problem: text is concatenated without spaces

PDF text items do not always include space characters between words. The PDF format positions each glyph by coordinate. Reconstruct spaces by checking whether the gap between consecutive transform[4] (x) values exceeds the typical glyph width.

With pdfjs-dist, use includeMarkedContent: true and check item.hasEOL to detect line breaks. For most PDFs, joining with a space character rather than empty string gives better results.

Problem: text is out of reading order

PDFs from design tools may store text objects in creation order, not reading order. To reconstruct reading order:

Sort text items by vertical position (transform[5], descending for top-to-bottom)
Group items within the same horizontal band (within ±2 units of y)
Sort each band by horizontal position (transform[4], ascending)

pdfminer.six applies this algorithm automatically via LAParams.

Problem: no text extracted (scanned PDF)

If getTextContent() returns zero items, the PDF contains raster images. You need OCR. Options:

Tesseract.js (browser/Node.js): free, open source, 100+ languages
Google Drive: open a PDF, right-click → Open with Google Docs. Google's OCR runs automatically
OCR.space API: free tier up to 25,000 requests/month

For programmatic OCR in Python, pytesseract wraps the Tesseract binary. Render each PDF page to an image first (PyMuPDF's page.get_pixmap()), then run OCR on the image.

Problem: garbled characters or question marks

This happens with PDFs that use custom (Type 3) fonts or proprietary encodings where the glyph-to-Unicode mapping is not embedded. pdfminer.six handles more encoding edge cases than pypdf. For PDFs produced by older software (pre-2000), PyMuPDF often has better decoding logic.

Using PDF4.dev to generate text-searchable PDFs

When you generate PDFs with PDF4.dev, the output is always a native, text-searchable PDF. Playwright renders your HTML to a PDF with a full text layer embedded — every word is extractable without OCR.

This matters for document workflows where PDFs are stored and later searched, processed, or indexed. Scanned document workflows and image-to-PDF pipelines often skip the text layer; programmatic generation never does.

The PDF to Text tool at pdf4.dev/tools/pdf-to-text uses the same PDF.js extraction engine described in this guide, so you can verify the text layer of any generated PDF immediately after creation.

For bulk text extraction from many PDFs — for example, indexing a document archive — combine PDF4.dev's merge PDF tool (to consolidate files) with a Node.js or Python script using the extraction code above.

When to use a PDF to Word converter instead

Plain text extraction loses all formatting: headings become flat text, tables become a stream of values, bold and italic are gone. If you need to edit the PDF content in a word processor, use PDF to Word conversion instead.

Goal	Use
Search text content of a PDF	Text extraction (PDF.js, pdfminer)
Feed PDF content to an LLM or search index	Text extraction
Edit PDF content in Word or Google Docs	PDF to Word converter
Extract data from PDF tables	pdfplumber (Python)
Extract text from a scanned document	OCR (Tesseract, Google Drive)

Summary

PDF text extraction reads the text layer embedded in digitally created PDFs. For browser and Node.js, use PDF.js via the pdfjs-dist package or the pdf-parse wrapper. For Python, pdfminer.six is the most accurate, pypdf is the simplest, and PyMuPDF is the fastest.

Scanned PDFs require OCR — text extraction returns nothing from image-only documents.

Use PDF4.dev's free PDF to Text tool for quick one-off extractions, and the code examples in this guide for automated, programmatic workflows.

Free tools mentioned:

Pdf To TextTry it free Pdf To PngTry it free Compress PdfTry it free

Start generating PDFs

Build PDF templates with a visual editor. Render them via API from any language in ~300ms.

Get Started free API Docs

PDF Conversion

How to convert PDF to Word: LibreOffice, Python pdf2docx, and 5 free methods

Convert PDF to editable DOCX with LibreOffice, Python pdf2docx, Microsoft Word, or Google Docs. Covers equations, batch conversion, and layout preservation.

Mar 14, 20268 min read

PDF ManipulationPillar

The complete guide to PDF manipulation: merge, split, compress, and more

Learn how to merge, split, compress, rotate, reorder, and watermark PDF files. A comprehensive guide covering every PDF manipulation technique with free tools.

Dec 10, 20257 min read

PDF Security

How to redact a PDF permanently (and why highlighting isn't enough)

Learn how to permanently redact sensitive text and images in PDFs. Covers free browser tools, Python scripts, and why simple highlighting fails security audits.

Apr 1, 20269 min read

Start generating PDFs

Related Articles

How to convert PDF to Word: LibreOffice, Python pdf2docx, and 5 free methods

The complete guide to PDF manipulation: merge, split, compress, and more

How to redact a PDF permanently (and why highlighting isn't enough)