What does it mean to repair a corrupted PDF?

Repairing a corrupted PDF means re-parsing the file to rebuild its internal structure, specifically the cross-reference table (xref) and the page tree. The repair process reads every surviving PDF object, discards any that are malformed, and writes a clean file with a new xref table that correctly maps each object to its byte offset.

What causes a PDF to become corrupted?

The most common causes are incomplete downloads (browser or network interruption), interrupted saves (application crash or power loss during write), email attachment truncation (some mail servers strip bytes past a size limit), disk errors (bad sectors on HDD or SSD), and incompatible PDF producers (software that writes non-conformant PDF structures).

Can every corrupted PDF be repaired?

No. If the file is severely truncated (missing more than 50% of its bytes), or if the content streams are overwritten with zeroes, the data is gone and no tool can recover it. Repair works when the structural metadata (xref, trailer) is damaged but the page content objects are still intact in the file.

How do I repair a PDF without Adobe Acrobat?

Use the free PDF4.dev repair PDF tool at /tools/repair-pdf. It runs entirely in your browser using pdf-lib, so the file never leaves your device. For programmatic repair, pdf-lib (Node.js), PyMuPDF (Python), Ghostscript, and qpdf all rebuild corrupted PDFs from the command line or a script.

Does repairing a PDF change its content?

Repairing should not change visible content. The process rebuilds the internal file structure (cross-reference table, page tree, object offsets) without modifying the actual page content streams. However, annotations, bookmarks, or form fields that reference corrupted objects may be lost during repair.

How do I repair a PDF in Python?

Use PyMuPDF with the repair flag: pymupdf.open("damaged.pdf", repair=True). PyMuPDF's MuPDF engine rebuilds the xref table and page tree automatically. Save with garbage=4, deflate=True to clean up orphaned objects. Code examples are in this article.

How do I repair a PDF on the command line?

Use qpdf (qpdf damaged.pdf repaired.pdf, which auto-recovers xref errors) or Ghostscript (gs -sDEVICE=pdfwrite -sOutputFile=repaired.pdf damaged.pdf, which re-renders the entire file from scratch). Both are open source and available on Linux, macOS, and Windows.

Why does my PDF show blank pages?

Blank pages usually mean the page content stream references are broken but the page objects themselves exist. The viewer finds the page entry in the page tree but cannot locate the content stream it points to. Repairing the file rebuilds those references. If the content stream bytes are actually missing from the file, the pages will remain blank after repair.

PDF Manipulation

How to repair a corrupted PDF

Repair a corrupted PDF online for free, or automate recovery with pdf-lib, PyMuPDF, Ghostscript, and qpdf. Covers xref rebuilding and common corruption causes.

benoitdedApril 19, 202612 min read

Repairing a corrupted PDF means rebuilding the file's internal structure, primarily the cross-reference table (xref) and the page tree, so that PDF viewers can locate and render each page again. Use the PDF4.dev repair PDF tool for a free browser-based repair in seconds, or the code examples below for batch recovery in Node.js, Python, and the command line.

What causes PDF corruption?

PDF files become corrupted when bytes are lost or written out of order. The five most common causes are:

Cause	What happens
Incomplete download	Browser or network interruption cuts the file short, leaving a truncated trailer and missing xref entries
Interrupted save	Application crash or power loss during a write leaves the xref table pointing at old byte offsets that no longer match the updated objects
Email attachment truncation	Some mail servers or clients impose attachment size limits and silently strip bytes past the cap
Disk errors	Bad sectors on an HDD, or flash cell degradation on an SSD, corrupt arbitrary byte ranges in the file
Incompatible PDF producers	Software that writes non-conformant PDF structures (incorrect stream lengths, missing `endobj` markers, invalid object numbers) produces files that work in one viewer but fail in others

A 2023 study by the PDF Association found that 1 in 6 PDF files in enterprise document management systems contains at least one structural violation of ISO 32000-2. Most of these go unnoticed because viewers apply their own error recovery, but some trigger visible failures.

Signs of a corrupted PDF

A corrupted PDF typically shows one or more of these symptoms:

The viewer displays "This file is damaged and could not be repaired" or "There was an error opening this document"
Some or all pages render as blank white rectangles
Text appears as garbled characters or random symbols instead of readable words
Embedded images are missing or display as grey placeholders
The file opens but reports fewer pages than expected
The file refuses to open at all, with a generic I/O or parsing error

If the file opens but looks wrong (wrong fonts, shifted layout), that is usually a rendering compatibility issue rather than corruption. True corruption involves broken internal references, not just visual differences between viewers.

How to repair a PDF online (free, no upload)

The PDF4.dev repair PDF tool repairs corrupted PDFs entirely in your browser using pdf-lib. The file never leaves your device, which matters for sensitive documents like contracts, medical records, or financial statements.

Open pdf4.dev/tools/repair-pdf
Drag your corrupted PDF onto the upload zone (or click to pick a file)
Click Repair PDF
Download the repaired result

The tool works by loading the damaged file with pdf-lib's lenient parser, which skips over malformed objects and rebuilds the xref table from whatever valid objects it finds. It then saves a clean copy with a new xref, new trailer, and correct byte offsets for every surviving object.

Repair PdfTry it free

How PDF repair works under the hood

A PDF file has three structural layers that repair tools target:

Cross-reference table (xref). The xref table is an index at the end of the file that maps every object number to its byte offset. When the xref is corrupted, the viewer cannot find any objects and reports the file as damaged. Repair tools rebuild the xref by scanning the entire file byte-by-byte for obj markers and recording each object's actual position.

Page tree. The page tree is a hierarchical structure that lists every page and its associated content stream, fonts, and images. If a page tree node references an object that no longer exists (because it was truncated or overwritten), the page appears blank. Repair tools rebuild the page tree from whatever page objects they find.

Stream lengths. Each content stream in a PDF has a declared byte length in its dictionary. If the actual bytes do not match the declared length (because of truncation or partial overwrite), the stream is unreadable. Some repair tools recalculate stream lengths from the actual data; others skip streams whose lengths do not match.

How to repair a PDF with pdf-lib (Node.js)

pdf-lib is a pure JavaScript library with no native dependencies. Its PDFDocument.load() method accepts an { ignoreEncryption: true } option and applies lenient parsing by default, which recovers from many structural errors.

import { PDFDocument } from "pdf-lib";
import { readFileSync, writeFileSync } from "fs";
 
async function repairPdf(inputPath: string, outputPath: string) {
  const bytes = readFileSync(inputPath);
 
  // pdf-lib's parser rebuilds the xref table from raw object markers
  const doc = await PDFDocument.load(bytes, {
    ignoreEncryption: true,
    updateMetadata: false,
  });
 
  const repairedBytes = await doc.save();
  writeFileSync(outputPath, repairedBytes);
 
  const pages = doc.getPageCount();
  console.log(`Repaired ${inputPath} → ${outputPath} (${pages} pages)`);
}
 
repairPdf("damaged.pdf", "repaired.pdf");

Install pdf-lib:

npm install pdf-lib

The load() call does the repair work. pdf-lib scans the raw bytes for PDF object markers (N N obj ... endobj), parses each one independently, and builds a new in-memory xref table from the results. The save() call writes a clean file with correct byte offsets and a valid trailer.

One limitation: pdf-lib cannot recover content streams whose bytes are physically missing or overwritten with zeroes. If the page content is gone, the page will exist in the repaired file but render as blank.

How to repair a PDF with PyMuPDF (Python)

PyMuPDF (pymupdf) is built on the MuPDF C library, which has a dedicated repair mode that rebuilds the xref table and page tree in a single pass.

pip install pymupdf

import pymupdf  # pip install pymupdf
 
def repair_pdf(input_path: str, output_path: str) -> None:
    # The repair flag triggers MuPDF's xref rebuild on open
    doc = pymupdf.open(input_path)
 
    if doc.needs_pass:
        print("File is encrypted — cannot repair without password")
        doc.close()
        return
 
    # Save with cleanup options to remove orphaned objects
    doc.save(
        output_path,
        garbage=4,      # remove unreferenced objects
        deflate=True,   # recompress streams
        clean=True,     # sanitize content streams
    )
    print(f"Repaired → {output_path} ({doc.page_count} pages)")
    doc.close()
 
repair_pdf("damaged.pdf", "repaired.pdf")

PyMuPDF automatically detects xref corruption and switches to repair mode when pymupdf.open() encounters a broken trailer or invalid xref entries. The garbage=4 option on save removes all unreferenced objects, which cleans up orphans left behind by the corruption. The clean=True option sanitizes content streams by re-parsing and rewriting their operator sequences.

For batch repair of an entire directory:

import pymupdf
from pathlib import Path
 
input_dir = Path("damaged-files")
output_dir = Path("repaired-files")
output_dir.mkdir(exist_ok=True)
 
for pdf in input_dir.glob("*.pdf"):
    try:
        doc = pymupdf.open(str(pdf))
        doc.save(str(output_dir / pdf.name), garbage=4, deflate=True, clean=True)
        print(f"OK: {pdf.name} ({doc.page_count} pages)")
        doc.close()
    except Exception as e:
        print(f"FAIL: {pdf.name} — {e}")

How to repair a PDF on the command line

Two command-line tools handle PDF repair reliably: qpdf and Ghostscript. Both are open source, available on Linux, macOS, and Windows, and scriptable.

qpdf

qpdf is a structural PDF transformation tool. It automatically detects and recovers from xref table errors when processing any file.

# Install on macOS
brew install qpdf
 
# Install on Ubuntu/Debian
sudo apt install qpdf

# Basic repair: qpdf rebuilds the xref table automatically
qpdf damaged.pdf repaired.pdf

qpdf reads the file, detects broken xref entries, reconstructs the object map by scanning for obj markers, and writes a clean output. If the file has non-fatal errors that qpdf can recover from, it prints warnings to stderr but still produces output.

For files with more severe issues, use the --replace-input flag to repair in place, and --warning-exit-0 to suppress the non-zero exit code that qpdf returns when it encounters warnings:

qpdf --replace-input --warning-exit-0 damaged.pdf

Additional repair-related flags:

Flag	Effect
`--replace-input`	Overwrite the input file with the repaired version
`--warning-exit-0`	Return exit code 0 even if warnings are emitted
`--linearize`	Optimize the output for web viewing (rewrites the entire structure)
`--object-streams=disable`	Unpack object streams into individual objects (helps viewers that cannot parse object streams)

For the most thorough repair, combine linearization with object stream unpacking:

qpdf --linearize --object-streams=disable damaged.pdf repaired.pdf

This rewrites every object individually, rebuilds the xref, and adds a linearization dictionary for fast web loading.

Ghostscript

Ghostscript takes a different approach: it re-renders the entire PDF through its PostScript interpreter and pdfwrite device, producing a completely new file from scratch.

# Install on macOS
brew install ghostscript
 
# Install on Ubuntu/Debian
sudo apt install ghostscript

gs -dNOPAUSE -dBATCH -dSAFER \
  -sDEVICE=pdfwrite \
  -sOutputFile=repaired.pdf \
  damaged.pdf

Because Ghostscript re-renders rather than restructures, it can recover from a wider range of corruption types than tools that only rebuild the xref. If a content stream is partially damaged, Ghostscript renders whatever it can parse and skips the rest, producing a page with partial content rather than a blank page.

The trade-off is speed and fidelity. Ghostscript is slower than qpdf because it interprets every PDF operator. It may also re-encode images, change font subsets, or alter metadata, so the output is not byte-identical to the original even for undamaged pages. For files where fidelity matters more than recovery depth, use qpdf first and fall back to Ghostscript only if qpdf fails.

Repair methods compared

Tool	Repair method	Handles truncated files	Handles broken xref	Handles damaged streams	Speed	Output fidelity
PDF4.dev repair tool	Re-parse + save	Partial	Yes	No	Fast	High
pdf-lib (Node.js)	Re-parse + save	Partial	Yes	No	Fast	High
PyMuPDF (Python)	MuPDF repair engine	Partial	Yes	Partial	Fast	High
Ghostscript	Full re-render	Yes (partial content)	Yes	Yes (partial)	Slow	Medium
qpdf	Structural rebuild	Partial	Yes	No	Very fast	Very high

"Partial" for truncated files means the tool recovers pages whose objects are present in the file but cannot reconstruct pages whose bytes are missing entirely.

For most cases, qpdf is the best first attempt because it is fast, preserves the original byte content of undamaged objects, and handles the most common corruption type (broken xref). If qpdf produces a file that still has blank or garbled pages, run Ghostscript as a second pass: it re-renders each page independently and can recover partial content from damaged streams.

Common use cases

Recover a PDF from a failed download. Interrupted browser downloads are the most common source of corrupted PDFs. The file is truncated at the point where the connection dropped. If the truncation happened past the first few pages, qpdf or pdf-lib can usually recover all pages whose objects were fully downloaded.

Fix a PDF after a disk error. Bad sectors corrupt random byte ranges. If the damaged bytes fall in the xref table or trailer (the last few kilobytes of the file), repair is straightforward: rebuild the xref. If the damaged bytes fall in a content stream, that specific page will have missing content but other pages will be intact.

Repair PDFs from legacy software. Some older PDF generators (pre-2010 scanner drivers, early LibreOffice versions, certain ERP export modules) produce files with structural violations that modern viewers reject. Re-saving through pdf-lib, PyMuPDF, or Ghostscript normalizes the structure to conform to the current spec.

Batch-repair a document archive. Organizations migrating document archives sometimes find that 2 to 5 percent of stored PDFs are unreadable due to accumulated bit rot or inconsistent backup processes. A qpdf batch script can scan and repair thousands of files per minute.

Recover a PDF that "won't open" after email. Some email clients and web-based email services re-encode attachments (base64 round-trip, line wrapping changes) in ways that corrupt the byte stream. Re-downloading the attachment or asking the sender to share via a file transfer service avoids the issue; repairing with qpdf fixes files already received.

What cannot be repaired

Not every corrupted PDF can be recovered. The following types of damage are permanent:

Severe truncation. If the file is missing more than half its bytes (for example, a 10 MB PDF that is only 2 MB on disk), most page objects are simply not in the file. No repair tool can reconstruct data that was never written.

Encryption without password. If the PDF is encrypted and you do not have the password, repair tools cannot decrypt the content streams to re-render them. The structural repair may succeed (the xref is rebuilt), but the content remains encrypted and unreadable. Remove the password first if you have it, then repair.

Overwritten bytes. If the file was partially overwritten with other data (for example, a disk recovery tool placed unrelated bytes into the middle of the file), the overwritten content is destroyed. Pages whose streams fall in the overwritten region will be blank or garbled after repair.

XFA form data loss. XFA forms (used by some government and financial institutions) store form data in an XML stream separate from the page content. If the XFA stream is corrupted, the form data is lost even if the page backgrounds render correctly after repair.

Summary

PDF corruption is caused by incomplete downloads, interrupted saves, disk errors, email truncation, or non-conformant PDF producers.
Repair works by rebuilding the cross-reference table and page tree from surviving objects in the file.
For quick one-off repairs, use the PDF4.dev repair PDF tool, which runs in the browser with no upload.
For Node.js automation, pdf-lib with PDFDocument.load() re-parses and re-saves the file in one call.
For Python, PyMuPDF with its built-in MuPDF repair engine handles xref rebuilding and content stream sanitization.
For maximum recovery depth on the command line, try qpdf first (fast, high fidelity), then Ghostscript as a fallback (slower, re-renders everything).
Severely truncated files, encrypted files without a password, and overwritten byte ranges cannot be repaired by any tool.