Repairing a corrupted PDF means rebuilding the file's internal structure, primarily the cross-reference table (xref) and the page tree, so that PDF viewers can locate and render each page again. Use the PDF4.dev repair PDF tool for a free browser-based repair in seconds, or the code examples below for batch recovery in Node.js, Python, and the command line.
What causes PDF corruption?
PDF files become corrupted when bytes are lost or written out of order. The five most common causes are:
| Cause | What happens |
|---|---|
| Incomplete download | Browser or network interruption cuts the file short, leaving a truncated trailer and missing xref entries |
| Interrupted save | Application crash or power loss during a write leaves the xref table pointing at old byte offsets that no longer match the updated objects |
| Email attachment truncation | Some mail servers or clients impose attachment size limits and silently strip bytes past the cap |
| Disk errors | Bad sectors on an HDD, or flash cell degradation on an SSD, corrupt arbitrary byte ranges in the file |
| Incompatible PDF producers | Software that writes non-conformant PDF structures (incorrect stream lengths, missing endobj markers, invalid object numbers) produces files that work in one viewer but fail in others |
A 2023 study by the PDF Association found that 1 in 6 PDF files in enterprise document management systems contains at least one structural violation of ISO 32000-2. Most of these go unnoticed because viewers apply their own error recovery, but some trigger visible failures.
Signs of a corrupted PDF
A corrupted PDF typically shows one or more of these symptoms:
- The viewer displays "This file is damaged and could not be repaired" or "There was an error opening this document"
- Some or all pages render as blank white rectangles
- Text appears as garbled characters or random symbols instead of readable words
- Embedded images are missing or display as grey placeholders
- The file opens but reports fewer pages than expected
- The file refuses to open at all, with a generic I/O or parsing error
If the file opens but looks wrong (wrong fonts, shifted layout), that is usually a rendering compatibility issue rather than corruption. True corruption involves broken internal references, not just visual differences between viewers.
How to repair a PDF online (free, no upload)
The PDF4.dev repair PDF tool repairs corrupted PDFs entirely in your browser using pdf-lib. The file never leaves your device, which matters for sensitive documents like contracts, medical records, or financial statements.
- Open pdf4.dev/tools/repair-pdf
- Drag your corrupted PDF onto the upload zone (or click to pick a file)
- Click Repair PDF
- Download the repaired result
The tool works by loading the damaged file with pdf-lib's lenient parser, which skips over malformed objects and rebuilds the xref table from whatever valid objects it finds. It then saves a clean copy with a new xref, new trailer, and correct byte offsets for every surviving object.
Repair PdfTry it freeHow PDF repair works under the hood
A PDF file has three structural layers that repair tools target:
Cross-reference table (xref). The xref table is an index at the end of the file that maps every object number to its byte offset. When the xref is corrupted, the viewer cannot find any objects and reports the file as damaged. Repair tools rebuild the xref by scanning the entire file byte-by-byte for obj markers and recording each object's actual position.
Page tree. The page tree is a hierarchical structure that lists every page and its associated content stream, fonts, and images. If a page tree node references an object that no longer exists (because it was truncated or overwritten), the page appears blank. Repair tools rebuild the page tree from whatever page objects they find.
Stream lengths. Each content stream in a PDF has a declared byte length in its dictionary. If the actual bytes do not match the declared length (because of truncation or partial overwrite), the stream is unreadable. Some repair tools recalculate stream lengths from the actual data; others skip streams whose lengths do not match.
How to repair a PDF with pdf-lib (Node.js)
pdf-lib is a pure JavaScript library with no native dependencies. Its PDFDocument.load() method accepts an { ignoreEncryption: true } option and applies lenient parsing by default, which recovers from many structural errors.
import { PDFDocument } from "pdf-lib";
import { readFileSync, writeFileSync } from "fs";
async function repairPdf(inputPath: string, outputPath: string) {
const bytes = readFileSync(inputPath);
// pdf-lib's parser rebuilds the xref table from raw object markers
const doc = await PDFDocument.load(bytes, {
ignoreEncryption: true,
updateMetadata: false,
});
const repairedBytes = await doc.save();
writeFileSync(outputPath, repairedBytes);
const pages = doc.getPageCount();
console.log(`Repaired ${inputPath} → ${outputPath} (${pages} pages)`);
}
repairPdf("damaged.pdf", "repaired.pdf");Install pdf-lib:
npm install pdf-libThe load() call does the repair work. pdf-lib scans the raw bytes for PDF object markers (N N obj ... endobj), parses each one independently, and builds a new in-memory xref table from the results. The save() call writes a clean file with correct byte offsets and a valid trailer.
One limitation: pdf-lib cannot recover content streams whose bytes are physically missing or overwritten with zeroes. If the page content is gone, the page will exist in the repaired file but render as blank.
How to repair a PDF with PyMuPDF (Python)
PyMuPDF (pymupdf) is built on the MuPDF C library, which has a dedicated repair mode that rebuilds the xref table and page tree in a single pass.
pip install pymupdfimport pymupdf # pip install pymupdf
def repair_pdf(input_path: str, output_path: str) -> None:
# The repair flag triggers MuPDF's xref rebuild on open
doc = pymupdf.open(input_path)
if doc.needs_pass:
print("File is encrypted — cannot repair without password")
doc.close()
return
# Save with cleanup options to remove orphaned objects
doc.save(
output_path,
garbage=4, # remove unreferenced objects
deflate=True, # recompress streams
clean=True, # sanitize content streams
)
print(f"Repaired → {output_path} ({doc.page_count} pages)")
doc.close()
repair_pdf("damaged.pdf", "repaired.pdf")PyMuPDF automatically detects xref corruption and switches to repair mode when pymupdf.open() encounters a broken trailer or invalid xref entries. The garbage=4 option on save removes all unreferenced objects, which cleans up orphans left behind by the corruption. The clean=True option sanitizes content streams by re-parsing and rewriting their operator sequences.
For batch repair of an entire directory:
import pymupdf
from pathlib import Path
input_dir = Path("damaged-files")
output_dir = Path("repaired-files")
output_dir.mkdir(exist_ok=True)
for pdf in input_dir.glob("*.pdf"):
try:
doc = pymupdf.open(str(pdf))
doc.save(str(output_dir / pdf.name), garbage=4, deflate=True, clean=True)
print(f"OK: {pdf.name} ({doc.page_count} pages)")
doc.close()
except Exception as e:
print(f"FAIL: {pdf.name} — {e}")How to repair a PDF on the command line
Two command-line tools handle PDF repair reliably: qpdf and Ghostscript. Both are open source, available on Linux, macOS, and Windows, and scriptable.
qpdf
qpdf is a structural PDF transformation tool. It automatically detects and recovers from xref table errors when processing any file.
# Install on macOS
brew install qpdf
# Install on Ubuntu/Debian
sudo apt install qpdf# Basic repair: qpdf rebuilds the xref table automatically
qpdf damaged.pdf repaired.pdfqpdf reads the file, detects broken xref entries, reconstructs the object map by scanning for obj markers, and writes a clean output. If the file has non-fatal errors that qpdf can recover from, it prints warnings to stderr but still produces output.
For files with more severe issues, use the --replace-input flag to repair in place, and --warning-exit-0 to suppress the non-zero exit code that qpdf returns when it encounters warnings:
qpdf --replace-input --warning-exit-0 damaged.pdfAdditional repair-related flags:
| Flag | Effect |
|---|---|
--replace-input | Overwrite the input file with the repaired version |
--warning-exit-0 | Return exit code 0 even if warnings are emitted |
--linearize | Optimize the output for web viewing (rewrites the entire structure) |
--object-streams=disable | Unpack object streams into individual objects (helps viewers that cannot parse object streams) |
For the most thorough repair, combine linearization with object stream unpacking:
qpdf --linearize --object-streams=disable damaged.pdf repaired.pdfThis rewrites every object individually, rebuilds the xref, and adds a linearization dictionary for fast web loading.
Ghostscript
Ghostscript takes a different approach: it re-renders the entire PDF through its PostScript interpreter and pdfwrite device, producing a completely new file from scratch.
# Install on macOS
brew install ghostscript
# Install on Ubuntu/Debian
sudo apt install ghostscriptgs -dNOPAUSE -dBATCH -dSAFER \
-sDEVICE=pdfwrite \
-sOutputFile=repaired.pdf \
damaged.pdfBecause Ghostscript re-renders rather than restructures, it can recover from a wider range of corruption types than tools that only rebuild the xref. If a content stream is partially damaged, Ghostscript renders whatever it can parse and skips the rest, producing a page with partial content rather than a blank page.
The trade-off is speed and fidelity. Ghostscript is slower than qpdf because it interprets every PDF operator. It may also re-encode images, change font subsets, or alter metadata, so the output is not byte-identical to the original even for undamaged pages. For files where fidelity matters more than recovery depth, use qpdf first and fall back to Ghostscript only if qpdf fails.
Repair methods compared
| Tool | Repair method | Handles truncated files | Handles broken xref | Handles damaged streams | Speed | Output fidelity |
|---|---|---|---|---|---|---|
| PDF4.dev repair tool | Re-parse + save | Partial | Yes | No | Fast | High |
| pdf-lib (Node.js) | Re-parse + save | Partial | Yes | No | Fast | High |
| PyMuPDF (Python) | MuPDF repair engine | Partial | Yes | Partial | Fast | High |
| Ghostscript | Full re-render | Yes (partial content) | Yes | Yes (partial) | Slow | Medium |
| qpdf | Structural rebuild | Partial | Yes | No | Very fast | Very high |
"Partial" for truncated files means the tool recovers pages whose objects are present in the file but cannot reconstruct pages whose bytes are missing entirely.
For most cases, qpdf is the best first attempt because it is fast, preserves the original byte content of undamaged objects, and handles the most common corruption type (broken xref). If qpdf produces a file that still has blank or garbled pages, run Ghostscript as a second pass: it re-renders each page independently and can recover partial content from damaged streams.
Common use cases
Recover a PDF from a failed download. Interrupted browser downloads are the most common source of corrupted PDFs. The file is truncated at the point where the connection dropped. If the truncation happened past the first few pages, qpdf or pdf-lib can usually recover all pages whose objects were fully downloaded.
Fix a PDF after a disk error. Bad sectors corrupt random byte ranges. If the damaged bytes fall in the xref table or trailer (the last few kilobytes of the file), repair is straightforward: rebuild the xref. If the damaged bytes fall in a content stream, that specific page will have missing content but other pages will be intact.
Repair PDFs from legacy software. Some older PDF generators (pre-2010 scanner drivers, early LibreOffice versions, certain ERP export modules) produce files with structural violations that modern viewers reject. Re-saving through pdf-lib, PyMuPDF, or Ghostscript normalizes the structure to conform to the current spec.
Batch-repair a document archive. Organizations migrating document archives sometimes find that 2 to 5 percent of stored PDFs are unreadable due to accumulated bit rot or inconsistent backup processes. A qpdf batch script can scan and repair thousands of files per minute.
Recover a PDF that "won't open" after email. Some email clients and web-based email services re-encode attachments (base64 round-trip, line wrapping changes) in ways that corrupt the byte stream. Re-downloading the attachment or asking the sender to share via a file transfer service avoids the issue; repairing with qpdf fixes files already received.
What cannot be repaired
Not every corrupted PDF can be recovered. The following types of damage are permanent:
Severe truncation. If the file is missing more than half its bytes (for example, a 10 MB PDF that is only 2 MB on disk), most page objects are simply not in the file. No repair tool can reconstruct data that was never written.
Encryption without password. If the PDF is encrypted and you do not have the password, repair tools cannot decrypt the content streams to re-render them. The structural repair may succeed (the xref is rebuilt), but the content remains encrypted and unreadable. Remove the password first if you have it, then repair.
Overwritten bytes. If the file was partially overwritten with other data (for example, a disk recovery tool placed unrelated bytes into the middle of the file), the overwritten content is destroyed. Pages whose streams fall in the overwritten region will be blank or garbled after repair.
XFA form data loss. XFA forms (used by some government and financial institutions) store form data in an XML stream separate from the page content. If the XFA stream is corrupted, the form data is lost even if the page backgrounds render correctly after repair.
Summary
- PDF corruption is caused by incomplete downloads, interrupted saves, disk errors, email truncation, or non-conformant PDF producers.
- Repair works by rebuilding the cross-reference table and page tree from surviving objects in the file.
- For quick one-off repairs, use the PDF4.dev repair PDF tool, which runs in the browser with no upload.
- For Node.js automation,
pdf-libwithPDFDocument.load()re-parses and re-saves the file in one call. - For Python,
PyMuPDFwith its built-in MuPDF repair engine handles xref rebuilding and content stream sanitization. - For maximum recovery depth on the command line, try
qpdffirst (fast, high fidelity), thenGhostscriptas a fallback (slower, re-renders everything). - Severely truncated files, encrypted files without a password, and overwritten byte ranges cannot be repaired by any tool.
Start generating PDFs
Build PDF templates with a visual editor. Render them via API from any language in ~300ms.


