Get started

How to redact a PDF permanently (and why highlighting isn't enough)

Learn how to permanently redact sensitive text and images in PDFs. Covers free browser tools, Python scripts, and why simple highlighting fails security audits.

benoitdedApril 1, 20269 min read

PDF redaction permanently removes sensitive content from a document. The most common mistake is placing a black rectangle on top of text — this hides the text visually but leaves the original data fully accessible in the file. This guide explains what true redaction is, how to do it correctly, and when a programmatic approach is the right choice.

What redaction actually means

Redaction is the permanent deletion of content from a PDF file. A properly redacted PDF has the sensitive content removed from the file's data structures, not just hidden behind an opaque overlay. The resulting file should be indistinguishable from a document that never contained the redacted content.

The PDF specification stores text as content streams, images as compressed binary objects, and annotations as a separate layer. A black highlight is an annotation — it sits on top of the content layer and can be removed by any user who opens the annotation panel in Acrobat, Preview, or a PDF editor.

What stays in the file if you only draw over it

When you draw a black box without true redaction, the original data remains:

MethodText removable?Image removable?Copy-paste blocked?
Black highlight annotationNoNoNo
Black image overlayNoNoNo
Comment/markup boxNoNoNo
True redaction (pdf-lib, PyMuPDF)YesYesYes

A simple test: open any "redacted" PDF in a text editor or run pdftotext file.pdf - in your terminal. If the sensitive words appear in the output, the redaction is incomplete.

How to redact a PDF in the browser (free, no upload)

PDF4.dev's redact PDF tool runs entirely in your browser using pdf-lib. No file leaves your device.

  1. Open pdf4.dev/tools/redact-pdf.
  2. Drop your PDF onto the upload area.
  3. Draw rectangles over the content you want to remove. Each rectangle becomes a permanent black block.
  4. Click "Apply redactions" to flatten the changes into the file.
  5. Download the result.

The tool renders each page using PDF.js, overlays your drawn rectangles as vector shapes, and then uses pdf-lib to permanently replace those regions with opaque black rectangles in the file's content stream. The original text data under each rectangle is not recoverable.

After downloading your redacted file, verify the result: open it, select all text (Cmd+A / Ctrl+A), and check whether any redacted content is selectable. If it is, the redaction did not work correctly.

Redacting a PDF programmatically with Python

For batch redaction or automated pipelines, PyMuPDF (also available as pymupdf on PyPI) is the most capable Python library for this task. It supports text search, image removal, and metadata scrubbing.

Install PyMuPDF

pip install pymupdf

Redact specific text

This script searches for a pattern (e.g., a Social Security number) and redacts every occurrence across all pages:

import fitz  # PyMuPDF
 
def redact_text_in_pdf(input_path: str, output_path: str, search_term: str) -> int:
    doc = fitz.open(input_path)
    total_redactions = 0
 
    for page in doc:
        # Find all instances of the search term
        instances = page.search_for(search_term)
        for rect in instances:
            # Add a redaction annotation with a black fill
            page.add_redact_annot(rect, fill=(0, 0, 0))
            total_redactions += 1
 
        # Apply redactions — this permanently removes the underlying text
        page.apply_redactions()
 
    doc.save(output_path, garbage=4, deflate=True)
    doc.close()
    return total_redactions
 
count = redact_text_in_pdf("contract.pdf", "contract_redacted.pdf", "John Smith")
print(f"Redacted {count} instance(s)")

The key step is page.apply_redactions(). Without it, the annotations are added but the underlying content is not removed. The garbage=4 flag in doc.save() removes orphaned objects and compresses the output.

Redact by regex pattern (phone numbers, emails, SSNs)

import fitz
import re
 
PATTERNS = {
    "ssn": r"\b\d{3}-\d{2}-\d{4}\b",
    "email": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b",
    "phone": r"\b(\+\d{1,3}[\s.-])?\(?\d{3}\)?[\s.-]\d{3}[\s.-]\d{4}\b",
}
 
def redact_patterns(input_path: str, output_path: str) -> dict:
    doc = fitz.open(input_path)
    counts = {k: 0 for k in PATTERNS}
 
    for page in doc:
        # Extract text with position data
        words = page.get_text("words")  # returns (x0, y0, x1, y1, word, ...)
        full_text = page.get_text("text")
 
        for pattern_name, pattern in PATTERNS.items():
            for match in re.finditer(pattern, full_text):
                # Search for the matched string on the page
                instances = page.search_for(match.group())
                for rect in instances:
                    page.add_redact_annot(rect, fill=(0, 0, 0))
                    counts[pattern_name] += 1
 
        page.apply_redactions()
 
    doc.save(output_path, garbage=4, deflate=True)
    doc.close()
    return counts
 
result = redact_patterns("document.pdf", "document_redacted.pdf")
print(result)
# {'ssn': 3, 'email': 7, 'phone': 2}

For large-scale document processing, run redaction as part of your document ingestion pipeline before storing or sharing files. Redacting at generation time (before the document is ever stored with sensitive data) is more reliable than redacting after the fact.

Remove PDF metadata after redaction

Visible content is only part of the problem. PDF metadata can include the original author's name, revision history, and document title. Clear it after redacting:

import fitz
 
def scrub_metadata(input_path: str, output_path: str) -> None:
    doc = fitz.open(input_path)
    doc.set_metadata({
        "author": "",
        "creator": "",
        "producer": "",
        "subject": "",
        "title": "",
        "keywords": "",
        "creationDate": "",
        "modDate": "",
    })
    doc.save(output_path, garbage=4, deflate=True, clean=True)
    doc.close()

Alternatively, use the PDF4.dev metadata editor tool to clear metadata in the browser without writing code.

Redacting with JavaScript and pdf-lib

For Node.js environments, pdf-lib can draw filled rectangles over specific regions. This approach requires knowing the coordinates of the content to redact — it does not support text search.

import { PDFDocument, rgb, degrees } from "pdf-lib";
import fs from "fs";
 
interface RedactionRect {
  page: number; // 0-indexed
  x: number;
  y: number;
  width: number;
  height: number;
}
 
async function redactPdf(
  inputPath: string,
  outputPath: string,
  redactions: RedactionRect[]
): Promise<void> {
  const pdfBytes = fs.readFileSync(inputPath);
  const pdfDoc = await PDFDocument.load(pdfBytes);
  const pages = pdfDoc.getPages();
 
  for (const rect of redactions) {
    const page = pages[rect.page];
    const { height: pageHeight } = page.getSize();
 
    // PDF coordinate system: origin is bottom-left
    // Convert from top-left coordinates to bottom-left
    page.drawRectangle({
      x: rect.x,
      y: pageHeight - rect.y - rect.height,
      width: rect.width,
      height: rect.height,
      color: rgb(0, 0, 0),
      opacity: 1,
    });
  }
 
  const outputBytes = await pdfDoc.save();
  fs.writeFileSync(outputPath, outputBytes);
}
 
// Example: redact a 200x20 rectangle on page 0, 50px from top, 100px from left
await redactPdf("input.pdf", "output.pdf", [
  { page: 0, x: 100, y: 50, width: 200, height: 20 },
]);

pdf-lib draws over content but does not remove the underlying text from the content stream in all cases. For PDFs where text must be cryptographically unrecoverable, use PyMuPDF with apply_redactions() or a dedicated redaction service. The PDF4.dev browser tool uses pdf-lib for client-side convenience; for regulated industries (HIPAA, GDPR, legal discovery), use PyMuPDF or a dedicated solution.

Preventing the problem at generation time

If you generate PDFs programmatically (invoices, reports, contracts), the most secure approach is to never embed sensitive data in the final PDF in the first place. Design your template to exclude or abbreviate sensitive fields:

// Instead of embedding a full credit card number in the PDF
const invoiceData = {
  customer_name: "Jane Doe",
  card_number: "4111111111111111", // ❌ don't embed this
};
 
// Mask it before passing to the PDF template
const invoiceData = {
  customer_name: "Jane Doe",
  card_last_four: "1111", // ✅ only embed what's needed
};
 
const response = await fetch("https://pdf4.dev/api/v1/render", {
  method: "POST",
  headers: {
    Authorization: "Bearer p4_live_xxx",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    template_id: "invoice",
    data: invoiceData,
  }),
});

This eliminates the need for post-generation redaction entirely. See the guide on generating PDF invoices programmatically for a full example.

Redaction checklist for regulated environments

For documents subject to GDPR, HIPAA, CCPA, or legal discovery, use this checklist before sharing a redacted PDF:

StepActionTool
1Redact visible text and imagesPyMuPDF apply_redactions() or PDF4.dev tool
2Clear document metadata (author, title, dates)PyMuPDF set_metadata() or PDF4.dev metadata tool
3Remove embedded attachmentsPyMuPDF embfile_del() if applicable
4Remove JavaScript actionsManual review or specialized tool
5Verify: extract text and confirm no sensitive content remainspdftotext file.pdf - or PyMuPDF get_text()
6Verify: check metadata is clearedexiftool file.pdf or PDF properties dialog
7Password-protect if distributing externallyPDF4.dev protect PDF tool

Step 5 is the most frequently skipped. Run it every time.

How redaction compares to other PDF security methods

MethodWhat it doesReversible?Use case
RedactionPermanently removes contentNoSharing with third parties
Password protectionRestricts access to the fileYes (if password known)Controlling who opens the file
WatermarkMarks the document as confidentialYesDeterrence, not removal
Permission restrictionsBlocks printing, copying, editingPartialReducing accidental sharing
Page deletionRemoves entire pagesNoRemoving wholly sensitive pages

Redaction and password protection solve different problems. Redaction removes the sensitive content; password protection controls who can read the remaining content. For the highest security, apply both. See how to password-protect a PDF for the second step.

Summary

Permanent PDF redaction requires removing content from the file's data structures, not just drawing over it. For browser-based redaction of individual files, the PDF4.dev redact tool handles this without uploading your document. For programmatic pipelines, PyMuPDF's apply_redactions() is the most reliable option for ensuring text cannot be recovered. After redacting, always verify by extracting text from the output file, and clear metadata as a second step.

For teams generating documents programmatically, the most secure approach is designing templates that never include sensitive data beyond what needs to appear in the final document — eliminating the redaction step entirely.

Related tools: redact PDF · protect PDF · edit PDF metadata · watermark PDF

Free tools mentioned:

Redact PdfTry it freeProtect PdfTry it freeWatermark PdfTry it free

Start generating PDFs

Build PDF templates with a visual editor. Render them via API from any language in ~300ms.