Get started
PDF generation best practices for production

PDF generation best practices for production

An operational guide to running PDF generation at scale: caching, retries, timeouts, browser pool lifecycle, memory, observability, and cost. Real numbers from the PDF4.dev production stack.

benoitded15 min read

PDF generation looks like a stateless function until you put it in production. Then you discover Chromium leaks, fonts go missing, the browser disconnects mid-request, and one stuck network call brings down the whole worker. This is the operational guide we wish existed when we built PDF4.dev.

The short answer: cache aggressively, retry with idempotency, never trust networkidle, recycle workers on a counter, and measure everything at p99.

What does production-grade PDF generation actually mean?

Production-grade means the system holds together under the full failure surface: cold starts, OOM kills, slow upstream assets, retries, malformed input, and traffic spikes. A prototype that renders an invoice in 300 ms is not production-grade. A service that holds 99.9 percent at p99 under 800 ms across 200k renders per day is.

ConcernPrototypeProduction
Browser lifecyclelaunch() per requestSingleton + worker recycle
Asset loadingwaitUntil: networkidlewaitUntil: load + 30s cap
Errorstry/catch, log to stdoutSanitized, typed, retried
CachingNoneRender cache by content hash
ObservabilityConsole logsp50/p95/p99, queue depth, RSS
CostFree$/1000 renders + on-call
Cold start1.2 sunder 100 ms (warm pool)

This guide walks through each row of that table.

How do I cache PDF renders without serving stale documents?

Cache on a content hash of the inputs, never on the URL or the user. A PDF render is a pure function of HTML plus data plus format. Hash all three (SHA-256 over a canonical JSON representation) and use that as the cache key. Two requests with the same inputs return the same bytes; one byte difference produces a new key and a fresh render.

import { createHash } from "node:crypto";
 
function renderKey(html: string, data: object, format: object): string {
  const payload = JSON.stringify({
    html,
    data: sortKeys(data),
    format: sortKeys(format),
  });
  return "render_" + createHash("sha256").update(payload).digest("hex").slice(0, 32);
}
 
function sortKeys<T>(obj: T): T {
  if (Array.isArray(obj)) return obj.map(sortKeys) as T;
  if (obj && typeof obj === "object") {
    return Object.fromEntries(
      Object.entries(obj as Record<string, unknown>)
        .sort(([a], [b]) => a.localeCompare(b))
        .map(([k, v]) => [k, sortKeys(v)]),
    ) as T;
  }
  return obj;
}

Two layers of cache work well in practice:

  1. Render cache. Stores the actual PDF bytes for 24 hours, keyed by renderKey(). Disk-backed for cheap, S3-backed for shared. PDF4.dev uses disk under data/renders/ with a 2000-entry soft cap and an HMAC-signed URL for delivery.
  2. Template cache. Stores the compiled Handlebars function in memory, keyed by the template ID and version. Skips the parse step entirely. A Handlebars compile takes 5-15 ms, which adds up across thousands of renders per minute.

Cache invalidation happens automatically: any change to HTML, data, or format produces a new hash. You never need to call cache.delete() manually.

Do not cache by user ID or template name alone. Two users with the same invoice template and the same data will get the same PDF, but if you key by user the cache hit rate drops to zero. Key by content, not by identity.

How do I implement retries and idempotency?

Accept an Idempotency-Key header on the render endpoint. Store the response keyed by that header for 24 hours. On retry, return the original response (same bytes, same HTTP status, same Content-Type). This is the Stripe pattern and it is the only retry strategy that holds under network failures.

import { NextRequest, NextResponse } from "next/server";
 
const idempotencyStore = new Map<string, { status: number; body: Buffer }>();
 
export async function POST(req: NextRequest) {
  const idempotencyKey = req.headers.get("idempotency-key");
  if (idempotencyKey && idempotencyStore.has(idempotencyKey)) {
    const cached = idempotencyStore.get(idempotencyKey)!;
    return new NextResponse(new Uint8Array(cached.body), {
      status: cached.status,
      headers: { "Content-Type": "application/pdf", "Idempotency-Replay": "true" },
    });
  }
 
  const { html, data } = await req.json();
  const pdf = await renderPdf(html, data);
 
  if (idempotencyKey) {
    idempotencyStore.set(idempotencyKey, { status: 200, body: pdf });
    setTimeout(() => idempotencyStore.delete(idempotencyKey), 24 * 60 * 60 * 1000);
  }
 
  return new NextResponse(new Uint8Array(pdf), {
    status: 200,
    headers: { "Content-Type": "application/pdf" },
  });
}

The in-memory Map is fine for a single instance. For multi-instance, swap it for Redis with a 24-hour TTL.

The retry policy on the client side should use exponential backoff with jitter, capped at 3 attempts. Anything more is masking a real problem.

async function renderWithRetry(payload: object, attempt = 0): Promise<Buffer> {
  try {
    const idempotencyKey = crypto.randomUUID();
    const res = await fetch("https://pdf4.dev/api/v1/render", {
      method: "POST",
      headers: {
        "Authorization": `Bearer ${process.env.PDF4_API_KEY}`,
        "Content-Type": "application/json",
        "Idempotency-Key": idempotencyKey,
      },
      body: JSON.stringify(payload),
    });
    if (res.status >= 500 && attempt < 2) throw new Error("retryable");
    if (!res.ok) throw new Error(`HTTP ${res.status}`);
    return Buffer.from(await res.arrayBuffer());
  } catch (err) {
    if (attempt >= 2) throw err;
    const delay = 200 * 2 ** attempt + Math.random() * 100;
    await new Promise((r) => setTimeout(r, delay));
    return renderWithRetry(payload, attempt + 1);
  }
}

Why does waitUntil matter for PDF timeouts?

page.setContent(html, { waitUntil: ... }) is the most consequential single argument in the Playwright API for PDF work. Pick the wrong value and your worker hangs on a slow CDN or fires the PDF before the fonts arrive.

ValueFires whenWhen to use
domcontentloadedDOM tree parsed, no subresources loadedFully self-contained HTML (no <img>, no @font-face)
loadwindow.load event, all subresources loadedDefault for inlined assets and embedded fonts
networkidleNo network activity for 500 msHeavy JS apps that wait on XHR. Hangs if any asset 404s slowly.
commitNavigation committedAlmost never useful for PDFs

Pair the waitUntil with an explicit hard timeout on the page operation:

const RENDER_TIMEOUT_MS = 30_000;
 
await page.setContent(html, { waitUntil: "load", timeout: RENDER_TIMEOUT_MS });
const pdf = await page.pdf({
  format: "A4",
  printBackground: true,
  timeout: RENDER_TIMEOUT_MS,
});

The 30-second cap is the safety net, not the target. Most PDFs render in 200-400 ms. Anything over 5 seconds is almost always a stuck <link rel="stylesheet"> to a slow CDN or a JavaScript animation that never settles.

For fonts specifically, prefer base64 data URIs over remote URLs. The CSS @font-face { src: url("data:font/woff2;base64,...") } pattern guarantees the font is "loaded" the moment the HTML is parsed, so waitUntil: load works without extra coordination.

How do I run a browser pool that does not crash?

One Chromium per worker process. One worker process per CPU core. One page per concurrent render. The browser stays alive forever, the page is created on demand and closed in a finally block.

import { chromium, type Browser } from "playwright";
 
let browser: Browser | null = null;
let renderCount = 0;
const MAX_RENDERS_PER_BROWSER = 5000;
 
async function getBrowser(): Promise<Browser> {
  if (browser && browser.isConnected() && renderCount < MAX_RENDERS_PER_BROWSER) {
    return browser;
  }
  if (browser) {
    await browser.close().catch(() => {});
  }
  browser = await chromium.launch({
    args: ["--no-sandbox", "--disable-dev-shm-usage", "--font-render-hinting=none"],
  });
  renderCount = 0;
 
  browser.on("disconnected", () => {
    browser = null;
  });
 
  return browser;
}
 
export async function renderPdf(html: string): Promise<Buffer> {
  const b = await getBrowser();
  const page = await b.newPage();
  try {
    await page.setContent(html, { waitUntil: "load", timeout: 30_000 });
    const pdf = await page.pdf({ format: "A4", printBackground: true });
    renderCount++;
    return pdf;
  } finally {
    await page.close().catch(() => {});
  }
}

The three things this pattern does right:

  1. Recycles the browser after 5000 renders to bound memory growth.
  2. Listens for disconnect so the next request creates a fresh browser instead of throwing on a dead handle.
  3. Closes the page in finally so a render error does not leak the page object.

Why not multiple browsers? Because Chromium uses ~250 MB of RSS at idle and concurrent pages inside one browser are essentially free. A single browser handles 30+ concurrent pages comfortably. Spawning multiple browsers per worker just multiplies the memory cost.

On Linux containers, --disable-dev-shm-usage is mandatory. The default /dev/shm is 64 MB and Chromium uses it as scratch space. Without that flag you will get random Page crashed! errors on the second or third concurrent render. Increasing /dev/shm to 1 GB also works (--shm-size=1g in Docker), but the flag is simpler.

How do I avoid cold starts on serverless?

You don't. Cold starts on serverless platforms with bundled Chromium take 1.2-2.5 seconds because Lambda or Vercel must extract a 300 MB layer, mount it, and spawn the binary. Even with @sparticuz/chromium, the warm path is fast but the cold path is brutal.

Three strategies actually work:

  1. Provisioned concurrency (AWS Lambda) keeps N instances warm. Cost scales linearly with N, but cold start drops to zero.
  2. Long-lived workers (ECS, Fly Machines, Railway) run a singleton browser for hours or days. No cold starts after the first request.
  3. Hosted API keeps the warm pool on someone else's machine. PDF4.dev's /api/v1/render endpoint runs against a permanently-warm Chromium pool, so the only latency is the network round-trip plus the actual render.

The right answer depends on traffic shape. Bursty traffic with long idle periods favours option 3. Steady traffic favours option 2. If you need option 1, you are usually two months away from outgrowing serverless anyway.

How do I detect and fix memory leaks?

Memory leaks in PDF workers come from three places: forgotten pages, cached compiled templates that never expire, and Chromium itself leaking 200-500 KB per render even when you do everything right. The first two are bugs. The third is a property of the engine and you have to plan around it.

The detection pattern is straightforward: log RSS every minute and chart it. A healthy worker oscillates between 280 MB and 320 MB. A leaking worker climbs by ~100 KB per render, so 1000 renders later you are at 380 MB and rising.

setInterval(() => {
  const rss = process.memoryUsage().rss / 1024 / 1024;
  console.log(`[mem] rss=${rss.toFixed(0)}MB renders=${renderCount}`);
  if (rss > 1500) {
    console.warn("[mem] RSS over 1.5GB, exiting for restart");
    process.exit(0); // process manager restarts us
  }
}, 60_000);

The process.exit(0) line is the trick. Let your process manager (PM2, systemd, Kubernetes, the Railway runner) restart the worker. Kubernetes does this for you with a memory limit on the pod. PM2 does it with --max-memory-restart 1500M. Either way, the recycle is cheap because incoming requests just route to the other workers in the pool.

What metrics should I track?

Render-time histograms, queue depth, error rate, and worker memory. Everything else is secondary. The three queries you should be able to answer in 30 seconds:

  1. What is the p99 render duration over the last hour?
  2. How many renders are queued right now?
  3. Which error types are climbing today?

A minimal Prometheus setup:

import { Histogram, Counter, Gauge } from "prom-client";
 
const renderDuration = new Histogram({
  name: "pdf_render_duration_seconds",
  help: "PDF render duration",
  buckets: [0.1, 0.2, 0.3, 0.5, 1, 2, 5, 10, 30],
});
 
const renderErrors = new Counter({
  name: "pdf_render_errors_total",
  help: "PDF render errors",
  labelNames: ["error_type"],
});
 
const browserPages = new Gauge({
  name: "pdf_browser_pages",
  help: "Number of open browser pages",
});
 
export async function renderPdfInstrumented(html: string): Promise<Buffer> {
  const end = renderDuration.startTimer();
  browserPages.inc();
  try {
    return await renderPdf(html);
  } catch (err) {
    renderErrors.inc({ error_type: classify(err) });
    throw err;
  } finally {
    end();
    browserPages.dec();
  }
}
 
function classify(err: unknown): string {
  const msg = err instanceof Error ? err.message : String(err);
  if (msg.includes("timeout")) return "timeout";
  if (msg.includes("Page crashed")) return "page_crash";
  if (msg.includes("Target closed")) return "browser_disconnect";
  return "unknown";
}

Alert thresholds we run in production: p99 above 2 seconds for 5 minutes, error rate above 1 percent for 5 minutes, queue depth above 50 for 1 minute, RSS above 1.4 GB on any single worker.

How do I sanitize errors before logging or returning them?

PDF rendering errors leak filesystem paths, container hostnames, and stack traces. Strip them before they hit logs or response bodies. The PDF4.dev pipeline runs every error through a sanitizeErrorMessage() helper that removes absolute paths, source locations, and stack tails:

export function sanitizeErrorMessage(input: unknown): string {
  let msg = input instanceof Error ? input.message : String(input);
 
  msg = msg
    .replace(/\/(?:app|home|workspaces|root|usr)\/[\w./-]+/g, "[path]")
    .replace(/[A-Z]:\\\\[\w\\\\.-]+/g, "[path]")
    .replace(/file:\/\/\/[^\s)]+/g, "[file]")
    .replace(/:\d+:\d+/g, "")
    .replace(/\n\s+at\s.+/g, "");
 
  msg = msg.split("\n").find((line) => line.trim()) ?? "";
  return msg.slice(0, 500);
}

Apply it at three points: writing to the log table, sending to your error tracker, and serializing the error response to the client. Skipping any of those leaks infrastructure detail to a customer or, worse, to another tenant in a multi-tenant deployment.

How much does it cost to run at scale?

Self-hosting cost is dominated by RAM, not CPU. A single worker handles 30+ concurrent renders inside one Chromium and uses 300-400 MB of RSS. At 8 workers per machine you need 4 GB just for browsers, plus headroom for Node and the OS. A typical $40/month VPS with 8 GB of RAM handles roughly 1 million renders per month before saturation.

Volume / monthSelf-hostPDF4.dev APIEngineering time
10k$40 (1 VPS)$0 (free tier)1-2 days setup, ongoing on-call
100k$40 (1 VPS)$20-503-5 days hardening, ongoing on-call
500k$80 (2 VPS + LB)$100-200Dedicated browser-pool work
5M$400 (cluster)$800-1500Dedicated SRE work

The engineering-time column is the one that matters. The first 100k renders are cheap. The transition from "it works" to "it holds at p99 under 800 ms with 99.9 percent uptime" costs weeks. The transition from "it holds" to "it scales horizontally with on-call rotation" costs months. Most teams underestimate both.

Production-readiness checklist

A 24-item table you can paste into a runbook. Every item is a real failure mode we have hit or seen.

AreaItemDone?
BrowserSingleton Chromium per workerโ˜
Browser--no-sandbox and --disable-dev-shm-usage flagsโ˜
BrowserWorker recycle at 5000 renders or 1.5 GB RSSโ˜
Browserdisconnected event handler that nulls the singletonโ˜
Browserpage.close() in finally on every renderโ˜
RenderingwaitUntil: "load" not networkidleโ˜
Rendering30-second hard timeout on setContent and pdfโ˜
RenderingEmbedded fonts via base64 @font-faceโ˜
RenderingPrint stylesheet authored against @pageโ˜
CachingContent-hash render cache (24 h TTL)โ˜
CachingCompiled template cache (in-memory)โ˜
ReliabilityIdempotency-Key header support on POSTโ˜
ReliabilityExponential backoff with jitter, max 3 retriesโ˜
ReliabilitySanitized error messages (no paths or stack tails)โ˜
ReliabilityTyped error responses with error.type and error.codeโ˜
ObservabilityRender duration histogram (p50/p95/p99)โ˜
ObservabilityError counter labelled by error typeโ˜
ObservabilityRSS gauge per worker, alerting at 1.4 GBโ˜
ObservabilityQueue depth gaugeโ˜
OperationsProcess manager configured to restart on exitโ˜
OperationsHealth check endpoint that calls browser.isConnected()โ˜
OperationsGraceful shutdown closes browser before exitโ˜
SecurityRender endpoint behind API key auth, never anonymousโ˜
SecurityPer-tenant rate limits with Retry-After headersโ˜

If any item is unchecked when you ship, it will eventually wake someone up at 3 a.m.

How does the PDF4.dev architecture handle all of this?

We built PDF4.dev because we wanted to stop solving these problems on every project. The production stack uses a singleton Chromium per Next.js process, a 24-hour disk-backed render cache with HMAC-signed delivery URLs, sanitized errors persisted to SQLite, per-org rate limits, and the same Idempotency-Key pattern shown above. The internals are documented in how we render PDFs in under 300 ms and how we built an MCP server for PDF generation.

The point of the architecture is not that it is novel. It is that all 24 checklist items are already done. You wire your application to one HTTPS endpoint and stop thinking about browser pools.

FAQ

How do I prevent Chromium memory leaks in long-lived workers?

Close the page in a finally block, never the browser. Recycle the worker after a fixed number of renders (5000) or at an RSS threshold (1.5 GB). Run one Chromium per worker.

What is the right waitUntil value?

load. Use domcontentloaded only for fully self-contained HTML. Avoid networkidle because it hangs on slow asset 404s.

How do I implement idempotency for a render endpoint?

Accept Idempotency-Key, store the response keyed by that header for 24 hours, return the cached bytes on retry.

What metrics should I track?

Render duration p50/p95/p99, queue depth, browser-page count, RSS per worker, error rate by type.

One browser per worker or one shared browser?

One browser per worker. Sharing across processes is error-prone and the second browser only costs 250 MB of RSS.

How long should a render take before timeout?

30 seconds hard cap. Most PDFs finish in 200-500 ms.

Self-host or use an API?

The crossover is around 100-200k renders per month, dominated by engineering time, not infra cost.

Try it without the operational burden

The fastest way to validate the patterns in this guide is to run them against a service that already implements them. The free Html To PdfTry it free tool runs the same Playwright pipeline with the same browser pool, the same timeouts, and the same error sanitization. When you are ready to wire it into your application, the render endpoint takes the same HTML.

Free tools mentioned:

Html To PdfTry it free

Start generating PDFs

Build PDF templates with a visual editor. Render them via API from any language in ~300ms.