Get started
Rendering 100 invoices in 2k tokens with code execution and MCP

Rendering 100 invoices in 2k tokens with code execution and MCP

Applying Anthropic's code-execution-with-MCP pattern to PDF4.dev for batch document generation. The architecture, the token math, the security model, and concrete TypeScript code.

13 min read

A user opens Claude Desktop, drops in a CSV of 100 line items, and asks for an invoice PDF per row. The naive path is for the model to call render_pdf a hundred times. Each call costs context, each response costs context, and by row thirty the window is shredded. The total bill on Anthropic's internal benchmark for this kind of workload sat around 150,000 tokens.

In November 2025 Anthropic published a different pattern. Expose the MCP server as a code module, let the model write a ten-line script that does the loop, run the script in a sandbox, return only the final summary. Same workload, 2,000 tokens. A 98.7% reduction.

This article walks through what the pattern is, how it maps onto PDF4.dev's MCP server, the security model, and concrete TypeScript that drives a batch invoice run.

Why naive tool-call loops break at scale

The standard MCP tool-use loop looks like this: the model emits a tool call, the client forwards it to the server, the server runs, the result is appended to the conversation, the model reads it, decides the next call. Every step happens through the model context window.

For "render one invoice", that loop is fine. For "render 100 invoices", the trace looks roughly like:

turn 1: model → tool_use(render_pdf, {row: 0})
turn 1: tool_result → { url: "https://...", size_bytes: 47000 }
turn 2: model → tool_use(render_pdf, {row: 1})
turn 2: tool_result → { url: "https://...", size_bytes: 48000 }
...
turn 100: model → tool_use(render_pdf, {row: 99})
turn 100: tool_result → { url: "https://...", size_bytes: 51000 }

Three things go wrong as the batch grows. First, every tool result stays in scope. The hundredth call sees the first ninety-nine results still in context. Second, the model has to re-read the schema and the tool description on every turn to pick the next call, so the system prompt overhead multiplies. Third, latency compounds because each round trip is sequential by design: the model waits for tool result N before deciding tool call N+1.

By invoice thirty the context is so polluted that the model starts hallucinating row data, dropping fields, or repeating renders. By invoice fifty most providers will refuse to continue because the window is full.

The code-execution-with-MCP pattern

Anthropic's writeup reframes the MCP server. Instead of a remote service the model pokes one call at a time, the server is surfaced as a code module the model imports inside a sandboxed runtime. The model writes a script, the script runs, only the script's return value lands back in the context window.

The flow becomes:

PhaseWhere it runsWhat lands in model context
Discover toolsModel + MCP serverTool list and types, once
Plan and write scriptModelThe script source, once
Execute scriptSandbox runtimeNothing, runs locally
Loop and call MCPSandbox to MCP serverNothing, results stay in sandbox memory
Return resultSandbox to modelOne summary object

The model never sees the per-row tool calls. It never sees the per-row results. It sees the script it wrote, and the array of URLs the script returned. That is the whole reduction.

Applied to PDF4.dev: the architecture

The PDF4.dev MCP server already exposes render_pdf, list_templates, get_template, and the rest of the catalog over Streamable HTTP. The full server is documented in how we built an MCP server for PDF generation.

For code execution, nothing about the server changes. The change is on the client side: instead of routing tool calls one at a time, the host (Claude Desktop, Cursor, a custom orchestrator) spawns a sandbox, hands it a typed wrapper around the MCP server, and lets the model script against it.

The shape is:

Model (Claude)
   |
   v
Sandbox runtime (Deno, Bun, Node-with-isolated-vm, or Anthropic sandbox-runtime)
   |  imports
   v
mcp:pdf4dev module  ->  HTTPS  ->  https://pdf4.dev/api/mcp
                                    render_pdf, list_templates, etc.

The model emits a script. The sandbox imports mcp:pdf4dev (or whatever name the host wires up), which is a typed proxy that forwards calls to https://pdf4.dev/api/mcp over JSON-RPC. The script runs the loop, awaits all 100 renders in parallel, and returns a small summary. Only that summary travels back to the model.

Walkthrough with code

Here is the script the model writes for the 100-invoice batch. The signature import ... from "mcp:pdf4dev" is the host-specific convention for accessing an MCP server as a module. Anthropic's reference docs use this form, other hosts may use a slightly different specifier.

import { renderPdf } from "mcp:pdf4dev";
 
// 100 rows of invoice data, already loaded from CSV or DB
const invoices = await loadInvoiceData();
 
const results = await Promise.all(
  invoices.map(async (row) => {
    try {
      const res = await renderPdf({
        template_id: "tmpl_invoice",
        data: row,
        delivery: "url",
      });
      return { ok: true as const, id: row.id, url: res.url };
    } catch (err) {
      return { ok: false as const, id: row.id, error: String(err) };
    }
  }),
);
 
const successes = results.filter((r) => r.ok);
const failures = results.filter((r) => !r.ok);
 
return {
  total: invoices.length,
  succeeded: successes.length,
  failed: failures.length,
  urls: successes.map((r) => ({ id: r.id, url: r.url })),
  errors: failures,
};

The contrast is the point. In code execution mode the model emits one message: the script. The sandbox emits one return value: the summary. The naive path emits a hundred messages and reads a hundred results.

Token math

The numbers below are illustrative, anchored on Anthropic's published 150k-to-2k reduction at 100 tool calls. Per-call overheads vary by model and by tool schema, but the slope is consistent: linear growth in the naive path, roughly flat growth in the code-execution path.

Batch sizeNaive input tokensNaive output tokensCode-exec inputCode-exec outputReduction
1012,0003,0001,400200~89%
5065,00014,0001,800300~97%
100135,00030,0002,000400~98.5%
500(window full)(window full)3,000800n/a

Latency follows the same shape. Naive mode is sequential by design: 100 round trips through the model. Code execution runs the renders in parallel inside the sandbox, so the wall-clock time is dominated by the slowest single render rather than the sum. On a typical PDF4.dev render of 400ms, a 100-row batch finishes in seconds instead of minutes.

The other column the table does not show is cost. At batch size 100, naive mode is roughly two orders of magnitude more expensive in API spend than code execution. The break-even is around batch size 3 to 5: below that the sandbox startup overhead matters, above that code execution wins on every dimension.

Picking delivery mode for batch

PDF4.dev's render_pdf tool accepts delivery: "base64" | "url". The full design is covered in how we built our MCP server.

DeliveryResponse size per renderBest for in code execution
"base64"PDF size times 1.33, plus JSON envelopeUnder 50 PDFs and total payload under sandbox memory budget
"url"A few hundred bytes50 PDFs or more, or any single PDF over 1MB

Inside a sandbox the practical limit is memory, not context window. A hundred 200KB invoices in base64 is roughly 27MB of strings sitting in the runtime heap while the script awaits all promises. Most hosted sandboxes cap memory at 256MB to 1GB. Use delivery: "url" for anything over fifty renders and let the agent return the list of links instead of the bytes. The signed URLs are valid for 24 hours, which is plenty for the user to download or for a downstream pipeline to fetch.

Security model: what the sandbox sees

Giving an agent a code runtime is not a free lunch. Anything the script can do, the agent can do, and prompt injection in upstream data can turn a benign batch job into an exfiltration attempt. The pattern only works if the sandbox is real.

Three boundaries matter:

BoundaryWhat it blocksHow to enforce
FilesystemReading host secrets, writing to system pathsOS primitives: bubblewrap on Linux, sandbox-exec on macOS. Anthropic ships sandbox-runtime as a research preview that wraps both.
NetworkHitting arbitrary hosts, leaking dataAllow-list only the MCP server origins the agent needs. For PDF4.dev that is pdf4.dev and *.pdf4.dev for the render-URL host.
CapabilitiesCalling unintended MCP toolsMount only the tools the job needs. For a render-only batch, render_pdf is enough. Hide delete_template and the CRUD tools.

On the PDF4.dev side, the second line of defence is the API key scope. Issue a render_only key for the batch job. The key resolves to a user but cannot delete templates, cannot create new templates, cannot call any CRUD endpoint. Even if the sandbox is breached, the blast radius is one HTTP origin and one read-only-plus-render capability.

Three practical rules:

  1. One sandbox per job, torn down on exit. Do not reuse a long-lived runtime across user sessions.
  2. One MCP server per sandbox unless they genuinely need to compose. Two servers, two attack surfaces.
  3. Per-request CPU and memory caps. A model that writes while (true) renderPdf(...) should hit a limit, not eat your bill.

When NOT to use code execution

The pattern is not free, and it is not always right.

ScenarioWhy naive tool calls win
One PDFSandbox startup is a few hundred ms. A single render is faster as a direct tool call.
Interactive UX where the user wants progressThe script returns once. Mid-loop progress needs a streaming channel the host has to plumb.
Per-row reasoningIf the model needs to decide what to do with each failure, that decision is a model turn. A loop in code cannot reason.
Highly variable schemas per rowIf every invoice has a different template and the model needs to pick it per row, that is a model call per row. Code execution adds no benefit there.

The rough mental model: code execution is great for fan-out, bad for branching. If the script is a clean map, use it. If it is a state machine with model-level branches, stay in tool-call mode or split the job.

Error handling inside the sandbox

The big risk in a parallel batch is one bad row poisoning the whole run. Promise.all rejects on the first failure, which is the wrong shape for "render what you can, report the rest".

The script in the walkthrough above wraps each call in try/catch and returns a tagged result. The summary the model sees has one field for successes and one for failures, and the failure entries carry the row id and a sanitized error string. The model can then decide whether to retry the failed rows in a second pass or hand the list back to the user.

return {
  total: 100,
  succeeded: 97,
  failed: 3,
  urls: [{ id: "INV-001", url: "..." }, /* ... 96 more */],
  errors: [
    { id: "INV-042", error: "render_failed: missing variable 'tax_rate'" },
    { id: "INV-058", error: "render_failed: invalid date in line items" },
    { id: "INV-071", error: "render_failed: total exceeds template max width" },
  ],
};

The PDF4.dev MCP server returns structured errors with type and code fields, so the script can branch on type === "invalid_request_error" and retry with a fix rather than treating every failure as fatal.

Connecting it to MCP 2025-11-25 async tasks

For batches over a few hundred PDFs the sandbox itself becomes the constraint: the script needs to stay running for minutes, and many host runtimes cap at 60 to 120 seconds. The MCP 2025-11-25 specification adds async tasks: a tool can return a task handle, the caller polls or subscribes for completion, the agent never blocks waiting for the long-running job.

The pattern composes cleanly with code execution. The script inside the sandbox fires the batch as an async task, gets a task ID back, returns the ID to the model, and the model polls or waits for the manifest URL. That gives:

  • Short script runtime (start the batch, return the handle)
  • Server-side fan-out (PDF4.dev runs the loop, no sandbox memory pressure)
  • Final manifest the model can hand to the user as one URL

PDF4.dev does not expose async tasks yet, but the design intention is for render_pdf to accept a task mode that pairs with the code-execution sandbox for batches in the thousands.

How to try this today on PDF4.dev

The PDF4.dev MCP server has been live since early 2026 and exposes 14 tools, 4 resources, and 3 prompts. For code execution specifically, the relevant ones are:

  • render_pdf with delivery: "url" for batch fan-out
  • list_templates to discover what is available before scripting
  • get_template if the script needs to inspect a template's variables

The end-to-end setup:

  1. Sign up at pdf4.dev and create a render_only API key in Settings.
  2. Connect the MCP server to a host that supports code execution. Claude Desktop with the latest sandbox runtime works. Cursor and the OpenAI Apps SDK have analogous capabilities under different names.
  3. Drop a CSV in the chat, ask for "one invoice PDF per row using template tmpl_invoice, return a list of signed URLs".
  4. The model picks code execution, writes a script like the one above, the sandbox runs it, you get a list of 100 URLs in a single response.

The full setup walkthrough for every supported client is on the AI integration page.

The summary of the summary: regular tool calls are great when the agent reasons between every action. Code execution is great when the agent's job is to do the same thing a hundred times. The two compose, they do not compete, and on the PDF4.dev MCP server the same render_pdf tool works for both modes. Pick the one that matches the shape of the job.

Free tools mentioned:

Html To PdfTry it free

Start generating PDFs

Build PDF templates with a visual editor. Render them via API from any language in ~300ms.