What is "code execution with MCP"?

A pattern Anthropic published in November 2025. Instead of letting the model call MCP tools one by one through the standard tool-use loop, MCP servers are exposed as code modules inside a sandboxed runtime. The model writes a short TypeScript or Python script that imports the server like a library, runs a loop locally, and only returns the final result to its context window.

How is it different from regular MCP tool calls?

A regular tool call round-trips through the model context: arguments go in, results come back, every byte counts against the window. With code execution, the loop runs inside the sandbox and only the final summary reaches the model. A hundred renders become one return value instead of a hundred response frames.

How much does it actually save?

Anthropic reports an internal workload moving from 150,000 tokens to 2,000 tokens, a 98.7% reduction. The savings scale with batch size. A single render saves nothing, a batch of 100 saves almost two orders of magnitude.

Is it Anthropic-specific or does it work with OpenAI and Gemini?

The pattern is provider-agnostic. Any model that can write code and that has a sandboxed code-execution tool can use it. Anthropic published the design first and ships the most polished tooling for it, but OpenAI's Code Interpreter and Google Gemini's code execution can host the same flow with different glue code.

What are the security implications of giving an agent a sandbox?

The sandbox needs strict filesystem isolation, network allow-lists that include only the MCP servers you trust, no access to host secrets, and limited CPU and memory. Anthropic ships sandbox-runtime, a research preview that uses OS-level primitives like bubblewrap on Linux and sandbox-exec on macOS to enforce these boundaries without containers.

When should I NOT use code execution with MCP?

Single document jobs, interactive flows where the user wants to see progress between renders, and runs where the model needs to reason about each individual failure. The fixed cost of opening a sandbox and writing a script dwarfs the savings on jobs under five or six tool calls.

Does PDF4.dev support this today?

Yes. The PDF4.dev MCP server exposes render_pdf with both base64 and signed-URL delivery, plus list_templates and the rest of the catalog. Any client that supports code execution with MCP, including Claude Desktop with the recent runtime, can fan out a batch render from inside the sandbox.

Why use signed URLs instead of base64 inside the sandbox?

Base64 inflates a PDF by 33% and lives in the sandbox memory until the script returns. For batches over fifty renders the cumulative memory pressure can crash the runtime. Signed URLs keep each response a few hundred bytes regardless of PDF size and the agent hands the list back to the user without ever materializing the bytes.

AI & PDF

Rendering 100 invoices in 2k tokens with code execution and MCP

Applying Anthropic's code-execution-with-MCP pattern to PDF4.dev for batch document generation. The architecture, the token math, the security model, and concrete TypeScript code.

benoitdedJune 2, 202613 min read

A user opens Claude Desktop, drops in a CSV of 100 line items, and asks for an invoice PDF per row. The naive path is for the model to call render_pdf a hundred times. Each call costs context, each response costs context, and by row thirty the window is shredded. The total bill on Anthropic's internal benchmark for this kind of workload sat around 150,000 tokens.

In November 2025 Anthropic published a different pattern. Expose the MCP server as a code module, let the model write a ten-line script that does the loop, run the script in a sandbox, return only the final summary. Same workload, 2,000 tokens. A 98.7% reduction.

This article walks through what the pattern is, how it maps onto PDF4.dev's MCP server, the security model, and concrete TypeScript that drives a batch invoice run.

Why naive tool-call loops break at scale

The standard MCP tool-use loop looks like this: the model emits a tool call, the client forwards it to the server, the server runs, the result is appended to the conversation, the model reads it, decides the next call. Every step happens through the model context window.

For "render one invoice", that loop is fine. For "render 100 invoices", the trace looks roughly like:

turn 1: model → tool_use(render_pdf, {row: 0})
turn 1: tool_result → { url: "https://...", size_bytes: 47000 }
turn 2: model → tool_use(render_pdf, {row: 1})
turn 2: tool_result → { url: "https://...", size_bytes: 48000 }
...
turn 100: model → tool_use(render_pdf, {row: 99})
turn 100: tool_result → { url: "https://...", size_bytes: 51000 }

Three things go wrong as the batch grows. First, every tool result stays in scope. The hundredth call sees the first ninety-nine results still in context. Second, the model has to re-read the schema and the tool description on every turn to pick the next call, so the system prompt overhead multiplies. Third, latency compounds because each round trip is sequential by design: the model waits for tool result N before deciding tool call N+1.

By invoice thirty the context is so polluted that the model starts hallucinating row data, dropping fields, or repeating renders. By invoice fifty most providers will refuse to continue because the window is full.

The code-execution-with-MCP pattern

Anthropic's writeup reframes the MCP server. Instead of a remote service the model pokes one call at a time, the server is surfaced as a code module the model imports inside a sandboxed runtime. The model writes a script, the script runs, only the script's return value lands back in the context window.

The flow becomes:

Phase	Where it runs	What lands in model context
Discover tools	Model + MCP server	Tool list and types, once
Plan and write script	Model	The script source, once
Execute script	Sandbox runtime	Nothing, runs locally
Loop and call MCP	Sandbox to MCP server	Nothing, results stay in sandbox memory
Return result	Sandbox to model	One summary object

The model never sees the per-row tool calls. It never sees the per-row results. It sees the script it wrote, and the array of URLs the script returned. That is the whole reduction.

Applied to PDF4.dev: the architecture

The PDF4.dev MCP server already exposes render_pdf, list_templates, get_template, and the rest of the catalog over Streamable HTTP. The full server is documented in how we built an MCP server for PDF generation.

For code execution, nothing about the server changes. The change is on the client side: instead of routing tool calls one at a time, the host (Claude Desktop, Cursor, a custom orchestrator) spawns a sandbox, hands it a typed wrapper around the MCP server, and lets the model script against it.

The shape is:

Model (Claude)
   |
   v
Sandbox runtime (Deno, Bun, Node-with-isolated-vm, or Anthropic sandbox-runtime)
   |  imports
   v
mcp:pdf4dev module  ->  HTTPS  ->  https://pdf4.dev/api/mcp
                                    render_pdf, list_templates, etc.

The model emits a script. The sandbox imports mcp:pdf4dev (or whatever name the host wires up), which is a typed proxy that forwards calls to https://pdf4.dev/api/mcp over JSON-RPC. The script runs the loop, awaits all 100 renders in parallel, and returns a small summary. Only that summary travels back to the model.

Walkthrough with code

Here is the script the model writes for the 100-invoice batch. The signature import ... from "mcp:pdf4dev" is the host-specific convention for accessing an MCP server as a module. Anthropic's reference docs use this form, other hosts may use a slightly different specifier.

import { renderPdf } from "mcp:pdf4dev";
 
// 100 rows of invoice data, already loaded from CSV or DB
const invoices = await loadInvoiceData();
 
const results = await Promise.all(
  invoices.map(async (row) => {
    try {
      const res = await renderPdf({
        template_id: "tmpl_invoice",
        data: row,
        delivery: "url",
      });
      return { ok: true as const, id: row.id, url: res.url };
    } catch (err) {
      return { ok: false as const, id: row.id, error: String(err) };
    }
  }),
);
 
const successes = results.filter((r) => r.ok);
const failures = results.filter((r) => !r.ok);
 
return {
  total: invoices.length,
  succeeded: successes.length,
  failed: failures.length,
  urls: successes.map((r) => ({ id: r.id, url: r.url })),
  errors: failures,
};

The contrast is the point. In code execution mode the model emits one message: the script. The sandbox emits one return value: the summary. The naive path emits a hundred messages and reads a hundred results.

Token math

The numbers below are illustrative, anchored on Anthropic's published 150k-to-2k reduction at 100 tool calls. Per-call overheads vary by model and by tool schema, but the slope is consistent: linear growth in the naive path, roughly flat growth in the code-execution path.

Batch size	Naive input tokens	Naive output tokens	Code-exec input	Code-exec output	Reduction
10	12,000	3,000	1,400	200	~89%
50	65,000	14,000	1,800	300	~97%
100	135,000	30,000	2,000	400	~98.5%
500	(window full)	(window full)	3,000	800	n/a

Latency follows the same shape. Naive mode is sequential by design: 100 round trips through the model. Code execution runs the renders in parallel inside the sandbox, so the wall-clock time is dominated by the slowest single render rather than the sum. On a typical PDF4.dev render of 400ms, a 100-row batch finishes in seconds instead of minutes.

The other column the table does not show is cost. At batch size 100, naive mode is roughly two orders of magnitude more expensive in API spend than code execution. The break-even is around batch size 3 to 5: below that the sandbox startup overhead matters, above that code execution wins on every dimension.

Picking delivery mode for batch

PDF4.dev's render_pdf tool accepts delivery: "base64" | "url". The full design is covered in how we built our MCP server.

Delivery	Response size per render	Best for in code execution
`"base64"`	PDF size times 1.33, plus JSON envelope	Under 50 PDFs and total payload under sandbox memory budget
`"url"`	A few hundred bytes	50 PDFs or more, or any single PDF over 1MB

Inside a sandbox the practical limit is memory, not context window. A hundred 200KB invoices in base64 is roughly 27MB of strings sitting in the runtime heap while the script awaits all promises. Most hosted sandboxes cap memory at 256MB to 1GB. Use delivery: "url" for anything over fifty renders and let the agent return the list of links instead of the bytes. The signed URLs are valid for 24 hours, which is plenty for the user to download or for a downstream pipeline to fetch.

Security model: what the sandbox sees

Giving an agent a code runtime is not a free lunch. Anything the script can do, the agent can do, and prompt injection in upstream data can turn a benign batch job into an exfiltration attempt. The pattern only works if the sandbox is real.

Three boundaries matter:

Boundary	What it blocks	How to enforce
Filesystem	Reading host secrets, writing to system paths	OS primitives: `bubblewrap` on Linux, `sandbox-exec` on macOS. Anthropic ships sandbox-runtime as a research preview that wraps both.
Network	Hitting arbitrary hosts, leaking data	Allow-list only the MCP server origins the agent needs. For PDF4.dev that is `pdf4.dev` and `*.pdf4.dev` for the render-URL host.
Capabilities	Calling unintended MCP tools	Mount only the tools the job needs. For a render-only batch, `render_pdf` is enough. Hide `delete_template` and the CRUD tools.

On the PDF4.dev side, the second line of defence is the API key scope. Issue a render_only key for the batch job. The key resolves to a user but cannot delete templates, cannot create new templates, cannot call any CRUD endpoint. Even if the sandbox is breached, the blast radius is one HTTP origin and one read-only-plus-render capability.

Three practical rules:

One sandbox per job, torn down on exit. Do not reuse a long-lived runtime across user sessions.
One MCP server per sandbox unless they genuinely need to compose. Two servers, two attack surfaces.
Per-request CPU and memory caps. A model that writes while (true) renderPdf(...) should hit a limit, not eat your bill.

When NOT to use code execution

The pattern is not free, and it is not always right.

Scenario	Why naive tool calls win
One PDF	Sandbox startup is a few hundred ms. A single render is faster as a direct tool call.
Interactive UX where the user wants progress	The script returns once. Mid-loop progress needs a streaming channel the host has to plumb.
Per-row reasoning	If the model needs to decide what to do with each failure, that decision is a model turn. A loop in code cannot reason.
Highly variable schemas per row	If every invoice has a different template and the model needs to pick it per row, that is a model call per row. Code execution adds no benefit there.

The rough mental model: code execution is great for fan-out, bad for branching. If the script is a clean map, use it. If it is a state machine with model-level branches, stay in tool-call mode or split the job.

Error handling inside the sandbox

The big risk in a parallel batch is one bad row poisoning the whole run. Promise.all rejects on the first failure, which is the wrong shape for "render what you can, report the rest".

The script in the walkthrough above wraps each call in try/catch and returns a tagged result. The summary the model sees has one field for successes and one for failures, and the failure entries carry the row id and a sanitized error string. The model can then decide whether to retry the failed rows in a second pass or hand the list back to the user.

return {
  total: 100,
  succeeded: 97,
  failed: 3,
  urls: [{ id: "INV-001", url: "..." }, /* ... 96 more */],
  errors: [
    { id: "INV-042", error: "render_failed: missing variable 'tax_rate'" },
    { id: "INV-058", error: "render_failed: invalid date in line items" },
    { id: "INV-071", error: "render_failed: total exceeds template max width" },
  ],
};

The PDF4.dev MCP server returns structured errors with type and code fields, so the script can branch on type === "invalid_request_error" and retry with a fix rather than treating every failure as fatal.

Connecting it to MCP 2025-11-25 async tasks

For batches over a few hundred PDFs the sandbox itself becomes the constraint: the script needs to stay running for minutes, and many host runtimes cap at 60 to 120 seconds. The MCP 2025-11-25 specification adds async tasks: a tool can return a task handle, the caller polls or subscribes for completion, the agent never blocks waiting for the long-running job.

The pattern composes cleanly with code execution. The script inside the sandbox fires the batch as an async task, gets a task ID back, returns the ID to the model, and the model polls or waits for the manifest URL. That gives:

Short script runtime (start the batch, return the handle)
Server-side fan-out (PDF4.dev runs the loop, no sandbox memory pressure)
Final manifest the model can hand to the user as one URL

PDF4.dev does not expose async tasks yet, but the design intention is for render_pdf to accept a task mode that pairs with the code-execution sandbox for batches in the thousands.

How to try this today on PDF4.dev

The PDF4.dev MCP server has been live since early 2026 and exposes 14 tools, 4 resources, and 3 prompts. For code execution specifically, the relevant ones are:

render_pdf with delivery: "url" for batch fan-out
list_templates to discover what is available before scripting
get_template if the script needs to inspect a template's variables

The end-to-end setup:

Sign up at pdf4.dev and create a render_only API key in Settings.
Connect the MCP server to a host that supports code execution. Claude Desktop with the latest sandbox runtime works. Cursor and the OpenAI Apps SDK have analogous capabilities under different names.
Drop a CSV in the chat, ask for "one invoice PDF per row using template tmpl_invoice, return a list of signed URLs".
The model picks code execution, writes a script like the one above, the sandbox runs it, you get a list of 100 URLs in a single response.

The full setup walkthrough for every supported client is on the AI integration page.

The summary of the summary: regular tool calls are great when the agent reasons between every action. Code execution is great when the agent's job is to do the same thing a hundred times. The two compose, they do not compete, and on the PDF4.dev MCP server the same render_pdf tool works for both modes. Pick the one that matches the shape of the job.

Free tools mentioned:

Html To PdfTry it free

Start generating PDFs

Build PDF templates with a visual editor. Render them via API from any language in ~300ms.

Get Started free API Docs

AI & PDFPillar

What is the Model Context Protocol (MCP) and how to use it for PDF generation

MCP lets AI agents call external tools directly. Learn what MCP is, how the protocol works, and how to connect Claude, ChatGPT, Cursor, or VS Code to a PDF API in under 3 minutes.

Mar 18, 202611 min read