How well do the major LLMs actually read a PDF? In June 2026 we ran the same 50-page PDF (mixed text, tables, a chart, a scanned section) through ChatGPT GPT-4o, Claude 3.7 Sonnet, Gemini 1.5 Pro, and Perplexity Pro. Claude scored highest on factual accuracy for text-heavy PDFs (9 out of 10). Gemini won on cost-per-page and long-document context. ChatGPT won on multimodal layout (charts and diagrams). Perplexity won on cited human-facing Q&A. No provider was best across all axes.
This benchmark is intentionally pragmatic. It is not a leaderboard of model intelligence, it is a measurement of how each product handles the same input PDF that a SaaS team would actually hand to an LLM.
The benchmark setup
We picked a single 50-page test PDF, ran the same 10 questions on each provider, and scored the answers manually.
The PDF mixes content types on purpose so that no provider can win by handling only its preferred input:
- Pages 1 to 20: native text (a corporate annual report)
- Pages 21 to 30: a 12-column financial table with merged cells
- Pages 31 to 35: a stacked-bar revenue chart with axis labels
- Pages 36 to 45: a contract section with footnotes and cross-references
- Pages 46 to 50: a scanned appendix at roughly 200 DPI
The 10 questions split across extraction (3), summarization (2), table reading (2), scanned-section OCR (2), and citation accuracy (1).
We measured four things on every run: cost in USD (computed from each provider's published token pricing), end-to-end latency (time-to-first-token + total response), factual accuracy (manual 0/1 scoring against ground truth), and citation correctness (did the provider quote the right page).
The numbers below are point-in-time. Pricing and model capabilities for all four providers change every quarter. Recompute on current public pricing before committing infra.
How each provider actually reads a PDF
Key insight: each provider handles the same PDF differently under the hood. Some convert every page to an image and use vision, some run a native text parser, some do both. This is the single biggest source of accuracy variance in the benchmark below. Two providers can have identical context windows and still disagree on the same question, because they are looking at different representations of the input.
| Provider | Reading method | Max file size | Max pages | Context window | Image awareness |
|---|---|---|---|---|---|
| ChatGPT (GPT-4o / GPT-4.1) | Hybrid: text parse + vision on page renders | 32MB per file | 20 files per chat | 128K tokens | Yes (multimodal) |
| Claude 3.7 Sonnet | Vision-rendered pages + extracted text | 32MB | 100 pages | 200K tokens | Yes (multimodal) |
| Gemini 1.5 Pro | Native PDF parser, vision on figures | 50MB via Files API | About 3,600 pages | 1M+ tokens | Yes (multimodal) |
| Perplexity Pro | Text parse plus retrieval over chunks | 25MB | Not published | Not published | Limited |
Source: Anthropic PDF support docs, OpenAI Assistants file_search, Gemini document processing, Perplexity help center.
The practical implication: if your PDFs are text-native (modern HTML-to-PDF output, exported from Word, generated by an API), all four providers can read them. If your PDFs are scanned or image-heavy, Claude and ChatGPT have the cleanest vision integration. If your PDFs are very long (500+ pages), only Gemini fits the document in one request.
Side-by-side results table
We scored each of the 10 questions on a 0/1 basis: correct or incorrect against ground truth. Cost per question is the API cost for one round trip (input + output tokens), using public pricing as of June 2026.
| # | Question type | ChatGPT GPT-4o | Claude 3.7 Sonnet | Gemini 1.5 Pro | Perplexity Pro |
|---|---|---|---|---|---|
| 1 | Extract CEO name from p.2 | 1 | 1 | 1 | 1 |
| 2 | Extract fiscal year revenue | 1 | 1 | 1 | 1 |
| 3 | Extract publication date | 1 | 1 | 1 | 1 |
| 4 | Summarize pages 1 to 20 | 1 | 1 | 1 | 1 |
| 5 | Summarize the contract section | 0 | 1 | 1 | 0 |
| 6 | Read cell (row 7, col 4) of table | 0 | 1 | 1 | 0 |
| 7 | Sum revenue across regions | 1 | 1 | 0 | 0 |
| 8 | Read chart Y-axis max value | 1 | 1 | 1 | 0 |
| 9 | OCR scanned page 48 paragraph 2 | 1 | 1 | 0 | 0 |
| 10 | Cite exact page for footnote 12 | 0 | 1 | 1 | 0 |
| Total | 7/10 | 10/10 | 8/10 | 3/10 | |
| Cost per question (USD) | 0.019 | 0.025 | 0.004 | n/a (web only) | |
| Latency (median) | 8.2s | 11.4s | 6.1s | 9.3s |
Claude swept the test. Gemini missed the cross-region sum (it pulled the wrong row from the table) and the OCR question on the noisier scanned page. ChatGPT missed the merged-cell table read and the contract summarization, where it skipped a key clause. Perplexity scored lowest because its file-upload Q&A is tuned for short answers with citations, not for full-document table or chart extraction.
The headline: on a text-heavy mixed PDF, Claude is the most reliable extractor. Gemini is close behind for a quarter of the price.
Cost-per-1,000-PDF analysis
Single-question benchmarks are useful but rarely match real workloads. The realistic case for a SaaS is something like 1,000 incoming PDFs per month, each one queried 5 times by your agent (extract fields, summarize, check signatures, verify totals, classify). That is 5,000 queries per month.
Using June 2026 public pricing and our average input + output token counts per question on the 50-page PDF:
| Provider | Model | Cost per query | Monthly cost (5,000 queries) |
|---|---|---|---|
| Gemini | 1.5 Flash | 0.0036 USD | 18 USD |
| Gemini | 1.5 Pro | 0.0042 USD | 21 USD |
| OpenAI | GPT-4.1 mini | 0.0072 USD | 36 USD |
| OpenAI | GPT-4o | 0.019 USD | 95 USD |
| Anthropic | Claude 3.5 Haiku | 0.011 USD | 55 USD |
| Anthropic | Claude 3.7 Sonnet | 0.025 USD | 125 USD |
| Perplexity | Pro (web only) | 20 USD flat | 20 USD (1 seat) |
Three observations from real workloads we have run.
First, Gemini Flash is in a different cost class. At 18 USD/month for 5,000 queries, it makes batch PDF processing viable for use cases that would not pay 125 USD/month at Claude Sonnet pricing.
Second, the cost-vs-accuracy trade-off is real. Gemini Flash drops 2 to 3 points of accuracy vs Pro on the same questions. If your queries are user-facing, the accuracy difference matters more than the cost. If your queries are internal classification or first-pass triage, the cheaper model is fine.
Third, Perplexity Pro is a flat 20 USD per seat per month with no per-query metering. That makes sense for human analysts, not for backend pipelines. There is no Sonar API endpoint that ingests files at this writing.
Latency comparison
Latency matters when the LLM is in a user-facing path. For background batch jobs it matters less.
| Provider | Time to first token (median) | Total response (median) |
|---|---|---|
| Gemini 1.5 Pro | 1.2s | 6.1s |
| ChatGPT GPT-4o | 1.8s | 8.2s |
| Perplexity Pro | 2.1s | 9.3s |
| Claude 3.7 Sonnet | 2.6s | 11.4s |
A few notes on the shapes of these distributions.
Gemini consistently emits the first token fastest. It also streams the most evenly, with a smooth token-per-second curve. For a chat UI where the user reads as the model writes, this feels the most responsive.
Claude has the largest gap between first-token and total time. It tends to "think then burst", producing a long stretch of silence (the document is being processed and reasoned over) before emitting the final answer in a fast continuous stream. For agent backends this is fine. For chat UIs it can feel slow even when the answer is correct.
ChatGPT and Perplexity sit between the two. ChatGPT's latency is dominated by the file_search retrieval step on the first message of a thread; subsequent messages on the same file are 2 to 4 seconds faster because the vector index is warm.
When to pick each
There is no universally best provider for PDFs. There are clear winners by workload.
| Workload | Pick | Reason |
|---|---|---|
| Cheap, long-context, batch processing | Gemini 1.5 Flash or Pro | Lowest cost per page, 1M+ token context, native PDF parser |
| Highest factual accuracy on text-heavy PDFs | Claude 3.7 Sonnet | Best score in the benchmark, strong citation accuracy |
| Multimodal accuracy (charts, diagrams, scanned content) | ChatGPT GPT-4o or Gemini 2.x | Mature vision stacks, file_search index for retrieval |
| Sourced Q&A for end users | Perplexity Pro | Built-in source citation UX, but web only |
| Very long PDFs (200+ pages in one shot) | Gemini 1.5 Pro | Only provider with a context window large enough |
| Generating a new PDF from the model's output | None of the above directly | Use the LLM to produce HTML or structured data, then a hosted render API |
The last row is the one most teams underestimate. LLMs are good at reading PDFs and reasoning about them. They are bad at producing pixel-accurate PDFs as output. Asking ChatGPT to "make me a PDF invoice" tends to produce a markdown table dressed up as a fake PDF, not a real one. The output is also not reproducible: the same prompt produces a slightly different layout every time, which is unacceptable for invoices, contracts, or any document with a legal or accounting footprint.
The right pattern is to separate reading from writing.
The hybrid workflow PDF4.dev recommends
In production, agents that touch PDFs always need two halves: a model that reads and reasons (one of the four providers above), and a deterministic render service that writes the final document.
The PDF4.dev side of that loop is exposed two ways:
- The REST API:
POST /api/v1/rendertakes a template ID plus a JSON data object, applies Handlebars variables, runs Playwright headless Chromium, and returns the PDF as a binary, a base64 payload, or a signed URL. The output is reproducible: the same input always produces byte-identical PDFs. - The MCP server: an AI agent calls
render_pdf,create_template,update_template, and 11 other tools directly through the Model Context Protocol. No glue code, no wrapper. The agent decides when to call the tool based on the conversation.
A concrete agent loop:
- User uploads an incoming PDF invoice.
- Claude or Gemini reads it, extracts vendor, line items, totals (this is the "read" half, where the benchmark above is decisive).
- The agent calls the PDF4.dev MCP
render_pdftool with apurchase-ordertemplate ID and the extracted data. - PDF4.dev returns a deterministic, pixel-accurate purchase-order PDF.
Two different jobs, two different tools, one agent orchestrating both. This is the pattern that actually ships, and it sidesteps the "make me a PDF" failure mode of asking an LLM to render layout.
Frequently asked questions
Which AI is best at reading PDFs? No provider is universally best. Claude 3.7 Sonnet wins on factual accuracy for text-heavy native PDFs. Gemini 1.5 Pro wins on cost-per-page and long-document context. ChatGPT GPT-4o wins on multimodal layout. Perplexity wins for one-off cited Q&A. Pick by workload.
Can ChatGPT read 100-page PDFs? Yes up to a point. The 32MB and 20-file cap on the consumer app applies. For PDFs above 100 pages, GPT-4o tends to skip middle sections, so Gemini's 1M-token context is the better fit.
Does Claude analyze scanned PDFs? Yes. Claude renders every page as an image and runs vision plus text extraction, so scanned PDFs work without a separate OCR step. The cap is 32MB and 100 pages per request.
How does Gemini handle long PDFs differently from Claude? Gemini holds 1M+ tokens of context, large enough for 500+ page PDFs in one request. Claude caps at 100 pages. For batch summarization across a single very long document, Gemini wins. For accuracy on a normal-length document, Claude scores higher.
Is there an API for ChatGPT PDF reading? Yes. OpenAI's Assistants API plus the file_search tool ingests PDFs and builds a vector index. You can also pass PDF pages as images to GPT-4o via the Chat Completions API.
Which AI is cheapest for batch PDF processing? Gemini 1.5 Flash, at roughly 0.075 USD per 1M input tokens. For 5,000 PDF queries per month, that comes to about 18 USD vs 95 USD for GPT-4o and 125 USD for Claude 3.7 Sonnet.
Can Perplexity be used in a SaaS pipeline? Not for PDF ingestion at this writing. Perplexity's file upload is a Pro web feature. The Sonar API does not accept files. Use Perplexity for human-facing Q&A, not backend pipelines.
How accurate are LLMs at extracting tables from PDFs? Above 90% on clean native tables, below 75% on scanned tables or tables with merged cells. For mission-critical table extraction, run a deterministic tool first, then ask the LLM to interpret the result.
Should I use AI or Tabula/pdfplumber for table extraction? Use deterministic tools (Tabula, pdfplumber, Camelot) for native-text tables: free, exact, fast. Use an LLM for scanned or unusual layouts. The best stack often uses both.
How does the MCP protocol fit into a PDF + AI workflow? MCP lets an AI agent call external tools through a standard JSON-RPC interface. PDF4.dev exposes an MCP server with render_pdf, create_template, and 12 other tools. The pattern: read the input PDF with the LLM, render the output PDF through MCP.
Wrap-up
If you are choosing one provider for PDF reading in mid-2026: Claude for accuracy, Gemini for cost and context, ChatGPT for multimodal layout, Perplexity for human-facing sourced Q&A. The benchmark above will shift every quarter as the four providers ship new model versions. The architectural lesson is more durable: read with an LLM, write with a deterministic render API. Mixing the two is the source of most of the "AI can't do PDFs" frustration.
Free tools mentioned:
Start generating PDFs
Build PDF templates with a visual editor. Render them via API from any language in ~300ms.



