PDF to Excel conversion works in two ways: text-layer extraction reads a digital PDF's positioned glyphs and reconstructs cells (fast, free, accurate for ~80% of cases), and OCR converts scanned page images to text first, then extracts the table. The three most reliable free tools are Tabula, pdfplumber, and Camelot. For most digital PDFs, this one-liner gets the job done:
python -c "import tabula; tabula.convert_into('input.pdf', 'output.csv', output_format='csv', pages='all')"This guide covers the free online path, Python and Node.js code that works on real data, when to reach for a paid API like Adobe Extract, and how to handle scanned PDFs.
The free way: convert PDF to Excel online in 30 seconds
For a single PDF with a clean table, the fastest free path is text extraction in the browser:
- Open pdf4.dev/tools/pdf-to-text
- Drop your PDF onto the upload area
- Click Extract text
- Copy the output, paste it into Excel column A, then use Data > Text to Columns with the appropriate delimiter (tab, space, or comma)
- Save the file as XLSX
The extraction runs in your browser with PDF.js, so the file never reaches a server. This path is best for one-off conversions; for repeat work or scanned PDFs, jump to the programmatic sections below.
Why "PDF to Excel" is harder than "PDF to text"
A PDF file does not store tables as tables. It stores a list of glyphs (single characters), each placed at an absolute x/y coordinate on a page. There is no row index, no column index, no cell concept. The PDF spec, ISO 32000, only defines how to draw graphics and text; it does not define document structure beyond optional tagging.
Excel needs the opposite: a structured grid of rows and columns with typed values. Converting from PDF to Excel is therefore a layout reconstruction problem. The extractor reads glyph positions, infers where columns start and end based on whitespace gaps or ruling lines, groups glyphs into cells, and then groups cells into rows.
Two extractors run on the same PDF can produce different cell boundaries. That is not a bug. It reflects different heuristics for the same ambiguous input. The practical consequence: expect to fix a few cells by hand on complex tables, and pick the tool whose defaults match your input.
Digital PDF vs scanned PDF. Digital PDFs (made by Word, InDesign, Playwright, or any programmatic generator) contain a text layer with real glyph data. Scanned PDFs are bitmap images of pages with no text layer. Standard extractors return nothing from scanned PDFs; you need OCR first. To check: try selecting text in any PDF viewer. If nothing highlights, the PDF is scanned.
Three extraction paths, ranked by accuracy
The right tool depends on the input. Most production pipelines mix two paths to cover the range of PDFs they receive.
| Path | Best for | Free tools | Paid tools | Accuracy |
|---|---|---|---|---|
| Text-layer + heuristic table detector | Digital PDFs with ruled or aligned tables | Tabula, pdfplumber, Camelot | None needed | High |
| Layout-aware ML | Scanned or complex multi-column PDFs | Unstructured.io (open core) | Adobe Extract API, Microsoft Document Intelligence | Very high |
| Pure OCR + post-processing | Scanned PDFs as fallback | Tesseract, PaddleOCR | Mistral OCR, AWS Textract | Medium |
If you control the input format and the PDFs are digital, start with text-layer extraction. If you receive scanned bank statements, invoices, or third-party documents, a paid layout-aware API saves more engineering hours than it costs.
How to convert PDF to Excel in Python (free)
Python has the strongest open-source ecosystem for PDF table extraction. Two libraries cover the common cases: tabula-py for fast extraction of well-formed tables, and pdfplumber when you need finer control or the table is irregular.
Option 1: tabula-py (fastest, simplest)
tabula-py is a thin Python wrapper around tabula-java, the most mature open-source table extractor. It requires a JDK on the system.
pip install tabula-py pandas openpyxl
# Java JDK 8 or higher must also be installedimport tabula
import pandas as pd
# Read all tables from a PDF, one DataFrame per table
tables = tabula.read_pdf("invoice.pdf", pages="all")
print(f"Found {len(tables)} table(s)")
# Write each table as a separate sheet in one XLSX file
with pd.ExcelWriter("output.xlsx", engine="openpyxl") as writer:
for i, df in enumerate(tables, start=1):
df.to_excel(writer, sheet_name=f"Table_{i}", index=False)
print("Saved to output.xlsx")For tables with visible ruling lines, switch to lattice mode. For tables that rely on whitespace alignment, use stream mode:
import tabula
# Lattice: use when every cell has a visible border
tables_lattice = tabula.read_pdf(
"ruled_table.pdf",
pages="all",
lattice=True,
)
# Stream: use when columns are aligned by whitespace only
tables_stream = tabula.read_pdf(
"report.pdf",
pages="all",
stream=True,
)
# Specify a region (top, left, bottom, right) in PDF points
tables_area = tabula.read_pdf(
"invoice.pdf",
pages=1,
area=[200, 50, 750, 550],
stream=True,
)When read_pdf returns an empty list, the heuristic failed. Switch modes (lattice to stream or vice versa), pass guess=False with an explicit area, or fall back to pdfplumber.
Option 2: pdfplumber (most flexible)
pdfplumber wraps pdfminer.six and exposes the underlying page geometry. It is slower than Tabula but handles tables that Tabula misses, especially ones without consistent column spacing.
pip install pdfplumber pandas openpyxlimport pdfplumber
import pandas as pd
rows = []
with pdfplumber.open("statement.pdf") as pdf:
for page in pdf.pages:
# extract_tables returns a list of lists of lists
for table in page.extract_tables():
for row in table:
rows.append(row)
# Use the first row as headers
df = pd.DataFrame(rows[1:], columns=rows[0])
df.to_excel("statement.xlsx", index=False)For irregular tables, pdfplumber exposes extract_table options to tune the detection. The relevant settings are documented in the pdfplumber table extraction guide:
import pdfplumber
with pdfplumber.open("complex.pdf") as pdf:
page = pdf.pages[0]
table = page.extract_table(
table_settings={
"vertical_strategy": "text", # or "lines"
"horizontal_strategy": "text", # or "lines"
"snap_tolerance": 3,
"join_tolerance": 3,
"edge_min_length": 3,
}
)
for row in table:
print(row)The vertical_strategy and horizontal_strategy options switch between lattice-style ("lines") and stream-style ("text") detection independently per axis, which is more flexible than Tabula's all-or-nothing flag.
Option 3: Camelot (best for ruled tables)
Camelot ships two algorithms: lattice (the same idea as Tabula's lattice) and stream. Its output is the same DataFrame shape as Tabula, with a richer accuracy score that helps you decide whether to fall back.
pip install "camelot-py[base]" pandas openpyxl
# Also requires Ghostscript and OpenCV system packagesimport camelot
# Lattice mode for ruled tables
tables = camelot.read_pdf("ruled.pdf", pages="all", flavor="lattice")
print(f"Total tables extracted: {tables.n}")
# Inspect the accuracy score for each table
for i, t in enumerate(tables):
print(f"Table {i + 1}: accuracy={t.accuracy:.1f}%, whitespace={t.whitespace:.1f}%")
# Export all tables to one XLSX with one sheet per table
tables.export("output.xlsx", f="excel")The accuracy score is a useful trigger for automated pipelines: if it drops below a threshold (say, 80), retry with the other flavor or send the file to manual review.
Batch convert a folder of PDFs
import os
import tabula
import pandas as pd
input_dir = "./pdfs"
output_dir = "./xlsx"
os.makedirs(output_dir, exist_ok=True)
for filename in os.listdir(input_dir):
if not filename.endswith(".pdf"):
continue
pdf_path = os.path.join(input_dir, filename)
xlsx_path = os.path.join(output_dir, filename.replace(".pdf", ".xlsx"))
try:
tables = tabula.read_pdf(pdf_path, pages="all")
if not tables:
print(f"No tables found in {filename}")
continue
with pd.ExcelWriter(xlsx_path, engine="openpyxl") as writer:
for i, df in enumerate(tables, start=1):
df.to_excel(writer, sheet_name=f"Table_{i}", index=False)
print(f"Converted: {filename}")
except Exception as e:
print(f"Failed on {filename}: {e}")This pattern handles a few hundred PDFs on a laptop in well under a minute. For larger batches, parallelize with concurrent.futures.ProcessPoolExecutor.
How to convert PDF to Excel in Node.js (free)
Native Node.js table extraction is weaker than Python's. There is no tabula-py equivalent in pure JavaScript. Two practical paths:
Path 1: call tabula-java via child_process
The most reliable Node.js option is to shell out to the original tabula-java JAR. This treats Tabula as a CLI binary, which it supports first-class.
import { spawn } from "node:child_process";
import { promises as fs } from "node:fs";
async function pdfToCsv(pdfPath: string, csvPath: string): Promise<void> {
return new Promise((resolve, reject) => {
const proc = spawn("java", [
"-jar",
"./tabula.jar",
"--pages",
"all",
"--format",
"CSV",
"--outfile",
csvPath,
pdfPath,
]);
let stderr = "";
proc.stderr.on("data", (data) => {
stderr += data.toString();
});
proc.on("close", (code) => {
if (code === 0) resolve();
else reject(new Error(`tabula exited ${code}: ${stderr}`));
});
});
}
await pdfToCsv("./invoice.pdf", "./invoice.csv");Download the JAR from the Tabula releases page. Once you have a CSV, convert it to XLSX with exceljs or xlsx from npm.
Path 2: pdfjs-dist + custom column detection
For pure-Node workflows without a JVM, use pdfjs-dist to read positioned text items, then group them into rows and columns yourself. This is more code but no system dependencies.
import * as pdfjsLib from "pdfjs-dist";
import { promises as fs } from "node:fs";
pdfjsLib.GlobalWorkerOptions.workerSrc = "";
interface PositionedItem {
text: string;
x: number;
y: number;
}
async function extractRows(
pdfPath: string,
pageNum = 1,
yTolerance = 2,
): Promise<string[][]> {
const data = await fs.readFile(pdfPath);
const pdf = await pdfjsLib.getDocument({ data }).promise;
const page = await pdf.getPage(pageNum);
const content = await page.getTextContent();
const items: PositionedItem[] = content.items
.filter((it) => "str" in it && it.str.trim().length > 0)
.map((it) => ({
text: (it as { str: string }).str.trim(),
x: (it as { transform: number[] }).transform[4],
y: (it as { transform: number[] }).transform[5],
}));
// Group items into rows by y coordinate
items.sort((a, b) => b.y - a.y);
const rows: PositionedItem[][] = [];
for (const it of items) {
const last = rows[rows.length - 1];
if (last && Math.abs(last[0].y - it.y) <= yTolerance) {
last.push(it);
} else {
rows.push([it]);
}
}
// Sort each row left to right and project to plain text
return rows.map((row) =>
row.sort((a, b) => a.x - b.x).map((it) => it.text),
);
}
const rows = await extractRows("./invoice.pdf");
console.table(rows);This reconstructs rows correctly but does not infer column boundaries: every row is a flat list of text fragments. Adding real column detection (clustering x positions across all rows) is another 30 lines. For production, the tabula-java path is faster to ship.
Node.js vs Python: honest comparison
| Concern | Python | Node.js |
|---|---|---|
| Free library quality | Excellent (Tabula, pdfplumber, Camelot) | Weak (no equivalent) |
| Setup friction | pip install + Java for Tabula | Java for Tabula, or write reconstruction code |
| Speed per page | Fast | Comparable when shelling to Java |
| Production maintenance | Mature ecosystem | More custom code |
If your stack is mostly Node.js but PDF-to-Excel is a critical path, the cleanest answer is a small Python microservice that the Node app calls over HTTP. The alternative is a paid API.
When to use a paid API (and which one)
Free libraries cover most digital PDFs. Paid APIs earn their cost on scanned PDFs, complex multi-column layouts, and pipelines where engineering time matters more than per-page cost.
| API | Pricing | Strength | Weakness |
|---|---|---|---|
| Adobe Extract API | ~0.05 USD per page | Industry-leading layout reconstruction, structured JSON output | Per-page cost adds up at scale |
| Microsoft Document Intelligence (Layout) | ~0.01 USD per page (Layout) | Excellent on scanned + handwritten input, prebuilt models for invoices and receipts | Azure setup overhead |
| Aspose.PDF Cloud | Tiered (credits) | Wide format support, table extraction with cell types | Older API ergonomics |
| PDF.co | ~0.005-0.02 USD per page | Cheapest paid option, REST API | Lower accuracy on edge cases |
| ConvertAPI | ~0.0025 USD per page | Direct PDF to XLSX endpoint | Quality varies by input |
Rule of thumb. Under 100 PDFs per day with clean digital tables: stay free. Over 1,000 PDFs per day or scanned input: a paid API saves more engineering hours than it costs. Mixed input where 80% are digital and 20% scanned: extract with Tabula first, route failures to a paid API.
PDF4.dev does not currently offer a /extract endpoint. The product runs in the generation direction (data to PDF). For PDF-to-Excel at scale, the Python libraries above and Adobe Extract API are the right tools.
OCR for scanned PDFs (when text-layer is empty)
If pdfplumber or Tabula returns nothing on a PDF that visually contains a table, the text layer is empty. Confirm by trying to select text in any viewer; if nothing highlights, the PDF is a scan. The fix is OCR: convert the page image to text, then extract the table.
Free: Tesseract
Tesseract is the standard open-source OCR engine. It does not detect tables; it returns a flat text stream. Combine it with pdfplumber-style position reconstruction or with a layout-aware wrapper like unstructured.io.
import fitz # PyMuPDF
import pytesseract
from PIL import Image
import io
def ocr_pdf_page(pdf_path: str, page_num: int = 0, dpi: int = 300) -> str:
doc = fitz.open(pdf_path)
page = doc[page_num]
pix = page.get_pixmap(dpi=dpi)
img = Image.open(io.BytesIO(pix.tobytes("png")))
return pytesseract.image_to_string(img)
text = ocr_pdf_page("scanned.pdf", page_num=0)
print(text)Run OCR at 300 DPI or higher. Lower resolution drops accuracy fast. The Tesseract input format docs confirm 300 DPI as the practical floor for character recognition.
Paid: Mistral OCR, AWS Textract
Mistral OCR, announced in September 2024, was the first general-purpose OCR model with strong table-aware output. It returns structured Markdown with tables preserved as Markdown tables, which is much easier to convert to Excel than a flat text stream.
AWS Textract and Microsoft Document Intelligence offer dedicated table extraction modes that return cell-level JSON with row and column indices. For scanned bank statements and invoices at scale, these services beat Tesseract by a wide margin.
Common pitfalls
Real-world PDFs break in predictable ways. Plan for each.
| Problem | What happens | Fix |
|---|---|---|
| Merged cells | Two rows collapse into one | Detect by row count mismatch; un-merge by duplicating value |
| Multi-line cells | One logical cell wraps across two y positions | Increase pdfplumber join_tolerance or merge rows by detecting empty cells |
| Headers spanning multiple rows | Header gets parsed as data | Skip the first N rows on read, set columns manually |
| Negative numbers in parentheses | (1,234.50) reads as string | Post-process with regex: df["amt"] = df["amt"].str.replace(r"\((.*)\)", r"-\1", regex=True) |
| Currency symbols | $1,234.50 reads as string | Strip non-numeric characters before casting to float |
| Locale decimal separators | 1.234,56 (European) read incorrectly | Pass decimal="," and thousands="." to pd.read_csv |
| Dates parsed as strings | "03/05/2026" stays a string | Use pd.to_datetime(df["date"], format="%d/%m/%Y") |
| Scientific notation | 1.23E+05 collapses to a single cell | Set Excel cell format to "Number" after writing |
A short post-processing layer in pandas catches most of these in one pass:
import pandas as pd
import re
def clean_amount(value):
if pd.isna(value) or not isinstance(value, str):
return value
s = value.strip()
# Negative in parentheses
neg = bool(re.match(r"^\(.*\)$", s))
s = re.sub(r"[^\d.,-]", "", s.strip("()"))
# European decimal: swap , and .
if s.count(",") == 1 and s.count(".") >= 1:
s = s.replace(".", "").replace(",", ".")
try:
return -float(s) if neg else float(s)
except ValueError:
return None
df["amount"] = df["amount"].apply(clean_amount)Choosing the right tool: decision tree
The shortest decision path:
| Input | Volume | Recommendation |
|---|---|---|
| Digital PDF, ruled table | Any | Tabula lattice or Camelot lattice |
| Digital PDF, whitespace table | Any | Tabula stream or pdfplumber |
| Digital PDF, irregular layout | Any | pdfplumber with tuned table_settings |
| Scanned PDF | Low | Tesseract + manual review |
| Scanned PDF | High | Adobe Extract or Microsoft Document Intelligence |
| Mixed digital + scanned | Any | Tabula first, route failures to paid API |
| Node.js stack, free path | Low | pdfjs-dist + custom reconstruction |
| Node.js stack, production | Any | Shell to tabula-java or call a Python microservice |
Frequently asked questions
Can I convert PDF to Excel for free?
Yes. For digital PDFs with clear tables, free Python libraries like Tabula and pdfplumber convert tables to Excel in a few lines of code. For one-shot conversions in the browser, free online tools extract the text layer and let you paste it into Excel, then split into columns. Both paths work without an account or paid API.
Is Tabula better than Adobe Extract API?
Tabula is free and works well for digital PDFs with ruled or whitespace-aligned tables. Adobe Extract API costs around 0.05 USD per page but handles scanned PDFs, multi-column layouts, and inferred cell structure that Tabula misses. Tabula is the right tool for clean digital tables; Adobe Extract is worth the cost for messy or scanned input.
Why does my PDF to Excel conversion mess up cells?
PDFs do not store tables as tables. They store positioned glyphs on a page, and the extractor guesses cell boundaries from whitespace, ruling lines, or column alignment. Merged cells, multi-line cells, headers that span rows, and decorative borders confuse the heuristics. Switching between lattice mode (ruled tables) and stream mode (whitespace tables) fixes most issues; the rest needs manual layout hints.
How do I convert a scanned PDF to Excel?
Scanned PDFs are images, so the text layer is empty and standard extractors return nothing. Run OCR first to generate a text layer, then extract the table. Tesseract is the free open-source option; Adobe Extract API, Microsoft Document Intelligence, and Mistral OCR are paid services with much higher table accuracy on scanned input.
Can I batch-convert hundreds of PDFs to Excel?
Yes. A short Python script with tabula-py or pdfplumber loops over a folder, extracts each PDF, and writes one XLSX per file (or one combined workbook with one sheet per PDF). On modest hardware, a few hundred single-page PDFs convert in under a minute.
Is there a Node.js library for PDF to Excel?
Native Node.js libraries for table extraction are weaker than their Python counterparts. Two practical paths: call tabula-java from Node via child_process, or use pdfjs-dist to read positioned text items and reconstruct cells with custom logic. For production workloads, calling a Python microservice or a paid API is usually faster than writing the reconstruction code yourself.
What is the best free PDF to Excel converter for Mac and Windows?
For a no-install path, upload your PDF to a browser-based extractor and paste the result into Excel. For local conversion, install Python and run tabula-py or pdfplumber; both work on macOS, Windows, and Linux. There is no built-in PDF to Excel option in Microsoft Excel itself, unlike Word, which can open PDFs directly.
How do I convert PDF bank statements to Excel?
Bank statements are usually digital PDFs with a consistent table layout, which is the easiest case for tabula-py or pdfplumber. Use stream mode if the table has no visible ruling lines, lattice mode if it does. The common pitfall is negative amounts in parentheses, which pandas reads as strings; convert them with a small post-processing step.
Can I convert PDF tables to Google Sheets directly?
Not natively. Google Sheets has no PDF import. The reliable path is to extract the table to CSV first (Tabula, pdfplumber, or a browser tool), then import the CSV into Sheets via File then Import. Sheets also reads XLSX files generated by pandas.to_excel.
Does PDF4.dev offer a PDF to Excel API?
No. PDF4.dev renders PDFs from HTML templates, so the API runs in the generation direction (data to PDF), not extraction (PDF to data). For PDF-to-Excel at scale, use tabula-py or pdfplumber for digital PDFs and Adobe Extract API or Microsoft Document Intelligence for scanned or complex layouts.
Summary
PDF to Excel is a layout reconstruction problem, not a deterministic conversion. For digital PDFs, Tabula and pdfplumber cover most cases for free, with Camelot as a strong alternative for ruled tables. Choose lattice mode for ruled tables, stream mode for whitespace-aligned tables, and tune table_settings in pdfplumber when both fail. For scanned PDFs, run OCR with Tesseract (free) or Adobe Extract API and Mistral OCR (paid, table-aware). Node.js stacks shell out to tabula-java or call a Python microservice; native JavaScript libraries are not yet competitive for this task.
PDF4.dev's free PDF to text tool handles quick text-layer extraction in the browser. For the inverse direction, generating PDFs from structured data, PDF4.dev's render API renders HTML templates to PDFs in milliseconds, with a full text layer that any extractor can read back later.
Free tools mentioned:
Start generating PDFs
Build PDF templates with a visual editor. Render them via API from any language in ~300ms.



