PDF Conversion

How to convert PDF to Excel (free, programmatic, accurate)

How to convert PDF tables to Excel: best free tools, when client-side extraction works, when you need OCR, and how to automate it from Node.js or Python.

benoitdedMay 24, 202618 min read

On this page

The free way: convert PDF to Excel online in 30 seconds
Why "PDF to Excel" is harder than "PDF to text"
Three extraction paths, ranked by accuracy
How to convert PDF to Excel in Python (free)
Option 1: tabula-py (fastest, simplest)
Option 2: pdfplumber (most flexible)
Option 3: Camelot (best for ruled tables)
Batch convert a folder of PDFs
How to convert PDF to Excel in Node.js (free)
Path 1: call tabula-java via child_process
Path 2: pdfjs-dist + custom column detection
Node.js vs Python: honest comparison
When to use a paid API (and which one)
OCR for scanned PDFs (when text-layer is empty)
Free: Tesseract
Paid: Mistral OCR, AWS Textract
Common pitfalls
Choosing the right tool: decision tree
Frequently asked questions
Can I convert PDF to Excel for free?
Is Tabula better than Adobe Extract API?
Why does my PDF to Excel conversion mess up cells?
How do I convert a scanned PDF to Excel?
Can I batch-convert hundreds of PDFs to Excel?
Is there a Node.js library for PDF to Excel?
What is the best free PDF to Excel converter for Mac and Windows?
How do I convert PDF bank statements to Excel?
Can I convert PDF tables to Google Sheets directly?
Does PDF4.dev offer a PDF to Excel API?
Summary

PDF to Excel conversion works in two ways: text-layer extraction reads a digital PDF's positioned glyphs and reconstructs cells (fast, free, accurate for ~80% of cases), and OCR converts scanned page images to text first, then extracts the table. The three most reliable free tools are Tabula, pdfplumber, and Camelot. For most digital PDFs, this one-liner gets the job done:

python -c "import tabula; tabula.convert_into('input.pdf', 'output.csv', output_format='csv', pages='all')"

This guide covers the free online path, Python and Node.js code that works on real data, when to reach for a paid API like Adobe Extract, and how to handle scanned PDFs.

The free way: convert PDF to Excel online in 30 seconds

For a single PDF with a clean table, the fastest free path is text extraction in the browser:

Open pdf4.dev/tools/pdf-to-text
Drop your PDF onto the upload area
Click Extract text
Copy the output, paste it into Excel column A, then use Data > Text to Columns with the appropriate delimiter (tab, space, or comma)
Save the file as XLSX

The extraction runs in your browser with PDF.js, so the file never reaches a server. This path is best for one-off conversions; for repeat work or scanned PDFs, jump to the programmatic sections below.

Why "PDF to Excel" is harder than "PDF to text"

A PDF file does not store tables as tables. It stores a list of glyphs (single characters), each placed at an absolute x/y coordinate on a page. There is no row index, no column index, no cell concept. The PDF spec, ISO 32000, only defines how to draw graphics and text; it does not define document structure beyond optional tagging.

Excel needs the opposite: a structured grid of rows and columns with typed values. Converting from PDF to Excel is therefore a layout reconstruction problem. The extractor reads glyph positions, infers where columns start and end based on whitespace gaps or ruling lines, groups glyphs into cells, and then groups cells into rows.

Two extractors run on the same PDF can produce different cell boundaries. That is not a bug. It reflects different heuristics for the same ambiguous input. The practical consequence: expect to fix a few cells by hand on complex tables, and pick the tool whose defaults match your input.

Digital PDF vs scanned PDF. Digital PDFs (made by Word, InDesign, Playwright, or any programmatic generator) contain a text layer with real glyph data. Scanned PDFs are bitmap images of pages with no text layer. Standard extractors return nothing from scanned PDFs; you need OCR first. To check: try selecting text in any PDF viewer. If nothing highlights, the PDF is scanned.

Three extraction paths, ranked by accuracy

The right tool depends on the input. Most production pipelines mix two paths to cover the range of PDFs they receive.

Path	Best for	Free tools	Paid tools	Accuracy
Text-layer + heuristic table detector	Digital PDFs with ruled or aligned tables	Tabula, pdfplumber, Camelot	None needed	High
Layout-aware ML	Scanned or complex multi-column PDFs	Unstructured.io (open core)	Adobe Extract API, Microsoft Document Intelligence	Very high
Pure OCR + post-processing	Scanned PDFs as fallback	Tesseract, PaddleOCR	Mistral OCR, AWS Textract	Medium

If you control the input format and the PDFs are digital, start with text-layer extraction. If you receive scanned bank statements, invoices, or third-party documents, a paid layout-aware API saves more engineering hours than it costs.

How to convert PDF to Excel in Python (free)

Python has the strongest open-source ecosystem for PDF table extraction. Two libraries cover the common cases: tabula-py for fast extraction of well-formed tables, and pdfplumber when you need finer control or the table is irregular.

Option 1: tabula-py (fastest, simplest)

tabula-py is a thin Python wrapper around tabula-java, the most mature open-source table extractor. It requires a JDK on the system.

pip install tabula-py pandas openpyxl
# Java JDK 8 or higher must also be installed

import tabula
import pandas as pd
 
# Read all tables from a PDF, one DataFrame per table
tables = tabula.read_pdf("invoice.pdf", pages="all")
print(f"Found {len(tables)} table(s)")
 
# Write each table as a separate sheet in one XLSX file
with pd.ExcelWriter("output.xlsx", engine="openpyxl") as writer:
    for i, df in enumerate(tables, start=1):
        df.to_excel(writer, sheet_name=f"Table_{i}", index=False)
 
print("Saved to output.xlsx")

For tables with visible ruling lines, switch to lattice mode. For tables that rely on whitespace alignment, use stream mode:

import tabula
 
# Lattice: use when every cell has a visible border
tables_lattice = tabula.read_pdf(
    "ruled_table.pdf",
    pages="all",
    lattice=True,
)
 
# Stream: use when columns are aligned by whitespace only
tables_stream = tabula.read_pdf(
    "report.pdf",
    pages="all",
    stream=True,
)
 
# Specify a region (top, left, bottom, right) in PDF points
tables_area = tabula.read_pdf(
    "invoice.pdf",
    pages=1,
    area=[200, 50, 750, 550],
    stream=True,
)

When read_pdf returns an empty list, the heuristic failed. Switch modes (lattice to stream or vice versa), pass guess=False with an explicit area, or fall back to pdfplumber.

Option 2: pdfplumber (most flexible)

pdfplumber wraps pdfminer.six and exposes the underlying page geometry. It is slower than Tabula but handles tables that Tabula misses, especially ones without consistent column spacing.

pip install pdfplumber pandas openpyxl

import pdfplumber
import pandas as pd
 
rows = []
with pdfplumber.open("statement.pdf") as pdf:
    for page in pdf.pages:
        # extract_tables returns a list of lists of lists
        for table in page.extract_tables():
            for row in table:
                rows.append(row)
 
# Use the first row as headers
df = pd.DataFrame(rows[1:], columns=rows[0])
df.to_excel("statement.xlsx", index=False)

For irregular tables, pdfplumber exposes extract_table options to tune the detection. The relevant settings are documented in the pdfplumber table extraction guide:

import pdfplumber
 
with pdfplumber.open("complex.pdf") as pdf:
    page = pdf.pages[0]
    table = page.extract_table(
        table_settings={
            "vertical_strategy": "text",     # or "lines"
            "horizontal_strategy": "text",   # or "lines"
            "snap_tolerance": 3,
            "join_tolerance": 3,
            "edge_min_length": 3,
        }
    )
 
for row in table:
    print(row)

The vertical_strategy and horizontal_strategy options switch between lattice-style ("lines") and stream-style ("text") detection independently per axis, which is more flexible than Tabula's all-or-nothing flag.

Option 3: Camelot (best for ruled tables)

Camelot ships two algorithms: lattice (the same idea as Tabula's lattice) and stream. Its output is the same DataFrame shape as Tabula, with a richer accuracy score that helps you decide whether to fall back.

pip install "camelot-py[base]" pandas openpyxl
# Also requires Ghostscript and OpenCV system packages

import camelot
 
# Lattice mode for ruled tables
tables = camelot.read_pdf("ruled.pdf", pages="all", flavor="lattice")
print(f"Total tables extracted: {tables.n}")
 
# Inspect the accuracy score for each table
for i, t in enumerate(tables):
    print(f"Table {i + 1}: accuracy={t.accuracy:.1f}%, whitespace={t.whitespace:.1f}%")
 
# Export all tables to one XLSX with one sheet per table
tables.export("output.xlsx", f="excel")

The accuracy score is a useful trigger for automated pipelines: if it drops below a threshold (say, 80), retry with the other flavor or send the file to manual review.

Batch convert a folder of PDFs

import os
import tabula
import pandas as pd
 
input_dir = "./pdfs"
output_dir = "./xlsx"
os.makedirs(output_dir, exist_ok=True)
 
for filename in os.listdir(input_dir):
    if not filename.endswith(".pdf"):
        continue
 
    pdf_path = os.path.join(input_dir, filename)
    xlsx_path = os.path.join(output_dir, filename.replace(".pdf", ".xlsx"))
 
    try:
        tables = tabula.read_pdf(pdf_path, pages="all")
        if not tables:
            print(f"No tables found in {filename}")
            continue
 
        with pd.ExcelWriter(xlsx_path, engine="openpyxl") as writer:
            for i, df in enumerate(tables, start=1):
                df.to_excel(writer, sheet_name=f"Table_{i}", index=False)
        print(f"Converted: {filename}")
    except Exception as e:
        print(f"Failed on {filename}: {e}")

This pattern handles a few hundred PDFs on a laptop in well under a minute. For larger batches, parallelize with concurrent.futures.ProcessPoolExecutor.

How to convert PDF to Excel in Node.js (free)

Native Node.js table extraction is weaker than Python's. There is no tabula-py equivalent in pure JavaScript. Two practical paths:

Path 1: call tabula-java via child_process

The most reliable Node.js option is to shell out to the original tabula-java JAR. This treats Tabula as a CLI binary, which it supports first-class.

import { spawn } from "node:child_process";
import { promises as fs } from "node:fs";
 
async function pdfToCsv(pdfPath: string, csvPath: string): Promise<void> {
  return new Promise((resolve, reject) => {
    const proc = spawn("java", [
      "-jar",
      "./tabula.jar",
      "--pages",
      "all",
      "--format",
      "CSV",
      "--outfile",
      csvPath,
      pdfPath,
    ]);
 
    let stderr = "";
    proc.stderr.on("data", (data) => {
      stderr += data.toString();
    });
 
    proc.on("close", (code) => {
      if (code === 0) resolve();
      else reject(new Error(`tabula exited ${code}: ${stderr}`));
    });
  });
}
 
await pdfToCsv("./invoice.pdf", "./invoice.csv");

Download the JAR from the Tabula releases page. Once you have a CSV, convert it to XLSX with exceljs or xlsx from npm.

Path 2: pdfjs-dist + custom column detection

For pure-Node workflows without a JVM, use pdfjs-dist to read positioned text items, then group them into rows and columns yourself. This is more code but no system dependencies.

import * as pdfjsLib from "pdfjs-dist";
import { promises as fs } from "node:fs";
 
pdfjsLib.GlobalWorkerOptions.workerSrc = "";
 
interface PositionedItem {
  text: string;
  x: number;
  y: number;
}
 
async function extractRows(
  pdfPath: string,
  pageNum = 1,
  yTolerance = 2,
): Promise<string[][]> {
  const data = await fs.readFile(pdfPath);
  const pdf = await pdfjsLib.getDocument({ data }).promise;
  const page = await pdf.getPage(pageNum);
  const content = await page.getTextContent();
 
  const items: PositionedItem[] = content.items
    .filter((it) => "str" in it && it.str.trim().length > 0)
    .map((it) => ({
      text: (it as { str: string }).str.trim(),
      x: (it as { transform: number[] }).transform[4],
      y: (it as { transform: number[] }).transform[5],
    }));
 
  // Group items into rows by y coordinate
  items.sort((a, b) => b.y - a.y);
  const rows: PositionedItem[][] = [];
  for (const it of items) {
    const last = rows[rows.length - 1];
    if (last && Math.abs(last[0].y - it.y) <= yTolerance) {
      last.push(it);
    } else {
      rows.push([it]);
    }
  }
 
  // Sort each row left to right and project to plain text
  return rows.map((row) =>
    row.sort((a, b) => a.x - b.x).map((it) => it.text),
  );
}
 
const rows = await extractRows("./invoice.pdf");
console.table(rows);

This reconstructs rows correctly but does not infer column boundaries: every row is a flat list of text fragments. Adding real column detection (clustering x positions across all rows) is another 30 lines. For production, the tabula-java path is faster to ship.

Node.js vs Python: honest comparison

Concern	Python	Node.js
Free library quality	Excellent (Tabula, pdfplumber, Camelot)	Weak (no equivalent)
Setup friction	pip install + Java for Tabula	Java for Tabula, or write reconstruction code
Speed per page	Fast	Comparable when shelling to Java
Production maintenance	Mature ecosystem	More custom code

If your stack is mostly Node.js but PDF-to-Excel is a critical path, the cleanest answer is a small Python microservice that the Node app calls over HTTP. The alternative is a paid API.

When to use a paid API (and which one)

Free libraries cover most digital PDFs. Paid APIs earn their cost on scanned PDFs, complex multi-column layouts, and pipelines where engineering time matters more than per-page cost.

API	Pricing	Strength	Weakness
Adobe Extract API	~0.05 USD per page	Industry-leading layout reconstruction, structured JSON output	Per-page cost adds up at scale
Microsoft Document Intelligence (Layout)	~0.01 USD per page (Layout)	Excellent on scanned + handwritten input, prebuilt models for invoices and receipts	Azure setup overhead
Aspose.PDF Cloud	Tiered (credits)	Wide format support, table extraction with cell types	Older API ergonomics
PDF.co	~0.005-0.02 USD per page	Cheapest paid option, REST API	Lower accuracy on edge cases
ConvertAPI	~0.0025 USD per page	Direct PDF to XLSX endpoint	Quality varies by input

Rule of thumb. Under 100 PDFs per day with clean digital tables: stay free. Over 1,000 PDFs per day or scanned input: a paid API saves more engineering hours than it costs. Mixed input where 80% are digital and 20% scanned: extract with Tabula first, route failures to a paid API.

PDF4.dev does not currently offer a /extract endpoint. The product runs in the generation direction (data to PDF). For PDF-to-Excel at scale, the Python libraries above and Adobe Extract API are the right tools.

OCR for scanned PDFs (when text-layer is empty)

If pdfplumber or Tabula returns nothing on a PDF that visually contains a table, the text layer is empty. Confirm by trying to select text in any viewer; if nothing highlights, the PDF is a scan. The fix is OCR: convert the page image to text, then extract the table.

Free: Tesseract

Tesseract is the standard open-source OCR engine. It does not detect tables; it returns a flat text stream. Combine it with pdfplumber-style position reconstruction or with a layout-aware wrapper like unstructured.io.

import fitz  # PyMuPDF
import pytesseract
from PIL import Image
import io
 
def ocr_pdf_page(pdf_path: str, page_num: int = 0, dpi: int = 300) -> str:
    doc = fitz.open(pdf_path)
    page = doc[page_num]
    pix = page.get_pixmap(dpi=dpi)
    img = Image.open(io.BytesIO(pix.tobytes("png")))
    return pytesseract.image_to_string(img)
 
text = ocr_pdf_page("scanned.pdf", page_num=0)
print(text)

Run OCR at 300 DPI or higher. Lower resolution drops accuracy fast. The Tesseract input format docs confirm 300 DPI as the practical floor for character recognition.

Paid: Mistral OCR, AWS Textract

Mistral OCR, announced in September 2024, was the first general-purpose OCR model with strong table-aware output. It returns structured Markdown with tables preserved as Markdown tables, which is much easier to convert to Excel than a flat text stream.

AWS Textract and Microsoft Document Intelligence offer dedicated table extraction modes that return cell-level JSON with row and column indices. For scanned bank statements and invoices at scale, these services beat Tesseract by a wide margin.

Common pitfalls

Real-world PDFs break in predictable ways. Plan for each.

Problem	What happens	Fix
Merged cells	Two rows collapse into one	Detect by row count mismatch; un-merge by duplicating value
Multi-line cells	One logical cell wraps across two y positions	Increase pdfplumber `join_tolerance` or merge rows by detecting empty cells
Headers spanning multiple rows	Header gets parsed as data	Skip the first N rows on read, set columns manually
Negative numbers in parentheses	`(1,234.50)` reads as string	Post-process with regex: `df["amt"] = df["amt"].str.replace(r"$(.*)$", r"-\1", regex=True)`
Currency symbols	`$1,234.50` reads as string	Strip non-numeric characters before casting to float
Locale decimal separators	`1.234,56` (European) read incorrectly	Pass `decimal=","` and `thousands="."` to `pd.read_csv`
Dates parsed as strings	`"03/05/2026"` stays a string	Use `pd.to_datetime(df["date"], format="%d/%m/%Y")`
Scientific notation	`1.23E+05` collapses to a single cell	Set Excel cell format to "Number" after writing

A short post-processing layer in pandas catches most of these in one pass:

import pandas as pd
import re
 
def clean_amount(value):
    if pd.isna(value) or not isinstance(value, str):
        return value
    s = value.strip()
    # Negative in parentheses
    neg = bool(re.match(r"^\(.*\)$", s))
    s = re.sub(r"[^\d.,-]", "", s.strip("()"))
    # European decimal: swap , and .
    if s.count(",") == 1 and s.count(".") >= 1:
        s = s.replace(".", "").replace(",", ".")
    try:
        return -float(s) if neg else float(s)
    except ValueError:
        return None
 
df["amount"] = df["amount"].apply(clean_amount)

Choosing the right tool: decision tree

The shortest decision path:

Input	Volume	Recommendation
Digital PDF, ruled table	Any	Tabula lattice or Camelot lattice
Digital PDF, whitespace table	Any	Tabula stream or pdfplumber
Digital PDF, irregular layout	Any	pdfplumber with tuned `table_settings`
Scanned PDF	Low	Tesseract + manual review
Scanned PDF	High	Adobe Extract or Microsoft Document Intelligence
Mixed digital + scanned	Any	Tabula first, route failures to paid API
Node.js stack, free path	Low	`pdfjs-dist` + custom reconstruction
Node.js stack, production	Any	Shell to tabula-java or call a Python microservice

Frequently asked questions

Can I convert PDF to Excel for free?

Yes. For digital PDFs with clear tables, free Python libraries like Tabula and pdfplumber convert tables to Excel in a few lines of code. For one-shot conversions in the browser, free online tools extract the text layer and let you paste it into Excel, then split into columns. Both paths work without an account or paid API.

Is Tabula better than Adobe Extract API?

Tabula is free and works well for digital PDFs with ruled or whitespace-aligned tables. Adobe Extract API costs around 0.05 USD per page but handles scanned PDFs, multi-column layouts, and inferred cell structure that Tabula misses. Tabula is the right tool for clean digital tables; Adobe Extract is worth the cost for messy or scanned input.

Why does my PDF to Excel conversion mess up cells?

PDFs do not store tables as tables. They store positioned glyphs on a page, and the extractor guesses cell boundaries from whitespace, ruling lines, or column alignment. Merged cells, multi-line cells, headers that span rows, and decorative borders confuse the heuristics. Switching between lattice mode (ruled tables) and stream mode (whitespace tables) fixes most issues; the rest needs manual layout hints.

How do I convert a scanned PDF to Excel?

Scanned PDFs are images, so the text layer is empty and standard extractors return nothing. Run OCR first to generate a text layer, then extract the table. Tesseract is the free open-source option; Adobe Extract API, Microsoft Document Intelligence, and Mistral OCR are paid services with much higher table accuracy on scanned input.

Can I batch-convert hundreds of PDFs to Excel?

Yes. A short Python script with tabula-py or pdfplumber loops over a folder, extracts each PDF, and writes one XLSX per file (or one combined workbook with one sheet per PDF). On modest hardware, a few hundred single-page PDFs convert in under a minute.

Is there a Node.js library for PDF to Excel?

Native Node.js libraries for table extraction are weaker than their Python counterparts. Two practical paths: call tabula-java from Node via child_process, or use pdfjs-dist to read positioned text items and reconstruct cells with custom logic. For production workloads, calling a Python microservice or a paid API is usually faster than writing the reconstruction code yourself.

What is the best free PDF to Excel converter for Mac and Windows?

For a no-install path, upload your PDF to a browser-based extractor and paste the result into Excel. For local conversion, install Python and run tabula-py or pdfplumber; both work on macOS, Windows, and Linux. There is no built-in PDF to Excel option in Microsoft Excel itself, unlike Word, which can open PDFs directly.

How do I convert PDF bank statements to Excel?

Bank statements are usually digital PDFs with a consistent table layout, which is the easiest case for tabula-py or pdfplumber. Use stream mode if the table has no visible ruling lines, lattice mode if it does. The common pitfall is negative amounts in parentheses, which pandas reads as strings; convert them with a small post-processing step.

Can I convert PDF tables to Google Sheets directly?

Not natively. Google Sheets has no PDF import. The reliable path is to extract the table to CSV first (Tabula, pdfplumber, or a browser tool), then import the CSV into Sheets via File then Import. Sheets also reads XLSX files generated by pandas.to_excel.

Does PDF4.dev offer a PDF to Excel API?

No. PDF4.dev renders PDFs from HTML templates, so the API runs in the generation direction (data to PDF), not extraction (PDF to data). For PDF-to-Excel at scale, use tabula-py or pdfplumber for digital PDFs and Adobe Extract API or Microsoft Document Intelligence for scanned or complex layouts.

Summary

PDF to Excel is a layout reconstruction problem, not a deterministic conversion. For digital PDFs, Tabula and pdfplumber cover most cases for free, with Camelot as a strong alternative for ruled tables. Choose lattice mode for ruled tables, stream mode for whitespace-aligned tables, and tune table_settings in pdfplumber when both fail. For scanned PDFs, run OCR with Tesseract (free) or Adobe Extract API and Mistral OCR (paid, table-aware). Node.js stacks shell out to tabula-java or call a Python microservice; native JavaScript libraries are not yet competitive for this task.

PDF4.dev's free PDF to text tool handles quick text-layer extraction in the browser. For the inverse direction, generating PDFs from structured data, PDF4.dev's render API renders HTML templates to PDFs in milliseconds, with a full text layer that any extractor can read back later.

Free tools mentioned:

Pdf To TextTry it free Pdf To JpgTry it free Pdf To PngTry it free

Start generating PDFs

Build PDF templates with a visual editor. Render them via API from any language in ~300ms.

Get Started free API Docs

PDF ConversionPillar

Complete guide to PDF conversion: every format, every method (2026)

PDF conversion explained: convert PDF to JPG, PNG, Word, HTML and more. Covers free browser tools, Python, Node.js, and command-line methods. Updated 2026.

Mar 12, 202614 min read

PDF Conversion

How to convert PDF to Word: LibreOffice, Python pdf2docx, and 5 free methods

Convert PDF to editable DOCX with LibreOffice, Python pdf2docx, Microsoft Word, or Google Docs. Covers equations, batch conversion, and layout preservation.

Mar 14, 20268 min read

PDF Manipulation

How to extract text from a PDF (free online and programmatic methods)

Extract text from any PDF in seconds, free and online, or automate it with JavaScript, Python, and the PDF.js API. No signup required for the free tool.

Mar 25, 202611 min read

Start generating PDFs

Related Articles

Complete guide to PDF conversion: every format, every method (2026)

How to convert PDF to Word: LibreOffice, Python pdf2docx, and 5 free methods

How to extract text from a PDF (free online and programmatic methods)