Benchmark data

Document extraction benchmarks

We tested 27 language models on real document extraction tasks so you do not have to guess which one to use. Below: model results by quality, cost, and speed - and a separate engine comparison for the document-processing layer.

LLM benchmark

Which model extracts best - and at what cost

March 2026 - 27 models, 4 document types

Key findings

DeepSeek V3 is the first open-source model to reach proprietary extraction quality at 97.5% - previously only commercial models held this tier.

Around 10 open-source models score 94.4% quality. Several have free API tiers. The most expensive model in the set scores the same.

A 72x cost spread exists between the cheapest and most expensive models at comparable quality. Content-aware routing captures most of that gap automatically.

Multilingual gap: open-source models average 87.5% on Cyrillic and Arabic scripts versus 100% for proprietary models. Routing Cyrillic and Arabic documents to a proprietary model closes the gap without changing cost for the rest of the workload.

Tier 1 - Top quality (97.5-100%)

Model	Quality	Cost per run	Suite time	Type
GPT-4.1-mini	100%	$0.09	~990 s	Proprietary
Gemini 2.5 Flash	100%	-	Fast	Proprietary
Gemini 2.5 Pro	100%	-	Moderate	Proprietary
GPT-4.1	97.5%	$0.46	Moderate	Proprietary
DeepSeek V3	97.5%	Low	Moderate	Open-source

Tier 2 - Production quality (94.4%)

Model	Quality	Cost per run	Suite time	Type
Kimi K2 (Groq)	94.4%	$0.04	Fast	Open-source
Llama 3.3 70B (Groq)	94.4%	Free tier	Fastest OSS	Open-source
Qwen 3 235B (Groq)	94.4%	Free tier	Fast	Open-source
7 additional OSS models	94.4%	Free tier available	Varies	Open-source

Tier 3 - Speed-optimized

Model	Quality	Cost per run	Suite time	Type
Step 3.5 Flash	94.4%	Free tier	171 s (5.8x faster than GPT-4.1-mini)	Open-source
Groq-hosted models	94.4%	Free tier available	Hardware-accelerated	Open-source

Methodology

27 models tested on four real documents: a simple CSV, a 332-row financial table, a Cyrillic and Arabic invoice processed via OCR, and a complex multi-section investor PDF. Each run scored on extraction quality (field accuracy against ground truth), API cost, and wall-clock time for the full suite.

No single model is optimal for all documents. Content-aware routing - pre-scanning each document for type, language, and complexity before selecting a model - is how Datatera handles this in production.

A note on model selection: These results reflect document extraction tasks specifically - structured field extraction from PDFs, tables, and scanned documents. General-purpose benchmarks measuring code, reasoning, or chat quality will show different rankings.

Engine benchmark

Document-processing engine comparison

April 2026 - 10 local engines measured, 19 engines rated

The engine layer sits below the LLM: it converts raw files to text before extraction begins. Engine choice affects OCR quality, table structure, multi-language coverage, and cost. This comparison covers 19 engines across local and cloud deployment.

Ratings reflect general tendencies, not absolute scores. Your results will vary based on document quality, language, layout complexity, and hardware. Test on your own data with the docfold compare command.

Rating scale: Excellent / Good / Basic / Not supported

Engine	Type	License	OCR	Tables	Multi-lang	Speed	Cost
Docling	Local	MIT	★★☆	★★★	★★★	Medium	Free
MinerU	Local	AGPL-3.0	★★★	★★★	★★☆	Slow	Free
Chandra OCR 2	Local / VLM	OpenRAIL-M	★★★	★★★	★★★	Slow	Free*
Marker	SaaS	Paid	★★★	★★★	★★★	Fast	~$1/1K pages
PyMuPDF	Local	AGPL-3.0	None	★☆☆	★★★	Ultra-fast	Free
PaddleOCR	Local	Apache-2.0	★★★	★★☆	★★★	Medium	Free
Tesseract	Local	Apache-2.0	★★☆	★☆☆	★★★	Medium	Free
EasyOCR	Local	Apache-2.0	★★★	None	★★★	Medium	Free
Unstructured	Local / SaaS	Apache-2.0	★★☆	★★☆	★★☆	Medium	Free / Paid API
LlamaParse	SaaS	Paid	★★★	★★★	★★☆	Fast	~$3/1K pages
LiteParse	Local	Apache-2.0	★★☆	★★☆	★★☆	Fast	Free
Mistral OCR	SaaS	Paid	★★★	★★★	★★★	Fast	~$1/1K pages
Zerox	VLM	MIT	★★★	★★☆	★★★	Slow	VLM API cost
Nougat	Local	MIT	★★☆	★★☆	★☆☆	Slow	Free
Surya	Local	GPL-3.0	★★★	★★☆	★★★	Medium	Free
AWS Textract	SaaS	Paid	★★★	★★★	★★☆	Fast	~$1.50/1K pages
Google Document AI	SaaS	Paid	★★★	★★★	★★★	Fast	~$1.50/1K pages
Azure Document Intelligence	SaaS	Paid	★★★	★★★	★★★	Fast	~$1.50/1K pages
Firecrawl	SaaS	Paid	★★☆	★★☆	★★★	Fast	~$1/1K pages

Measured results (10 local engines, CPU)

Benchmarked on 2026-04-05 using synthetic PDF documents with known ground truth. All engines ran on CPU; no GPU. Times include model loading on first document.

Engine	Avg time	Avg CER	Avg WER	Notes
PyMuPDF	4 ms	0.000	0.000	Perfect on digital PDFs. No OCR capability.
Unstructured	597 ms	0.036	0.135	Fast hybrid engine. Higher error on formatting.
Tesseract	1,190 ms	0.000	0.000	Perfect on clean digital PDFs. 100+ languages.
PaddleOCR	1,617 ms	0.002	0.013	Near-perfect accuracy. Best OCR at this speed.
Docling	2,601 ms	0.017	0.041	Best accuracy-speed tradeoff among ML engines.
MinerU	18,305 ms	0.012	0.080	Good CER among ML engines. Slow on CPU.
Surya	32,959 ms	0.014	0.027	Lowest WER among ML engines. Slow on CPU.
Marker Local	39,091 ms	0.019	0.094	Full document conversion with layout analysis.
EasyOCR	30,311 ms	0.000	0.000	Perfect on single-page. Hangs on multi-page CPU.
Nougat	127,380 ms	1.000	1.000	Designed for academic papers only. Wrong domain for simple text.

CER = Character Error Rate. WER = Word Error Rate. Lower is better. 0.000 = perfect match.

Key takeaways

PyMuPDF is unmatched for digital PDFs: perfect accuracy at 4 ms average. Use it as the first-pass engine for text-layer documents.

PaddleOCR is the strongest OCR engine at this speed tier: 0.002 CER at 1.6 s on CPU.

Docling offers the best balance of speed and accuracy among ML-powered engines.

ML engines (Surya, Marker, MinerU) are designed for GPU and can be 5-20x faster with CUDA. CPU times shown here are worst-case.

Nougat is purpose-built for arXiv academic papers. It produces empty output on ordinary documents - not a defect, just wrong domain.

Test on your own data:docfold on GitHub

Bring your documents. We will find the right stack for them.

Book a 15-min discovery call. We will map extraction quality, cost, and data-residency requirements for your workload.

Book a 15-min discovery call Back to home