Benchmark data

Document extraction benchmarks

We tested 27 language models on real document extraction tasks so you do not have to guess which one to use. Below: model results by quality, cost, and speed - and a separate engine comparison for the document-processing layer.

LLM benchmark

Which model extracts best - and at what cost

March 2026 - 27 models, 4 document types

Key findings

DeepSeek V3 is the first open-source model to reach proprietary extraction quality at 97.5% - previously only commercial models held this tier.

Around 10 open-source models score 94.4% quality. Several have free API tiers. The most expensive model in the set scores the same.

A 72x cost spread exists between the cheapest and most expensive models at comparable quality. Content-aware routing captures most of that gap automatically.

Multilingual gap: open-source models average 87.5% on Cyrillic and Arabic scripts versus 100% for proprietary models. Routing Cyrillic and Arabic documents to a proprietary model closes the gap without changing cost for the rest of the workload.

Tier 1 - Top quality (97.5-100%)

ModelQualityCost per runSuite timeType
GPT-4.1-mini100%$0.09~990 sProprietary
Gemini 2.5 Flash100%-FastProprietary
Gemini 2.5 Pro100%-ModerateProprietary
GPT-4.197.5%$0.46ModerateProprietary
DeepSeek V397.5%LowModerateOpen-source

Tier 2 - Production quality (94.4%)

ModelQualityCost per runSuite timeType
Kimi K2 (Groq)94.4%$0.04FastOpen-source
Llama 3.3 70B (Groq)94.4%Free tierFastest OSSOpen-source
Qwen 3 235B (Groq)94.4%Free tierFastOpen-source
7 additional OSS models94.4%Free tier availableVariesOpen-source

Tier 3 - Speed-optimized

ModelQualityCost per runSuite timeType
Step 3.5 Flash94.4%Free tier171 s (5.8x faster than GPT-4.1-mini)Open-source
Groq-hosted models94.4%Free tier availableHardware-acceleratedOpen-source

Methodology

27 models tested on four real documents: a simple CSV, a 332-row financial table, a Cyrillic and Arabic invoice processed via OCR, and a complex multi-section investor PDF. Each run scored on extraction quality (field accuracy against ground truth), API cost, and wall-clock time for the full suite.

No single model is optimal for all documents. Content-aware routing - pre-scanning each document for type, language, and complexity before selecting a model - is how Datatera handles this in production.

A note on model selection: These results reflect document extraction tasks specifically - structured field extraction from PDFs, tables, and scanned documents. General-purpose benchmarks measuring code, reasoning, or chat quality will show different rankings.

Engine benchmark

Document-processing engine comparison

April 2026 - 10 local engines measured, 19 engines rated

The engine layer sits below the LLM: it converts raw files to text before extraction begins. Engine choice affects OCR quality, table structure, multi-language coverage, and cost. This comparison covers 19 engines across local and cloud deployment.

Ratings reflect general tendencies, not absolute scores. Your results will vary based on document quality, language, layout complexity, and hardware. Test on your own data with the docfold compare command.

Rating scale: Excellent / Good / Basic / Not supported

EngineTypeLicenseOCRTablesMulti-langSpeedCost
DoclingLocalMIT★★☆★★★★★★MediumFree
MinerULocalAGPL-3.0★★★★★★★★☆SlowFree
Chandra OCR 2Local / VLMOpenRAIL-M★★★★★★★★★SlowFree*
MarkerSaaSPaid★★★★★★★★★Fast~$1/1K pages
PyMuPDFLocalAGPL-3.0None★☆☆★★★Ultra-fastFree
PaddleOCRLocalApache-2.0★★★★★☆★★★MediumFree
TesseractLocalApache-2.0★★☆★☆☆★★★MediumFree
EasyOCRLocalApache-2.0★★★None★★★MediumFree
UnstructuredLocal / SaaSApache-2.0★★☆★★☆★★☆MediumFree / Paid API
LlamaParseSaaSPaid★★★★★★★★☆Fast~$3/1K pages
LiteParseLocalApache-2.0★★☆★★☆★★☆FastFree
Mistral OCRSaaSPaid★★★★★★★★★Fast~$1/1K pages
ZeroxVLMMIT★★★★★☆★★★SlowVLM API cost
NougatLocalMIT★★☆★★☆★☆☆SlowFree
SuryaLocalGPL-3.0★★★★★☆★★★MediumFree
AWS TextractSaaSPaid★★★★★★★★☆Fast~$1.50/1K pages
Google Document AISaaSPaid★★★★★★★★★Fast~$1.50/1K pages
Azure Document IntelligenceSaaSPaid★★★★★★★★★Fast~$1.50/1K pages
FirecrawlSaaSPaid★★☆★★☆★★★Fast~$1/1K pages

Measured results (10 local engines, CPU)

Benchmarked on 2026-04-05 using synthetic PDF documents with known ground truth. All engines ran on CPU; no GPU. Times include model loading on first document.

EngineAvg timeAvg CERAvg WERNotes
PyMuPDF4 ms0.0000.000Perfect on digital PDFs. No OCR capability.
Unstructured597 ms0.0360.135Fast hybrid engine. Higher error on formatting.
Tesseract1,190 ms0.0000.000Perfect on clean digital PDFs. 100+ languages.
PaddleOCR1,617 ms0.0020.013Near-perfect accuracy. Best OCR at this speed.
Docling2,601 ms0.0170.041Best accuracy-speed tradeoff among ML engines.
MinerU18,305 ms0.0120.080Good CER among ML engines. Slow on CPU.
Surya32,959 ms0.0140.027Lowest WER among ML engines. Slow on CPU.
Marker Local39,091 ms0.0190.094Full document conversion with layout analysis.
EasyOCR30,311 ms0.0000.000Perfect on single-page. Hangs on multi-page CPU.
Nougat127,380 ms1.0001.000Designed for academic papers only. Wrong domain for simple text.

CER = Character Error Rate. WER = Word Error Rate. Lower is better. 0.000 = perfect match.

Key takeaways

PyMuPDF is unmatched for digital PDFs: perfect accuracy at 4 ms average. Use it as the first-pass engine for text-layer documents.

PaddleOCR is the strongest OCR engine at this speed tier: 0.002 CER at 1.6 s on CPU.

Docling offers the best balance of speed and accuracy among ML-powered engines.

ML engines (Surya, Marker, MinerU) are designed for GPU and can be 5-20x faster with CUDA. CPU times shown here are worst-case.

Nougat is purpose-built for arXiv academic papers. It produces empty output on ordinary documents - not a defect, just wrong domain.

Test on your own data:docfold on GitHub

Bring your documents. We will find the right stack for them.

Book a 15-min discovery call. We will map extraction quality, cost, and data-residency requirements for your workload.