Benchmark data
Document extraction benchmarks
We tested 27 language models on real document extraction tasks so you do not have to guess which one to use. Below: model results by quality, cost, and speed - and a separate engine comparison for the document-processing layer.
LLM benchmark
Which model extracts best - and at what cost
March 2026 - 27 models, 4 document types
Key findings
DeepSeek V3 is the first open-source model to reach proprietary extraction quality at 97.5% - previously only commercial models held this tier.
Around 10 open-source models score 94.4% quality. Several have free API tiers. The most expensive model in the set scores the same.
A 72x cost spread exists between the cheapest and most expensive models at comparable quality. Content-aware routing captures most of that gap automatically.
Multilingual gap: open-source models average 87.5% on Cyrillic and Arabic scripts versus 100% for proprietary models. Routing Cyrillic and Arabic documents to a proprietary model closes the gap without changing cost for the rest of the workload.
Tier 1 - Top quality (97.5-100%)
| Model | Quality | Cost per run | Suite time | Type |
|---|---|---|---|---|
| GPT-4.1-mini | 100% | $0.09 | ~990 s | Proprietary |
| Gemini 2.5 Flash | 100% | - | Fast | Proprietary |
| Gemini 2.5 Pro | 100% | - | Moderate | Proprietary |
| GPT-4.1 | 97.5% | $0.46 | Moderate | Proprietary |
| DeepSeek V3 | 97.5% | Low | Moderate | Open-source |
Tier 2 - Production quality (94.4%)
| Model | Quality | Cost per run | Suite time | Type |
|---|---|---|---|---|
| Kimi K2 (Groq) | 94.4% | $0.04 | Fast | Open-source |
| Llama 3.3 70B (Groq) | 94.4% | Free tier | Fastest OSS | Open-source |
| Qwen 3 235B (Groq) | 94.4% | Free tier | Fast | Open-source |
| 7 additional OSS models | 94.4% | Free tier available | Varies | Open-source |
Tier 3 - Speed-optimized
| Model | Quality | Cost per run | Suite time | Type |
|---|---|---|---|---|
| Step 3.5 Flash | 94.4% | Free tier | 171 s (5.8x faster than GPT-4.1-mini) | Open-source |
| Groq-hosted models | 94.4% | Free tier available | Hardware-accelerated | Open-source |
Methodology
27 models tested on four real documents: a simple CSV, a 332-row financial table, a Cyrillic and Arabic invoice processed via OCR, and a complex multi-section investor PDF. Each run scored on extraction quality (field accuracy against ground truth), API cost, and wall-clock time for the full suite.
No single model is optimal for all documents. Content-aware routing - pre-scanning each document for type, language, and complexity before selecting a model - is how Datatera handles this in production.
A note on model selection: These results reflect document extraction tasks specifically - structured field extraction from PDFs, tables, and scanned documents. General-purpose benchmarks measuring code, reasoning, or chat quality will show different rankings.
Engine benchmark
Document-processing engine comparison
April 2026 - 10 local engines measured, 19 engines rated
The engine layer sits below the LLM: it converts raw files to text before extraction begins. Engine choice affects OCR quality, table structure, multi-language coverage, and cost. This comparison covers 19 engines across local and cloud deployment.
Rating scale: Excellent / Good / Basic / Not supported
| Engine | Type | License | OCR | Tables | Multi-lang | Speed | Cost |
|---|---|---|---|---|---|---|---|
| Docling | Local | MIT | ★★☆ | ★★★ | ★★★ | Medium | Free |
| MinerU | Local | AGPL-3.0 | ★★★ | ★★★ | ★★☆ | Slow | Free |
| Chandra OCR 2 | Local / VLM | OpenRAIL-M | ★★★ | ★★★ | ★★★ | Slow | Free* |
| Marker | SaaS | Paid | ★★★ | ★★★ | ★★★ | Fast | ~$1/1K pages |
| PyMuPDF | Local | AGPL-3.0 | None | ★☆☆ | ★★★ | Ultra-fast | Free |
| PaddleOCR | Local | Apache-2.0 | ★★★ | ★★☆ | ★★★ | Medium | Free |
| Tesseract | Local | Apache-2.0 | ★★☆ | ★☆☆ | ★★★ | Medium | Free |
| EasyOCR | Local | Apache-2.0 | ★★★ | None | ★★★ | Medium | Free |
| Unstructured | Local / SaaS | Apache-2.0 | ★★☆ | ★★☆ | ★★☆ | Medium | Free / Paid API |
| LlamaParse | SaaS | Paid | ★★★ | ★★★ | ★★☆ | Fast | ~$3/1K pages |
| LiteParse | Local | Apache-2.0 | ★★☆ | ★★☆ | ★★☆ | Fast | Free |
| Mistral OCR | SaaS | Paid | ★★★ | ★★★ | ★★★ | Fast | ~$1/1K pages |
| Zerox | VLM | MIT | ★★★ | ★★☆ | ★★★ | Slow | VLM API cost |
| Nougat | Local | MIT | ★★☆ | ★★☆ | ★☆☆ | Slow | Free |
| Surya | Local | GPL-3.0 | ★★★ | ★★☆ | ★★★ | Medium | Free |
| AWS Textract | SaaS | Paid | ★★★ | ★★★ | ★★☆ | Fast | ~$1.50/1K pages |
| Google Document AI | SaaS | Paid | ★★★ | ★★★ | ★★★ | Fast | ~$1.50/1K pages |
| Azure Document Intelligence | SaaS | Paid | ★★★ | ★★★ | ★★★ | Fast | ~$1.50/1K pages |
| Firecrawl | SaaS | Paid | ★★☆ | ★★☆ | ★★★ | Fast | ~$1/1K pages |
Measured results (10 local engines, CPU)
Benchmarked on 2026-04-05 using synthetic PDF documents with known ground truth. All engines ran on CPU; no GPU. Times include model loading on first document.
| Engine | Avg time | Avg CER | Avg WER | Notes |
|---|---|---|---|---|
| PyMuPDF | 4 ms | 0.000 | 0.000 | Perfect on digital PDFs. No OCR capability. |
| Unstructured | 597 ms | 0.036 | 0.135 | Fast hybrid engine. Higher error on formatting. |
| Tesseract | 1,190 ms | 0.000 | 0.000 | Perfect on clean digital PDFs. 100+ languages. |
| PaddleOCR | 1,617 ms | 0.002 | 0.013 | Near-perfect accuracy. Best OCR at this speed. |
| Docling | 2,601 ms | 0.017 | 0.041 | Best accuracy-speed tradeoff among ML engines. |
| MinerU | 18,305 ms | 0.012 | 0.080 | Good CER among ML engines. Slow on CPU. |
| Surya | 32,959 ms | 0.014 | 0.027 | Lowest WER among ML engines. Slow on CPU. |
| Marker Local | 39,091 ms | 0.019 | 0.094 | Full document conversion with layout analysis. |
| EasyOCR | 30,311 ms | 0.000 | 0.000 | Perfect on single-page. Hangs on multi-page CPU. |
| Nougat | 127,380 ms | 1.000 | 1.000 | Designed for academic papers only. Wrong domain for simple text. |
CER = Character Error Rate. WER = Word Error Rate. Lower is better. 0.000 = perfect match.
Key takeaways
PyMuPDF is unmatched for digital PDFs: perfect accuracy at 4 ms average. Use it as the first-pass engine for text-layer documents.
PaddleOCR is the strongest OCR engine at this speed tier: 0.002 CER at 1.6 s on CPU.
Docling offers the best balance of speed and accuracy among ML-powered engines.
ML engines (Surya, Marker, MinerU) are designed for GPU and can be 5-20x faster with CUDA. CPU times shown here are worst-case.
Nougat is purpose-built for arXiv academic papers. It produces empty output on ordinary documents - not a defect, just wrong domain.
Bring your documents. We will find the right stack for them.
Book a 15-min discovery call. We will map extraction quality, cost, and data-residency requirements for your workload.