
You need to extract invoice data, purchase order line items, or form fields from PDFs — and get clean, structured JSON on the other side. The problem? PDFs weren't designed for machines. They're a presentation format built for human eyes, and beneath the surface they're a nightmare of floating text objects, embedded images, and zero semantic structure.
Developers trying to solve this problem for the first time typically go through the same journey: start with a text extraction library, hit a wall when layouts vary, add OCR for scanned documents, hit another wall, reach for a general LLM, and eventually realize that production-grade document extraction is a harder problem than it looks.
This guide cuts through that journey. We'll compare every major approach — text extraction libraries, OCR, general LLMs, and purpose-built document AI APIs — with honest trade-offs, code examples, and a decision framework for choosing the right tool for your use case.
1. Why PDF Extraction Is Harder Than It Looks
PDFs encode documents as a collection of positioned drawing instructions, not as semantic content. There is no concept of a "table," a "field," or a "label-value pair" at the file format level. What looks like a clean table to your eyes is often dozens of text objects positioned to appear aligned — with no machine-readable relationship between them.
Developers encounter three categories of documents in practice:
- Digital PDFs: Generated by software (accounting systems, ERPs, Word exports). Text is selectable. Layouts can still vary wildly between vendors.
- Scanned documents: Images of paper documents. No selectable text at all — you need OCR before you can even start extracting fields.
- Semi-structured documents: The hardest category. Invoices, purchase orders, and contracts that follow a general pattern but differ in layout, field position, terminology, and page count across issuers.
The real cost of building document extraction isn't the happy path — it's the edge cases. Your regex works for the first 10 vendors. The 11th puts the invoice number in the header instead of the body. The 12th sends a two-page PDF. The 13th sends a photo taken on a phone. At scale, edge cases become the rule, not the exception.
2. Approach 1 — Text Extraction Libraries (pdfplumber, PyPDF2)
Text extraction libraries parse the underlying PDF structure and return the text content of digital PDFs. They're fast, require no API calls, and work offline.
import pdfplumber
with pdfplumber.open("invoice.pdf") as pdf:
page = pdf.pages[0]
text = page.extract_text()
print(text)
# Returns raw text — all words on the page, roughly top-to-bottom
# For tables, pdfplumber has a table extractor:
tables = page.extract_tables()
for table in tables:
for row in table:
print(row)
What works well: Consistently structured, software-generated PDFs with stable layouts. Great for bank statements, payslips, or any document where you control the source.
Where it breaks down:
- Fails entirely on scanned documents (returns empty or garbage text)
- Layout-dependent: column ordering shifts when vendors change their invoice template
- No semantic understanding — you still need custom logic to identify which text is an "invoice number" vs a "PO number"
- Multi-page documents require manual stitching logic
Verdict: A solid starting point for internal tools with known, consistent document sources. Not appropriate for handling documents from external parties.
3. Approach 2 — OCR (Tesseract, EasyOCR)
OCR converts images of text — scanned pages, photographs of documents — into machine-readable characters. It's essential whenever your input includes scanned documents, but it only solves the "text recognition" problem, not the "field extraction" problem.
import pytesseract
from PIL import Image
import pdf2image
# Convert PDF page to image first
pages = pdf2image.convert_from_path("scanned_invoice.pdf")
image = pages[0]
# Run OCR
text = pytesseract.image_to_string(image)
print(text)
# Returns raw text — same problem as pdfplumber: you still need to parse it
What works well: Getting text out of scanned or image-based documents. EasyOCR handles more languages and degraded images better than Tesseract.
Where it breaks down:
- Image quality matters significantly — low-res scans produce garbled text
- Returns raw text with no field understanding, just like pdfplumber
- Requires additional post-processing logic to extract specific fields
- Struggles with complex layouts, rotated text, or mixed print/handwriting
Verdict: A necessary component in any pipeline that handles scanned documents, but not a complete solution. OCR gives you text; you still need intelligence to turn that text into structured data.
4. Approach 3 — General LLMs (GPT-4 Vision, Claude)
Modern multimodal LLMs can "see" document images and understand their content contextually. Pass a PDF page as an image, ask for structured output, and you get surprisingly good results — especially for prototyping.
import openai
import base64
import json
from pdf2image import convert_from_path
# Convert first page to image and encode
pages = convert_from_path("invoice.pdf")
image = pages[0]
image.save("/tmp/page.png")
with open("/tmp/page.png", "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
client = openai.OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_data}"}},
{"type": "text", "text": """Extract the following fields from this invoice as JSON:
{
"invoice_number": "",
"invoice_date": "",
"vendor_name": "",
"total_amount": "",
"line_items": []
}
Return only the JSON object."""}
]
}
]
)
data = json.loads(response.choices[0].message.content)
What works well: Handles almost any layout without training. Great for prototyping fast, one-off extraction tasks, and documents with unstructured narrative content.
Production pitfalls:
- Inconsistent field names: The same field might come back as
total,total_amount, oramount_dueon different runs - Hallucinations on edge cases: LLMs occasionally invent values for fields they can't confidently read, especially on degraded scans
- No built-in validation: You get raw text strings — currency formatting, date format normalization, and range validation are all your problem
- Token cost at scale: Processing a 5-page invoice at GPT-4 vision rates adds up fast at 500+ documents/month
- Rate limits: Real-time document processing pipelines need predictable throughput; OpenAI rate limits are tier-dependent and can block production workloads
- Multi-page handling: Each page is a separate API call; you need to stitch results together across pages
Verdict: The best choice for prototyping, low-volume use cases (under 50 documents/month), or documents where layout is too varied for any pre-trained model. For production systems, the reliability and cost issues compound quickly.
5. Approach 4 — Purpose-Built Document Extraction APIs
Purpose-built document AI APIs solve the problems that general LLMs introduce at scale. They're trained specifically on business documents, return validated structured data with confidence scores, and are designed to handle the edge cases that break DIY pipelines.
What they solve over general LLMs:
- Consistent output schema: The same fields, the same names, the same format — every time
- Built-in validation: Dates are normalized, currency values are numeric, required fields are flagged when missing
- Confidence scores: Know when a field was extracted with high confidence vs when it needs human review
- Multi-page documents: Handled natively, with line item tables aggregated across pages
- Multi-language support: Process invoices in German, French, Spanish, and 50+ other languages without prompt changes
- Scans + digital PDFs: One API handles both — no separate OCR pipeline to maintain
API Comparison
| API | Best For | Training Required | Validated JSON | Pricing | |-----|----------|-------------------|----------------|---------| | DocumentPro | Invoices, POs, any business doc | No | Yes, with confidence scores | Credit-based | | Google Document AI | Teams in Google Cloud ecosystem | Yes (some processors) | Yes | Per-page | | AWS Textract | Teams in AWS ecosystem | No (forms/tables only) | Partial | Per-page | | Azure Form Recognizer | Teams in Azure ecosystem | Yes | Yes | Per-page |
The major cloud providers (Google, AWS, Azure) are strong choices if you're already embedded in their ecosystems — but they require significant configuration, and some processors need template training before they handle new document layouts reliably. For teams that want to go live quickly without training data or cloud vendor lock-in, DocumentPro is built for that use case specifically.
6. How to Integrate DocumentPro in Your Application
DocumentPro exposes a REST API that accepts document uploads and returns structured JSON. The integration flow is straightforward:
- Upload the document as
multipart/form-data - Receive a JSON response with extracted fields and confidence scores
- Apply your validation logic — flag low-confidence fields for human review or trigger a re-process
- Push validated data to your ERP, database, or downstream workflow
// Node.js example — extract fields from an invoice
const FormData = require('form-data');
const fs = require('fs');
const axios = require('axios');
async function extractInvoice(filePath) {
const form = new FormData();
form.append('file', fs.createReadStream(filePath));
const response = await axios.post(
'https://api.documentpro.ai/v1/extract',
form,
{
headers: {
...form.getHeaders(),
'Authorization': `Bearer ${process.env.DOCUMENTPRO_API_KEY}`,
}
}
);
const { fields } = response.data;
// Fields come back with values and confidence scores
console.log(fields.invoice_number); // { value: "INV-2025-0042", confidence: 0.99 }
console.log(fields.total_amount); // { value: 4250.00, confidence: 0.97 }
console.log(fields.vendor_name); // { value: "Acme Supplies Ltd", confidence: 0.98 }
console.log(fields.line_items); // Array of line items with qty, description, unit_price
return fields;
}
# Python example
import requests
import os
def extract_invoice(file_path):
with open(file_path, 'rb') as f:
response = requests.post(
'https://api.documentpro.ai/v1/extract',
headers={'Authorization': f"Bearer {os.environ['DOCUMENTPRO_API_KEY']}"},
files={'file': f}
)
fields = response.json()['fields']
# Handle low-confidence fields
for field_name, field_data in fields.items():
if field_data['confidence'] < 0.85:
print(f"Low confidence on {field_name}: {field_data['value']} ({field_data['confidence']})")
# Route to human review queue
return fields
DocumentPro handles scanned PDFs, digital PDFs, and multi-page documents through the same endpoint — no separate processing pipelines to maintain. See the DocumentPro documentation for the full API reference, including webhook support for asynchronous processing of large document batches.
7. Decision Framework — Which Approach Is Right for You?
Use this to cut through the noise:
Text extraction library (pdfplumber, PyPDF2): → Digital PDFs only, consistent layouts, you control the source, < 50 documents/month
OCR (Tesseract, EasyOCR): → Scanned documents with simple structure, as part of a larger custom pipeline, budget-constrained
General LLM (GPT-4, Claude): → Prototyping quickly, low volume (< 50 docs/month), highly varied or narrative-heavy documents, no tolerance for vendor lock-in
Purpose-built Document AI API (DocumentPro): → Production application, 100+ documents/month, varied layouts from external sources, need validated JSON output, want to go live in days not months, embedding document processing in a customer-facing product
The build-in-house path is genuinely the right choice in a small number of cases — but it consistently underestimates the maintenance burden. Every new vendor brings a new layout variation. Scanned documents degrade in quality. Languages change. A purpose-built API absorbs those edge cases; a custom pipeline requires you to chase them forever.
Frequently Asked Questions
What is a Document AI API? A Document AI API is a cloud service that accepts uploaded documents (PDFs, images, scans) and returns structured data — typically JSON — without requiring you to write custom parsing logic. Unlike general OCR, modern document AI APIs use large language models to understand document context, handle layout variations, and extract the right fields even from unstructured or semi-structured documents.
How do I extract structured data from a PDF programmatically? You have four main options: (1) Text extraction libraries like pdfplumber or PyPDF2 — fast but brittle for varied layouts, (2) OCR tools like Tesseract — handles scans but poor at understanding field context, (3) General LLMs like GPT-4 Vision — flexible but requires significant prompt engineering and error handling, or (4) Purpose-built document extraction APIs like DocumentPro — pre-trained on document types, returns validated JSON, handles scans and layout variations out of the box.
What is the best API for extracting data from invoices? For invoice extraction specifically, purpose-built APIs outperform general-purpose tools. DocumentPro, Google Document AI, and AWS Textract are the leading options. DocumentPro differentiates with no-template-training required, JSON output with field validation, and a simple REST API that goes live in days rather than weeks. Google Document AI and AWS Textract require more configuration and are better suited to teams already embedded in those cloud ecosystems.
What is the difference between OCR and Document AI? OCR converts images of text into machine-readable characters — it tells you what letters are on the page. Document AI goes further: it understands the structure and meaning of a document, identifies which text is a vendor name vs an invoice total, handles layout variations across different document formats, and returns structured data rather than raw text. OCR is a step inside Document AI, not a replacement for it.
Can I use ChatGPT or Claude to extract data from documents? Yes, but with significant caveats for production use. General LLMs can extract data from documents when given the right prompts, but they introduce reliability issues at scale: inconsistent field naming, hallucinated values on edge cases, no built-in validation, unpredictable token costs, and rate limits. For production applications processing hundreds of documents, a purpose-built document extraction API provides the consistency, error handling, and cost predictability that general LLMs lack.
How do I choose between building document parsing in-house vs using an API? Build in-house if: you process fewer than 50 documents per month with consistent, well-structured layouts, or you have unique security constraints preventing third-party APIs. Use a document extraction API if: you need to handle varied layouts, scans, or handwriting; you want to go live in days not months; you process 100+ documents per month where manual QA becomes a bottleneck; or you are embedding document processing in a customer-facing product. The build-in-house path typically underestimates the long-tail of edge cases — layout variations, multi-page documents, degraded scans — that a trained API handles automatically.
Conclusion
PDF extraction is a deceptively hard problem. Text libraries are fast but fragile. OCR covers scans but not semantics. General LLMs are flexible but unreliable at production scale. Purpose-built document AI APIs — trained specifically on business documents, returning validated JSON with confidence scores — are the production-ready choice for applications that process documents from external parties at any meaningful volume.
If you're building a product that needs to process invoices, purchase orders, or any structured business document, DocumentPro offers a free tier to start with — no implementation team, no template training, live in days.
Also Read: How to Extract Data from Documents Using LLMs | Document Data Extraction with OCR and LLMs | How to Extract Data from Invoices
