Guide to converting PDF to Excel using Python
Developers building solutions to automate business processes often face the challenge of extracting critical information from documents. While data extraction from file formats like Excel and CSV are more straightforward, doing the same from PDFs becomes much more challenging. PDF documents often contain semi-structured information or varying formats that makes creating rules to extract data impractical.
At DocumentPro, our goal is building solutions that can make it very simple for developers to extract structured JSON data from PDFs using our GPT document parsers. Our document parsers extract consistent data from documents even when the information layouts vary. Furthermore, exporting document data extracts to Excel makes the data more accessible for uses like data wrangling, analysis and record keeping.
In this guide, we will walk you through using DocumentPro’s Rest API with Python to extract data from Invoice PDFs (as an example) and exporting the extracted data to Excel.
Getting started with DocumentPro
To begin, sign up to DocumentPro with a free account (you’ll have 30 credits to spend on document parsing). For this tutorial, we will use our pre-built Invoice parser, but you can easily create your own custom parser.
Click on the Invoice parser on the left side navigation menu.
A parser is a JSON schema that describes the structure of the output of a document and the facts and tables to extract from a document. On the right of the image above, you see the fields extracted by this Invoice parser.
Parsers can extracts facts, tables, and repeating data as fields. Below we show example of fields extracted by this parser.
{
"fields": [
{
"name": "purchase_order_number",
"type": "text",
"description": null,
"required": true
},
{
"name": "invoice_number",
"type": "text",
"description": null,
"required": false
},
{
"name": "invoice_date",
"type": "text",
"description": null,
"required": false
},
{
"name": "invoice_due_date",
"type": "text",
"description": null,
"required": false
},
{
"name": "payment_terms",
"type": "text",
"description": null,
"required": false
},
{
"name": "currency",
"type": "text",
"description": null,
"required": false
},
{
"name": "invoice_total_amount",
"type": "text",
"description": null,
"required": false
},
{
"name": "invoice_total_tax_amount",
"type": "text",
"description": null,
"required": false
},
{
"name": "invoice_tax_rate",
"type": "text",
"description": null,
"required": false
},
{
"name": "invoice_subtotal_amount",
"type": "text",
"description": null,
"required": false
},
{
"name": "seller_name",
"type": "text",
"description": null,
"required": false
},
{
"name": "seller_address",
"type": "text",
"description": null,
"required": false
},
{
"name": "seller_email",
"type": "text",
"description": null,
"required": false
},
{
"name": "seller_phone",
"type": "text",
"description": null,
"required": false
},
{
"name": "seller_vat_number",
"type": "text",
"description": null,
"required": false
},
{
"name": "seller_bank_name",
"type": "text",
"description": null,
"required": false
},
{
"name": "seller_bank_account_number",
"type": "text",
"description": null,
"required": false
},
{
"name": "buyer_name",
"type": "text",
"description": null,
"required": false
},
{
"name": "buyer_address",
"type": "text",
"description": null,
"required": false
},
{
"name": "buyer_vat_number",
"type": "text",
"description": null,
"required": false
},
{
"name": "items",
"type": "table",
"description": null,
"subFields": [
{
"name": "description",
"type": "text",
"description": null,
"required": false
},
{
"name": "quantity",
"type": "number",
"description": null,
"required": false
},
{
"name": "unit_price",
"type": "number",
"description": null,
"required": false
},
{
"name": "subtotal",
"type": "number",
"description": null,
"required": false
}
],
"required": false
}
]
}
The invoice_date and invoice_due_date are facts that the invoice parser will extract. The items of an invoice is usually a table in the Invoice document that the parser will extract. And the description and and quantity are two of the columns of the table that parser will extract.
You will need the parser id
so that you can upload PDF invoices against this parser with the Rest API. Go the parser settings to copy the parser id.
You will also need your API Key. Go to the app settings by clicking on the settings button at the bottom left of the navigation menu.
Ingesting an Invoice using the Rest API and Python
Here’s the Invoice we will use for this guide.
Now we can create a Python script that will make a http POST call to our Rest API using your API Key to start processing documents. Let’s start by importing requests
import requests
Let’s set the API key and the parser we want to use as variables
PARSER_ID = YOUR_PARSER_ID
API_KEY = YOUR_API_KEY
Making a call to the API
file_name = 'filename.pdf'
file_path = 'path/to/your/file/{}'.format(file_name)
response = requests.request(
'POST',
'https://api.documentpro.ai/files/upload/{}'.format(PARSER_ID),
headers={
'x-api-key': API_KEY
},
data={},
files=[(
'file',(
file_name,
open(file_path,'rb')
'application/pdf'
)
)]
)
if response.status_code == 200:
print('File uploaded successfully')
print(response.json())
When you upload a PDF document to the API, you will get back a request_id
. The document will start processing immediately. Your response will look like this:
{
"request_id": "unique_document_identifier",
"success": true
}
Retrieving a parsed PDF
The document parsing process is asynchronous and usually takes less than 20 seconds. You can periodically check whether a document has completed processing using the following API to retrieve the document with it’s status and extracted data.
import requests
Let’s set the request_id
you got when you ingested your document and your API Key
REQUEST_ID = YOUR_REQUEST_ID
API_KEY = YOUR_API_KEY
Make the GET request
response = requests.request(
'GET',
'https://api.documentpro.ai/files?request_id={}'.format(REQUEST_ID),
headers={
'x-api-key': API_KEY
},
data={}
)
print(response.json())
If the file is still processing
, you will get a response like this, where the result_json_data
field is empty.
{
"request_id": "your-request-id",
"request_status": "processing",
"response_body": {
"file_name": "your-file-name",
"file_presigned_url": "temporary-url-to-your-file",
"user_error_msg": null,
"template_id": "your-template-id",
"template_type": "receipt",
"template_title": "Receipt",
"num_pages": 2,
"result_json_data": {}
},
"created_at": "2023-07-30T19:05:10.696893",
"updated_at": "2023-07-30T19:05:29.565249"
}
If the file has completed processing, your result_json_data
field will be populated with a JSON of the extracted data. Like this:
{
"request_id": "your-request-id",
"request_status": "completed",
"response_body": {
"file_name": "your-file-name",
"file_presigned_url": "temporary-url-to-your-file",
"user_error_msg": null,
"template_id": "your-template-id",
"template_type": "receipt",
"template_title": "Receipt",
"num_pages": 2,
"result_json_data": {
"purchase_order_number": "5300016211",
"invoice_number": "P007757",
"invoice_date": "04/06/20",
"invoice_due_date": null,
"payment_terms": "30 Days",
"currency": "USD",
"invoice_total_amount": "14,070.00",
"invoice_total_tax_amount": "670.00",
"invoice_tax_rate": "5.000%",
"invoice_subtotal_amount": "13,400.00",
"seller_name": "Grayling Momentum Ltd",
"seller_address": "HSBC Bank Middle East Limited, P.O.Box: 3766 Dubai , UAE",
"seller_email": null,
"seller_phone": null,
"seller_vat_number": "100390502100003",
"seller_bank_name": "HSBC Bank Middle East Limited",
"seller_bank_account_number": "AE900 2000 0003 0033 0210 01",
"buyer_name": "Audi Volkswagen Middle East FZE",
"buyer_address": "Dubai Airport Freezone, East Wing Building 4E/B, 8th Floor 27758 Dubai, United Arab Emirates",
"buyer_vat_number": null,
"items": [
{
"description": "PR Retainer for the month of June 2020",
"quantity": 1,
"unit_price": 13400,
"subtotal": 13400
}
]
}
},
"created_at": "2023-07-30T19:05:10.696893",
"updated_at": "2023-07-30T19:05:29.565249"
}
Export to Excel
Next let’s take a look at exporting the document extract to Excel. Currently, DocumentPro only supports file downloads through the web platform. To get an extract for your document you need to go into the documents section for your parser and select the document that you just parsed with the API.
Export single document to Excel
Go into the documents section of your Invoice parser.
Click view
on the document you just parsed.
Select the download tab and click export to file to open the options for file export.
Now select the Excel file format. For data layout, selecting merged
returns the entire extracted data in one row, while selecting line items
will create a row for each row in your table extract. Clicking Export to file
will start downloading a file as soon as it’s ready.
Export multiple documents to Excel
DocumentPro also has the ability to export data from multiple parsed documents to Excel. For this, you need to click on the preview
section of your Invoice parser. Here you will see a preview of the latest documents that were parsed.
Click Export to file
to see the export options.
The Date Range
option allows you to export documents parsed between one to ninety days in the past in one file. Clicking Export to file
will start the process of preparing a file and send you a temporary link to the file to your email which can be used to download the file.
You have now successfully extracted the data from one or many Invoice PDFs and exported them to Excel.
Conclusion
In this guide, you converted a PDF document to Excel. You used the DocumentPro web app to select the Invoice document parser and your API Key. In addition, you used Python to interact with DocumentPro’s API to extract data from your PDF with the Invoice document parser.
DocumentPro helps developers simplify the extraction of information like facts, tables and repeating data from PDF and Image files using powerful GPT document parsers. If you’re looking at processing larger volume of documents check out our various plans.