Guide to converting PDF to Excel using Python

DocumentPro

Developers building solutions to automate business processes often face the challenge of extracting critical information from documents. While data extraction from file formats like Excel and CSV are more straightforward, doing the same from PDFs becomes much more challenging. PDF documents often contain semi-structured information or varying formats that makes creating rules to extract data impractical.

At DocumentPro, our goal is building solutions that can make it very simple for developers to extract structured JSON data from PDFs using our GPT document parsers. Our document parsers extract consistent data from documents even when the information layouts vary. Furthermore, exporting document data extracts to Excel makes the data more accessible for uses like data wrangling, analysis and record keeping.

In this guide, we will walk you through using DocumentPro’s Rest API with Python to extract data from Invoice PDFs (as an example) and exporting the extracted data to Excel.

Getting started with DocumentPro

To begin, sign up to DocumentPro with a free account (you’ll have 30 credits to spend on document parsing). For this tutorial, we will use our pre-built Invoice parser, but you can easily create your own custom parser.

DocumentPro

Click on the Invoice parser on the left side navigation menu.

DocumentPro

A parser is a JSON schema that describes the structure of the output of a document and the facts and tables to extract from a document. On the right of the image above, you see the fields extracted by this Invoice parser.

Parsers can extracts facts, tables, and repeating data as fields. Below we show example of fields extracted by this parser.

{
  "fields": [
    {
      "name": "purchase_order_number",
      "type": "text",
      "description": null,
      "required": true
    },
    {
      "name": "invoice_number",
      "type": "text",
      "description": null,
      "required": false
    },
    {
      "name": "invoice_date",
      "type": "text",
      "description": null,
      "required": false
    },
    {
      "name": "invoice_due_date",
      "type": "text",
      "description": null,
      "required": false
    },
    {
      "name": "payment_terms",
      "type": "text",
      "description": null,
      "required": false
    },
    {
      "name": "currency",
      "type": "text",
      "description": null,
      "required": false
    },
    {
      "name": "invoice_total_amount",
      "type": "text",
      "description": null,
      "required": false
    },
    {
      "name": "invoice_total_tax_amount",
      "type": "text",
      "description": null,
      "required": false
    },
    {
      "name": "invoice_tax_rate",
      "type": "text",
      "description": null,
      "required": false
    },
    {
      "name": "invoice_subtotal_amount",
      "type": "text",
      "description": null,
      "required": false
    },
    {
      "name": "seller_name",
      "type": "text",
      "description": null,
      "required": false
    },
    {
      "name": "seller_address",
      "type": "text",
      "description": null,
      "required": false
    },
    {
      "name": "seller_email",
      "type": "text",
      "description": null,
      "required": false
    },
    {
      "name": "seller_phone",
      "type": "text",
      "description": null,
      "required": false
    },
    {
      "name": "seller_vat_number",
      "type": "text",
      "description": null,
      "required": false
    },
    {
      "name": "seller_bank_name",
      "type": "text",
      "description": null,
      "required": false
    },
    {
      "name": "seller_bank_account_number",
      "type": "text",
      "description": null,
      "required": false
    },
    {
      "name": "buyer_name",
      "type": "text",
      "description": null,
      "required": false
    },
    {
      "name": "buyer_address",
      "type": "text",
      "description": null,
      "required": false
    },
    {
      "name": "buyer_vat_number",
      "type": "text",
      "description": null,
      "required": false
    },
    {
      "name": "items",
      "type": "table",
      "description": null,
      "subFields": [
        {
          "name": "description",
          "type": "text",
          "description": null,
          "required": false
        },
        {
          "name": "quantity",
          "type": "number",
          "description": null,
          "required": false
        },
        {
          "name": "unit_price",
          "type": "number",
          "description": null,
          "required": false
        },
        {
          "name": "subtotal",
          "type": "number",
          "description": null,
          "required": false
        }
      ],
      "required": false
    }
  ]
}

The invoice_date and invoice_due_date are facts that the invoice parser will extract. The items of an invoice is usually a table in the Invoice document that the parser will extract. And the description and and quantity are two of the columns of the table that parser will extract.

You will need the parser id so that you can upload PDF invoices against this parser with the Rest API. Go the parser settings to copy the parser id.

DocumentPro

You will also need your API Key. Go to the app settings by clicking on the settings button at the bottom left of the navigation menu.

DocumentPro

Ingesting an Invoice using the Rest API and Python

Here’s the Invoice we will use for this guide.

DocumentPro

Now we can create a Python script that will make a http POST call to our Rest API using your API Key to start processing documents. Let’s start by importing requests

import requests

Let’s set the API key and the parser we want to use as variables

PARSER_ID = YOUR_PARSER_ID
API_KEY = YOUR_API_KEY

Making a call to the API

file_name = 'filename.pdf'
file_path = 'path/to/your/file/{}'.format(file_name)

response = requests.request(
	'POST',
	'https://api.documentpro.ai/files/upload/{}'.format(PARSER_ID),
	headers={
		'x-api-key': API_KEY
	},
	data={},
	files=[(
		'file',(
			file_name,
			open(file_path,'rb')
			'application/pdf'
		)
	)]
)

if response.status_code == 200:
	print('File uploaded successfully')
	print(response.json())

When you upload a PDF document to the API, you will get back a request_id . The document will start processing immediately. Your response will look like this:

{
  "request_id": "unique_document_identifier",
  "success": true
}

Retrieving a parsed PDF

The document parsing process is asynchronous and usually takes less than 20 seconds. You can periodically check whether a document has completed processing using the following API to retrieve the document with it’s status and extracted data.

import requests

Let’s set the request_id you got when you ingested your document and your API Key

REQUEST_ID = YOUR_REQUEST_ID
API_KEY = YOUR_API_KEY

Make the GET request

response = requests.request(
	'GET',
	'https://api.documentpro.ai/files?request_id={}'.format(REQUEST_ID),
	headers={
		'x-api-key': API_KEY
	},
	data={}
)

print(response.json())

If the file is still processing , you will get a response like this, where the result_json_data field is empty.

{
    "request_id": "your-request-id",
    "request_status": "processing",
    "response_body": {
        "file_name": "your-file-name",
        "file_presigned_url": "temporary-url-to-your-file",
        "user_error_msg": null,
        "template_id": "your-template-id",
        "template_type": "receipt",
        "template_title": "Receipt",
        "num_pages": 2,
        "result_json_data": {}
    },
    "created_at": "2023-07-30T19:05:10.696893",
    "updated_at": "2023-07-30T19:05:29.565249"
}

If the file has completed processing, your result_json_data field will be populated with a JSON of the extracted data. Like this:

{
    "request_id": "your-request-id",
    "request_status": "completed",
    "response_body": {
        "file_name": "your-file-name",
        "file_presigned_url": "temporary-url-to-your-file",
        "user_error_msg": null,
        "template_id": "your-template-id",
        "template_type": "receipt",
        "template_title": "Receipt",
        "num_pages": 2,
        "result_json_data": {
            "purchase_order_number": "5300016211",
            "invoice_number": "P007757",
            "invoice_date": "04/06/20",
            "invoice_due_date": null,
            "payment_terms": "30 Days",
            "currency": "USD",
            "invoice_total_amount": "14,070.00",
            "invoice_total_tax_amount": "670.00",
            "invoice_tax_rate": "5.000%",
            "invoice_subtotal_amount": "13,400.00",
            "seller_name": "Grayling Momentum Ltd",
            "seller_address": "HSBC Bank Middle East Limited, P.O.Box: 3766 Dubai , UAE",
            "seller_email": null,
            "seller_phone": null,
            "seller_vat_number": "100390502100003",
            "seller_bank_name": "HSBC Bank Middle East Limited",
            "seller_bank_account_number": "AE900 2000 0003 0033 0210 01",
            "buyer_name": "Audi Volkswagen Middle East FZE",
            "buyer_address": "Dubai Airport Freezone, East Wing Building 4E/B, 8th Floor 27758 Dubai, United Arab Emirates",
            "buyer_vat_number": null,
            "items": [
               {
               "description": "PR Retainer for the month of June 2020",
               "quantity": 1,
               "unit_price": 13400,
               "subtotal": 13400
               }
            ]
         }
    },
    "created_at": "2023-07-30T19:05:10.696893",
    "updated_at": "2023-07-30T19:05:29.565249"
}

Export to Excel

Next let’s take a look at exporting the document extract to Excel. Currently, DocumentPro only supports file downloads through the web platform. To get an extract for your document you need to go into the documents section for your parser and select the document that you just parsed with the API.

Export single document to Excel

Go into the documents section of your Invoice parser.

DocumentPro

Click view on the document you just parsed.

DocumentPro

Select the download tab and click export to file to open the options for file export.

DocumentPro

Now select the Excel file format. For data layout, selecting merged returns the entire extracted data in one row, while selecting line items will create a row for each row in your table extract. Clicking Export to file will start downloading a file as soon as it’s ready.

Export multiple documents to Excel

DocumentPro also has the ability to export data from multiple parsed documents to Excel. For this, you need to click on the preview section of your Invoice parser. Here you will see a preview of the latest documents that were parsed.

DocumentPro

Click Export to file to see the export options.

DocumentPro

The Date Range option allows you to export documents parsed between one to ninety days in the past in one file. Clicking Export to file will start the process of preparing a file and send you a temporary link to the file to your email which can be used to download the file.

You have now successfully extracted the data from one or many Invoice PDFs and exported them to Excel.

Conclusion

In this guide, you converted a PDF document to Excel. You used the DocumentPro web app to select the Invoice document parser and your API Key. In addition, you used Python to interact with DocumentPro’s API to extract data from your PDF with the Invoice document parser.

DocumentPro helps developers simplify the extraction of information like facts, tables and repeating data from PDF and Image files using powerful GPT document parsers. If you’re looking at processing larger volume of documents check out our various plans.