How to Extract Receipt Data with OCR, Regex and AI
Our journey of developing the high accuracy receipt extraction solution.
Learn more about the common use cases of PDF data extraction and how you can extract data from PDF files with manual or automated methods.
PDF, or Portable Document Format, has been widely used in all areas since it is the more efficient, reliable, and cost-effective replacement of paper. There are trillions of PDF files in the world now and the number continues to increase every single day. However, such a convenient file format also has its drawback when it comes to data extraction.
Unlike the content on webpages, which are structured by HTML tags, the data or information within PDF files are not organized in ways that a computer can understand. Even if you manage to pull the data out, what you will get might look quite messy.
Let’s make it a bit clearer with an example.
If you wish to analyze the data from several receipts, you will need to organize all the information into an excel file or a table as shown below.
To extract the data from PDF files, using OCR technologies without layout extraction AI might generate results that are just texts without organization.
Take this sample Purchase Order from TemplateLab:
After you extract the data using OCR technologies, the result will look like:
All the data is scrambled together and not structured in any way. If you want to compute for the sum of 10 purchase orders, you will have to parse the PDF and copy & paste the subtotals one by one to a spreadsheet.
There should be an easier way to extract data from PDF files to structured data, right?
In this blog post, we will be talking about the:
Since PDF was introduced in the early 1990s, all kinds of businesses, such as finance, logistics, retail, and more, have used it as the main format for internal and external information exchange. PDF files can contain important business data in the forms of texts, tables, and even images. Common documents digitized into PDF files include:
Piles of documents with important business information can now be digitized into one single file and sent to others within seconds. However, you usually cannot edit the content in the PDF files since very often PDF files contain images of scanned documents. Even if you manage to copy & paste data stored as texts or tables in the PDF files to other file formats like Excel or CSV, there will certainly be some formatting and order issues, rendering the results unusable.
This is probably the go to solution when dealing with a small number of PDF files. If your PDF files contain images, you just open the files, rekey everything from the images to a spreadsheet or even create a json file.
Although this method is scalable as you can always hire more people to complete the tasks, it is error-prone and inefficient. Furthermore, your employees will waste precious time on repetitive tasks when they could be doing other more strategic and productive tasks.
In the long run, extracting data from PDF through in-house manual data entry can be quite costly and is certainly not ideal. Outsourcing it to other companies might reduce the cost and allow employees to focus on their original tasks. However, there are a few common issues with outsourcing manual data entry such as quality control and data security as you are providing business data to a third party and asking them to handle the task.
Involving PDF converters can speed up the in-house manual data entry process as employees will not have to rekey everything from scratch. PDF converters come in different forms, such as on-premises or web-based software, and usually convert PDFs into Excel or CSV.
Nevertheless, They cannot convert images into Excel or CSV or handle batches of documents. In order to handle a significant number of documents in a more efficient manner, it’s always better to look for an automated solution.
Compared to the aforementioned methods, automated data extraction solutions like FormX are more scalable, cost-effective, efficient and accurate.
FormX comes with a set of pre-trained templates, such as receipts, business registration, food license, passport, Hong Kong ID, etc., for users to extract data from images or PDF files. All you have to do is upload the files and let FormX take care of the rest. The extracted data is available as JSON or CSV file that can be sent to other systems for further uses.
For example, we want to extract the data from Coyote Bar & Grill’s receipts stored in a PDF file to analyze consumer behaviors. You can use the pre-trained receipt template of FormX.
To test the result, you can upload it to our portal and scroll down to the JSON Output section as shown below. As you can see, FormX has extracted data such as the total amount, date, time, and the price, name and amount of each product.
Let’s demonstrate with another receipt from MUJI. Although the layouts of MUJI’s and Coyote Bar & Grill’s receipts are very different, FormX can still extract the data with high precision.
The specific information about each product and the receipt has been extracted and organized by FormX within a few seconds as shown below.
Check out our use cases to see how businesses have benefited from incorporating FormX into their workflow.
If you have other types of PDF files from which you wish to extract data, you can also collect some samples of documents and use them to train the software. Thereafter, the only manual process would be to test and verify the extracted data from your PDF files.
Need to automate your data entry process and extract data from images or PDF files with ease? Talk to us about your workflow and needs for us to maximize your productivity.