A Technical Overview of Automated Data Extraction from Documents

Financial institutions face challenging decisions as a part of their daily routines: Lenders decide whether X will be granted a loan and what the loan's terms will be. On the other hand, investors decide whether to invest in Y and what capital will be invested.

The commonality for all financial institutions is that their decisions are based on the financial data that's contained in the documents filed by companies and individuals. These documents usually include numerous figures, texts, tables, and graphs, amounting to a massive amount of data which requires a hard-working (and very careful) employee to feed it to a computer so it can be used later for the estimation of future profits.

Financial institutions nowadays use computers and algorithms to analyze this data and for decision-making; however, the process of data extraction and data feeding to computers is still (surprisingly!) manual. Spend a moment to think of the time spent by the hard-working employee who still uses a ruler to avoid errors when extracting data from tables. This manual process is tedious and redundant, and it should be avoided because it is time-consuming, prone to errors, and above all, because an automated alternative is at hand.

This short article aims to provide an overview of the necessary components needed for an automated solution to extract data from financial documents, enabling financial institutions to have a fast, accurate, and effortless extraction process for all the financial information necessary for making their decisions.

Component 1: An OCR engine capable of identifying text

One of the earlier successes of neural networks was identifying handwritten and typed characters and digits. Nowadays, identifying a single digit or character is considered a trivial task from an AI perspective; characterizing an entire text is still a bit more challenging. This is mainly because the algorithm must distinguish between the document's different parts (words, sentences, paragraphs, empty areas, figures, etc.) and define the borders between these parts. In addition, processing large amounts of text inevitably results in more "noise" in the input, forcing the trained model to distinguish actual characters from "ink stains" due to the copy's poor quality and to differentiate text from graphs and photos in the proximity of it.

Fortunately, AI capabilities can now extract text from large and diverse documents, and AI has been integrated into impressive state-of-the-art algorithms for these purposes. These AI models' very detailed outputs allow them to reconstruct the physical document electronically. This is possible when the identification of each fragment in the text (a word, a sentence, or a paragraph), together with its physical position, is expressed by coordinates.

From this geometrical data, we can extract many other types of structured objects, such as tables and questionnaires, that contain much of the data required by financial institutions. However, in cases when the extraction of tables and key-value pairs is not successful, the data can be extracted by employing another AI model trained on features engineered from the initial result or by introducing geometrical considerations which utilize coordinates in combination with some logical statements.

The choice between several base OCR (optical character recognition) engines for a particular task should be made after carefully considering the advantages of each algorithm. However, the discussion of the specific features of each algorithm is beyond the scope of this short overview.

Component 2: Parsing valuable data from the OCR response

Even after transforming the physical document into an electronically recognized text using the OCR engines and the AI models described above, we still must complete another step before reaching the goal of specific values extraction. This step is parsing the values of interest from the OCR and AI algorithms results.

To identify and parse these values, we may employ each of the following tools separately or in a complementary manner:

1. Identifying values using regular expressions – This is probably the most conventional and simplest approach. In it, we search the text for expressions that indicate the position of the value in the text. For example, if for a particular certificate, the ID number of the certificate's holder always comes after "No," we can search the text for this pattern and return the string that proceeds it. We can also validate the parsing by checking, for example, whether the ID is correct according to the Luhn algorithm or by searching for this ID in government databases.

2. Identifying values by their physical position on the page – In cases where it's not possible to parse a value by the regex patterns of its nearest neighbors or in cases where the OCR response does not directly reflect the true ordering of words in the text, we can make use of the fact that each "part of text" is returned together with its "bounding box" coordinates. This information allows us to parse values according to their position relative to other objects on the page.

3. Training NLP models to enable parsing from unstructured documents – Some document types lack a standard and consistent format. For example, US utility bills (which are commonly required by financial institutions as a confirmation of a business address) are provided by hundreds of different US utility companies. Consequently, there are numerous utility bill formats we should be able to extract addresses from. In such cases, using the above techniques to extract addresses is unfeasible. To tackle this problem, we introduce an NLP.

Natural language processing (NLP) enables computers to understand natural language as humans do. Natural language processing uses AI to take real-world input, process it, and make sense of it in a way a computer can understand. At some point in processing, the input is converted into a code the computer can understand before being fed into machine learning algorithms that use statistical methods. The algorithm learns to perform tasks based on training data it's fed, adjusting its methods as more data is processed. Using a combination of machine learning, deep learning, neural networks, and natural language processing, the algorithms hone their own rules through repeated processing and learning.

There is an NLP-related capability called Named Entity Recognition (NER) which, after model training, allows extraction of all types of names from an unstructured text, including addresses. The infrastructure of this linguistic model was published by Google in 2018 and was named BERT (Bidirectional Encoder Representations from Transformers). BERT is a pre-trained neural network specifically designed to be trained for NLP-related tasks, such as NER.

Conclusion

Once these two components are implemented satisfactorily, the extraction of important financial data from financial documents can occur in an accurate and efficient way. To be harnessed for financial analysis, the output of the extraction process should be presented in key-value pairs, such as the following:

‍

Once the financial data of an individual or a company is organized in this order, the process of making a financial decision is shortened considerably.

‍