One of the more positive changes that have been embraced over the past year is digitization, especially in the mortgage industry, where piles of paperwork have regularly filled desks and filing cabinets.

Digitization gives us the ability to sign documents electronically and store digital files rather than paperwork, all without leaving the comfort and safety of home.  Optical Character Recognition (OCR) and its ability to convert paper-based documents into electronic ones are no stranger to many people. However, they might not be aware of how AI technologies could digitize unstructured data into indexed data, saving companies significant overheads.

You may be wondering how digitizing documents results in cost savings. The answer lies in data extraction. Regarding the tasks involving unstructured paper documents, it takes time to find the document, identify the data point(s), and then take action or log the results. Simply converting the paper documents with OCR saves time on the first step but still requires manual effort afterwards.  

An AI-enabled data extraction solution can actively learn to extract those data points over time, allowing employees to focus on higher-value tasks. The process can be completed without any complex programming and is designed to be user-friendly so that anyone can benefit from its accuracy.

FPT’s Data Extraction Solution

FPT recently helped one of our clients to develop a system built on the IBM Watson Natural Language Understanding (NLU) platform, which was chosen due to its comprehensive text analytic features and ability to establish complex data relationships. Our team has implemented many pre and post-processing customizations specifically for mortgage-related datasets to improve the accuracy of results.

It begins by reformatting the OCR results from a scanned document to remove noise and other artefacts. Then it uses Natural Language Processing (NLP) to identify the important data points (items) and groups of data (objects). Initially, it must be calibrated manually to recognize the items and objects, but then begins using Named Entity Recognition (NER) and Relation Extraction (RE) to identify them automatically. It also incorporates a dictionary to bootstrap the annotation task, provide equivalent words and reduce errors. Finally, the results of the data extraction are captured and output in the preferred format.

Complicated as it might sound, the solution is optimized for mortgage-related documents and tailored for non-technical personnel. Based on customers' feedback, the latest update includes a Pattern Extractor feature that identifies specific patterns under-represented in the dataset. This allows users to create rules to quickly find patterns that would otherwise require larger amounts of training data.

Data Extraction in Action

A client recently approached us for help with the task of combining OCR with AI to see if the technology could replace manual effort being done today on property image files of various degrees of clarity. The client had already been down the OCR path and wanted to improve the results with AI. A pilot program was done to determine how viable the solution would be.

The pilot used a data extraction tool with Named Entity Recognition (NER) and Relation Extraction (RE) models specially configured for the dataset. They were set up using the fields relating to subdivision property and abstract property, and relationships among them were pre-defined. Pre and post-processing steps were added to enhance the models’ output.

After the documents were scanned and pre-processed, the client began to train the model by identifying and annotating the correct data points. This took about 7 minutes per document initially, then reduced to 5 minutes once the tool had enough training data to pre-annotate automatically. Approximately 2500 documents were annotated and then the tool began extracting from a much larger dataset. Within 8 weeks the pilot was a success and further phases were completed to enhance the OCR and improve the model.

A Customizable Tool

With an innovative and customizable extraction tool, it is now possible to have a digital index of a wide range of unstructured paper documents. For more information about digitization solutions or other services, contact FPT today.

Author Trang Nguyen