Download PDFOpen PDF in browserFeedback Learning: Automating the Process of Correcting and Completing the Extracted InformationEasyChair Preprint 9926 pages•Date: May 12, 2019AbstractIn recent years, with the increasing usage of digital media and advancements in deep learning architectures, most of the paper-based documents have been revolutionized into digital versions. These advancements have helped the state-of-the-art Optical Character Recognition (OCR) and digital mailroom technologies become progressively efficient. Commercially, there already exists end to end systems which use OCR and digital mailroom technologies for extracting relevant information from financial documents such as invoices. However, there is plenty of room for improvement in terms of automating and correcting post information extracted errors. This paper describes the user-involved, self-correction concept based on the sequence to sequence Neural Machine Translation (NMT) as applied to rectify the incorrectness in the results of the information extraction. Even though many efficient Post-OCR error rectification methods have been introduced in the recent past to improve the quality of digitized documents, they are still imperfect and demand improvement in the area of context-based error correction specifically for the documents involving sensitive information. This paper further illustrates the capability of sequence learning with the help of feedback provided during each cycle of training, yields relatively better results and have outsmarted the state-of-the-art OCR error correction methods. Keyphrases: Post IE Error Correction and Completeness, Sequence to Sequence Neural Machine Translation (NMT), document understanding
|