An Ensemble Regression Approach For OCR Error Correction
Abstract
This thesis deals with the problem of error correction for Optical Character Recognation (OCR) generated text, or OCR-postprocessing: how to detect error words in a text generated from OCR process and to suggest the most appropriate candidates to correct such errors. The thesis demonstrates that OCR errors are inherently more protean and volatile than handwriting or typing errors, while existing OCR-postprocessing approaches have different limitations. Through analyzing the recent development of error correction techniques, we illustrate that the compositional approach incorporating correction inferences is broadly researched and practically usefull. Thus, we propose an ensemble regression approach that composite correction inferences for ranking correction candidates of complex OCR errors. On practical side, we make available a benchmark dataset for this task and conduct a comprehensive study on performance analysis with different correction inferences and ensemble algorithms. In particular, the experimental results show that the proposed ensemble method is a robust approach that is able to handle complex OCR errors and outperform various baselines.