90 research outputs found
Improving OCR Post Processing with Machine Learning Tools
Optical Character Recognition (OCR) Post Processing involves data cleaning steps for documents that were digitized, such as a book or a newspaper article. One step in this process is the identification and correction of spelling and grammar errors generated due to the flaws in the OCR system. This work is a report on our efforts to enhance the post processing for large repositories of documents.
The main contributions of this work are:
• Development of tools and methodologies to build both OCR and ground truth text correspondence for training and testing of proposed techniques in our experiments. In particular, we will explain the alignment problem and tackle it with our de novo algorithm that has shown a high success rate.
• Exploration of the Google Web 1T corpus to correct errors using context. We show that over half of the errors in the OCR text can be detected and corrected.
• Applications of machine learning tools to generalize the past ad hoc approaches to OCR error corrections. As an example, we investigate the use of logistic regression to select the correct replacement for misspellings in the OCR text.
• Use of container technology to address the state of reproducible research in OCR and Computer Science as a whole. Many of the past experiments in the field of OCR are not considered reproducible research questioning whether the original results were outliers or finessed
Hybrid model of post-processing techniques for Arabic optical character recognition
Optical character recognition (OCR) is used to extract text contained in an image. One of the stages in OCR is the post-processing and it corrects the errors of OCR output text. The OCR multiple outputs approach consists of three processes: differentiation, alignment, and voting. Existing differentiation techniques suffer from
the loss of important features as it uses
N-versions of input images. On the other hand, alignment techniques in the literatures are based on approximation while the voting process is not context-aware. These drawbacks lead to a high error rate in OCR. This research proposed three improved techniques of differentiation, alignment, and voting to overcome the identified drawbacks. These techniques were later combined into a hybrid model that can recognize the optical characters in the
Arabic language. Each of the proposed technique was separately evaluated against three other relevant existing techniques. The performance measurements used in this study were Word Error Rate (WER), Character Error Rate (CER), and Non-word
Error Rate (NWER). Experimental results showed a relative decrease in error rate on all measurements for the evaluated techniques. Similarly, the hybrid model also obtained lower WER, CER, and NWER by 30.35%, 52.42%, and 47.86% respectively when compared to the three relevant existing models. This study contributes to the OCR domain as the proposed hybrid model of post-processing techniques could facilitate the automatic recognition of Arabic text. Hence, it will lead to a better information retrieval
An improved Levenshtein algorithm for spelling correction word candidate list generation
Candidates’ list generation in spelling correction is a process of finding words from a lexicon that should be close to the incorrect word. The most widely used algorithm for generating candidates’ list for incorrect words is based on Levenshtein distance. However, this algorithm takes too much time when there is a large number of spelling errors. The reason is that calculating Levenshtein algorithm includes operations that create an array and fill the cells of this array by comparing the characters of an incorrect word with the characters of a word from a lexicon. Since most lexicons contain millions of words, then these operations will be repeated millions of times for each incorrect word to generate its candidates list. This dissertation improved Levenshtein algorithm by designing an operational technique that has been included in this algorithm. The proposed operational technique enhances Levenshtein algorithm in terms of the processing time of its executing without affecting its accuracy. It reduces the operations required to measure cells’ values in the first row, first column, second row, second column, third row, and third column in Levenshtein array. The improved Levenshtein algorithm was evaluated against the original algorithm. Experimental results show that the proposed algorithm outperforms Levenshtein algorithm in terms of the processing time by 36.45% while the accuracy of both algorithms is still the same
Utilizing Big Data in Identification and Correction of OCR Errors
In this thesis, we report on our experiments for detection and correction of OCR errors with web data. More specifically, we utilize Google search to access the big data resources available to identify possible candidates for correction. We then use a combination of the Longest Common Subsequences (LCS) and Bayesian estimates to automatically pick the proper candidate.
Our experimental results on a small set of historical newspaper data show a recall and precision of 51% and 100%, respectively. The work in this thesis further provides a detailed classification and analysis of all errors. In particular, we point out the shortcomings of our approach in its ability to suggest proper candidates to correct the
remaining errors
Toward the Optimized Crowdsourcing Strategy for OCR Post-Correction
Digitization of historical documents is a challenging task in many digital
humanities projects. A popular approach for digitization is to scan the
documents into images, and then convert images into text using Optical
Character Recognition (OCR) algorithms. However, the outcome of OCR processing
of historical documents is usually inaccurate and requires post-processing
error correction. This study investigates how crowdsourcing can be utilized to
correct OCR errors in historical text collections, and which crowdsourcing
methodology is the most effective in different scenarios and for various
research objectives. A series of experiments with different micro-task's
structures and text lengths was conducted with 753 workers on the Amazon's
Mechanical Turk platform. The workers had to fix OCR errors in a selected
historical text. To analyze the results, new accuracy and efficiency measures
have been devised. The analysis suggests that in terms of accuracy, the optimal
text length is medium (paragraph-size) and the optimal structure of the
experiment is two-phase with a scanned image. In terms of efficiency, the best
results were obtained when using longer text in the single-stage structure with
no image. The study provides practical recommendations to researchers on how to
build the optimal crowdsourcing task for OCR post-correction. The developed
methodology can also be utilized to create golden standard historical texts for
automatic OCR post-correction. This is the first attempt to systematically
investigate the influence of various factors on crowdsourcing-based OCR
post-correction and propose an optimal strategy for this process.Comment: 25 pages, 12 figures, 1 tabl
Generating a training corpus for OCR post-correction using encoder-decoder model
International audienceIn this paper we present a novel approach to the automatic correction of OCR-induced orthographic errors in a given text. While current systems depend heavily on large training corpora or exter- nal information, such as domain-specific lexicons or confidence scores from the OCR process, our system only requires a small amount of relatively clean training data from a representative corpus to learn a character-based statistical language model using Bidirectional Long Short- Term Memory Networks (biLSTMs). We demonstrate the versatility and adaptability of our system on different text corpora with varying degrees of textual noise, in- cluding a real-life OCR corpus in the med- ical domain
- …