11 research outputs found

    An improved Levenshtein algorithm for spelling correction word candidate list generation

    Get PDF
    Candidates’ list generation in spelling correction is a process of finding words from a lexicon that should be close to the incorrect word. The most widely used algorithm for generating candidates’ list for incorrect words is based on Levenshtein distance. However, this algorithm takes too much time when there is a large number of spelling errors. The reason is that calculating Levenshtein algorithm includes operations that create an array and fill the cells of this array by comparing the characters of an incorrect word with the characters of a word from a lexicon. Since most lexicons contain millions of words, then these operations will be repeated millions of times for each incorrect word to generate its candidates list. This dissertation improved Levenshtein algorithm by designing an operational technique that has been included in this algorithm. The proposed operational technique enhances Levenshtein algorithm in terms of the processing time of its executing without affecting its accuracy. It reduces the operations required to measure cells’ values in the first row, first column, second row, second column, third row, and third column in Levenshtein array. The improved Levenshtein algorithm was evaluated against the original algorithm. Experimental results show that the proposed algorithm outperforms Levenshtein algorithm in terms of the processing time by 36.45% while the accuracy of both algorithms is still the same

    Ontologies and Bigram-based approach for Isolated Non-word Errors Correction in OCR System

    Get PDF
    In this paper, we describe a new and original approach for post-processing step in an OCR system. This approach is based on new method of spelling correction to correct automatically misspelled words resulting from a character recognition step of scanned documents by combining both ontologies and bigram code in order to create a robust system able to solve automatically the anomalies of classical approaches. The proposed approach is based on a hybrid method which is spread over two stages, first one is character recognition by using the ontological model and the second one is word recognition based on spelling correction approach based on bigram codification for detection and correction of errors. The spelling error is broadly classified in two categories namely non-word error and real-word error. In this paper, we interested only on detection and correction of non-word errors because this is the only type of errors treated by an OCR. In addition, the use of an online external resource such as WordNet proves necessary to improve its performances

    Hybrid model of post-processing techniques for Arabic optical character recognition

    Get PDF
    Optical character recognition (OCR) is used to extract text contained in an image. One of the stages in OCR is the post-processing and it corrects the errors of OCR output text. The OCR multiple outputs approach consists of three processes: differentiation, alignment, and voting. Existing differentiation techniques suffer from the loss of important features as it uses N-versions of input images. On the other hand, alignment techniques in the literatures are based on approximation while the voting process is not context-aware. These drawbacks lead to a high error rate in OCR. This research proposed three improved techniques of differentiation, alignment, and voting to overcome the identified drawbacks. These techniques were later combined into a hybrid model that can recognize the optical characters in the Arabic language. Each of the proposed technique was separately evaluated against three other relevant existing techniques. The performance measurements used in this study were Word Error Rate (WER), Character Error Rate (CER), and Non-word Error Rate (NWER). Experimental results showed a relative decrease in error rate on all measurements for the evaluated techniques. Similarly, the hybrid model also obtained lower WER, CER, and NWER by 30.35%, 52.42%, and 47.86% respectively when compared to the three relevant existing models. This study contributes to the OCR domain as the proposed hybrid model of post-processing techniques could facilitate the automatic recognition of Arabic text. Hence, it will lead to a better information retrieval

    Implementation of improved Levenshtein algorithm for spelling correction word candidate list generation

    Get PDF
    Candidates’ list generation in spelling correction is a process of finding words from a lexicon that are close to the incorrect word. The most widely used algorithm to generate the candidate list is the Levenshtein algorithm. However, the algorithm consumes high computational cost, especially when there is a large number of spelling errors. The reason is that calculating Levenshtein algorithm includes operations that create an array and fill the cells of this array by comparing the characters of an incorrect word with the characters of a word from a lexicon. Since most lexicons contain millions of words, such operations will be repeated millions of times for each incorrect word in order to generate its candidates’ list. This study proposes an improved Levenshtein algorithm that reduces the operation steps in comparing characters between the query and lexicon words. Experimental results show that the proposed algorithm outperformed the Levenshtein algorithm in terms of processing time by having 32.43% percentage decrease

    Improvement and expansion of a decoder module for an OCR system

    Get PDF
    OCR systems, short for Optical Character Recognition, are becoming increasingly popular due to the increase in the digitalization of everything. Books, textbooks, magazines and several other paper-based documents are being transformed into an electronic version to be manipulated by a computer. As well, instant translation by image is becoming a reality with the booming technology of smartphones. Nonetheless, OCR systems are still not perfect. The real world contains a lot of extra information and noise that is very difficult for a current OCR system to clean completely, as well as the immensity of variables that take place in handwritten characters and paper-based documents. This project is meant to further improve a decoding module that uses a graph-based algorithm to predict optimal words, and attempts to increase its overall accuracy by using synthetic dataset generation for testing and applying improvements to the base algorithm.Els sistemes OCR, de l'anglès Optical Character Recognition, s'estàn popularitzant considerablement degut a l'augment en la digitalització del món. Llibres de lectura, llibres de text, revistes i altres documents impresos s'estàn transformant en versions digitals per a ser manipulades a través d'ordinadors. A més a més, la traducció instantània a través d'imatge s'està convertint en una realitat amb la tecnologia dels mòbils intel·ligents. No obstant, els sistemes OCR encara no són perfectes. El món real conté molta informació adicional i soroll que són molt complicats d'eliminar per a un sistema OCR actual, a més a més de la immensa quantitat de variables que trobem als caràcters manuscrits i als documents a paper. Aquest projecte millora un mòdul decodificador que fa servir un algorisme basat en grafs per a predir paraules òptimes, i millora els seus resultats utilitzant conjunts de dades generats sintèticament i aplicant modificacions per a millorar l'algorisme base.Los sistemas OCR, del inglés Optical Character Recognition, se están popularizando considerablemente debido al aumento en la digitalización del mundo. Libros de lectura, libros de texto, revistas y otros documentos impresos se están transformando en versiones digitales para ser manipuladas a través de ordenadores. Además, la traducción instantánea a través de imagen se está convirtiendo en una realidad con la tecnología de los móviles inteligentes. No obstante, los sistemas OCR aún no son perfectos. El mundo real contiene mucha información adicional y ruido que son muy complicados de eliminar para un sistema OCR actual, además de la inmensa cantidad de variables que encontramos en los caracteres manuscritos y los documentos a papel. Este proyecto mejora un módulo decodificador que utiliza un algoritmo basado en grafos para predecir palabras óptimas, y mejora sus resultados utilizando conjuntos de datos generados sintéticamente y aplicando modificaciones para mejorar el algoritmo base

    Two bigrams based language model for auto correction of Arabic OCR errors

    Get PDF
    In Optical character recognition (OCR), the characteristics of Arabic text cause more errors than in English text.In this paper, a two bi-grams based language model that uses Wikipedia's database is presented.The method can perform auto detection and correction of non-word errors in Arabic OCR text, and auto detection of real word errors. The method consists of two parts: extracting the context information from Wikipedia's database, and implement the auto detection and correction of incorrect words.This method can be applied to any language with little modifications.The experimental results show successful extraction of context information from Wikipedia's articles. Furthermore, it also shows that using this method can reduce the error rate of Arabic OCR text

    Accessibility-as-a-service an open-source reading assistive tool for education

    Get PDF
    As technology evolves, more and more articles and materials are readily available on the internet for the world to use. This project proposes and demonstrates the implementation of an application to further increase the accessibility of web pages, through the use of image recognition techniques, object detection, and optical character recognition (OCR). The proposed application allows users to input URLs and the application will process the web page in under a minute and outputs a modified web page with translated words detected from images

    Utilizing Big Data in Identification and Correction of OCR Errors

    Full text link
    In this thesis, we report on our experiments for detection and correction of OCR errors with web data. More specifically, we utilize Google search to access the big data resources available to identify possible candidates for correction. We then use a combination of the Longest Common Subsequences (LCS) and Bayesian estimates to automatically pick the proper candidate. Our experimental results on a small set of historical newspaper data show a recall and precision of 51% and 100%, respectively. The work in this thesis further provides a detailed classification and analysis of all errors. In particular, we point out the shortcomings of our approach in its ability to suggest proper candidates to correct the remaining errors

    Advancements and Challenges in Arabic Optical Character Recognition: A Comprehensive Survey

    Full text link
    Optical character recognition (OCR) is a vital process that involves the extraction of handwritten or printed text from scanned or printed images, converting it into a format that can be understood and processed by machines. This enables further data processing activities such as searching and editing. The automatic extraction of text through OCR plays a crucial role in digitizing documents, enhancing productivity, improving accessibility, and preserving historical records. This paper seeks to offer an exhaustive review of contemporary applications, methodologies, and challenges associated with Arabic Optical Character Recognition (OCR). A thorough analysis is conducted on prevailing techniques utilized throughout the OCR process, with a dedicated effort to discern the most efficacious approaches that demonstrate enhanced outcomes. To ensure a thorough evaluation, a meticulous keyword-search methodology is adopted, encompassing a comprehensive analysis of articles relevant to Arabic OCR, including both backward and forward citation reviews. In addition to presenting cutting-edge techniques and methods, this paper critically identifies research gaps within the realm of Arabic OCR. By highlighting these gaps, we shed light on potential areas for future exploration and development, thereby guiding researchers toward promising avenues in the field of Arabic OCR. The outcomes of this study provide valuable insights for researchers, practitioners, and stakeholders involved in Arabic OCR, ultimately fostering advancements in the field and facilitating the creation of more accurate and efficient OCR systems for the Arabic language
    corecore