198 research outputs found

    Ontologies and Bigram-based approach for Isolated Non-word Errors Correction in OCR System

    Get PDF
    In this paper, we describe a new and original approach for post-processing step in an OCR system. This approach is based on new method of spelling correction to correct automatically misspelled words resulting from a character recognition step of scanned documents by combining both ontologies and bigram code in order to create a robust system able to solve automatically the anomalies of classical approaches. The proposed approach is based on a hybrid method which is spread over two stages, first one is character recognition by using the ontological model and the second one is word recognition based on spelling correction approach based on bigram codification for detection and correction of errors. The spelling error is broadly classified in two categories namely non-word error and real-word error. In this paper, we interested only on detection and correction of non-word errors because this is the only type of errors treated by an OCR. In addition, the use of an online external resource such as WordNet proves necessary to improve its performances

    Sentence Level N-Gram Context Feature in Real-Word Spelling Error Detection and Correction: Unsupervised Corpus Based Approach

    Get PDF
    Spell checking is the process of finding misspelled words and possibly correcting them. Most of the modern commercial spell checkers use a straightforward approach to finding misspellings, which considered a word is erroneous when it is not found in the dictionary. However, this approach is not able to check the correctness of words in their context and this is called real-word spelling error. To solve this issue, in the state-of-the-art researchers use context feature at fixed size n-gram (i.e. tri-gram) and this reduces the effectiveness of model due to limited feature. In this paper, we address the problem of this issue by adopting sentence level n-gram feature for real-word spelling error detection and correction. In this technique, all possible word n-grams are used to learn proposed model about properties of target language and this enhance its effectiveness. In this investigation, the only corpus required to training proposed model is unsupervised corpus (or raw text) and this enables the model flexible to be adoptable for any natural languages. But, for demonstration purpose we adopt under-resourced languages such as Amharic, Afaan Oromo and Tigrigna. The model has been evaluated in terms of Recall, Precision, F-measure and a comparison with literature was made (i.e. fixed n-gram context feature) to assess if the technique used performs as good.Β  The experimental result indicates proposed model with sentence level n-gram context feature achieves a better result: for real-word error detection and correction achieves an average F-measure of 90.03%, 85.95%, and 84.24% for Amharic, Afaan Oromo and Tigrigna respectively. Keywords: Sentence level n-gram, real-word spelling error, spell checker, unsupervised corpus based spell checker DOI: 10.7176/JIEA/10-4-02 Publication date:September 30th 202

    Spelling Correction for Estonian Learner Language

    Get PDF

    Fast and Accurate Spelling Correction Using Trie and Damerau-levenshtein Distance Bigram

    Get PDF
    This research was intended to create a fast and accurate spelling correction system with the ability to handle both kind of spelling errors, non-word and real word errors. Existing spelling correction system was analyzed and was then applied some modifications to improve its accuracy and speed. The proposed spelling correction system is then built based on the method and intuition used by existing system along with the modifications made in previous step. The result is a various spelling correction system using different methods. Best result is achieved by the system that uses bigram with Trie and Damerau-Levenshtein distance with the word level accuracy of 84.62% and an average processing speed of 18.89 ms per sentence

    Essay auto-scoring using N-Gram and Jaro Winkler based Indonesian Typos

    Get PDF
    Writing errors on e-essay exams reduce scores. Thus, detecting and correcting errors automatically in writing answers is necessary. The implementation of Levenshtein Distance and N-Gram can detect writing errors. However, this process needed a long time because of the distance method used. Therefore, this research aims to hybrid Jaro Winker and N-Gram methods to detect and correct writing errors automatically. This process required preprocessing and finding the best word recommendations by the Jaro Winkler method, which refers to Kamus Besar Bahasa Indonesia (KBBI). The N-Gram method refers to the corpus. The final scoring used the Vector Space Model (VSM) method based on the similarity of words between the answer keys and the respondent’s answers. Datasets used 115 answers from 23 respondents with some writing errors. The results of Jaro Winkler and N-Gram methods are good in detecting and correcting Indonesian words with the accuracy of detection averages of 83.64% (minimum of 57.14% and maximum of 100.00%). In contrast, the error correction accuracy averages 78.44% (minimum of 40.00% and maximum of 100.00%). However, Natural Language Processing (NLP) needs to improve these results for word recommendations

    REAL-WORD ERROR DETECTION AND CORRECTION IN ARABIC TEXT

    Get PDF

    REAL-WORD ERROR DETECTION AND CORRECTION IN ARABIC TEXT

    Get PDF

    ΠšΠΎΡ€Ρ€Π΅ΠΊΡ‚ΠΈΡ€ΠΎΠ²ΠΊΠ° ошибок оптичСского распознавания Π½Π° основС Ρ€Π΅ΠΉΡ‚ΠΈΠ½Π³ΠΎ-Ρ€Π°Π½Π³ΠΎΠ²ΠΎΠΉ ΠΌΠΎΠ΄Π΅Π»ΠΈ тСкста

    Get PDF
    OCR results of archival documents have to be corrected in order to improve accuracy. An algorithm that takes into account peculiarities of the Russian language and allows handling large volumes of text corpus in fully automatic mode is described. The correction process is divided into stages of analysis of the entire corpus of texts, preparation of data structures, the selection of word candidates and their final ranking. Using rank-rating model for generating text corrections allows handling texts containing specific terminology from different subject areas.Π Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚Ρ‹ массового оптичСского распознавания Π°Ρ€Ρ…ΠΈΠ²Π½Ρ‹Ρ… Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚ΠΎΠ² Π½Π΅ΠΎΠ±Ρ…ΠΎΠ΄ΠΈΠΌΠΎ ΠΏΠΎΠ΄Π²Π΅Ρ€Π³Π°Ρ‚ΡŒ ΠΊΠΎΡ€Ρ€Π΅ΠΊΡ‚ΠΈΡ€ΠΎΠ²ΠΊΠ΅ с Ρ†Π΅Π»ΡŒΡŽ сокращСния количСства ошибок. Π’ Ρ€Π°Π±ΠΎΡ‚Π΅ описываСтся Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌ ΠΊΠΎΡ€Ρ€Π΅ΠΊΡ‚ΠΈΡ€ΠΎΠ²ΠΊΠΈ, ΡƒΡ‡ΠΈΡ‚Ρ‹Π²Π°ΡŽΡ‰ΠΈΠΉ особСнности русского языка ΠΈ ΠΏΠΎΠ·Π²ΠΎΠ»ΡΡŽΡ‰ΠΈΠΉ ΠΎΠ±Ρ€Π°Π±Π°Ρ‚Ρ‹Π²Π°Ρ‚ΡŒ корпуса тСкстов Π±ΠΎΠ»ΡŒΡˆΠΈΡ… объСмов Π² ΠΏΠΎΠ»Π½ΠΎΡΡ‚ΡŒΡŽ автоматичСском Ρ€Π΅ΠΆΠΈΠΌΠ΅. ΠŸΡ€ΠΎΡ†Π΅ΡΡ ΠΊΠΎΡ€Ρ€Π΅ΠΊΡ‚ΠΈΡ€ΠΎΠ²ΠΊΠΈ раздСляСтся Π½Π° этапы Π°Π½Π°Π»ΠΈΠ·Π° всСго корпуса тСкстов, ΠΏΠΎΠ΄Π³ΠΎΡ‚ΠΎΠ²ΠΊΠΈ структур Π΄Π°Π½Π½Ρ‹Ρ…, ΠΎΡ‚Π±ΠΎΡ€Π° слов-ΠΊΠ°Π½Π΄ΠΈΠ΄Π°Ρ‚ΠΎΠ² ΠΈ ΠΈΡ… Ρ„ΠΈΠ½Π°Π»ΡŒΠ½ΠΎΠ³ΠΎ Ρ€Π°Π½ΠΆΠΈΡ€ΠΎΠ²Π°Π½ΠΈΠ΅. ИспользованиС Ρ€Π΅ΠΉΡ‚ΠΈΠ½Π³ΠΎ-Ρ€Π°Π½Π³ΠΎΠ²ΠΎΠΉ ΠΌΠΎΠ΄Π΅Π»ΠΈ тСкста для Π³Π΅Π½Π΅Ρ€Π°Ρ†ΠΈΠΈ ΠΊΠΎΡ€Ρ€Π΅ΠΊΡ‚ΠΈΡ€ΠΎΠ²ΠΎΠΊ позволяСт ΠΎΠ±Ρ€Π°Π±Π°Ρ‚Ρ‹Π²Π°Ρ‚ΡŒ тСксты, содСрТащиС ΡƒΠ·ΠΊΠΎΡΠΏΠ΅Ρ†ΠΈΠ°Π»ΠΈΠ·ΠΈΡ€ΠΎΠ²Π°Π½Π½ΡƒΡŽ Ρ‚Π΅Ρ€ΠΌΠΈΠ½ΠΎΠ»ΠΎΠ³ΠΈΡŽ, Ρ€Π°Π·Π»ΠΈΡ‡Π½Ρ‹Ρ… ΠΏΡ€Π΅Π΄ΠΌΠ΅Ρ‚Π½Ρ‹Ρ… областСй
    • …
    corecore