Search CORE

198 research outputs found

Ontologies and Bigram-based approach for Isolated Non-word Errors Correction in OCR System

Author: Belhadef Hacene
Eutamene Aicha
Kholladi Mohamed Khireddine
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 01/12/2015
Field of study

In this paper, we describe a new and original approach for post-processing step in an OCR system. This approach is based on new method of spelling correction to correct automatically misspelled words resulting from a character recognition step of scanned documents by combining both ontologies and bigram code in order to create a robust system able to solve automatically the anomalies of classical approaches. The proposed approach is based on a hybrid method which is spread over two stages, first one is character recognition by using the ontological model and the second one is word recognition based on spelling correction approach based on bigram codification for detection and correction of errors. The spelling error is broadly classified in two categories namely non-word error and real-word error. In this paper, we interested only on detection and correction of non-word errors because this is the only type of errors treated by an OCR. In addition, the use of an online external resource such as WordNet proves necessary to improve its performances

Crossref

Institute of Advanced Engineering and Science

Sentence Level N-Gram Context Feature in Real-Word Spelling Error Detection and Correction: Unsupervised Corpus Based Approach

Author: Kassa Tsegay Mullu
Publication venue: The International Institute for Science, Technology and Education (IISTE)
Publication date: 01/10/2020
Field of study

Spell checking is the process of finding misspelled words and possibly correcting them. Most of the modern commercial spell checkers use a straightforward approach to finding misspellings, which considered a word is erroneous when it is not found in the dictionary. However, this approach is not able to check the correctness of words in their context and this is called real-word spelling error. To solve this issue, in the state-of-the-art researchers use context feature at fixed size n-gram (i.e. tri-gram) and this reduces the effectiveness of model due to limited feature. In this paper, we address the problem of this issue by adopting sentence level n-gram feature for real-word spelling error detection and correction. In this technique, all possible word n-grams are used to learn proposed model about properties of target language and this enhance its effectiveness. In this investigation, the only corpus required to training proposed model is unsupervised corpus (or raw text) and this enables the model flexible to be adoptable for any natural languages. But, for demonstration purpose we adopt under-resourced languages such as Amharic, Afaan Oromo and Tigrigna. The model has been evaluated in terms of Recall, Precision, F-measure and a comparison with literature was made (i.e. fixed n-gram context feature) to assess if the technique used performs as good. The experimental result indicates proposed model with sentence level n-gram context feature achieves a better result: for real-word error detection and correction achieves an average F-measure of 90.03%, 85.95%, and 84.24% for Amharic, Afaan Oromo and Tigrigna respectively. Keywords: Sentence level n-gram, real-word spelling error, spell checker, unsupervised corpus based spell checker DOI: 10.7176/JIEA/10-4-02 Publication date:September 30th 202

International Institute for Science, Technology and Education (IISTE): E-Journals

Spelling Correction for Estonian Learner Language

Author: Allkivi-Metsoja Kais
Kippar Jaagup
Publication venue: University of Tartu Library
Publication date: 01/05/2023
Field of study

DSpace at Tartu University Library

Fast and Accurate Spelling Correction Using Trie and Damerau-levenshtein Distance Bigram

Author: M. Viny Christanti
Naga Dali S.
Rudy Rudy
Publication venue: 'Universitas Ahmad Dahlan'
Publication date: 01/04/2018
Field of study

This research was intended to create a fast and accurate spelling correction system with the ability to handle both kind of spelling errors, non-word and real word errors. Existing spelling correction system was analyzed and was then applied some modifications to improve its accuracy and speed. The proposed spelling correction system is then built based on the method and intuition used by existing system along with the modifications made in previous step. The result is a various spelling correction system using different methods. Best result is achieved by the system that uses bigram with Trie and Damerau-Levenshtein distance with the word level accuracy of 84.62% and an average processing speed of 18.89 ms per sentence

Journal of Education and Learning (EduLearn)

TELKOMNIKA (Telecommunication Computing Electronics and Control)

UAD Journal Management System

Essay auto-scoring using N-Gram and Jaro Winkler based Indonesian Typos

Author: Cahyaning Judanti
Drezewski Rafal
Jayadianti Herlina
Saifullah Shoffan
Santosa Budi
Publication venue: 'STMIK Bumigora Mataram'
Publication date: 24/03/2023
Field of study

Writing errors on e-essay exams reduce scores. Thus, detecting and correcting errors automatically in writing answers is necessary. The implementation of Levenshtein Distance and N-Gram can detect writing errors. However, this process needed a long time because of the distance method used. Therefore, this research aims to hybrid Jaro Winker and N-Gram methods to detect and correct writing errors automatically. This process required preprocessing and finding the best word recommendations by the Jaro Winkler method, which refers to Kamus Besar Bahasa Indonesia (KBBI). The N-Gram method refers to the corpus. The final scoring used the Vector Space Model (VSM) method based on the similarity of words between the answer keys and the respondent’s answers. Datasets used 115 answers from 23 respondents with some writing errors. The results of Jaro Winkler and N-Gram methods are good in detecting and correcting Indonesian words with the accuracy of detection averages of 83.64% (minimum of 57.14% and maximum of 100.00%). In contrast, the error correction accuracy averages 78.44% (minimum of 40.00% and maximum of 100.00%). However, Natural Language Processing (NLP) needs to improve these results for word recommendations

Open Journal System (OJS) Universitas Bumigora

REAL-WORD ERROR DETECTION AND CORRECTION IN ARABIC TEXT

Author
Publication venue
Publication date
Field of study

KFUPM ePrints

REAL-WORD ERROR DETECTION AND CORRECTION IN ARABIC TEXT

Author
Publication venue
Publication date
Field of study

Корректировка ошибок оптического распознавания на основе рейтинго-ранговой модели текста

Author: Смирнов Сергей Владимирович
Publication venue: СПб ФИЦ РАН
Publication date: 11/11/2014
Field of study

OCR results of archival documents have to be corrected in order to improve accuracy. An algorithm that takes into account peculiarities of the Russian language and allows handling large volumes of text corpus in fully automatic mode is described. The correction process is divided into stages of analysis of the entire corpus of texts, preparation of data structures, the selection of word candidates and their final ranking. Using rank-rating model for generating text corrections allows handling texts containing specific terminology from different subject areas.Результаты массового оптического распознавания архивных документов необходимо подвергать корректировке с целью сокращения количества ошибок. В работе описывается алгоритм корректировки, учитывающий особенности русского языка и позволяющий обрабатывать корпуса текстов больших объемов в полностью автоматическом режиме. Процесс корректировки разделяется на этапы анализа всего корпуса текстов, подготовки структур данных, отбора слов-кандидатов и их финального ранжирование. Использование рейтинго-ранговой модели текста для генерации корректировок позволяет обрабатывать тексты, содержащие узкоспециализированную терминологию, различных предметных областей

Информатика и автоматизация