198 research outputs found
Ontologies and Bigram-based approach for Isolated Non-word Errors Correction in OCR System
In this paper, we describe a new and original approach for post-processing step in an OCR system. This approach is based on new method of spelling correction to correct automatically misspelled words resulting from a character recognition step of scanned documents by combining both ontologies and bigram code in order to create a robust system able to solve automatically the anomalies of classical approaches. The proposed approach is based on a hybrid method which is spread over two stages, first one is character recognition by using the ontological model and the second one is word recognition based on spelling correction approach based on bigram codification for detection and correction of errors. The spelling error is broadly classified in two categories namely non-word error and real-word error. In this paper, we interested only on detection and correction of non-word errors because this is the only type of errors treated by an OCR. In addition, the use of an online external resource such as WordNet proves necessary to improve its performances
Sentence Level N-Gram Context Feature in Real-Word Spelling Error Detection and Correction: Unsupervised Corpus Based Approach
Spell checking is the process of finding misspelled words and possibly correcting them. Most of the modern commercial spell checkers use a straightforward approach to finding misspellings, which considered a word is erroneous when it is not found in the dictionary. However, this approach is not able to check the correctness of words in their context and this is called real-word spelling error. To solve this issue, in the state-of-the-art researchers use context feature at fixed size n-gram (i.e. tri-gram) and this reduces the effectiveness of model due to limited feature. In this paper, we address the problem of this issue by adopting sentence level n-gram feature for real-word spelling error detection and correction. In this technique, all possible word n-grams are used to learn proposed model about properties of target language and this enhance its effectiveness. In this investigation, the only corpus required to training proposed model is unsupervised corpus (or raw text) and this enables the model flexible to be adoptable for any natural languages. But, for demonstration purpose we adopt under-resourced languages such as Amharic, Afaan Oromo and Tigrigna. The model has been evaluated in terms of Recall, Precision, F-measure and a comparison with literature was made (i.e. fixed n-gram context feature) to assess if the technique used performs as good.Β The experimental result indicates proposed model with sentence level n-gram context feature achieves a better result: for real-word error detection and correction achieves an average F-measure of 90.03%, 85.95%, and 84.24% for Amharic, Afaan Oromo and Tigrigna respectively. Keywords: Sentence level n-gram, real-word spelling error, spell checker, unsupervised corpus based spell checker DOI: 10.7176/JIEA/10-4-02 Publication date:September 30th 202
Fast and Accurate Spelling Correction Using Trie and Damerau-levenshtein Distance Bigram
This research was intended to create a fast and accurate spelling correction system with the ability to handle both kind of spelling errors, non-word and real word errors. Existing spelling correction system was analyzed and was then applied some modifications to improve its accuracy and speed. The proposed spelling correction system is then built based on the method and intuition used by existing system along with the modifications made in previous step. The result is a various spelling correction system using different methods. Best result is achieved by the system that uses bigram with Trie and Damerau-Levenshtein distance with the word level accuracy of 84.62% and an average processing speed of 18.89 ms per sentence
Essay auto-scoring using N-Gram and Jaro Winkler based Indonesian Typos
Writing errors on e-essay exams reduce scores. Thus, detecting and correcting errors automatically in writing answers is necessary. The implementation of Levenshtein Distance and N-Gram can detect writing errors. However, this process needed a long time because of the distance method used. Therefore, this research aims to hybrid Jaro Winker and N-Gram methods to detect and correct writing errors automatically. This process required preprocessing and finding the best word recommendations by the Jaro Winkler method, which refers to Kamus Besar Bahasa Indonesia (KBBI). The N-Gram method refers to the corpus. The final scoring used the Vector Space Model (VSM) method based on the similarity of words between the answer keys and the respondentβs answers. Datasets used 115 answers from 23 respondents with some writing errors. The results of Jaro Winkler and N-Gram methods are good in detecting and correcting Indonesian words with the accuracy of detection averages of 83.64% (minimum of 57.14% and maximum of 100.00%). In contrast, the error correction accuracy averages 78.44% (minimum of 40.00% and maximum of 100.00%). However, Natural Language Processing (NLP) needs to improve these results for word recommendations
ΠΠΎΡΡΠ΅ΠΊΡΠΈΡΠΎΠ²ΠΊΠ° ΠΎΡΠΈΠ±ΠΎΠΊ ΠΎΠΏΡΠΈΡΠ΅ΡΠΊΠΎΠ³ΠΎ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ Π½Π° ΠΎΡΠ½ΠΎΠ²Π΅ ΡΠ΅ΠΉΡΠΈΠ½Π³ΠΎ-ΡΠ°Π½Π³ΠΎΠ²ΠΎΠΉ ΠΌΠΎΠ΄Π΅Π»ΠΈ ΡΠ΅ΠΊΡΡΠ°
OCR results of archival documents have to be corrected in order to improve accuracy. An algorithm that takes into account peculiarities of the Russian language and allows handling large volumes of text corpus in fully automatic mode is described. The correction process is divided into stages of analysis of the entire corpus of texts, preparation of data structures, the selection of word candidates and their final ranking. Using rank-rating model for generating text corrections allows handling texts containing specific terminology from different subject areas.Π Π΅Π·ΡΠ»ΡΡΠ°ΡΡ ΠΌΠ°ΡΡΠΎΠ²ΠΎΠ³ΠΎ ΠΎΠΏΡΠΈΡΠ΅ΡΠΊΠΎΠ³ΠΎ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ Π°ΡΡ
ΠΈΠ²Π½ΡΡ
Π΄ΠΎΠΊΡΠΌΠ΅Π½ΡΠΎΠ² Π½Π΅ΠΎΠ±Ρ
ΠΎΠ΄ΠΈΠΌΠΎ ΠΏΠΎΠ΄Π²Π΅ΡΠ³Π°ΡΡ ΠΊΠΎΡΡΠ΅ΠΊΡΠΈΡΠΎΠ²ΠΊΠ΅ Ρ ΡΠ΅Π»ΡΡ ΡΠΎΠΊΡΠ°ΡΠ΅Π½ΠΈΡ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²Π° ΠΎΡΠΈΠ±ΠΎΠΊ. Π ΡΠ°Π±ΠΎΡΠ΅ ΠΎΠΏΠΈΡΡΠ²Π°Π΅ΡΡΡ Π°Π»Π³ΠΎΡΠΈΡΠΌ ΠΊΠΎΡΡΠ΅ΠΊΡΠΈΡΠΎΠ²ΠΊΠΈ, ΡΡΠΈΡΡΠ²Π°ΡΡΠΈΠΉ ΠΎΡΠΎΠ±Π΅Π½Π½ΠΎΡΡΠΈ ΡΡΡΡΠΊΠΎΠ³ΠΎ ΡΠ·ΡΠΊΠ° ΠΈ ΠΏΠΎΠ·Π²ΠΎΠ»ΡΡΡΠΈΠΉ ΠΎΠ±ΡΠ°Π±Π°ΡΡΠ²Π°ΡΡ ΠΊΠΎΡΠΏΡΡΠ° ΡΠ΅ΠΊΡΡΠΎΠ² Π±ΠΎΠ»ΡΡΠΈΡ
ΠΎΠ±ΡΠ΅ΠΌΠΎΠ² Π² ΠΏΠΎΠ»Π½ΠΎΡΡΡΡ Π°Π²ΡΠΎΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΎΠΌ ΡΠ΅ΠΆΠΈΠΌΠ΅. ΠΡΠΎΡΠ΅ΡΡ ΠΊΠΎΡΡΠ΅ΠΊΡΠΈΡΠΎΠ²ΠΊΠΈ ΡΠ°Π·Π΄Π΅Π»ΡΠ΅ΡΡΡ Π½Π° ΡΡΠ°ΠΏΡ Π°Π½Π°Π»ΠΈΠ·Π° Π²ΡΠ΅Π³ΠΎ ΠΊΠΎΡΠΏΡΡΠ° ΡΠ΅ΠΊΡΡΠΎΠ², ΠΏΠΎΠ΄Π³ΠΎΡΠΎΠ²ΠΊΠΈ ΡΡΡΡΠΊΡΡΡ Π΄Π°Π½Π½ΡΡ
, ΠΎΡΠ±ΠΎΡΠ° ΡΠ»ΠΎΠ²-ΠΊΠ°Π½Π΄ΠΈΠ΄Π°ΡΠΎΠ² ΠΈ ΠΈΡ
ΡΠΈΠ½Π°Π»ΡΠ½ΠΎΠ³ΠΎ ΡΠ°Π½ΠΆΠΈΡΠΎΠ²Π°Π½ΠΈΠ΅. ΠΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ ΡΠ΅ΠΉΡΠΈΠ½Π³ΠΎ-ΡΠ°Π½Π³ΠΎΠ²ΠΎΠΉ ΠΌΠΎΠ΄Π΅Π»ΠΈ ΡΠ΅ΠΊΡΡΠ° Π΄Π»Ρ Π³Π΅Π½Π΅ΡΠ°ΡΠΈΠΈ ΠΊΠΎΡΡΠ΅ΠΊΡΠΈΡΠΎΠ²ΠΎΠΊ ΠΏΠΎΠ·Π²ΠΎΠ»ΡΠ΅Ρ ΠΎΠ±ΡΠ°Π±Π°ΡΡΠ²Π°ΡΡ ΡΠ΅ΠΊΡΡΡ, ΡΠΎΠ΄Π΅ΡΠΆΠ°ΡΠΈΠ΅ ΡΠ·ΠΊΠΎΡΠΏΠ΅ΡΠΈΠ°Π»ΠΈΠ·ΠΈΡΠΎΠ²Π°Π½Π½ΡΡ ΡΠ΅ΡΠΌΠΈΠ½ΠΎΠ»ΠΎΠ³ΠΈΡ, ΡΠ°Π·Π»ΠΈΡΠ½ΡΡ
ΠΏΡΠ΅Π΄ΠΌΠ΅ΡΠ½ΡΡ
ΠΎΠ±Π»Π°ΡΡΠ΅ΠΉ
- β¦