Search CORE

419 research outputs found

Ontologies and Bigram-based approach for Isolated Non-word Errors Correction in OCR System

Author: Belhadef Hacene
Eutamene Aicha
Kholladi Mohamed Khireddine
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 01/12/2015
Field of study

In this paper, we describe a new and original approach for post-processing step in an OCR system. This approach is based on new method of spelling correction to correct automatically misspelled words resulting from a character recognition step of scanned documents by combining both ontologies and bigram code in order to create a robust system able to solve automatically the anomalies of classical approaches. The proposed approach is based on a hybrid method which is spread over two stages, first one is character recognition by using the ontological model and the second one is word recognition based on spelling correction approach based on bigram codification for detection and correction of errors. The spelling error is broadly classified in two categories namely non-word error and real-word error. In this paper, we interested only on detection and correction of non-word errors because this is the only type of errors treated by an OCR. In addition, the use of an online external resource such as WordNet proves necessary to improve its performances

Crossref

Institute of Advanced Engineering and Science

Detection of semantic errors in Arabic texts

Author: Ben Ahmed Mohamed
Ben Othmane Zribi Chiraz
Publication venue: Elsevier B.V.
Publication date: 28/02/2013
Field of study

AbstractDetecting semantic errors in a text is still a challenging area of investigation. A lot of research has been done on lexical and syntactic errors while fewer studies have tackled semantic errors, as they are more difficult to treat. Compared to other languages, Arabic appears to be a special challenge for this problem. Because words are graphically very similar to each other, the risk of getting semantic errors in Arabic texts is bigger. Moreover, there are special cases and unique complexities for this language. This paper deals with the detection of semantic errors in Arabic texts but the approach we have adopted can also be applied for texts in other languages. It combines four contextual methods (using statistics and linguistic information) in order to decide about the semantic validity of a word in a sentence. We chose to implement our approach on a distributed architecture, namely, a Multi Agent System (MAS). The implemented system achieved a precision rate of about 90% and a recall rate of about 83%

Elsevier - Publisher Connector

Paronyms for Accelerated Correction of Semantic Errors

Author: Bolshakov Igor
Gelbukh Alexander
Publication venue: Institute of Information Theories and Applications FOI ITHEA
Publication date: 01/01/2003
Field of study

* Work done under partial support of Mexican Government (CONACyT, SNI), IPN (CGPI, COFAA) and Korean Government (KIPA Professorship for Visiting Faculty Positions). The second author is currently on Sabbatical leave at Chung-Ang University.The errors usually made by authors during text preparation are classified. The notion of semantic errors is elaborated, and malapropisms are pointed among them as “similar” to the intended word but essentially distorting the meaning of the text. For whatever method of malapropism correction, we propose to beforehand compile dictionaries of paronyms, i.e. of words similar to each other in letters, sounds or morphs. The proposed classification of errors and paronyms is illustrated by English and Russian examples being valid for many languages. Specific dictionaries of literal and morphemic paronyms are compiled for Russian. It is shown that literal paronyms drastically cut down (up to 360 times) the search of correction candidates, while morphemic paronyms permit to correct errors not studied so far and characteristic for foreigners

Bulgarian Digital Mathematics Library at IMI-BAS

An Emergent Approach to Text Analysis Based on a Connectionist Model and the Web

Author: Cimino MARIO GIOVANNI COSIMO ANTONIO
Vaglini Gigliola
Publication venue: 'MDPI AG'
Publication date: 01/01/2013
Field of study

In this paper, we present a method to provide proactive assistance in text checking, based on usage relationships between words structuralized on the Web. For a given sentence, the method builds a connectionist structure of relationships between word n-grams. Such structure is then parameterized by means of an unsupervised and language agnostic optimization process. Finally, the method provides a representation of the sentence that allows emerging the least prominent usage-based relational patterns, helping to easily find badly-written and unpopular text. The study includes the problem statement and its characterization in the literature, as well as the proposed solving approach and some experimental use

Multidisciplinary Digital Publishing Institute

Directory of Open Access Journals

Archivio della Ricerca - Università di Pisa

Correcting input noise in SMT as a char-based translation problem

Author: Formiga Fanals Lluís
Rodríguez Fonollosa José Adrián
Publication venue
Publication date: 01/01/2012
Field of study

Misspelled words have a direct impact on the final quality obtained by Statistical Machine Translation (SMT) systems as the input becomes noisy and unpredictable. This paper presents some improvement strategies for translating real-life noisy input. The proposed strategies are based on a preprocessing step consisting in a character-based translator.Peer ReviewedPreprin

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

The Effects of a Corpus on isiZulu Spellcheckers based on N-grams

Author: Keet C. Maria
Khumalo Langa
Ndaba Balone
Suleman Hussein
Publication venue: IIMC International Information Management Corporation
Publication date: 01/01/2016
Field of study

Correct spelling contributes to good content accessibility and readability for textual documents. However, there are few spellcheckers for Bantu languages such as isiZulu, the major language in South Africa. The objective of this research is to investigate development of spellcheckers for isiZulu and, more generally, an approach that can be reused across Bantu languages. To fill this gap in an extensible way, we used data-driven statistical language models with trigrams and quadrigrams. The models were trained on three different isiZulu corpora, being Ukwabelana, a selection of the isiZulu National Corpus, and a small corpus of news items. The system performed better with trigrams than with quadrigrams, and performance depended on the training and testing corpora. When the system was trained with old text (bible in isiZulu), it did not perform well when tested with the two corpora that contain more recent texts, such as the constitution and news items. The highest accuracy obtained was 89%. Given that data-driven statistical language models constitute a language-independent approach, we conclude that data-driven spellcheckers for all Bantu languages are indeed feasible. They are, however, sensitive to the training and testing data. This is less resource-intensive compared to manual specification of rules, and therefore the potential impact on realising spellcheckers for Bantu languages is now practically within reach. The potential societal impact of spellchecker-supported tools and apps is incalculable

UCT Computer Science Research Document Archive

Spellcheckers

Author: Mitton Roger
Publication venue: Taylor and Francis / Routledge
Publication date: 14/06/2016
Field of study

Techniques of computer spellchecking from the 1950's to the 2000's

Birkbeck Institutional Research Online