419 research outputs found

    Ontologies and Bigram-based approach for Isolated Non-word Errors Correction in OCR System

    Get PDF
    In this paper, we describe a new and original approach for post-processing step in an OCR system. This approach is based on new method of spelling correction to correct automatically misspelled words resulting from a character recognition step of scanned documents by combining both ontologies and bigram code in order to create a robust system able to solve automatically the anomalies of classical approaches. The proposed approach is based on a hybrid method which is spread over two stages, first one is character recognition by using the ontological model and the second one is word recognition based on spelling correction approach based on bigram codification for detection and correction of errors. The spelling error is broadly classified in two categories namely non-word error and real-word error. In this paper, we interested only on detection and correction of non-word errors because this is the only type of errors treated by an OCR. In addition, the use of an online external resource such as WordNet proves necessary to improve its performances

    Detection of semantic errors in Arabic texts

    Get PDF
    AbstractDetecting semantic errors in a text is still a challenging area of investigation. A lot of research has been done on lexical and syntactic errors while fewer studies have tackled semantic errors, as they are more difficult to treat. Compared to other languages, Arabic appears to be a special challenge for this problem. Because words are graphically very similar to each other, the risk of getting semantic errors in Arabic texts is bigger. Moreover, there are special cases and unique complexities for this language. This paper deals with the detection of semantic errors in Arabic texts but the approach we have adopted can also be applied for texts in other languages. It combines four contextual methods (using statistics and linguistic information) in order to decide about the semantic validity of a word in a sentence. We chose to implement our approach on a distributed architecture, namely, a Multi Agent System (MAS). The implemented system achieved a precision rate of about 90% and a recall rate of about 83%

    Paronyms for Accelerated Correction of Semantic Errors

    Get PDF
    * Work done under partial support of Mexican Government (CONACyT, SNI), IPN (CGPI, COFAA) and Korean Government (KIPA Professorship for Visiting Faculty Positions). The second author is currently on Sabbatical leave at Chung-Ang University.The errors usually made by authors during text preparation are classified. The notion of semantic errors is elaborated, and malapropisms are pointed among them as “similar” to the intended word but essentially distorting the meaning of the text. For whatever method of malapropism correction, we propose to beforehand compile dictionaries of paronyms, i.e. of words similar to each other in letters, sounds or morphs. The proposed classification of errors and paronyms is illustrated by English and Russian examples being valid for many languages. Specific dictionaries of literal and morphemic paronyms are compiled for Russian. It is shown that literal paronyms drastically cut down (up to 360 times) the search of correction candidates, while morphemic paronyms permit to correct errors not studied so far and characteristic for foreigners

    An Emergent Approach to Text Analysis Based on a Connectionist Model and the Web

    Get PDF
    In this paper, we present a method to provide proactive assistance in text checking, based on usage relationships between words structuralized on the Web. For a given sentence, the method builds a connectionist structure of relationships between word n-grams. Such structure is then parameterized by means of an unsupervised and language agnostic optimization process. Finally, the method provides a representation of the sentence that allows emerging the least prominent usage-based relational patterns, helping to easily find badly-written and unpopular text. The study includes the problem statement and its characterization in the literature, as well as the proposed solving approach and some experimental use

    Correcting input noise in SMT as a char-based translation problem

    Get PDF
    Misspelled words have a direct impact on the final quality obtained by Statistical Machine Translation (SMT) systems as the input becomes noisy and unpredictable. This paper presents some improvement strategies for translating real-life noisy input. The proposed strategies are based on a preprocessing step consisting in a character-based translator.Peer ReviewedPreprin

    The Effects of a Corpus on isiZulu Spellcheckers based on N-grams

    Get PDF
    Correct spelling contributes to good content accessibility and readability for textual documents. However, there are few spellcheckers for Bantu languages such as isiZulu, the major language in South Africa. The objective of this research is to investigate development of spellcheckers for isiZulu and, more generally, an approach that can be reused across Bantu languages. To fill this gap in an extensible way, we used data-driven statistical language models with trigrams and quadrigrams. The models were trained on three different isiZulu corpora, being Ukwabelana, a selection of the isiZulu National Corpus, and a small corpus of news items. The system performed better with trigrams than with quadrigrams, and performance depended on the training and testing corpora. When the system was trained with old text (bible in isiZulu), it did not perform well when tested with the two corpora that contain more recent texts, such as the constitution and news items. The highest accuracy obtained was 89%. Given that data-driven statistical language models constitute a language-independent approach, we conclude that data-driven spellcheckers for all Bantu languages are indeed feasible. They are, however, sensitive to the training and testing data. This is less resource-intensive compared to manual specification of rules, and therefore the potential impact on realising spellcheckers for Bantu languages is now practically within reach. The potential societal impact of spellchecker-supported tools and apps is incalculable

    Spellcheckers

    Get PDF
    Techniques of computer spellchecking from the 1950's to the 2000's
    • …
    corecore