7 research outputs found

    Two bigrams based language model for auto correction of Arabic OCR errors

    Get PDF
    In Optical character recognition (OCR), the characteristics of Arabic text cause more errors than in English text.In this paper, a two bi-grams based language model that uses Wikipedia's database is presented.The method can perform auto detection and correction of non-word errors in Arabic OCR text, and auto detection of real word errors. The method consists of two parts: extracting the context information from Wikipedia's database, and implement the auto detection and correction of incorrect words.This method can be applied to any language with little modifications.The experimental results show successful extraction of context information from Wikipedia's articles. Furthermore, it also shows that using this method can reduce the error rate of Arabic OCR text

    Unsupervised learning of pictures by genetic hibrydization of hidden Markov chain

    Get PDF
    This paper presents a learning algorithm using hidden Markov models (HMMs) and genetic algorithms (GAs) . Two standard problems to be solved with HMMs are how to determine the probabilities and the number of hidden states of the learned models . Generally, this number of states is determined either by the trial-error method that needs experimentation, or by the background knowledge available . The presented algorithm uses a GA in order to determine at the same time both the number of states and the probabilities of learned HMMs . This hybrid algorithm uses the Baum-Welch algorithm to optimise precisely the probabilities of HMMs . Several algorithms, either hybrid or not, are compared in a face recognition task . The obtained results highlight the strength of our approach for the concerned problem .Cet article présente un algorithme d'apprentissage non supervisé par chaînes de Markov cachées (CMC) et algorithmes génétiques (AG). Deux des problèmes rencontrés lors de l'utilisation des CMC sont de déterminer les probabilités de la CMC et le nombre d'états de cette chaîne. Bien souvent, ce nombre d'états est déterminé soit par expériences successives, soit à l'aide de connaissances a priori du domaine. L'algorithme présenté ici emploie un algorithme génétique afin de déterminer le nombre d'états cachés de la CMC ainsi que les différentes probabilités qui la constituent. Cet algorithme est couplé à l'algorithme de Baum-Welch qui permet une réestimation efficace des probabilités de la CMC. Différents algorithmes, hybrides ou non, sont comparés entre eux sur une application d'apprentissage et de reconnaissance d'images représentant des visages. Les résultats montrent la supériorité de l'approche génétique pour ce type de problème

    Corpus-consulting probabilistic approach to parsing: the CCPX parser and its complementary components

    Get PDF
    Corpus linguistics is now a major field in the study of language. In recent years corpora that are syntactically analysed have become available to researchers, and these clearly have great potential for use in the field of parsing natural language. This thesis describes a project that exploits this possibility. It makes four distinct contributions to these two fields. The first is an updated version of a corpus that is (a) analysed in terms of the rich syntax of Systemic Functional Grammar (SFG), and (b) annotated using the extensible Mark-up Language (XML). The second contribution is a native XML corpus database, and the third is a sophisticated corpus query tool for accessing it. The fourth contribution is a new type of parser that is both corpus-consulting and probabilistic. It draws its knowledge of syntactic probabilities from the corpus database, and it stores its working data within the database, so that it is strongly database-oriented. SFG has been widely used in natural language generation for approaching two decades, but it has been used far less frequently in parsing (the first stage in natural language understanding). Previous SFG corpus-based parsers have utilised traditional parsing algorithms, but they have experienced problems of efficiency and coverage, due to (a) the richness of the syntax and (b) the challenge of parsing unrestricted spoken and written texts. The present research overcomes these problems by introducing a new type of parsing algorithm that is 'semi-deterministic' (as human readers are), and utilises its knowledge of the rules—including probabilities—of English syntax. A language, however, is constantly evolving. New words and uses are added, while others become less frequent and drop out altogether. The new parsing system seeks to replicate this. As new sentences are parsed they are added to the corpus, and this slowly changes the frequencies of the words and the syntactic patterns. The corpus is in this sense dynamic, and so simulates a human's changing knowledge of words and syntax
    corecore