4 research outputs found

    Vers une meilleure modélisation du langage : la prise en compte des séquences dans les modèles statistiques

    Get PDF
    Colloque avec actes et comité de lecture. nationale.National audienceNous trouvons dans la langue naturelle, plusieurs séquences de mots clés traduisant la structure d'une phrase. Ces séquences sont de longueur variable et permettent d'avoir une élocution naturelle. Pour tenir compte de ces séquences lors de la reconnaissance de la parole, nous les avons considérées comme des unités et nous les avons ajoutées au vocabulaire de base. Par conséquent, les modèles de langage utilisant ce nouveau vocabulaire se fondent sur un historique d'unités où chacune d'entre elles peut être, soit un mot, soit une séquence. Nous présentons dans ce papier une méthode originale d'extraction de séquences de mots linguistiquement viable ; cette méthode se fonde sur le principe de la théorie de l'information. Nous exposons également dans ce papier différents modèles de langage se basant sur ces séquences. l'évaluation a été effectué avec un dictionnaire de 20000 mots et avec un corpus de 43 million de mots. l'utilisation des séquences a amélioré la perplexité d'environ 23% et le taux d'erreur de notre système de reconnaissance vocale MAUD d'environ 20%. || In natural language, several sequences of words are very frequent. Conventional language models do not adequately take into account such sequences, because they underestimate their probabilities. A better approach consists in modeling word sequences as i

    Variable-Length Sequence Language Model for Large Vocabulary Continuous Dictation Machine

    Get PDF
    Colloque avec actes et comité de lecture.In natural language, some sequences of words are very frequent. A classical language model, like n-gram, does not adequately take into account such sequences, because it underestimates their probabilities. A better approach consists in modeling word sequences as if they were individual dictionary elements. Sequences are considered as additional entries of the word lexicon, on which language models are computed. In this paper, we present two methods for automatically determining frequent phrases in unlabeled corpora of written sentences. These methods are based on information theoretic criteria which insure a high statistical consistency. Our models reach their local optimum since they minimize the perplexity. One procedure is based only on the n-gram language model to extract word sequences. The second one is based on a class n-gram model trained on 233 classes extracted from the eight grammatical classes of French. Experimental tests, in terms of perplexity and recognition rate, are carried out on a vocabulary of 20000 words and a corpus of 43 million words extracted from the ?Le Monde? newspaper. Our models reduce perplexity by more than 20% compared with n-gram (nR3) and multigram models. In terms of recognition rate, our models outperform n-gram and multigram models

    Beyond the Conventional Statistical Language Models: The Variable-Length Sequences Approach

    Get PDF
    Colloque avec actes et comité de lecture. internationale.International audienceIn natural language, several sequences of words are very frequent.A classical language model, like n-gram, does not adequately takeinto account such sequences, because it underestimates theirprobabilities. A better approach consists in modelling wordsequences as if they were individual dictionary elements.Sequences are considered as additional entries of the wordlexicon, on which language models are computed. In this paper,we present an original method for automatically determining themost important phrases in corpora. This method is based oninformation theoretic criteria, which insure a high statisticalconsistency, and on French grammatical classes which includeadditional type of linguistic dependencies. In addition, theperplexity is used in order to make the decision of selecting apotential sequence more accurate. We propose also severalvariants of language models with and without word sequences.Among them, we present a model in which the trigger pairs aremore significant linguistically. The originality of this model,compared with the commonly used trigger approaches, is the useof word sequences to estimate the trigger pair without limitingitself to single words. Experimental tests, in terms of perplexityand recognition rate, are carried out on a vocabulary of 20000words and a corpus of 43 million words. The use of wordsequences proposed by our algorithm reduces perplexity by morethan 16% compared to those, which are limited to single words.The introduction of these word sequences in our dictationmachine improves the accuracy by approximately 15%

    An Hybrid Language Model for a Continuous Dictation Prototype

    No full text
    International audienceThis paper describes the combination of a stochastic language model and a formal grammar modeled such as a unification grammar. The stochastic model is trained over 42 million words extracted from Le Monde newspaper. The stochastic model is based on smoothed 3-gram and 3-class. The 3-class model is represented by a Markov chain made up of four states. Several experiments have been done to state which values are the best for specific training and test corpus. Experiments indicate that the unification grammar reduces strongly the number of hypothesis (sentences) produced by the stochastic model
    corecore