7 research outputs found

    Improving speech recognition and keyword search for low resource languages using web data

    Get PDF
    We describe the use of text data scraped from the web to augment language models for Automatic Speech Recognition and Keyword Search for Low Resource Languages. We scrape text from multiple genres including blogs, online news, translated TED talks, and subtitles. Using linearly interpolated language models, we find that blogs and movie subtitles are more relevant for language modeling of conversational telephone speech and obtain large reductions in out-of-vocabulary keywords. Furthermore, we show that the web data can improve Term Error Rate Performance by 3.8% absolute and Maximum Term-Weighted Value in Keyword Search by 0.0076-0.1059 absolute points. Much of the gain comes from the reduction of out-of-vocabulary items

    Unsupervised learning of Arabic non-concatenative morphology

    Get PDF
    Unsupervised approaches to learning the morphology of a language play an important role in computer processing of language from a practical and theoretical perspective, due their minimal reliance on manually produced linguistic resources and human annotation. Such approaches have been widely researched for the problem of concatenative affixation, but less attention has been paid to the intercalated (non-concatenative) morphology exhibited by Arabic and other Semitic languages. The aim of this research is to learn the root and pattern morphology of Arabic, with accuracy comparable to manually built morphological analysis systems. The approach is kept free from human supervision or manual parameter settings, assuming only that roots and patterns intertwine to form a word. Promising results were obtained by applying a technique adapted from previous work in concatenative morphology learning, which uses machine learning to determine relatedness between words. The output, with probabilistic relatedness values between words, was then used to rank all possible roots and patterns to form a lexicon. Analysis using trilateral roots resulted in correct root identification accuracy of approximately 86% for inflected words. Although the machine learning-based approach is effective, it is conceptually complex. So an alternative, simpler and computationally efficient approach was then devised to obtain morpheme scores based on comparative counts of roots and patterns. In this approach, root and pattern scores are defined in terms of each other in a mutually recursive relationship, converging to an optimized morpheme ranking. This technique gives slightly better accuracy while being conceptually simpler and more efficient. The approach, after further enhancements, was evaluated on a version of the Quranic Arabic Corpus, attaining a final accuracy of approximately 93%. A comparative evaluation shows this to be superior to two existing, well used manually built Arabic stemmers, thus demonstrating the practical feasibility of unsupervised learning of non-concatenative morphology

    Utilisation de la linguistique en reconnaissance de la parole : un Ă©tat de l'art

    Get PDF
    To transcribe speech, automatic speech recognition systems use statistical methods, particularly hidden Markov model and N-gram models. Although these techniques perform well and lead to efficient systems, they approach their maximum possibilities. It seems thus necessary, in order to outperform current results, to use additional information, especially bound to language. However, introducing such knowledge must be realized taking into account specificities of spoken language (hesitations for example) and being robust to possible misrecognized words. This document presents a state of the art of these researches, evaluating the impact of the insertion of linguistic information on the quality of the transcription. ––– Pour transcrire des documents sonores, les systèmes de reconnaissance de la parole font appel à des méthodes statistiques, notamment aux chaînes de Markov cachées et aux modèles N-grammes. Même si ces techniques se sont révélées performantes, elles approchent du maximum de leurs possibilités avec la mise à disposition de corpus de taille suffisante et il semble nécessaire, pour tenter d'aller au-delà des résultats actuels, d'utiliser des informations supplémentaires, en particulier liées au langage. Intégrer de telles connaissances linguistiques doit toutefois se faire en tenant compte des spécificités de l'oral (présence d'hésitations par exemple) et en étant robuste à d'éventuelles erreurs de reconnaissance de certains mots. Ce document présente un état de l'art des recherches de ce type, en évaluant l'impact de l'insertion des informations linguistiques sur la qualité de la transcription

    Modèles de langage ad hoc pour la reconnaissance automatique de la parole

    Get PDF
    Les trois piliers d un système de reconnaissance automatique de la parole sont le lexique,le modèle de langage et le modèle acoustique. Le lexique fournit l ensemble des mots qu il est possible de transcrire, associés à leur prononciation. Le modèle acoustique donne une indication sur la manière dont sont réalisés les unités acoustiques et le modèle de langage apporte la connaissance de la manière dont les mots s enchaînent.Dans les systèmes de reconnaissance automatique de la parole markoviens, les modèles acoustiques et linguistiques sont de nature statistique. Leur estimation nécessite de gros volumes de données sélectionnées, normalisées et annotées.A l heure actuelle, les données disponibles sur le Web constituent de loin le plus gros corpus textuel disponible pour les langues française et anglaise. Ces données peuvent potentiellement servir à la construction du lexique et à l estimation et l adaptation du modèle de langage. Le travail présenté ici consiste à proposer de nouvelles approches permettant de tirer parti de cette ressource.Ce document est organisé en deux parties. La première traite de l utilisation des données présentes sur le Web pour mettre à jour dynamiquement le lexique du moteur de reconnaissance automatique de la parole. L approche proposée consiste à augmenter dynamiquement et localement le lexique du moteur de reconnaissance automatique de la parole lorsque des mots inconnus apparaissent dans le flux de parole. Les nouveaux mots sont extraits du Web grâce à la formulation automatique de requêtes soumises à un moteur de recherche. La phonétisation de ces mots est obtenue grâce à un phonétiseur automatique.La seconde partie présente une nouvelle manière de considérer l information que représente le Web et des éléments de la théorie des possibilités sont utilisés pour la modéliser. Un modèle de langage possibiliste est alors proposé. Il fournit une estimation de la possibilité d une séquence de mots à partir de connaissances relatives à existence de séquences de mots sur le Web. Un modèle probabiliste Web reposant sur le compte de documents fourni par un moteur de recherche Web est également présenté. Plusieurs approches permettant de combiner ces modèles avec des modèles probabilistes classiques estimés sur corpus sont proposées. Les résultats montrent que combiner les modèles probabilistes et possibilistes donne de meilleurs résultats que es modèles probabilistes classiques. De plus, les modèles estimés à partir des données Web donnent de meilleurs résultats que ceux estimés sur corpus.The three pillars of an automatic speech recognition system are the lexicon, the languagemodel and the acoustic model. The lexicon provides all the words that can betranscribed, associated with their pronunciation. The acoustic model provides an indicationof how the phone units are pronounced, and the language model brings theknowledge of how words are linked. In modern automatic speech recognition systems,the acoustic and language models are statistical. Their estimation requires large volumesof data selected, standardized and annotated.At present, the Web is by far the largest textual corpus available for English andFrench languages. The data it holds can potentially be used to build the vocabularyand the estimation and adaptation of language model. The work presented here is topropose new approaches to take advantage of this resource in the context of languagemodeling.The document is organized into two parts. The first deals with the use of the Webdata to dynamically update the lexicon of the automatic speech recognition system.The proposed approach consists on increasing dynamically and locally the lexicon onlywhen unknown words appear in the speech. New words are extracted from the Webthrough the formulation of queries submitted toWeb search engines. The phonetizationof the words is obtained by an automatic grapheme-to-phoneme transcriber.The second part of the document presents a new way of handling the informationcontained on the Web by relying on possibility theory concepts. A Web-based possibilisticlanguage model is proposed. It provides an estition of the possibility of a wordsequence from knowledge of the existence of its sub-sequences on the Web. A probabilisticWeb-based language model is also proposed. It relies on Web document countsto estimate n-gram probabilities. Several approaches for combining these models withclassical models are proposed. The results show that combining probabilistic and possibilisticmodels gives better results than classical probabilistic models alone. In addition,the models estimated from Web data perform better than those estimated on corpus.AVIGNON-Bib. numérique (840079901) / SudocSudocFranceF

    Just-in-Time Language Modelling

    No full text
    Traditional approaches to language modelling have relied on a fixed corpus of text to inform the parameters of a probability distribution over word sequences. Increasing the corpus size often leads to better-performing language models, but no matter how large, the corpus is a static entity, unable to reflect information about events which postdate it. In these pages we introduce an online paradigm which interleaves the estimation and application of a language model. We present a Bayesian approach to online language modelling, in which the marginal probabilities of a static trigram model are dynamically updated to match the topic being dictated to the system. We also describe the architecture of a prototype we have implemented which uses the World Wide Web (WWW) as a source of information, and provide results from some initial proof of concept experiments
    corecore