1,188 research outputs found

    Improving Searchability of Automatically Transcribed Lectures Through Dynamic Language Modelling

    Get PDF
    Recording university lectures through lecture capture systems is increasingly common. However, a single continuous audio recording is often unhelpful for users, who may wish to navigate quickly to a particular part of a lecture, or locate a specific lecture within a set of recordings. A transcript of the recording can enable faster navigation and searching. Automatic speech recognition (ASR) technologies may be used to create automated transcripts, to avoid the significant time and cost involved in manual transcription. Low accuracy of ASR-generated transcripts may however limit their usefulness. In particular, ASR systems optimized for general speech recognition may not recognize the many technical or discipline-specific words occurring in university lectures. To improve the usefulness of ASR transcripts for the purposes of information retrieval (search) and navigating within recordings, the lexicon and language model used by the ASR engine may be dynamically adapted for the topic of each lecture. A prototype is presented which uses the English Wikipedia as a semantically dense, large language corpus to generate a custom lexicon and language model for each lecture from a small set of keywords. Two strategies for extracting a topic-specific subset of Wikipedia articles are investigated: a naïve crawler which follows all article links from a set of seed articles produced by a Wikipedia search from the initial keywords, and a refinement which follows only links to articles sufficiently similar to the parent article. Pair-wise article similarity is computed from a pre-computed vector space model of Wikipedia article term scores generated using latent semantic indexing. The CMU Sphinx4 ASR engine is used to generate transcripts from thirteen recorded lectures from Open Yale Courses, using the English HUB4 language model as a reference and the two topic-specific language models generated for each lecture from Wikipedia

    Improving searchability of automatically transcribed lectures through dynamic language modelling

    Get PDF
    Recording university lectures through lecture capture systems is increasingly common. However, a single continuous audio recording is often unhelpful for users, who may wish to navigate quickly to a particular part of a lecture, or locate a specific lecture within a set of recordings. A transcript of the recording can enable faster navigation and searching. Automatic speech recognition (ASR) technologies may be used to create automated transcripts, to avoid the significant time and cost involved in manual transcription. Low accuracy of ASR-generated transcripts may however limit their usefulness. In particular, ASR systems optimized for general speech recognition may not recognize the many technical or discipline-specific words occurring in university lectures. To improve the usefulness of ASR transcripts for the purposes of information retrieval (search) and navigating within recordings, the lexicon and language model used by the ASR engine may be dynamically adapted for the topic of each lecture. A prototype is presented which uses the English Wikipedia as a semantically dense, large language corpus to generate a custom lexicon and language model for each lecture from a small set of keywords. Two strategies for extracting a topic-specific subset of Wikipedia articles are investigated: a naïve crawler which follows all article links from a set of seed articles produced by a Wikipedia search from the initial keywords, and a refinement which follows only links to articles sufficiently similar to the parent article. Pair-wise article similarity is computed from a pre-computed vector space model of Wikipedia article term scores generated using latent semantic indexing. The CMU Sphinx4 ASR engine is used to generate transcripts from thirteen recorded lectures from Open Yale Courses, using the English HUB4 language model as a reference and the two topic-specific language models generated for each lecture from Wikipedia. Three standard metrics – Perplexity, Word Error Rate and Word Correct Rate – are used to evaluate the extent to which the adapted language models improve the searchability of the resulting transcripts, and in particular improve the recognition of specialist words. Ranked Word Correct Rate is proposed as a new metric better aligned with the goals of improving transcript searchability and specialist word recognition. Analysis of recognition performance shows that the language models derived using the similarity-based Wikipedia crawler outperform models created using the naïve crawler, and that transcripts using similarity-based language models have better perplexity and Ranked Word Correct Rate scores than those created using the HUB4 language model, but worse Word Error Rates. It is concluded that English Wikipedia may successfully be used as a language resource for unsupervised topic adaptation of language models to improve recognition performance for better searchability of lecture recording transcripts, although possibly at the expense of other attributes such as readability

    Language Modeling for limited-data domains

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student submitted PDF version of thesis.Includes bibliographical references (p. 99-109).With the increasing focus of speech recognition and natural language processing applications on domains with limited amount of in-domain training data, enhanced system performance often relies on approaches involving model adaptation and combination. In such domains, language models are often constructed by interpolating component models trained from partially matched corpora. Instead of simple linear interpolation, we introduce a generalized linear interpolation technique that computes context-dependent mixture weights from features that correlate with the component confidence and relevance for each n-gram context. Since the n-grams from partially matched corpora may not be of equal relevance to the target domain, we propose an n-gram weighting scheme to adjust the component n-gram probabilities based on features derived from readily available corpus segmentation and metadata to de-emphasize out-of-domain n-grams. In scenarios without any matched data for a development set, we examine unsupervised and active learning techniques for tuning the interpolation and weighting parameters. Results on a lecture transcription task using the proposed generalized linear interpolation and n-gram weighting techniques yield up to a 1.4% absolute word error rate reduction over a linearly interpolated baseline language model. As more sophisticated models are only as useful as they are practical, we developed the MIT Language Modeling (MITLM) toolkit, designed for efficient iterative parameter optimization, and released it to the research community.(cont.) With a compact vector-based n-gram data structure and optimized algorithm implementations, the toolkit not only improves the running time of common tasks by up to 40x, but also enables the efficient parameter tuning for language modeling techniques that were previously deemed impractical.by Bo-June (Paul) Hsu.Ph.D

    Transformer Models for Machine Translation and Streaming Automatic Speech Recognition

    Full text link
    [ES] El procesamiento del lenguaje natural (NLP) es un conjunto de problemas computacionales con aplicaciones de máxima relevancia, que junto con otras tecnologías informáticas se ha beneficiado de la revolución que ha significado el aprendizaje profundo. Esta tesis se centra en dos problemas fundamentales para el NLP: la traducción automática (MT) y el reconocimiento automático del habla o transcripción automática (ASR); así como en una arquitectura neuronal profunda, el Transformer, que pondremos en práctica para mejorar las soluciones de MT y ASR en algunas de sus aplicaciones. El ASR y MT pueden servir para obtener textos multilingües de alta calidad a un coste razonable para una diversidad de contenidos audiovisuales. Concre- tamente, esta tesis aborda problemas como el de traducción de noticias o el de subtitulación automática de televisión. El ASR y MT también se pueden com- binar entre sí, generando automáticamente subtítulos traducidos, o con otras soluciones de NLP: resumen de textos para producir resúmenes de discursos, o síntesis del habla para crear doblajes automáticos. Estas aplicaciones quedan fuera del alcance de esta tesis pero pueden aprovechar las contribuciones que contiene, en la meduda que ayudan a mejorar el rendimiento de los sistemas automáticos de los que dependen. Esta tesis contiene una aplicación de la arquitectura Transformer al MT tal y como fue concebida, mediante la que obtenemos resultados de primer nivel en traducción de lenguas semejantes. En capítulos subsecuentes, esta tesis aborda la adaptación del Transformer como modelo de lenguaje para sistemas híbri- dos de ASR en vivo. Posteriormente, describe la aplicación de este tipus de sistemas al caso de uso de subtitulación de televisión, participando en una com- petición pública de RTVE donde obtenemos la primera posición con un marge importante. También demostramos que la mejora se debe principalmenta a la tecnología desarrollada y no tanto a la parte de los datos.[CA] El processament del llenguage natural (NLP) és un conjunt de problemes com- putacionals amb aplicacions de màxima rellevància, que juntament amb al- tres tecnologies informàtiques s'ha beneficiat de la revolució que ha significat l'impacte de l'aprenentatge profund. Aquesta tesi se centra en dos problemes fonamentals per al NLP: la traducció automàtica (MT) i el reconeixement automàtic de la parla o transcripció automàtica (ASR); així com en una ar- quitectura neuronal profunda, el Transformer, que posarem en pràctica per a millorar les solucions de MT i ASR en algunes de les seues aplicacions. l'ASR i MT poden servir per obtindre textos multilingües d'alta qualitat a un cost raonable per a un gran ventall de continguts audiovisuals. Concretament, aquesta tesi aborda problemes com el de traducció de notícies o el de subtitu- lació automàtica de televisió. l'ASR i MT també es poden combinar entre ells, generant automàticament subtítols traduïts, o amb altres solucions de NLP: amb resum de textos per produir resums de discursos, o amb síntesi de la parla per crear doblatges automàtics. Aquestes altres aplicacions es troben fora de l'abast d'aquesta tesi però poden aprofitar les contribucions que conté, en la mesura que ajuden a millorar els resultats dels sistemes automàtics dels quals depenen. Aquesta tesi conté una aplicació de l'arquitectura Transformer al MT tal com va ser concebuda, mitjançant la qual obtenim resultats de primer nivell en traducció de llengües semblants. En capítols subseqüents, aquesta tesi aborda l'adaptació del Transformer com a model de llenguatge per a sistemes híbrids d'ASR en viu. Posteriorment, descriu l'aplicació d'aquest tipus de sistemes al cas d'ús de subtitulació de continguts televisius, participant en una competició pública de RTVE on obtenim la primera posició amb un marge significant. També demostrem que la millora es deu principalment a la tecnologia desen- volupada i no tant a la part de les dades[EN] Natural language processing (NLP) is a set of fundamental computing prob- lems with immense applicability, as language is the natural communication vehicle for people. NLP, along with many other computer technologies, has been revolutionized in recent years by the impact of deep learning. This thesis is centered around two keystone problems for NLP: machine translation (MT) and automatic speech recognition (ASR); and a common deep neural architec- ture, the Transformer, that is leveraged to improve the technical solutions for some MT and ASR applications. ASR and MT can be utilized to produce cost-effective, high-quality multilin- gual texts for a wide array of media. Particular applications pursued in this thesis are that of news translation or that of automatic live captioning of tele- vision broadcasts. ASR and MT can also be combined with each other, for instance generating automatic translated subtitles from audio, or augmented with other NLP solutions: text summarization to produce a summary of a speech, or speech synthesis to create an automatic translated dubbing, for in- stance. These other applications fall out of the scope of this thesis, but can profit from the contributions that it contains, as they help to improve the performance of the automatic systems on which they depend. This thesis contains an application of the Transformer architecture to MT as it was originally conceived, achieving state-of-the-art results in similar language translation. In successive chapters, this thesis covers the adaptation of the Transformer as a language model for streaming hybrid ASR systems. After- wards, it describes how we applied the developed technology for a specific use case in television captioning by participating in a competitive challenge and achieving the first position by a large margin. We also show that the gains came mostly from the improvement in technology capabilities over two years including that of the Transformer language model adapted for streaming, and the data component was minor.Baquero Arnal, P. (2023). Transformer Models for Machine Translation and Streaming Automatic Speech Recognition [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/19368

    Modeling the user state for context-aware spoken interaction in ambient assisted living

    Get PDF
    Ambient Assisted Living (AAL) systems must provide adapted services easily accessible by a wide variety of users. This can only be possible if the communication between the user and the system is carried out through an interface that is simple, rapid, effective, and robust. Natural language interfaces such as dialog systems fulfill these requisites, as they are based on a spoken conversation that resembles human communication. In this paper, we enhance systems interacting in AAL domains by means of incorporating context-aware conversational agents that consider the external context of the interaction and predict the user's state. The user's state is built on the basis of their emotional state and intention, and it is recognized by means of a module conceived as an intermediate phase between natural language understanding and dialog management in the architecture of the conversational agent. This prediction, carried out for each user turn in the dialog, makes it possible to adapt the system dynamically to the user's needs. We have evaluated our proposal developing a context-aware system adapted to patients suffering from chronic pulmonary diseases, and provide a detailed discussion of the positive influence of our proposal in the success of the interaction, the information and services provided, as well as the perceived quality.This work was supported in part by Projects MINECO TEC2012-37832-C02-01, CICYT TEC2011-28626-C02- 02, CAM CONTEXTS (S2009/TIC-1485

    A practical guide to conversation research: how to study what people say to each other

    Get PDF
    Conversation—a verbal interaction between two or more people—is a complex, pervasive, and consequential human behavior. Conversations have been studied across many academic disciplines. However, advances in recording and analysis techniques over the last decade have allowed researchers to more directly and precisely examine conversations in natural contexts and at a larger scale than ever before, and these advances open new paths to understand humanity and the social world. Existing reviews of text analysis and conversation research have focused on text generated by a single author (e.g., product reviews, news articles, and public speeches) and thus leave open questions about the unique challenges presented by interactive conversation data (i.e., dialogue). In this article, we suggest approaches to overcome common challenges in the workflow of conversation science, including recording and transcribing conversations, structuring data (to merge turn-level and speaker-level data sets), extracting and aggregating linguistic features, estimating effects, and sharing data. This practical guide is meant to shed light on current best practices and empower more researchers to study conversations more directly—to expand the community of conversation scholars and contribute to a greater cumulative scientific understanding of the social world

    Deriving and Exploiting Situational Information in Speech: Investigations in a Simulated Search and Rescue Scenario

    Get PDF
    The need for automatic recognition and understanding of speech is emerging in tasks involving the processing of large volumes of natural conversations. In application domains such as Search and Rescue, exploiting automated systems for extracting mission-critical information from speech communications has the potential to make a real difference. Spoken language understanding has commonly been approached by identifying units of meaning (such as sentences, named entities, and dialogue acts) for providing a basis for further discourse analysis. However, this fine-grained identification of fundamental units of meaning is sensitive to high error rates in the automatic transcription of noisy speech. This thesis demonstrates that topic segmentation and identification techniques can be employed for information extraction from spoken conversations by being robust to such errors. Two novel topic-based approaches are presented for extracting situational information within the search and rescue context. The first approach shows that identifying the changes in the context and content of first responders' report over time can provide an estimation of their location. The second approach presents a speech-based topological map estimation technique that is inspired, in part, by automatic mapping algorithms commonly used in robotics. The proposed approaches are evaluated on a goal-oriented conversational speech corpus, which has been designed and collected based on an abstract communication model between a first responder and a task leader during a search process. Results have confirmed that a highly imperfect transcription of noisy speech has limited impact on the information extraction performance compared with that obtained on the transcription of clean speech data. This thesis also shows that speech recognition accuracy can benefit from rescoring its initial transcription hypotheses based on the derived high-level location information. A new two-pass speech decoding architecture is presented. In this architecture, the location estimation from a first decoding pass is used to dynamically adapt a general language model which is used for rescoring the initial recognition hypotheses. This decoding strategy has resulted in a statistically significant gain in the recognition accuracy of the spoken conversations in high background noise. It is concluded that the techniques developed in this thesis can be extended to more application domains that deal with large volumes of natural spoken conversations

    Searching Spontaneous Conversational Speech:Proceedings of ACM SIGIR Workshop (SSCS2008)

    Get PDF
    corecore