8 research outputs found

    Spoken language processing: piecing together the puzzle

    No full text
    Attempting to understand the fundamental mechanisms underlying spoken language processing, whether it is viewed as behaviour exhibited by human beings or as a faculty simulated by machines, is one of the greatest scientific challenges of our age. Despite tremendous achievements over the past 50 or so years, there is still a long way to go before we reach a comprehensive explanation of human spoken language behaviour and can create a technology with performance approaching or exceeding that of a human being. It is argued that progress is hampered by the fragmentation of the field across many different disciplines, coupled with a failure to create an integrated view of the fundamental mechanisms that underpin one organism's ability to communicate with another. This paper weaves together accounts from a wide variety of different disciplines concerned with the behaviour of living systems - many of them outside the normal realms of spoken language - and compiles them into a new model: PRESENCE (PREdictive SENsorimotor Control and Emulation). It is hoped that the results of this research will provide a sufficient glimpse into the future to give breath to a new generation of research into spoken language processing by mind or machine. (c) 2007 Elsevier B.V. All rights reserved

    Evaluation of preprocessors for neural network speaker verification

    Get PDF

    Spoken content retrieval: A survey of techniques and technologies

    Get PDF
    Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR

    Improving Searchability of Automatically Transcribed Lectures Through Dynamic Language Modelling

    Get PDF
    Recording university lectures through lecture capture systems is increasingly common. However, a single continuous audio recording is often unhelpful for users, who may wish to navigate quickly to a particular part of a lecture, or locate a specific lecture within a set of recordings. A transcript of the recording can enable faster navigation and searching. Automatic speech recognition (ASR) technologies may be used to create automated transcripts, to avoid the significant time and cost involved in manual transcription. Low accuracy of ASR-generated transcripts may however limit their usefulness. In particular, ASR systems optimized for general speech recognition may not recognize the many technical or discipline-specific words occurring in university lectures. To improve the usefulness of ASR transcripts for the purposes of information retrieval (search) and navigating within recordings, the lexicon and language model used by the ASR engine may be dynamically adapted for the topic of each lecture. A prototype is presented which uses the English Wikipedia as a semantically dense, large language corpus to generate a custom lexicon and language model for each lecture from a small set of keywords. Two strategies for extracting a topic-specific subset of Wikipedia articles are investigated: a naïve crawler which follows all article links from a set of seed articles produced by a Wikipedia search from the initial keywords, and a refinement which follows only links to articles sufficiently similar to the parent article. Pair-wise article similarity is computed from a pre-computed vector space model of Wikipedia article term scores generated using latent semantic indexing. The CMU Sphinx4 ASR engine is used to generate transcripts from thirteen recorded lectures from Open Yale Courses, using the English HUB4 language model as a reference and the two topic-specific language models generated for each lecture from Wikipedia

    Multi-tape finite-state transducer for asynchronous multi-stream pattern recognition with application to speech

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.Includes bibliographical references (p. 119-127).In this thesis, we have focused on improving the acoustic modeling of speech recognition systems to increase the overall recognition performance. We formulate a novel multi-stream speech recognition framework using multi-tape finite-state transducers (FSTs). The multi-dimensional input labels of the multi-tape FST transitions specify the acoustic models to be used for the individual feature streams. An additional auxiliary field is used to model the degree of asynchrony among the feature streams. The individual feature streams can be linear sequences such as fixed-frame-rate features in traditional hidden Markov model (HMM) systems, and the feature streams can also be directed acyclic graphs such as segment features in segment-based systems. In a single-tape mode, this multi-stream framework also unifies the frame-based HMM and the segment-based approach. Systems using the multi-stream speech recognition framework were evaluated on an audio-only and an audio-visual speech recognition task. On the Wall Street Journal speech recognition task, the multi-stream framework combined a traditional frame-based HMM with segment-based landmark features.(cont.) The system achieved word error rate (WER) of 8.0%, improved from both the WER of 8.8% of the baseline HMM-only system and the WER of 10.4% of the landmark-only system. On the AV-TIMIT audio-visual speech recognition task, the multi-stream framework combined a landmark model, a segment model, and a visual HMM. The system achieved a WER of 0.9%, which also improved from the baseline systems. These results demonstrate the feasibility and versatility of the multi-stream speech recognition framework.by Han Shu.Ph.D

    Improving searchability of automatically transcribed lectures through dynamic language modelling

    Get PDF
    Recording university lectures through lecture capture systems is increasingly common. However, a single continuous audio recording is often unhelpful for users, who may wish to navigate quickly to a particular part of a lecture, or locate a specific lecture within a set of recordings. A transcript of the recording can enable faster navigation and searching. Automatic speech recognition (ASR) technologies may be used to create automated transcripts, to avoid the significant time and cost involved in manual transcription. Low accuracy of ASR-generated transcripts may however limit their usefulness. In particular, ASR systems optimized for general speech recognition may not recognize the many technical or discipline-specific words occurring in university lectures. To improve the usefulness of ASR transcripts for the purposes of information retrieval (search) and navigating within recordings, the lexicon and language model used by the ASR engine may be dynamically adapted for the topic of each lecture. A prototype is presented which uses the English Wikipedia as a semantically dense, large language corpus to generate a custom lexicon and language model for each lecture from a small set of keywords. Two strategies for extracting a topic-specific subset of Wikipedia articles are investigated: a naïve crawler which follows all article links from a set of seed articles produced by a Wikipedia search from the initial keywords, and a refinement which follows only links to articles sufficiently similar to the parent article. Pair-wise article similarity is computed from a pre-computed vector space model of Wikipedia article term scores generated using latent semantic indexing. The CMU Sphinx4 ASR engine is used to generate transcripts from thirteen recorded lectures from Open Yale Courses, using the English HUB4 language model as a reference and the two topic-specific language models generated for each lecture from Wikipedia. Three standard metrics – Perplexity, Word Error Rate and Word Correct Rate – are used to evaluate the extent to which the adapted language models improve the searchability of the resulting transcripts, and in particular improve the recognition of specialist words. Ranked Word Correct Rate is proposed as a new metric better aligned with the goals of improving transcript searchability and specialist word recognition. Analysis of recognition performance shows that the language models derived using the similarity-based Wikipedia crawler outperform models created using the naïve crawler, and that transcripts using similarity-based language models have better perplexity and Ranked Word Correct Rate scores than those created using the HUB4 language model, but worse Word Error Rates. It is concluded that English Wikipedia may successfully be used as a language resource for unsupervised topic adaptation of language models to improve recognition performance for better searchability of lecture recording transcripts, although possibly at the expense of other attributes such as readability

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Predicting the performance of a speech recognition task.

    Get PDF
    Yau Pui Yuk.Thesis (M.Phil.)--Chinese University of Hong Kong, 2002.Includes bibliographical references (leaves 147-152).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Overview --- p.1Chapter 1.2 --- Speech Recognition --- p.2Chapter 1.2.1 --- How Speech Recognition Works --- p.3Chapter 1.2.2 --- Types of Speech Recognition Tasks --- p.4Chapter 1.2.3 --- Variabilities in Speech 一 a Challenge for Speech Recog- nition --- p.6Chapter 1.3 --- Performance Prediction of Speech Recognition Task --- p.7Chapter 1.4 --- Thesis Goals --- p.9Chapter 1.5 --- Thesis Organization --- p.10Chapter 2 --- Background --- p.11Chapter 2.1 --- The Acoustic-phonetic Approach --- p.12Chapter 2.1.1 --- Prediction based on the Degree of Mismatch --- p.12Chapter 2.1.2 --- Prediction based on Acoustic Similarity --- p.13Chapter 2.1.3 --- Prediction based on Between-Word Distance --- p.14Chapter 2.2 --- The Lexical Approach --- p.16Chapter 2.2.1 --- Perplexity --- p.16Chapter 2.2.2 --- SMR-perplexity --- p.17Chapter 2.3 --- The Combined Acoustic-phonetic and Lexical Approach --- p.18Chapter 2.3.1 --- Speech Decoder Entropy (SDE) --- p.19Chapter 2.3.2 --- Ideal Speech Decoding Difficulty (ISDD) --- p.20Chapter 2.4 --- Chapter Summary --- p.23Chapter 3 --- Components for Predicting the Performance of Speech Recog- nition Task --- p.24Chapter 3.1 --- Components of Speech Recognizer --- p.25Chapter 3.2 --- Word Similarity Measure --- p.27Chapter 3.2.1 --- Universal Phoneme Symbol (UPS) --- p.30Chapter 3.2.2 --- Definition of Phonetic Distance --- p.31Chapter 3.2.3 --- Definition of Word Pair Phonetic Distance --- p.45Chapter 3.2.4 --- Definition of Word Similarity Measure --- p.47Chapter 3.3 --- Word Occurrence Measure --- p.62Chapter 3.4 --- Chapter Summary --- p.64Chapter 4 --- Formulation of Recognition Error Predictive Index (REPI) --- p.65Chapter 4.1 --- Formulation of Recognition Error Predictive Index (REPI) --- p.66Chapter 4.2 --- Characteristics of Recognition Error Predictive Index (REPI) --- p.74Chapter 4.2.1 --- Weakness of Ideal Speech Decoding Difficulty (ISDD) --- p.75Chapter 4.2.2 --- Advantages of Recognition Error Predictive Index (REPI) --- p.79Chapter 4.3 --- Chapter Summary --- p.82Chapter 5 --- Experimental Design and Setup --- p.83Chapter 5.1 --- Objectives --- p.83Chapter 5.2 --- Experiments Preparation --- p.84Chapter 5.2.1 --- Speech Corpus and Speech Recognizers --- p.85Chapter 5.2.2 --- Speech Recognition Tasks --- p.93Chapter 5.2.3 --- Evaluation Criterion --- p.98Chapter 5.3 --- Experiment Categories and their Setup --- p.99Chapter 5.3.1 --- Experiment Category 1 一 Investigating and comparing the overall prediction performance of the two predictive indices --- p.102Chapter 5.3.2 --- Experiment Category 2 一 Comparing the applicability of the word similarity measures of the two predictive indices on predicting the recognition performance --- p.104Chapter 5.3.3 --- Experiment Category 3 - Comparing the applicability of the formulation method of the two predictive indices on predicting the recognition performance --- p.107Chapter 5.3.4 --- Experiment Category 4 一 Comparing the performance of different phonetic distance definitions --- p.109Chapter 5.4 --- Chapter Summary --- p.111Chapter 6 --- Experimental Results and Analysis --- p.112Chapter 6.1 --- Experimental Results and Analysis --- p.113Chapter 6.1.1 --- Experiment Category 1 - Investigating and comparing the overall prediction performance of the two predictive indices --- p.113Chapter 6.1.2 --- Experiment Category 2- Comparing the applicability of the word similarity measures of the two predictive indices on predicting the recognition performance --- p.117Chapter 6.1.3 --- Experiment Category 3 一 Comparing the applicability of the formulation method of the two predictive indices on predicting the recognition performance --- p.124Chapter 6.1.4 --- Experiment Category 4 - Comparing the performance of different phonetic distance definitions --- p.131Chapter 6.2 --- Experimental Summary --- p.137Chapter 6.3 --- Chapter Summary --- p.141Chapter 7 --- Conclusions --- p.142Chapter 7.1 --- Contributions --- p.144Chapter 7.2 --- Future Directions --- p.145Bibliography --- p.147Chapter A --- Table of Universal Phoneme Symbol --- p.153Chapter B --- Vocabulary Lists --- p.157Chapter C --- Experimental Results of Two-words Speech Recognition Tasks --- p.171Chapter D --- Experimental Results of Three-words Speech Recognition Tasks --- p.180Chapter E --- Significance Testing --- p.190Chapter E.1 --- Procedures of Significance Testing --- p.190Chapter E.2 --- Results of the Significance Testing --- p.191Chapter E.2.1 --- Experiment Category 1 --- p.191Chapter E.2.2 --- Experiment Category 2 --- p.192Chapter E.2.3 --- Experiment Category 3 --- p.194Chapter E.2.4 --- Experiment Category 4 --- p.196Chapter F --- Linear Regression Models --- p.19
    corecore