81 research outputs found

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Unsupervised Stream-Weights Computation in Classification and Recognition Tasks

    Get PDF
    International audienceIn this paper, we provide theoretical results on the problem of optimal stream weight selection for the multi-stream classi- fication problem. It is shown, that in the presence of estimation or modeling errors using stream weights can decrease the total classification error. Stream weight estimates are computed for various conditions. Then we turn our attention to the problem of unsupervised stream weights computation. Based on the theoretical results we propose to use models and “anti-models” (class- specific background models) to estimate stream weights. A non-linear function of the ratio of the inter- to intra-class distance is used for stream weight estimation. The proposed unsupervised stream weight estimation algorithm is evaluated on both artificial data and on the problem of audio-visual speech classification. Finally the proposed algorithm is extended to the problem of audio- visual speech recognition. It is shown that the proposed algorithms achieve results comparable to the supervised minimum-error training approach under most testing conditions

    Enhancing posterior based speech recognition systems

    Get PDF
    The use of local phoneme posterior probabilities has been increasingly explored for improving speech recognition systems. Hybrid hidden Markov model / artificial neural network (HMM/ANN) and Tandem are the most successful examples of such systems. In this thesis, we present a principled framework for enhancing the estimation of local posteriors, by integrating phonetic and lexical knowledge, as well as long contextual information. This framework allows for hierarchical estimation, integration and use of local posteriors from the phoneme up to the word level. We propose two approaches for enhancing the posteriors. In the first approach, phoneme posteriors estimated with an ANN (particularly multi-layer Perceptron – MLP) are used as emission probabilities in HMM forward-backward recursions. This yields new enhanced posterior estimates integrating HMM topological constraints (encoding specific phonetic and lexical knowledge), and long context. In the second approach, a temporal context of the regular MLP posteriors is post-processed by a secondary MLP, in order to learn inter and intra dependencies among the phoneme posteriors. The learned knowledge is integrated in the posterior estimation during the inference (forward pass) of the second MLP, resulting in enhanced posteriors. The use of resulting local enhanced posteriors is investigated in a wide range of posterior based speech recognition systems (e.g. Tandem and hybrid HMM/ANN), as a replacement or in combination with the regular MLP posteriors. The enhanced posteriors consistently outperform the regular posteriors in different applications over small and large vocabulary databases

    Utterance verification in large vocabulary spoken language understanding system

    Get PDF
    Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.Includes bibliographical references (leaves 87-89).by Huan Yao.M.Eng

    Spoken command recognition for robotics

    Get PDF
    In this thesis, I investigate spoken command recognition technology for robotics. While high robustness is expected, the distant and noisy conditions in which the system has to operate make the task very challenging. Unlike commercial systems which all rely on a "wake-up" word to initiate the interaction, the pipeline proposed here directly detect and recognizes commands from the continuous audio stream. In order to keep the task manageable despite low-resource conditions, I propose to focus on a limited set of commands, thus trading off flexibility of the system against robustness. Domain and speaker adaptation strategies based on a multi-task regularization paradigm are first explored. More precisely, two different methods are proposed which rely on a tied loss function which penalizes the distance between the output of several networks. The first method considers each speaker or domain as a task. A canonical task-independent network is jointly trained with task-dependent models, allowing both types of networks to improve by learning from one another. While an improvement of 3.2% on the frame error rate (FER) of the task-independent network is obtained, this only partially carried over to the phone error rate (PER), with 1.5% of improvement. Similarly, a second method explored the parallel training of the canonical network with a privileged model having access to i-vectors. This method proved less effective with only 1.2% of improvement on the FER. In order to make the developed technology more accessible, I also investigated the use of a sequence-to-sequence (S2S) architecture for command classification. The use of an attention-based encoder-decoder model reduced the classification error by 40% relative to a strong convolutional neural network (CNN)-hidden Markov model (HMM) baseline, showing the relevance of S2S architectures in such context. In order to improve the flexibility of the trained system, I also explored strategies for few-shot learning, which allow to extend the set of commands with minimum requirements in terms of data. Retraining a model on the combination of original and new commands, I managed to achieve 40.5% of accuracy on the new commands with only 10 examples for each of them. This scores goes up to 81.5% of accuracy with a larger set of 100 examples per new command. An alternative strategy, based on model adaptation achieved even better scores, with 68.8% and 88.4% of accuracy with 10 and 100 examples respectively, while being faster to train. This high performance is obtained at the expense of the original categories though, on which the accuracy deteriorated. Those results are very promising as the methods allow to easily extend an existing S2S model with minimal resources. Finally, a full spoken command recognition system (named iCubrec) has been developed for the iCub platform. The pipeline relies on a voice activity detection (VAD) system to propose a fully hand-free experience. By segmenting only regions that are likely to contain commands, the VAD module also allows to reduce greatly the computational cost of the pipeline. Command candidates are then passed to the deep neural network (DNN)-HMM command recognition system for transcription. The VoCub dataset has been specifically gathered to train a DNN-based acoustic model for our task. Through multi-condition training with the CHiME4 dataset, an accuracy of 94.5% is reached on VoCub test set. A filler model, complemented by a rejection mechanism based on a confidence score, is finally added to the system to reject non-command speech in a live demonstration of the system

    Articulatory features for conversational speech recognition

    Get PDF

    Decision fusion for multi-modal person authentication.

    Get PDF
    Hui Pak Sum Henry.Thesis (M.Phil.)--Chinese University of Hong Kong, 2006.Includes bibliographical references (leaves [147]-152).Abstracts in English and Chinese.Chapter 1. --- Introduction --- p.1Chapter 1.1. --- Objectives --- p.4Chapter 1.2. --- Thesis Outline --- p.5Chapter 2. --- Background --- p.6Chapter 2.1. --- User Authentication Systems --- p.6Chapter 2.2. --- Biometric Authentication --- p.9Chapter 2.2.1. --- Speaker Verification System --- p.9Chapter 2.2.2. --- Face Verification System --- p.10Chapter 2.2.3. --- Fingerprint Verification System --- p.11Chapter 2.3. --- Verbal Information Verification (VIV) --- p.12Chapter 2.4. --- Combining SV and VIV --- p.15Chapter 2.5. --- Biometric Decision Fusion Techniques --- p.17Chapter 2.6. --- Fuzzy Logic --- p.20Chapter 2.6.1. --- Fuzzy Membership Function and Fuzzy Set --- p.21Chapter 2.6.2. --- Fuzzy Operators --- p.22Chapter 2.6.3. --- Fuzzy Rules --- p.22Chapter 2.6.4. --- Defuzzification --- p.23Chapter 2.6.5. --- Advantage of Using Fuzzy Logic in Biometric Fusion --- p.23Chapter 2.7. --- Chapter Summary --- p.25Chapter 3. --- Experimental Data --- p.26Chapter 3.1. --- Data for Multi-biometric Fusion --- p.26Chapter 3.1.1. --- Speech Utterances --- p.30Chapter 3.1.2. --- Face Movement Video Frames --- p.31Chapter 3.1.3. --- Fingerprint Images --- p.32Chapter 3.2. --- Data for Speech Authentication Fusion --- p.33Chapter 3.2.1. --- SV Training Data for Speaker Model --- p.34Chapter 3.2.2. --- VIV Training Data for Speaker Independent Model --- p.34Chapter 3.2.3. --- Validation Data --- p.34Chapter 3.3. --- Chapter Summary --- p.36Chapter 4. --- Authentication Modules --- p.37Chapter 4.1. --- Biometric Authentication --- p.38Chapter 4.1.1. --- Speaker Verification --- p.38Chapter 4.1.2. --- Face Verification --- p.38Chapter 4.1.3. --- Fingerprint Verification --- p.39Chapter 4.1.4. --- Individual Biometric Performance --- p.39Chapter 4.2. --- Verbal Information Verification (VIV) --- p.42Chapter 4.3. --- Chapter Summary --- p.44Chapter 5. --- Weighted Average Fusion for Multi-Modal Biometrics --- p.46Chapter 5.1. --- Experimental Setup and Results --- p.46Chapter 5.2. --- Analysis of Weighted Average Fusion Results --- p.48Chapter 5.3. --- Chapter Summary --- p.59Chapter 6. --- Fully Adaptive Fuzzy Logic Decision Fusion Framework --- p.61Chapter 6.1. --- Factors Considered in the Estimation of Biometric Sample Quality --- p.62Chapter 6.1.1. --- Factors for Speech --- p.63Chapter 6.1.2. --- Factors for Face --- p.65Chapter 6.1.3. --- Factors for Fingerprint --- p.70Chapter 6.2. --- Fuzzy Logic Decision Fusion Framework --- p.76Chapter 6.2.1. --- Speech Fuzzy Sets --- p.77Chapter 6.2.2. --- Face Fuzzy Sets --- p.79Chapter 6.2.3. --- Fingerprint Fuzzy Sets --- p.80Chapter 6.2.4. --- Output Fuzzy Sets --- p.81Chapter 6.2.5. --- Fuzzy Rules and Other Information --- p.83Chapter 6.3. --- Experimental Setup and Results --- p.84Chapter 6.4. --- Comparison Between Weighted Average and Fuzzy Logic Decision Fusion --- p.86Chapter 6.5. --- Chapter Summary --- p.95Chapter 7. --- Factors Affecting VIV Performance --- p.97Chapter 7.1. --- Factors from Verbal Messages --- p.99Chapter 7.1.1. --- Number of Distinct-Unique Responses --- p.99Chapter 7.1.2. --- Distribution of Distinct-Unique Responses --- p.101Chapter 7.1.3. --- Inter-person Lexical Choice Variations --- p.103Chapter 7.1.4. --- Intra-person Lexical Choice Variations --- p.106Chapter 7.2. --- Factors from Utterance Verification --- p.108Chapter 7.2.1. --- Thresholding --- p.109Chapter 7.2.2. --- Background Noise --- p.113Chapter 7.3. --- VIV Weight Estimation Using PDP --- p.115Chapter 7.4. --- Chapter Summary --- p.119Chapter 8. --- Adaptive Fusion for SV and VIV --- p.121Chapter 8.1. --- Weighted Average fusion of SV and VIV --- p.122Chapter 8.1.1. --- Scores Normalization --- p.123Chapter 8.1.2. --- Experimental Setup --- p.123Chapter 8.2. --- Adaptive Fusion for SV and VIV --- p.124Chapter 8.2.1. --- Components of Adaptive Fusion --- p.126Chapter 8.2.2. --- Three Categories Design --- p.129Chapter 8.2.3. --- Fusion Strategy for Each Category --- p.132Chapter 8.2.4. --- SV Driven Approach --- p.133Chapter 8.3. --- SV and Fixed-Pass Phrase VIV Fusion Results --- p.133Chapter 8.4. --- SV and Key-Pass Phrase VIV Fusion Results --- p.136Chapter 8.5. --- Chapter Summary --- p.141Chapter 9. --- Conclusions and Future Work --- p.143Chapter 9.1. --- Conclusions --- p.143Chapter 9.2. --- Future Work --- p.145Bibliography --- p.147Appendix A Detail of BSC Speech --- p.153Appendix B Fuzzy Rules for Multimodal Biometric Fusion --- p.155Appendix C Full Example for Multimodal Biometrics Fusion --- p.157Appendix DReason for Having a Flat Error Surface --- p.161Appendix E Reason for Having a Relative Peak Point in the Middle of the Error Surface --- p.164Appendix F Illustration on Fuzzy Logic Weight Estimation --- p.166Appendix GExamples for SV and Key-Pass Phrase VIV Fusion --- p.17

    Selected Topics in Audio-based Recommendation of TV Content

    Get PDF

    Machine Learning for Information Retrieval

    Get PDF
    In this thesis, we explore the use of machine learning techniques for information retrieval. More specifically, we focus on ad-hoc retrieval, which is concerned with searching large corpora to identify the documents relevant to user queries. Thisidentification is performed through a ranking task. Given a user query, an ad-hoc retrieval system ranks the corpus documents, so that the documents relevant to the query ideally appear above the others. In a machine learning framework, we are interested in proposing learning algorithms that can benefit from limited training data in order to identify a ranker likely to achieve high retrieval performance over unseen documents and queries. This problem presents novel challenges compared to traditional learning tasks, such as regression or classification. First, our task is a ranking problem, which means that the loss for a given query cannot be measured as a sum of an individual loss suffered for each corpus document. Second, most retrieval queries present a highly unbalanced setup, with a set of relevant documents accounting only for a very small fraction of the corpus. Third, ad-hoc retrieval corresponds to a kind of ``double'' generalization problem, since the learned model should not only generalize to new documents but also to new queries. Finally, our task also presents challenging efficiency constraints, since ad-hoc retrieval is typically applied to large corpora. % The main objective of this thesis is to investigate the discriminative learning of ad-hoc retrieval models. For that purpose, we propose different models based on kernel machines or neural networks adapted to different retrieval contexts. The proposed approaches rely on different online learning algorithms that allow efficient learning over large corpora. The first part of the thesis focus on text retrieval. In this case, we adopt a classical approach to the retrieval ranking problem, and order the text documents according to their estimated similarity to the text query. The assessment of semantic similarity between text items plays a key role in that setup and we propose a learning approach to identify an effective measure of text similarity. This identification is not performed relying on a set of queries with their corresponding relevant document sets, since such data are especially expensive to label and hence rare. Instead, we propose to rely on hyperlink data, since hyperlinks convey semantic proximity information that is relevant to similarity learning. This setup is hence a transfer learning setup, where we benefit from the proximity information encoded by hyperlinks to improve the performance over the ad-hoc retrieval task. We then investigate another retrieval problem, i.e. the retrieval of images from text queries. Our approach introduces a learning procedure optimizing a criterion related to the ranking performance. This criterion adapts our previous learning objective for learning textual similarity to the image retrieval problem. This yields an image ranking model that addresses the retrieval problem directly. This approach contrasts with previous research that rely on an intermediate image annotation task. Moreover, our learning procedure builds upon recent work on the online learning of kernel-based classifiers. This yields an efficient, scalable algorithm, which can benefit from recent kernels developed for image comparison. In the last part of the thesis, we show that the objective function used in the previous retrieval problems can be applied to the task of keyword spotting, i.e. the detection of given keywords in speech utterances. For that purpose, we formalize this problem as a ranking task: given a keyword, the keyword spotter should order the utterances so that the utterances containing the keyword appear above the others. Interestingly, this formulation yields an objective directly maximizing the area under the receiver operating curve, the most common keyword spotter evaluation measure. This objective is then used to train a model adapted to this intrinsically sequential problem. This model is then learned with a procedure derived from the algorithm previously introduced for the image retrieval task. To conclude, this thesis introduces machine learning approaches for ad-hoc retrieval. We propose learning models for various multi-modal retrieval setups, i.e. the retrieval of text documents from text queries, the retrieval of images from text queries and the retrieval of speech recordings from written keywords. Our approaches rely on discriminative learning and enjoy efficient training procedures, which yields effective and scalable models. In all cases, links with prior approaches were investigated and experimental comparisons were conducted
    • …
    corecore