405 research outputs found
ARTICULATORY INFORMATION FOR ROBUST SPEECH RECOGNITION
Current Automatic Speech Recognition (ASR) systems fail to perform nearly as good as human speech recognition performance due to their lack of robustness against speech variability and noise contamination. The goal of this dissertation is to investigate these critical robustness issues, put forth different ways to address them and finally present an ASR architecture based upon these robustness criteria.
Acoustic variations adversely affect the performance of current phone-based ASR systems, in which speech is modeled as `beads-on-a-string', where the beads are the individual phone units. While phone units are distinctive in cognitive domain, they are varying in the physical domain and their variation occurs due to a combination of factors including speech style, speaking rate etc.; a phenomenon commonly known as `coarticulation'. Traditional ASR systems address such coarticulatory variations by using contextualized phone-units such as triphones. Articulatory phonology accounts for coarticulatory variations by modeling speech as a constellation of constricting actions known as articulatory gestures. In such a framework, speech variations such as coarticulation and lenition are accounted for by gestural overlap in time and gestural reduction in space. To realize a gesture-based ASR system, articulatory gestures have to be inferred from the acoustic signal. At the initial stage of this research an initial study was performed using synthetically generated speech to obtain a proof-of-concept that articulatory gestures can indeed be recognized from the speech signal. It was observed that having vocal tract constriction trajectories (TVs) as intermediate representation facilitated the gesture recognition task from the speech signal.
Presently no natural speech database contains articulatory gesture annotation; hence an automated iterative time-warping architecture is proposed that can annotate any natural speech database with articulatory gestures and TVs. Two natural speech databases: X-ray microbeam and Aurora-2 were annotated, where the former was used to train a TV-estimator and the latter was used to train a Dynamic Bayesian Network (DBN) based ASR architecture. The DBN architecture used two sets of observation: (a) acoustic features in the form of mel-frequency cepstral coefficients (MFCCs) and (b) TVs (estimated from the acoustic speech signal). In this setup the articulatory gestures were modeled as hidden random variables, hence eliminating the necessity for explicit gesture recognition. Word recognition results using the DBN architecture indicate that articulatory representations not only can help to account for coarticulatory variations but can also significantly improve the noise robustness of ASR system
Applying Facial Emotion Recognition to Usability Evaluations to Reduce Analysis Time
Usability testing is an important part of product design that offers developers insight into a product’s ability to help users achieve their goals. Despite the usefulness of usability testing, human usability evaluations are costly and time-intensive processes. Developing methods to reduce the time and costs of usability evaluations is important for organizations to improve the usability of their products without expensive investments. One prospective solution to this is the application of facial emotion recognition to automate the collection of qualitative metrics normally identified by human usability evaluators.
In this paper, facial emotion recognition (FER) was applied to mock usability recordings to evaluate how well FER could parse moments of emotional significance. To determine the accuracy of FER in this context, a FER Python library created by Justin Shenk was compared with data tags produced by human reporters. This study found that the facial emotion recognizer could only match its emotion recognition output with less than 40% of the human-reported emotion timestamps and less than 78% of the emotion data tags were recognized at all. The current lack of consistency with the human reported emotions found in this thesis makes it difficult to recommend using FER for parsing moments of semantic significance over conventional human usability evaluators
PHONOTACTIC AND ACOUSTIC LANGUAGE RECOGNITION
Práce pojednává o fonotaktickĂ©m a akustickĂ©m pĹ™Ăstupu pro automatickĂ© rozpoznávánĂ jazyka. Prvnà část práce pojednává o fonotaktickĂ©m pĹ™Ăstupu zaloĹľenĂ©m na vĂ˝skytu fonĂ©movĂ˝ch sekvenci v Ĺ™eÄŤi. NejdĹ™Ăve je prezentován popis vĂ˝voje fonĂ©movĂ©ho rozpoznávaÄŤe jako techniky pro pĹ™epis Ĺ™eÄŤi do sekvence smysluplnĂ˝ch symbolĹŻ. HlavnĂ dĹŻraz je kladen na dobrĂ© natrĂ©novánĂ fonĂ©movĂ©ho rozpoznávaÄŤe a kombinaci vĂ˝sledkĹŻ z nÄ›kolika fonĂ©movĂ˝ch rozpoznávaÄŤĹŻ trĂ©novanĂ˝ch na rĹŻznĂ˝ch jazycĂch (ParalelnĂ fonĂ©movĂ© rozpoznávánĂ následovanĂ© jazykovĂ˝mi modely (PPRLM)). Práce takĂ© pojednává o novĂ© technice anti-modely v PPRLM a studuje pouĹľitĂ fonĂ©movĂ˝ch grafĹŻ mĂsto nejlepšĂho pĹ™episu. Na závÄ›r práce jsou porovnány dva pĹ™Ăstupy modelovánĂ vĂ˝stupu fonĂ©movĂ©ho rozpoznávaÄŤe -- standardnĂ n-gramovĂ© jazykovĂ© modely a binárnĂ rozhodovacĂ stromy. HlavnĂ pĹ™Ănos v akustickĂ©m pĹ™Ăstupu je diskriminativnĂ modelovánĂ cĂlovĂ˝ch modelĹŻ jazykĹŻ a prvnĂ experimenty s kombinacĂ diskriminativnĂho trĂ©novánĂ a na pĹ™ĂznacĂch, kde byl odstranÄ›n vliv kanálu. Práce dále zkoumá rĹŻznĂ© druhy technik fĂşzi akustickĂ©ho a fonotaktickĂ©ho pĹ™Ăstupu. Všechny experimenty jsou provedeny na standardnĂch datech z NIST evaluaci konanĂ© v letech 2003, 2005 a 2007, takĹľe jsou pĹ™Ămo porovnatelnĂ© s vĂ˝sledky ostatnĂch skupin zabĂ˝vajĂcĂch se automatickĂ˝m rozpoznávánĂm jazyka. S fĂşzĂ uvedenĂ˝ch technik jsme posunuli state-of-the-art vĂ˝sledky a dosáhli vynikajĂcĂch vĂ˝sledkĹŻ ve dvou NIST evaluacĂch.This thesis deals with phonotactic and acoustic techniques for automatic language recognition (LRE). The first part of the thesis deals with the phonotactic language recognition based on co-occurrences of phone sequences in speech. A thorough study of phone recognition as tokenization technique for LRE is done, with focus on the amounts of training data for phone recognizer and on the combination of phone recognizers trained on several language (Parallel Phone Recognition followed by Language Model - PPRLM). The thesis also deals with novel technique of anti-models in PPRLM and investigates into using phone lattices instead of strings. The work on phonotactic approach is concluded by a comparison of classical n-gram modeling techniques and binary decision trees. The acoustic LRE was addressed too, with the main focus on discriminative techniques for training target language acoustic models and on initial (but successful) experiments with removing channel dependencies. We have also investigated into the fusion of phonotactic and acoustic approaches. All experiments were performed on standard data from NIST 2003, 2005 and 2007 evaluations so that the results are directly comparable to other laboratories in the LRE community. With the above mentioned techniques, the fused systems defined the state-of-the-art in the LRE field and reached excellent results in NIST evaluations.
Multi-Sensory Emotion Recognition with Speech and Facial Expression
Emotion plays an important role in human beings’ daily lives. Understanding emotions and recognizing how to react to others’ feelings are fundamental to engaging in successful social interactions. Currently, emotion recognition is not only significant in human beings’ daily lives, but also a hot topic in academic research, as new techniques such as emotion recognition from speech context inspires us as to how emotions are related to the content we are uttering.
The demand and importance of emotion recognition have highly increased in many applications in recent years, such as video games, human computer interaction, cognitive computing, and affective computing. Emotion recognition can be done from many sources including text, speech, hand, and body gesture as well as facial expression. Presently, most of the emotion recognition methods only use one of these sources. The emotion of human beings changes every second and using a single way to process the emotion recognition may not reflect the emotion correctly. This research is motivated by the desire to understand and evaluate human beings’ emotion from multiple ways such as speech and facial expressions.
In this dissertation, multi-sensory emotion recognition has been exploited. The proposed framework can recognize emotion from speech, facial expression, and both of them. There are three important parts in the design of the system: the facial emotion recognizer, the speech emotion recognizer, and the information fusion. The information fusion part uses the results from the speech emotion recognition and facial emotion recognition. Then, a novel weighted method is used to integrate the results, and a final decision of the emotion is given after the fusion.
The experiments show that with the weighted fusion methods, the accuracy can be improved to an average of 3.66% compared to fusion without adding weight. The improvement of the recognition rate can reach 18.27% and 5.66% compared to the speech emotion recognition and facial expression recognition, respectively. By improving the emotion recognition accuracy, the proposed multi-sensory emotion recognition system can help to improve the naturalness of human computer interaction
Multi-Sensory Emotion Recognition with Speech and Facial Expression
Emotion plays an important role in human beings’ daily lives. Understanding emotions and recognizing how to react to others’ feelings are fundamental to engaging in successful social interactions. Currently, emotion recognition is not only significant in human beings’ daily lives, but also a hot topic in academic research, as new techniques such as emotion recognition from speech context inspires us as to how emotions are related to the content we are uttering.
The demand and importance of emotion recognition have highly increased in many applications in recent years, such as video games, human computer interaction, cognitive computing, and affective computing. Emotion recognition can be done from many sources including text, speech, hand, and body gesture as well as facial expression. Presently, most of the emotion recognition methods only use one of these sources. The emotion of human beings changes every second and using a single way to process the emotion recognition may not reflect the emotion correctly. This research is motivated by the desire to understand and evaluate human beings’ emotion from multiple ways such as speech and facial expressions.
In this dissertation, multi-sensory emotion recognition has been exploited. The proposed framework can recognize emotion from speech, facial expression, and both of them. There are three important parts in the design of the system: the facial emotion recognizer, the speech emotion recognizer, and the information fusion. The information fusion part uses the results from the speech emotion recognition and facial emotion recognition. Then, a novel weighted method is used to integrate the results, and a final decision of the emotion is given after the fusion.
The experiments show that with the weighted fusion methods, the accuracy can be improved to an average of 3.66% compared to fusion without adding weight. The improvement of the recognition rate can reach 18.27% and 5.66% compared to the speech emotion recognition and facial expression recognition, respectively. By improving the emotion recognition accuracy, the proposed multi-sensory emotion recognition system can help to improve the naturalness of human computer interaction
Jointly optimizing sensing pipelines for multimodal mixed reality interaction
National Research Foundation (NRF) Singapore under International Research Centres in Singapore Funding Initiative; Ministry of Education, Singapore under its Academic Research Funding Tier
- …