326 research outputs found

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Trifoniklusterointi suomenkielisessä jatkuvassa puheentunnistuksessa

    Get PDF
    Tässä diplomityössä tutkitaan kontekstiriippuvien foneemimallien (trifonien) käyttöä suomenkielisen puhujariippuvan jatkuvan puheen tunnistimessa. Työn ensimmäisessä osassa tarkastellaan ihmisen puheentuotto- ja kuulojärjestelmiä, suomen kielen ominaisuuksia puheentunnistuksen kannalta sekä esitellään puheentunnistusjärjestelmien yleinen rakenne ja toiminta. Selostuksessa painotetaan foneemien kontekstiriippuvuutta sekä koartikulatorisia efektejä. Työn toisessa osassa opetetaan puhujariippuva tunnistin käyttäen kätkettyjä Markov-malleja (HMM) sekä Hidden Markov Model Toolkit (HTK)-ohjelmistoa. Trifoniklusteroinnissa kokeillaan datalähtöistä binääriseen päätöspuuhun perustuvaa menetelmää sekä menetelmiä, jotka käyttävät hyväkseen tietoa foneemien äännetyypeistä sekä ääntämispaikoista. Parhaat tunnistustulokset saavutetaan puuklusterointimenetelmällä, jolloin myös malleja on suurin määrä. Tunnistuskokeiden virheitä tarkastellaan laajasti. Foneemikohtaiset tyypilliset virheet ja eniten virheitä tuottaneet kontekstit analysoidaan

    USING DEEP LEARNING-BASED FRAMEWORK FOR CHILD SPEECH EMOTION RECOGNITION

    Get PDF
    Biological languages of the body through which human emotion can be detected abound including heart rate, facial expressions, movement of the eyelids and dilation of the eyes, body postures, skin conductance, and even the speech we make. Speech emotion recognition research started some three decades ago, and the popular Interspeech Emotion Challenge has helped to propagate this research area. However, most speech recognition research is focused on adults and there is very little research on child speech. This dissertation is a description of the development and evaluation of a child speech emotion recognition framework. The higher-level components of the framework are designed to sort and separate speech based on the speaker’s age, ensuring that focus is only on speeches made by children. The framework uses Baddeley’s Theory of Working Memory to model a Working Memory Recurrent Network that can process and recognize emotions from speech. Baddeley’s Theory of Working Memory offers one of the best explanations on how the human brain holds and manipulates temporary information which is very crucial in the development of neural networks that learns effectively. Experiments were designed and performed to provide answers to the research questions, evaluate the proposed framework, and benchmark the performance of the framework with other methods. Satisfactory results were obtained from the experiments and in many cases, our framework was able to outperform other popular approaches. This study has implications for various applications of child speech emotion recognition such as child abuse detection and child learning robots

    Rationality, pragmatics, and sources

    Get PDF
    This thesis contributes to the Great Rationality Debate in cognitive science. It introduces and explores a triangular scheme for understanding the relationship between rationality and two key abilities: pragmatics – roughly, inferring implicit intended utterance meanings – and learning from sources. The thesis argues that these three components – rationality, pragmatics, and sources – should be considered together: that each one informs the others. The thesis makes this case through literature review and theoretical work (principally, in Chapters 1 and 8) and through a series of empirical chapters focusing on different parts of the triangular scheme. Chapters 2 to 4 address the relationship between pragmatics and sources, focusing on how people change their beliefs when they read a conditional with a partially reliable source. The data bear on theories of the conditional and on the literature assessing people’s rationality with conditionals. Chapter 5 addresses the relationship between rationality and pragmatics, focusing on conditionals ‘in action’ in a framing effect known as goal framing. The data suggest a complex relationship between pragmatics and utilities, and support a new approach to goal framing. Chapter 6 addresses the relationship between rationality and sources, using normative Bayesian models to explore how people respond to simple claims from sources of different reliabilities. The data support a two-way relationship between claims and source information and, perhaps most strikingly, suggest that people readily treat sources as ‘anti-reliable’: as negatively correlated with the truth. Chapter 7 extends these experiments to test the theory that speakers can guard against reputational damage using hedging. The data do not support this theory, and raise questions about whether trust and vigilance against deception are prerequisites for pragmatics. Lastly, Chapter 8 synthesizes the results; argues for new ways of understanding the relationships between rationality, pragmatics, and sources; and relates the findings to emerging formal methods in psychology

    Methods for pronunciation assessment in computer aided language learning

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.Cataloged from PDF version of thesis.Includes bibliographical references (p. 149-176).Learning a foreign language is a challenging endeavor that entails acquiring a wide range of new knowledge including words, grammar, gestures, sounds, etc. Mastering these skills all require extensive practice by the learner and opportunities may not always be available. Computer Aided Language Learning (CALL) systems provide non-threatening environments where foreign language skills can be practiced where ever and whenever a student desires. These systems often have several technologies to identify the different types of errors made by a student. This thesis focuses on the problem of identifying mispronunciations made by a foreign language student using a CALL system. We make several assumptions about the nature of the learning activity: it takes place using a dialogue system, it is a task- or game-oriented activity, the student should not be interrupted by the pronunciation feedback system, and that the goal of the feedback system is to identify severe mispronunciations with high reliability. Detecting mispronunciations requires a corpus of speech with human judgements of pronunciation quality. Typical approaches to collecting such a corpus use an expert phonetician to both phonetically transcribe and assign judgements of quality to each phone in a corpus. This is time consuming and expensive. It also places an extra burden on the transcriber. We describe a novel method for obtaining phone level judgements of pronunciation quality by utilizing non-expert, crowd-sourced, word level judgements of pronunciation. Foreign language learners typically exhibit high variation and pronunciation shapes distinct from native speakers that make analysis for mispronunciation difficult. We detail a simple, but effective method for transforming the vowel space of non-native speakers to make mispronunciation detection more robust and accurate. We show that this transformation not only enhances performance on a simple classification task, but also results in distributions that can be better exploited for mispronunciation detection. This transformation of the vowel is exploited to train a mispronunciation detector using a variety of features derived from acoustic model scores and vowel class distributions. We confirm that the transformation technique results in a more robust and accurate identification of mispronunciations than traditional acoustic models.by Mitchell A. Peabody.Ph.D
    corecore