406 research outputs found

    Error Correction based on Error Signatures applied to automatic speech recognition

    Get PDF

    Holistic Vocabulary Independent Spoken Term Detection

    Get PDF
    Within this thesis, we aim at designing a loosely coupled holistic system for Spoken Term Detection (STD) on heterogeneous German broadcast data in selected application scenarios. Starting from STD on the 1-best output of a word-based speech recognizer, we study the performance of several subword units for vocabulary independent STD on a linguistically and acoustically challenging German corpus. We explore the typical error sources in subword STD, and find that they differ from the error sources in word-based speech search. We select, extend and combine a set of state-of-the-art methods for error compensation in STD in order to explicitly merge the corresponding STD error spaces through anchor-based approximate lattice retrieval. Novel methods for STD result verification are proposed in order to increase retrieval precision by exploiting external knowledge at search time. Error-compensating methods for STD typically suffer from high response times on large scale databases, and we propose scalable approaches suitable for large corpora. Highest STD accuracy is obtained by combining anchor-based approximate retrieval from both syllable lattice ASR and syllabified word ASR into a hybrid STD system, and pruning the result list using external knowledge with hybrid contextual and anti-query verification.Die vorliegende Arbeit beschreibt ein lose gekoppeltes, ganzheitliches System zur Sprachsuche auf heterogenenen deutschen Sprachdaten in unterschiedlichen Anwendungsszenarien. Ausgehend von einer wortbasierten Sprachsuche auf dem Transkript eines aktuellen Wort-Erkenners werden zunächst unterschiedliche Subwort-Einheiten für die vokabularunabhängige Sprachsuche auf deutschen Daten untersucht. Auf dieser Basis werden die typischen Fehlerquellen in der Subwort-basierten Sprachsuche analysiert. Diese Fehlerquellen unterscheiden sich vom Fall der klassichen Suche im Worttranskript und müssen explizit adressiert werden. Die explizite Kompensation der unterschiedlichen Fehlerquellen erfolgt durch einen neuartigen hybriden Ansatz zur effizienten Ankerbasierten unscharfen Wortgraph-Suche. Darüber hinaus werden neuartige Methoden zur Verifikation von Suchergebnissen vorgestellt, die zur Suchzeit verfügbares externes Wissen einbeziehen. Alle vorgestellten Verfahren werden auf einem umfangreichen Satz von deutschen Fernsehdaten mit Fokus auf ausgewählte, repräsentative Einsatzszenarien evaluiert. Da Methoden zur Fehlerkompensation in der Sprachsuchforschung typischerweise zu hohen Laufzeiten bei der Suche in großen Archiven führen, werden insbesondere auch Szenarien mit sehr großen Datenmengen betrachtet. Die höchste Suchleistung für Archive mittlerer Größe wird durch eine unscharfe und Anker-basierte Suche auf einem hybriden Index aus Silben-Wortgraphen und silbifizierter Wort-Erkennung erreicht, bei der die Suchergebnisse mit hybrider Verifikation bereinigt werden

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    A Multimodal Sensor Fusion Architecture for Audio-Visual Speech Recognition

    Get PDF
    A key requirement for developing any innovative system in a computing environment is to integrate a sufficiently friendly interface with the average end user. Accurate design of such a user-centered interface, however, means more than just the ergonomics of the panels and displays. It also requires that designers precisely define what information to use and how, where, and when to use it. Recent advances in user-centered design of computing systems have suggested that multimodal integration can provide different types and levels of intelligence to the user interface. The work of this thesis aims at improving speech recognition-based interfaces by making use of the visual modality conveyed by the movements of the lips. Designing a good visual front end is a major part of this framework. For this purpose, this work derives the optical flow fields for consecutive frames of people speaking. Independent Component Analysis (ICA) is then used to derive basis flow fields. The coefficients of these basis fields comprise the visual features of interest. It is shown that using ICA on optical flow fields yields better classification results than the traditional approaches based on Principal Component Analysis (PCA). In fact, ICA can capture higher order statistics that are needed to understand the motion of the mouth. This is due to the fact that lips movement is complex in its nature, as it involves large image velocities, self occlusion (due to the appearance and disappearance of the teeth) and a lot of non-rigidity. Another issue that is of great interest to audio-visual speech recognition systems designers is the integration (fusion) of the audio and visual information into an automatic speech recognizer. For this purpose, a reliability-driven sensor fusion scheme is developed. A statistical approach is developed to account for the dynamic changes in reliability. This is done in two steps. The first step derives suitable statistical reliability measures for the individual information streams. These measures are based on the dispersion of the N-best hypotheses of the individual stream classifiers. The second step finds an optimal mapping between the reliability measures and the stream weights that maximizes the conditional likelihood. For this purpose, genetic algorithms are used. The addressed issues are challenging problems and are substantial for developing an audio-visual speech recognition framework that can maximize the information gather about the words uttered and minimize the impact of noise

    Singing voice analysis/synthesis

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2003.Includes bibliographical references (p. 109-115).The singing voice is the oldest and most variable of musical instruments. By combining music, lyrics, and expression, the voice is able to affect us in ways that no other instrument can. As listeners, we are innately drawn to the sound of the human voice, and when present it is almost always the focal point of a musical piece. But the acoustic flexibility of the voice in intimating words, shaping phrases, and conveying emotion also makes it the most difficult instrument to model computationally. Moreover, while all voices are capable of producing the common sounds necessary for language understanding and communication, each voice possesses distinctive features independent of phonemes and words. These unique acoustic qualities are the result of a combination of innate physical factors and expressive characteristics of performance, reflecting an individual's vocal identity. A great deal of prior research has focused on speech recognition and speaker identification, but relatively little work has been performed specifically on singing. There are significant differences between speech and singing in terms of both production and perception. Traditional computational models of speech have focused on the intelligibility of language, often sacrificing sound quality for model simplicity. Such models, however, are detrimental to the goal of singing, which relies on acoustic authenticity for the non-linguistic communication of expression and emotion. These differences between speech and singing dictate that a different and specialized representation is needed to capture the sound quality and musicality most valued in singing.(cont.) This dissertation proposes an analysis/synthesis framework specifically for the singing voice that models the time-varying physical and expressive characteristics unique to an individual voice. The system operates by jointly estimating source-filter voice model parameters, representing vocal physiology, and modeling the dynamic behavior of these features over time to represent aspects of expression. This framework is demonstrated to be useful for several applications, such as singing voice coding, automatic singer identification, and voice transformation.by Youngmoo Edmund Kim.Ph.D

    Deep Transfer Learning for Automatic Speech Recognition: Towards Better Generalization

    Full text link
    Automatic speech recognition (ASR) has recently become an important challenge when using deep learning (DL). It requires large-scale training datasets and high computational and storage resources. Moreover, DL techniques and machine learning (ML) approaches in general, hypothesize that training and testing data come from the same domain, with the same input feature space and data distribution characteristics. This assumption, however, is not applicable in some real-world artificial intelligence (AI) applications. Moreover, there are situations where gathering real data is challenging, expensive, or rarely occurring, which can not meet the data requirements of DL models. deep transfer learning (DTL) has been introduced to overcome these issues, which helps develop high-performing models using real datasets that are small or slightly different but related to the training data. This paper presents a comprehensive survey of DTL-based ASR frameworks to shed light on the latest developments and helps academics and professionals understand current challenges. Specifically, after presenting the DTL background, a well-designed taxonomy is adopted to inform the state-of-the-art. A critical analysis is then conducted to identify the limitations and advantages of each framework. Moving on, a comparative study is introduced to highlight the current challenges before deriving opportunities for future research

    The use of acoustic cues in phonetic perception: Effects of spectral degradation, limited bandwidth and background noise

    Get PDF
    Hearing impairment, cochlear implantation, background noise and other auditory degradations result in the loss or distortion of sound information thought to be critical to speech perception. In many cases, listeners can still identify speech sounds despite degradations, but understanding of how this is accomplished is incomplete. Experiments presented here tested the hypothesis that listeners would utilize acoustic-phonetic cues differently if one or more cues were degraded by hearing impairment or simulated hearing impairment. Results supported this hypothesis for various listening conditions that are directly relevant for clinical populations. Analysis included mixed-effects logistic modeling of contributions of individual acoustic cues for various contrasts. Listeners with cochlear implants (CIs) or normal-hearing (NH) listeners in CI simulations showed increased use of acoustic cues in the temporal domain and decreased use of cues in the spectral domain for the tense/lax vowel contrast and the word-final fricative voicing contrast. For the word-initial stop voicing contrast, NH listeners made less use of voice-onset time and greater use of voice pitch in conditions that simulated high-frequency hearing impairment and/or masking noise; influence of these cues was further modulated by consonant place of articulation. A pair of experiments measured phonetic context effects for the "s/sh" contrast, replicating previously observed effects for NH listeners and generalizing them to CI listeners as well, despite known deficiencies in spectral resolution for CI listeners. For NH listeners in CI simulations, these context effects were absent or negligible. Audio-visual delivery of this experiment revealed enhanced influence of visual lip-rounding cues for CI listeners and NH listeners in CI simulations. Additionally, CI listeners demonstrated that visual cues to gender influence phonetic perception in a manner consistent with gender-related voice acoustics. All of these results suggest that listeners are able to accommodate challenging listening situations by capitalizing on the natural (multimodal) covariance in speech signals. Additionally, these results imply that there are potential differences in speech perception by NH listeners and listeners with hearing impairment that would be overlooked by traditional word recognition or consonant confusion matrix analysis
    corecore