264 research outputs found

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Pattern Recognition

    Get PDF
    Pattern recognition is a very wide research field. It involves factors as diverse as sensors, feature extraction, pattern classification, decision fusion, applications and others. The signals processed are commonly one, two or three dimensional, the processing is done in real- time or takes hours and days, some systems look for one narrow object class, others search huge databases for entries with at least a small amount of similarity. No single person can claim expertise across the whole field, which develops rapidly, updates its paradigms and comprehends several philosophical approaches. This book reflects this diversity by presenting a selection of recent developments within the area of pattern recognition and related fields. It covers theoretical advances in classification and feature extraction as well as application-oriented works. Authors of these 25 works present and advocate recent achievements of their research related to the field of pattern recognition

    Multimedia Retrieval

    Get PDF

    ACII 2009: Affective Computing and Intelligent Interaction. Proceedings of the Doctoral Consortium 2009

    Get PDF

    A survey of the application of soft computing to investment and financial trading

    Get PDF

    Robust speech recognition under band-limited channels and other channel distortions

    Full text link
    Tesis doctoral inédita. Universidad Autónoma de Madrid, Escuela Politécnica Superior, junio de 200

    Voice Modeling Methods for Automatic Speaker Recognition

    Get PDF
    Building a voice model means to capture the characteristics of a speaker´s voice in a data structure. This data structure is then used by a computer for further processing, such as comparison with other voices. Voice modeling is a vital step in the process of automatic speaker recognition that itself is the foundation of several applied technologies: (a) biometric authentication, (b) speech recognition and (c) multimedia indexing. Several challenges arise in the context of automatic speaker recognition. First, there is the problem of data shortage, i.e., the unavailability of sufficiently long utterances for speaker recognition. It stems from the fact that the speech signal conveys different aspects of the sound in a single, one-dimensional time series: linguistic (what is said?), prosodic (how is it said?), individual (who said it?), locational (where is the speaker?) and emotional features of the speech sound itself (to name a few) are contained in the speech signal, as well as acoustic background information. To analyze a specific aspect of the sound regardless of the other aspects, analysis methods have to be applied to a specific time scale (length) of the signal in which this aspect stands out of the rest. For example, linguistic information (i.e., which phone or syllable has been uttered?) is found in very short time spans of only milliseconds of length. On the contrary, speakerspecific information emerges the better the longer the analyzed sound is. Long utterances, however, are not always available for analysis. Second, the speech signal is easily corrupted by background sound sources (noise, such as music or sound effects). Their characteristics tend to dominate a voice model, if present, such that model comparison might then be mainly due to background features instead of speaker characteristics. Current automatic speaker recognition works well under relatively constrained circumstances, such as studio recordings, or when prior knowledge on the number and identity of occurring speakers is available. Under more adverse conditions, such as in feature films or amateur material on the web, the achieved speaker recognition scores drop below a rate that is acceptable for an end user or for further processing. For example, the typical speaker turn duration of only one second and the sound effect background in cinematic movies render most current automatic analysis techniques useless. In this thesis, methods for voice modeling that are robust with respect to short utterances and background noise are presented. The aim is to facilitate movie analysis with respect to occurring speakers. Therefore, algorithmic improvements are suggested that (a) improve the modeling of very short utterances, (b) facilitate voice model building even in the case of severe background noise and (c) allow for efficient voice model comparison to support the indexing of large multimedia archives. The proposed methods improve the state of the art in terms of recognition rate and computational efficiency. Going beyond selective algorithmic improvements, subsequent chapters also investigate the question of what is lacking in principle in current voice modeling methods. By reporting on a study with human probands, it is shown that the exclusion of time coherence information from a voice model induces an artificial upper bound on the recognition accuracy of automatic analysis methods. A proof-of-concept implementation confirms the usefulness of exploiting this kind of information by halving the error rate. This result questions the general speaker modeling paradigm of the last two decades and presents a promising new way. The approach taken to arrive at the previous results is based on a novel methodology of algorithm design and development called “eidetic design". It uses a human-in-the-loop technique that analyses existing algorithms in terms of their abstract intermediate results. The aim is to detect flaws or failures in them intuitively and to suggest solutions. The intermediate results often consist of large matrices of numbers whose meaning is not clear to a human observer. Therefore, the core of the approach is to transform them to a suitable domain of perception (such as, e.g., the auditory domain of speech sounds in case of speech feature vectors) where their content, meaning and flaws are intuitively clear to the human designer. This methodology is formalized, and the corresponding workflow is explicated by several use cases. Finally, the use of the proposed methods in video analysis and retrieval are presented. This shows the applicability of the developed methods and the companying software library sclib by means of improved results using a multimodal analysis approach. The sclib´s source code is available to the public upon request to the author. A summary of the contributions together with an outlook to short- and long-term future work concludes this thesis
    • …
    corecore