15 research outputs found

    Speech recognition in noise using weighted matching algorithms

    Get PDF

    Discriminative preprocessing of speech : towards improving biometric authentication

    Get PDF
    Im Rahmen des "SecurePhone-Projektes" wurde ein multimodales System zur Benutzerauthentifizierung entwickelt, das auf ein PDA implementiert wurde. Bei der vollzogenen Erweiterung dieses Systems wurde der Möglichkeit nachgegangen, die Benutzerauthentifizierung durch eine auf biometrischen Parametern (E.: "feature enhancement") basierende Unterscheidung zwischen Sprechern sowie durch eine Kombination mehrerer Parameter zu verbessern. In der vorliegenden Dissertation wird ein allgemeines Bezugssystem zur Verbesserung der Parameter prĂ€sentiert, das ein mehrschichtiges neuronales Netz (E.: "MLP: multilayer perceptron") benutzt, um zu einer optimalen Sprecherdiskrimination zu gelangen. In einem ersten Schritt wird beim Trainieren des MLPs eine Teilmenge der Sprecher (Sprecherbasis) berĂŒcksichtigt, um die zugrundeliegenden Charakteristika des vorhandenen akustischen Parameterraums darzustellen. Am Ende eines zweiten Schrittes steht die Erkenntnis, dass die GrĂ¶ĂŸe der verwendeten Sprecherbasis die LeistungsfĂ€higkeit eines Sprechererkennungssystems entscheidend beeinflussen kann. Ein dritter Schritt fĂŒhrt zur Feststellung, dass sich die Selektion der Sprecherbasis ebenfalls auf die LeistungsfĂ€higkeit des Systems auswirken kann. Aufgrund dieser Beobachtung wird eine automatische Selektionsmethode fĂŒr die Sprecher auf der Basis des maximalen Durchschnittswertes der Zwischenklassenvariation (between-class variance) vorgeschlagen. Unter RĂŒckgriff auf verschiedene sprachliche Produktionssituationen (Sprachproduktion mit und ohne HintergrundgerĂ€usche; Sprachproduktion beim Telefonieren) wird gezeigt, dass diese Methode die LeistungsfĂ€higkeit des Erkennungssystems verbessern kann. Auf der Grundlage dieser Ergebnisse wird erwartet, dass sich die hier fĂŒr die Sprechererkennung verwendete Methode auch fĂŒr andere biometrische ModalitĂ€ten als sinnvoll erweist. ZusĂ€tzlich wird in der vorliegenden Dissertation eine alternative ParameterreprĂ€sentation vorgeschlagen, die aus der sog. "Sprecher-Stimme-Signatur" (E.: "SVS: speaker voice signature") abgeleitet wird. Die SVS besteht aus Trajektorien in einem Kohonennetz (E.: "SOM: self-organising map"), das den akustischen Raum reprĂ€sentiert. Als weiteres Ergebnis der Arbeit erweist sich diese ParameterreprĂ€sentation als ErgĂ€nzung zu dem zugrundeliegenden Parameterset. Deshalb liegt eine Kombination beider Parametersets im Sinne einer Verbesserung der LeistungsfĂ€higkeit des Erkennungssystems nahe. Am Ende der Arbeit sind schließlich einige potentielle Erweiterungsmöglichkeiten zu den vorgestellten Methoden zu finden. SchlĂŒsselwörter: Feature Enhancement, MLP, SOM, Sprecher-Basis-Selektion, SprechererkennungIn the context of the SecurePhone project, a multimodal user authentication system was developed for implementation on a PDA. Extending this system, we investigate biometric feature enhancement and multi-feature fusion with the aim of improving user authentication accuracy. In this dissertation, a general framework for feature enhancement is proposed which uses a multilayer perceptron (MLP) to achieve optimal speaker discrimination. First, to train this MLP a subset of speakers (speaker basis) is used to represent the underlying characteristics of the given acoustic feature space. Second, the size of the speaker basis is found to be among the crucial factors affecting the performance of a speaker recognition system. Third, it is found that the selection of the speaker basis can also influence system performance. Based on this observation, an automatic speaker selection approach is proposed on the basis of the maximal average between-class variance. Tests in a variety of conditions, including clean and noisy as well as telephone speech, show that this approach can improve the performance of speaker recognition systems. This approach, which is applied here to feature enhancement for speaker recognition, can be expected to also be effective with other biometric modalities besides speech. Further, an alternative feature representation is proposed in this dissertation, which is derived from what we call speaker voice signatures (SVS). These are trajectories in a Kohonen self organising map (SOM) which has been trained to represent the acoustic space. This feature representation is found to be somewhat complementary to the baseline feature set, suggesting that they can be fused to achieve improved performance in speaker recognition. Finally, this dissertation finishes with a number of potential extensions of the proposed approaches. Keywords: feature enhancement, MLP, SOM, speaker basis selection, speaker recognition, biometric, authentication, verificatio

    VOICE BIOMETRICS UNDER MISMATCHED NOISE CONDITIONS

    Get PDF
    This thesis describes research into effective voice biometrics (speaker recognition) under mismatched noise conditions. Over the last two decades, this class of biometrics has been the subject of considerable research due to its various applications in such areas as telephone banking, remote access control and surveillance. One of the main challenges associated with the deployment of voice biometrics in practice is that of undesired variations in speech characteristics caused by environmental noise. Such variations can in turn lead to a mismatch between the corresponding test and reference material from the same speaker. This is found to adversely affect the performance of speaker recognition in terms of accuracy. To address the above problem, a novel approach is introduced and investigated. The proposed method is based on minimising the noise mismatch between reference speaker models and the given test utterance, and involves a new form of Test-Normalisation (T-Norm) for further enhancing matching scores under the aforementioned adverse operating conditions. Through experimental investigations, based on the two main classes of speaker recognition (i.e. verification/ open-set identification), it is shown that the proposed approach can significantly improve the performance accuracy under mismatched noise conditions. In order to further improve the recognition accuracy in severe mismatch conditions, an approach to enhancing the above stated method is proposed. This, which involves providing a closer adjustment of the reference speaker models to the noise condition in the test utterance, is shown to considerably increase the accuracy in extreme cases of noisy test data. Moreover, to tackle the computational burden associated with the use of the enhanced approach with open-set identification, an efficient algorithm for its realisation in this context is introduced and evaluated. The thesis presents a detailed description of the research undertaken, describes the experimental investigations and provides a thorough analysis of the outcomes

    Evaluation of preprocessors for neural network speaker verification

    Get PDF

    Singing voice analysis/synthesis

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2003.Includes bibliographical references (p. 109-115).The singing voice is the oldest and most variable of musical instruments. By combining music, lyrics, and expression, the voice is able to affect us in ways that no other instrument can. As listeners, we are innately drawn to the sound of the human voice, and when present it is almost always the focal point of a musical piece. But the acoustic flexibility of the voice in intimating words, shaping phrases, and conveying emotion also makes it the most difficult instrument to model computationally. Moreover, while all voices are capable of producing the common sounds necessary for language understanding and communication, each voice possesses distinctive features independent of phonemes and words. These unique acoustic qualities are the result of a combination of innate physical factors and expressive characteristics of performance, reflecting an individual's vocal identity. A great deal of prior research has focused on speech recognition and speaker identification, but relatively little work has been performed specifically on singing. There are significant differences between speech and singing in terms of both production and perception. Traditional computational models of speech have focused on the intelligibility of language, often sacrificing sound quality for model simplicity. Such models, however, are detrimental to the goal of singing, which relies on acoustic authenticity for the non-linguistic communication of expression and emotion. These differences between speech and singing dictate that a different and specialized representation is needed to capture the sound quality and musicality most valued in singing.(cont.) This dissertation proposes an analysis/synthesis framework specifically for the singing voice that models the time-varying physical and expressive characteristics unique to an individual voice. The system operates by jointly estimating source-filter voice model parameters, representing vocal physiology, and modeling the dynamic behavior of these features over time to represent aspects of expression. This framework is demonstrated to be useful for several applications, such as singing voice coding, automatic singer identification, and voice transformation.by Youngmoo Edmund Kim.Ph.D

    Robot environment learning with a mixed-linear probabilistic state-space model

    Get PDF
    This thesis proposes the use of a probabilistic state-space model with mixed-linear dynamics for learning to predict a robot's experiences. It is motivated by a desire to bridge the gap between traditional models with predefined objective semantics on the one hand, and the biologically-inspired "black box" behavioural paradigm on the other. A novel EM-type algorithm for the model is presented, which is less compuationally demanding than the Monte Carlo techniques developed for use in (for example) visual applications. The algorithm's E-step is slightly approximative, but an extension is described which would in principle make it asymptotically correct. Investigation using synthetically sampled data shows that the uncorrected E-step can any case make correct inferences about quite complicated systems. Results collected from two simulated mobile robot environments support the claim that mixed-linear models can capture both discontinuous and continuous structure in world in an intuitively natural manner; while they proved to perform only slightly better than simpler autoregressive hidden Markov models on these simple tasks, it is possible to claim tentatively that they might scale more effectively to environments in which trends over time played a larger role. Bayesian confidence regions—easily by mixed-linear model— proved be an effective guard for preventing it from making over-confident predictions outside its area of competence. A section on future extensions discusses how the model's easy invertibility could be harnessed to the ultimate aim of choosing actions, from a continuous space of possibilities, which maximise the robot's expected payoff over several steps into the futur

    Hidden Markov Models

    Get PDF
    Hidden Markov Models (HMMs), although known for decades, have made a big career nowadays and are still in state of development. This book presents theoretical issues and a variety of HMMs applications in speech recognition and synthesis, medicine, neurosciences, computational biology, bioinformatics, seismology, environment protection and engineering. I hope that the reader will find this book useful and helpful for their own research

    Fast incremental learning of stochastic context-free grammars in radar electronic support

    Get PDF
    Radar Electronic Support (ES) involves the passive search for, interception, location, analysis and identification of radiated electromagnetic energy for military purposes. Although Stochastic Context-Free Grammars (SCFGs) appear promising for recognition of radar emitters, and for estimation of their respective level of threat in radar ES systems, the computations associated with well-known techniques for learning their production rule probabilities are very computationally demanding. The most popular methods for this task are the Inside-Outside (IO) algorithm, which maximizes the likelihood of a data set,and the Viterbi Score (VS) algorithm, which maximizes the likelihood of its best parse trees. For each iteration, their time complexity is cubic with the length of sequences in the training set and with the number of non-terminal symbols in the grammar. Since applications of radar ES require timely protection against threats, fast techniques for learning SCFGs probabilities are needed. Moreover, in radar ES applications, new information from a battlefield or other sources often becomes available at different points in time. In order to rapidly refiect changes in operational environments, fast incremental learning of SCFG probabilities is therefore an undisputed asset. Several techniques have been developed to accelerate the computation of production rules probabilities of SCFGs. In the first part of this thesis, three fast alternatives, called graphical EM (gEM), Tree Scanning (TS) and HOLA, are compared from several perspectives - perplexity, state estimation, ability to detect MFRs, time and memory complexity, and convergence time. Estimation of the average-case and worst-case execution time and storage requirements allows for the assessment of complexity, while computer simulations, performed using radar pulse data, facilitates the assessment of the other performance measures. An experimental protocol has been defined such that the impact on performance of factors like training set size and level of ambiguity of grammars may be observed. In addition, since VS is known to have a lower overall computational cost in practice, VS versions of the original 10-based gEM and TS have also been proposed and compared. Results indicate that both gEM(IO) and TS(IO) provide the same level of accuracy, yet the resource requirements mostly vary as a function of the ambiguity of the grammars. Furthermore, for a similar quality in results, the gEM(VS) and TS(VS) techniques provide significantly lower convergence times and time complexities per iteration in practice than do gEM(IO) and TS(IO). All of these algorithms may provide a greater level of accuracy than HOLA, yet their computational complexity may be orders of magnitude higher. Finally, HOLA is an on-line technique that naturally allows for incremental learning of production rule probabilities. In the second part of this thesis, two new incremental versions of gEM, called Incremental gEM (igEM) and On-Line Incremental gEM (oigEM), are proposed and compared to HOLA. They allow for a SCFG to efficiently learn new training sequences incrementally, without retraining from the start an all training data. An experimental protocol has been defined such that the impact on performance of factors like the size of new data blocks for incremental learning, and the level of ambiguity of MFR grammars, may be observed. Results indicate that, unlike HOLA, incremental learning of training data blocks with igEM and oigEM provides the same level of accuracy as learning from all cumulative data from scratch, even for relatively small data blocks. As expected, incremental leaming significantly reduces the overall time and memory complexities associated with updating SCFG probabilities. Finally, it appears that while the computational complexity and memory requirements of igEM and oigEM may be greater than that of HOLA, they both provide the higher level of accuracy

    Acoustical measurements on stages of nine U.S. concert halls

    Get PDF
    corecore