397 research outputs found

    Bio-motivated features and deep learning for robust speech recognition

    Get PDF
    Mención Internacional en el título de doctorIn spite of the enormous leap forward that the Automatic Speech Recognition (ASR) technologies has experienced over the last five years their performance under hard environmental condition is still far from that of humans preventing their adoption in several real applications. In this thesis the challenge of robustness of modern automatic speech recognition systems is addressed following two main research lines. The first one focuses on modeling the human auditory system to improve the robustness of the feature extraction stage yielding to novel auditory motivated features. Two main contributions are produced. On the one hand, a model of the masking behaviour of the Human Auditory System (HAS) is introduced, based on the non-linear filtering of a speech spectro-temporal representation applied simultaneously to both frequency and time domains. This filtering is accomplished by using image processing techniques, in particular mathematical morphology operations with an specifically designed Structuring Element (SE) that closely resembles the masking phenomena that take place in the cochlea. On the other hand, the temporal patterns of auditory-nerve firings are modeled. Most conventional acoustic features are based on short-time energy per frequency band discarding the information contained in the temporal patterns. Our contribution is the design of several types of feature extraction schemes based on the synchrony effect of auditory-nerve activity, showing that the modeling of this effect can indeed improve speech recognition accuracy in the presence of additive noise. Both models are further integrated into the well known Power Normalized Cepstral Coefficients (PNCC). The second research line addresses the problem of robustness in noisy environments by means of the use of Deep Neural Networks (DNNs)-based acoustic modeling and, in particular, of Convolutional Neural Networks (CNNs) architectures. A deep residual network scheme is proposed and adapted for our purposes, allowing Residual Networks (ResNets), originally intended for image processing tasks, to be used in speech recognition where the network input is small in comparison with usual image dimensions. We have observed that ResNets on their own already enhance the robustness of the whole system against noisy conditions. Moreover, our experiments demonstrate that their combination with the auditory motivated features devised in this thesis provide significant improvements in recognition accuracy in comparison to other state-of-the-art CNN-based ASR systems under mismatched conditions, while maintaining the performance in matched scenarios. The proposed methods have been thoroughly tested and compared with other state-of-the-art proposals for a variety of datasets and conditions. The obtained results prove that our methods outperform other state-of-the-art approaches and reveal that they are suitable for practical applications, specially where the operating conditions are unknown.El objetivo de esta tesis se centra en proponer soluciones al problema del reconocimiento de habla robusto; por ello, se han llevado a cabo dos líneas de investigación. En la primera líınea se han propuesto esquemas de extracción de características novedosos, basados en el modelado del comportamiento del sistema auditivo humano, modelando especialmente los fenómenos de enmascaramiento y sincronía. En la segunda, se propone mejorar las tasas de reconocimiento mediante el uso de técnicas de aprendizaje profundo, en conjunto con las características propuestas. Los métodos propuestos tienen como principal objetivo, mejorar la precisión del sistema de reconocimiento cuando las condiciones de operación no son conocidas, aunque el caso contrario también ha sido abordado. En concreto, nuestras principales propuestas son los siguientes: Simular el sistema auditivo humano con el objetivo de mejorar la tasa de reconocimiento en condiciones difíciles, principalmente en situaciones de alto ruido, proponiendo esquemas de extracción de características novedosos. Siguiendo esta dirección, nuestras principales propuestas se detallan a continuación: • Modelar el comportamiento de enmascaramiento del sistema auditivo humano, usando técnicas del procesado de imagen sobre el espectro, en concreto, llevando a cabo el diseño de un filtro morfológico que captura este efecto. • Modelar el efecto de la sincroní que tiene lugar en el nervio auditivo. • La integración de ambos modelos en los conocidos Power Normalized Cepstral Coefficients (PNCC). La aplicación de técnicas de aprendizaje profundo con el objetivo de hacer el sistema más robusto frente al ruido, en particular con el uso de redes neuronales convolucionales profundas, como pueden ser las redes residuales. Por último, la aplicación de las características propuestas en combinación con las redes neuronales profundas, con el objetivo principal de obtener mejoras significativas, cuando las condiciones de entrenamiento y test no coinciden.Programa Oficial de Doctorado en Multimedia y ComunicacionesPresidente: Javier Ferreiros López.- Secretario: Fernando Díaz de María.- Vocal: Rubén Solera Ureñ

    주변 환경에 강인한 음성인식을 위한 모델 및 데이터기반 기법

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2015. 8. 김남수.In this thesis, we propose model-based and data-driven techniques for environment-robust automatic speech recognition. The model-based technique is the feature enhancement method in the reverberant noisy environment to improve the performance of Gaussian mixture model-hidden Markov model (HMM) system. It is based on the interacting multiple model (IMM), which was originally developed in single-channel scenario. We extend the single-channel IMM algorithm such that it can handle the multi-channel inputs under the Bayesian framework. The multi-channel IMM algorithm is capable of tracking time-varying room impulse responses and background noises by updating the relevant parameters in an on-line manner. In order to reduce the computation as the number of microphones increases, a computationally efficient algorithm is also devised. In various simulated and real environmental conditions, the performance gain of the proposed method has been confirmed. The data-driven techniques are based on deep neural network (DNN)-HMM hybrid system. In order to enhance the performance of DNN-HMM system in the adverse environments, we propose three techniques. Firstly, we propose a novel supervised pre-training technique for DNN-HMM system to achieve robust speech recognition in adverse environments. In the proposed approach, our aim is to initialize the DNN parameters such that they yield abstract features robust to acoustic environment variations. In order to achieve this, we first derive the abstract features from an early fine-tuned DNN model which is trained based on a clean speech database. By using the derived abstract features as the target values, the standard error back-propagation algorithm with the stochastic gradient descent method is performed to estimate the initial parameters of the DNN. The performance of the proposed algorithm was evaluated on Aurora-4 DB and better results were observed compared to a number of conventional pre-training methods. Secondly, a new DNN-based robust speech recognition approaches taking advantage of noise estimates are proposed. A novel part of the proposed approaches is that the time-varying noise estimates are applied to the DNN as additional inputs. For this, we extract the noise estimates in a frame-by-frame manner from the IMM algorithm which has been known to show good performance in tracking slowly-varying background noise. The performance of the proposed approaches is evaluated on Aurora-4 DB and better performance is observed compared to the conventional DNN-based robust speech recognition algorithms. Finally, a new approach to DNN-based robust speech recognition using soft target labels is proposed. The soft target labeling means that each target value of the DNN output is not restricted to 0 or 1 but takes non negative values in (0,1) and their sum equals 1. In this study, the soft target labels are obtained from the forward-backward algorithm well-known in HMM training. The proposed method makes the DNN training be more robust in noisy and unseen conditions. The performance of the proposed approach was evaluated on Aurora-4 DB and various mismatched noise test conditions, and found better compared to the conventional hard target labeling method. Furthermore, in the data-driven approaches, an integrated technique using above three algorithms and model-based technique is described. In matched and mismatched noise conditions, the performance results are discussed. In matched noise conditions, the initialization method for the DNN was effective to enhance the recognition performance. In mismatched noise conditions, the combination of using the noise estimates as an DNN input and soft target labels showed the best recognition results in all the tested combinations of the proposed techniques.Abstract i Contents iv List of Figures viii List of Tables x 1 Introduction 1 2 Experimental Environments and Database 7 2.1 ASR in Hands-Free Scenario and Feature Extraction 7 2.2 Relationship between Clean and Distorted Speech in Feature Domain 10 2.3 Database 12 2.3.1 TI Digits Corpus 13 2.3.2 Aurora-4 DB 15 3 Previous Robust ASR Approaches 17 3.1 IMM-Based Feature Compensation in Noise Environment 18 3.2 Single-Channel Reverberation and Noise-Robust Feature Enhancement Based on IMM 24 3.3 Multi-Channel Feature Enhancement for Robust Speech Recognition 26 3.4 DNN-Based Robust Speech Recognition 27 4 Multi-Channel IMM-Based Feature Enhancement for Robust Speech Recognition 31 4.1 Introduction 31 4.2 Observation Model in Multi-Channel Reverberant Noisy Environment 33 4.3 Multi-Channel Feature Enhancement in a Bayesian Framework 35 4.3.1 A Priori Clean Speech Model 37 4.3.2 A Priori Model for RIR 38 4.3.3 A Priori Model for Background Noise 39 4.3.4 State Transition Formulation 40 4.3.5 Function Linearization 41 4.4 Feature Enhancement Algorithm 42 4.5 Incremental State Estimation 48 4.6 Experiments 52 4.6.1 Simulation Data 52 4.6.2 Live Recording Data 54 4.6.3 Computational Complexity 55 4.7 Summary 56 5 Supervised Denoising Pre-Training for Robust ASR with DNN-HMM 59 5.1 Introduction 59 5.2 Deep Neural Networks 61 5.3 Supervised Denoising Pre-Training 63 5.4 Experiments 65 5.4.1 Feature Extraction and GMM-HMM System 66 5.4.2 DNN Structures 66 5.4.3 Performance Evaluation 68 5.5 Summary 69 6 DNN-Based Frameworks for Robust Speech Recognition Using Noise Estimates 71 6.1 Introduction 71 6.2 DNN-Based Frameworks for Robust ASR 73 6.2.1 Robust Feature Enhancement 74 6.2.2 Robust Model Training 75 6.3 IMM-Based Noise Estimation 77 6.4 Experiments 78 6.4.1 DNN Structures 78 6.4.2 Performance Evaluations 79 6.5 Summary 82 7 DNN-Based Robust Speech Recognition Using Soft Target Labels 83 7.1 Introduction 83 7.2 DNN-HMM Hybrid System 85 7.3 Soft Target Label Estimation 87 7.4 Experiments 89 7.4.1 DNN Structures 89 7.4.2 Performance Evaluation 90 7.4.3 Effects of Control Parameter ξ 91 7.4.4 An Integration with SDPT and ESTN Methods 92 7.4.5 Performance Evaluation on Various Noise Types 93 7.4.6 DNN Training and Decoding Time 95 7.5 Summary 96 8 Conclusions 99 Bibliography 101 요약 108Docto

    Recognition of human activities and expressions in video sequences using shape context descriptor

    Get PDF
    The recognition of objects and classes of objects is of importance in the field of computer vision due to its applicability in areas such as video surveillance, medical imaging and retrieval of images and videos from large databases on the Internet. Effective recognition of object classes is still a challenge in vision; hence, there is much interest to improve the rate of recognition in order to keep up with the rising demands of the fields where these techniques are being applied. This thesis investigates the recognition of activities and expressions in video sequences using a new descriptor called the spatiotemporal shape context. The shape context is a well-known algorithm that describes the shape of an object based upon the mutual distribution of points in the contour of the object; however, it falls short when the distinctive property of an object is not just its shape but also its movement across frames in a video sequence. Since actions and expressions tend to have a motion component that enhances the capability of distinguishing them, the shape based information from the shape context proves insufficient. This thesis proposes new 3D and 4D spatiotemporal shape context descriptors that incorporate into the original shape context changes in motion across frames. Results of classification of actions and expressions demonstrate that the spatiotemporal shape context is better than the original shape context at enhancing recognition of classes in the activity and expression domains

    Design of reservoir computing systems for the recognition of noise corrupted speech and handwriting

    Get PDF

    Reconstruction, Classification, and Segmentation for Computational Microscopy

    Full text link
    This thesis treats two fundamental problems in computational microscopy: image reconstruction for magnetic resonance force microscopy (MRFM) and image classification for electron backscatter diffraction (EBSD). In MRFM, as in many inverse problems, the true point spread function (PSF) that blurs the image may be only partially known. The image quality may suffer from this possible mismatch when standard image reconstruction techniques are applied. To deal with the mismatch, we develop novel Bayesian sparse reconstruction methods that account for possible errors in the PSF of the microscope and for the inherent sparsity of MRFM images. Two methods are proposed: a stochastic method and a variational method. They both jointly estimate the unknown PSF and unknown image. Our proposed framework for reconstruction has the flexibility to incorporate sparsity inducing priors, thus addressing ill-posedness of this non-convex problem, Markov-Random field priors, and can be extended to other image models. To obtain scalable and tractable solutions, a dimensionality reduction technique is applied to the highly nonlinear PSF space. The experiments clearly demonstrate that the proposed methods have superior performance compared to previous methods. In EBSD we develop novel and robust dictionary-based methods for segmentation and classification of grain and sub-grain structures in polycrystalline materials. Our work is the first in EBSD analysis to use a physics-based forward model, called the dictionary, to use full diffraction patterns, and that efficiently classifies patterns into grains, boundaries, and anomalies. In particular, unlike previous methods, our method incorporates anomaly detection directly into the segmentation process. The proposed approach also permits super-resolution of grain mantle and grain boundary locations. Finally, the proposed dictionary-based segmentation method performs uncertainty quantification, i.e. p-values, for the classified grain interiors and grain boundaries. We demonstrate that the dictionary-based approach is robust to instrument drift and material differences that produce small amounts of dictionary mismatch.PhDElectrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/102296/1/seunpark_1.pd

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Development and application of a quantitative analysis method for fluorescence resonance energy transfer localization experiments

    Get PDF

    Classification of linear and nonlinear modulations using Bayesian methods

    Get PDF
    La reconnaissance de modulations numériques consiste à identifier, au niveau du récepteur d'une chaîne de transmission, l'alphabet auquel appartiennent les symboles du message transmis. Cette reconnaissance est nécessaire dans de nombreux scénarios de communication, afin, par exemple, de sécuriser les transmissions pour détecter d'éventuels utilisateurs non autorisés ou bien encore de déterminer quel terminal brouille les autres. Le signal observé en réception est généralement affecté d'un certain nombre d'imperfections, dues à une synchronisation imparfaite de l'émetteur et du récepteur, une démodulation imparfaite, une égalisation imparfaite du canal de transmission. Nous proposons plusieurs méthodes de classification qui permettent d'annuler les effets liés aux imperfections de la chaîne de transmission. Les symboles reçus sont alors corrigés puis comparés à ceux du dictionnaire des symboles transmis. Plus précisément, nous étudions trois techniques permettant d'estimer la loi a posteriori d'une modulation au niveau du récepteur. La première technique estime les paramètres inconnus associés aux diverses imperfections affectant le récepteur à l'aide d'une approche Bayésienne couplée avec une méthode de simulation MCMC (Markov Chain Monte Carlo). Une deuxième technique utilise l'algorithme de Baum Welch qui permet d'estimer de manière récursive la loi a posteriori du signal reçu et de déterminer la modulation la plus probable parmi un catalogue donné. La dernière méthode étudiée dans cette thèse consiste à corriger les erreurs de synchronisation de phase et de fréquence avec une boucle de phase. Les algorithmes considérés dans cette thèse ont permis de reconnaître un certain nombre de modulations linéaires de types QAM (Quadrature Amplitude Modulation) et PSK (Phase Shift Keying) mais aussi des modulations non linéaires de type GMSK (Gaussian Minimum Shift Keying). ABSTRACT : This thesis studies classification of digital linear and nonlinear modulations using Bayesian methods. Modulation recognition consists of identifying, at the receiver, the type of modulation signals used by the transmitter. It is important in many communication scenarios, for example, to secure transmissions by detecting unauthorized users, or to determine which transmitter interferes the others. The received signal is generally affected by a number of impairments. We propose several classification methods that can mitigate the effects related to imperfections in transmission channels. More specifically, we study three techniques to estimate the posterior probabilities of the received signals conditionally to each modulation. The first technique estimates the unknown parameters associated with various imperfections using a Bayesian approach coupled with Markov Chain Monte Carlo (MCMC) methods. A second technique uses the Baum Welch (BW) algorithm to estimate recursively the posterior probabilities and determine the most likely modulation type from a catalogue. The last method studied in this thesis corrects synchronization errors (phase and frequency offsets) with a phase-locked loop (PLL). The classification algorithms considered in this thesis can recognize a number of linear modulations such as Quadrature Amplitude Modulation (QAM), Phase Shift Keying (PSK), and nonlinear modulations such as Gaussian Minimum Shift Keying (GMSK

    Graph-Based Offline Signature Verification

    Get PDF
    Graphs provide a powerful representation formalism that offers great promise to benefit tasks like handwritten signature verification. While most state-of-the-art approaches to signature verification rely on fixed-size representations, graphs are flexible in size and allow modeling local features as well as the global structure of the handwriting. In this article, we present two recent graph-based approaches to offline signature verification: keypoint graphs with approximated graph edit distance and inkball models. We provide a comprehensive description of the methods, propose improvements both in terms of computational time and accuracy, and report experimental results for four benchmark datasets. The proposed methods achieve top results for several benchmarks, highlighting the potential of graph-based signature verification
    corecore