397 research outputs found
Bio-motivated features and deep learning for robust speech recognition
Mención Internacional en el título de doctorIn spite of the enormous leap forward that the Automatic Speech
Recognition (ASR) technologies has experienced over the last five years
their performance under hard environmental condition is still far from
that of humans preventing their adoption in several real applications.
In this thesis the challenge of robustness of modern automatic speech
recognition systems is addressed following two main research lines.
The first one focuses on modeling the human auditory system to
improve the robustness of the feature extraction stage yielding to novel
auditory motivated features. Two main contributions are produced.
On the one hand, a model of the masking behaviour of the Human
Auditory System (HAS) is introduced, based on the non-linear filtering
of a speech spectro-temporal representation applied simultaneously
to both frequency and time domains. This filtering is accomplished
by using image processing techniques, in particular mathematical
morphology operations with an specifically designed Structuring Element
(SE) that closely resembles the masking phenomena that take
place in the cochlea. On the other hand, the temporal patterns of
auditory-nerve firings are modeled. Most conventional acoustic features
are based on short-time energy per frequency band discarding
the information contained in the temporal patterns. Our contribution
is the design of several types of feature extraction schemes based on
the synchrony effect of auditory-nerve activity, showing that the modeling
of this effect can indeed improve speech recognition accuracy in
the presence of additive noise. Both models are further integrated into
the well known Power Normalized Cepstral Coefficients (PNCC).
The second research line addresses the problem of robustness in
noisy environments by means of the use of Deep Neural Networks
(DNNs)-based acoustic modeling and, in particular, of Convolutional
Neural Networks (CNNs) architectures. A deep residual network
scheme is proposed and adapted for our purposes, allowing Residual
Networks (ResNets), originally intended for image processing tasks,
to be used in speech recognition where the network input is small
in comparison with usual image dimensions. We have observed that
ResNets on their own already enhance the robustness of the whole system
against noisy conditions. Moreover, our experiments demonstrate
that their combination with the auditory motivated features devised
in this thesis provide significant improvements in recognition accuracy
in comparison to other state-of-the-art CNN-based ASR systems
under mismatched conditions, while maintaining the performance in
matched scenarios.
The proposed methods have been thoroughly tested and compared
with other state-of-the-art proposals for a variety of datasets and
conditions. The obtained results prove that our methods outperform
other state-of-the-art approaches and reveal that they are suitable for
practical applications, specially where the operating conditions are
unknown.El objetivo de esta tesis se centra en proponer soluciones al problema
del reconocimiento de habla robusto; por ello, se han llevado a cabo
dos líneas de investigación.
En la primera líınea se han propuesto esquemas de extracción de características novedosos, basados en el modelado del comportamiento
del sistema auditivo humano, modelando especialmente los fenómenos
de enmascaramiento y sincronía. En la segunda, se propone mejorar
las tasas de reconocimiento mediante el uso de técnicas de
aprendizaje profundo, en conjunto con las características propuestas.
Los métodos propuestos tienen como principal objetivo, mejorar la
precisión del sistema de reconocimiento cuando las condiciones de
operación no son conocidas, aunque el caso contrario también ha sido
abordado.
En concreto, nuestras principales propuestas son los siguientes:
Simular el sistema auditivo humano con el objetivo de mejorar
la tasa de reconocimiento en condiciones difíciles, principalmente
en situaciones de alto ruido, proponiendo esquemas de
extracción de características novedosos.
Siguiendo esta dirección, nuestras principales propuestas se detallan a continuación:
• Modelar el comportamiento de enmascaramiento del sistema
auditivo humano, usando técnicas del procesado de
imagen sobre el espectro, en concreto, llevando a cabo el
diseño de un filtro morfológico que captura este efecto.
• Modelar el efecto de la sincroní que tiene lugar en el nervio
auditivo.
• La integración de ambos modelos en los conocidos Power
Normalized Cepstral Coefficients (PNCC).
La aplicación de técnicas de aprendizaje profundo con el objetivo
de hacer el sistema más robusto frente al ruido, en particular
con el uso de redes neuronales convolucionales profundas, como
pueden ser las redes residuales.
Por último, la aplicación de las características propuestas en
combinación con las redes neuronales profundas, con el objetivo
principal de obtener mejoras significativas, cuando las condiciones
de entrenamiento y test no coinciden.Programa Oficial de Doctorado en Multimedia y ComunicacionesPresidente: Javier Ferreiros López.- Secretario: Fernando Díaz de María.- Vocal: Rubén Solera Ureñ
주변 환경에 강인한 음성인식을 위한 모델 및 데이터기반 기법
학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2015. 8. 김남수.In this thesis, we propose model-based and data-driven techniques for environment-robust automatic speech recognition. The model-based technique is the feature enhancement method in the reverberant noisy environment to improve the performance of Gaussian mixture model-hidden Markov model (HMM) system. It is based on the interacting multiple model (IMM), which was originally developed in single-channel scenario. We extend the single-channel IMM algorithm such that it can handle the multi-channel inputs under the Bayesian framework. The multi-channel IMM algorithm is capable of tracking time-varying room impulse responses and background noises by updating the relevant parameters in an on-line manner. In order to reduce the computation as the number of microphones increases, a computationally efficient algorithm is also devised. In various simulated and real environmental conditions, the performance gain of the proposed method has been confirmed.
The data-driven techniques are based on deep neural network (DNN)-HMM hybrid system. In order to enhance the performance of DNN-HMM system in the adverse environments, we propose three techniques. Firstly, we propose a novel supervised pre-training technique for DNN-HMM system to achieve robust speech recognition in adverse environments. In the proposed approach, our aim is to initialize the DNN parameters such that they yield abstract features robust to acoustic environment variations. In order to achieve this, we first derive the abstract features from an early fine-tuned DNN model which is trained based on a clean speech database. By using the derived abstract features as the target values, the standard error back-propagation algorithm with the stochastic gradient descent method is performed to estimate the initial parameters of the DNN. The performance of the proposed algorithm was evaluated on Aurora-4 DB and better results were observed compared to a number of conventional pre-training methods.
Secondly, a new DNN-based robust speech recognition approaches taking advantage of noise estimates are proposed. A novel part of the proposed approaches is that the time-varying noise estimates are applied to the DNN as additional inputs. For this, we extract the noise estimates in a frame-by-frame manner from the IMM algorithm which has been known to show good performance in tracking slowly-varying background noise. The performance of the proposed approaches is evaluated on Aurora-4 DB and better performance is observed compared to the conventional DNN-based robust speech recognition algorithms.
Finally, a new approach to DNN-based robust speech recognition using soft target labels is proposed. The soft target labeling means that each target value of the DNN output is not restricted to 0 or 1 but takes non negative values in (0,1) and their sum equals 1. In this study, the soft target labels are obtained from the forward-backward algorithm well-known in HMM training. The proposed method makes the DNN training be more robust in noisy and unseen conditions. The performance of the proposed approach was evaluated on Aurora-4 DB and various mismatched noise test conditions, and found better compared to the conventional hard target labeling method.
Furthermore, in the data-driven approaches, an integrated technique using above three algorithms and model-based technique is described. In matched and mismatched noise conditions, the performance results are discussed. In matched noise conditions, the initialization method for the DNN was effective to enhance the recognition performance. In mismatched noise conditions, the combination of using the noise estimates as an DNN input and soft target labels showed the best recognition results in all the tested combinations of the proposed techniques.Abstract i
Contents iv
List of Figures viii
List of Tables x
1 Introduction 1
2 Experimental Environments and Database 7
2.1 ASR in Hands-Free Scenario and Feature Extraction 7
2.2 Relationship between Clean and Distorted Speech in Feature Domain 10
2.3 Database 12
2.3.1 TI Digits Corpus 13
2.3.2 Aurora-4 DB 15
3 Previous Robust ASR Approaches 17
3.1 IMM-Based Feature Compensation in Noise Environment 18
3.2 Single-Channel Reverberation and Noise-Robust Feature Enhancement Based on IMM 24
3.3 Multi-Channel Feature Enhancement for Robust Speech Recognition 26
3.4 DNN-Based Robust Speech Recognition 27
4 Multi-Channel IMM-Based Feature Enhancement for Robust Speech Recognition 31
4.1 Introduction 31
4.2 Observation Model in Multi-Channel Reverberant Noisy Environment 33
4.3 Multi-Channel Feature Enhancement in a Bayesian Framework 35
4.3.1 A Priori Clean Speech Model 37
4.3.2 A Priori Model for RIR 38
4.3.3 A Priori Model for Background Noise 39
4.3.4 State Transition Formulation 40
4.3.5 Function Linearization 41
4.4 Feature Enhancement Algorithm 42
4.5 Incremental State Estimation 48
4.6 Experiments 52
4.6.1 Simulation Data 52
4.6.2 Live Recording Data 54
4.6.3 Computational Complexity 55
4.7 Summary 56
5 Supervised Denoising Pre-Training for Robust ASR with DNN-HMM 59
5.1 Introduction 59
5.2 Deep Neural Networks 61
5.3 Supervised Denoising Pre-Training 63
5.4 Experiments 65
5.4.1 Feature Extraction and GMM-HMM System 66
5.4.2 DNN Structures 66
5.4.3 Performance Evaluation 68
5.5 Summary 69
6 DNN-Based Frameworks for Robust Speech Recognition Using Noise Estimates 71
6.1 Introduction 71
6.2 DNN-Based Frameworks for Robust ASR 73
6.2.1 Robust Feature Enhancement 74
6.2.2 Robust Model Training 75
6.3 IMM-Based Noise Estimation 77
6.4 Experiments 78
6.4.1 DNN Structures 78
6.4.2 Performance Evaluations 79
6.5 Summary 82
7 DNN-Based Robust Speech Recognition Using Soft Target Labels 83
7.1 Introduction 83
7.2 DNN-HMM Hybrid System 85
7.3 Soft Target Label Estimation 87
7.4 Experiments 89
7.4.1 DNN Structures 89
7.4.2 Performance Evaluation 90
7.4.3 Effects of Control Parameter ξ 91
7.4.4 An Integration with SDPT and ESTN Methods 92
7.4.5 Performance Evaluation on Various Noise Types 93
7.4.6 DNN Training and Decoding Time 95
7.5 Summary 96
8 Conclusions 99
Bibliography 101
요약 108Docto
Recognition of human activities and expressions in video sequences using shape context descriptor
The recognition of objects and classes of objects is of importance in the field of computer vision due to its applicability in areas such as video surveillance, medical imaging and retrieval of images and videos from large databases on the Internet. Effective recognition of object classes is still a challenge in vision; hence, there is much interest to improve the rate of recognition in order to keep up with the rising demands of the fields where these techniques are being applied. This thesis investigates the recognition of activities and expressions in video sequences using a new descriptor called the spatiotemporal shape context. The shape context is a well-known algorithm that describes the shape of an object based upon the mutual distribution of points in the contour of the object; however, it falls short when the distinctive property of an object is not just its shape but also its movement across frames in a video sequence. Since actions and expressions tend to have a motion component that enhances the capability of distinguishing them, the shape based information from the shape context proves insufficient. This thesis proposes new 3D and 4D spatiotemporal shape context descriptors that incorporate into the original shape context changes in motion across frames. Results of classification of actions and expressions demonstrate that the spatiotemporal shape context is better than the original shape context at enhancing recognition of classes in the activity and expression domains
Reconstruction, Classification, and Segmentation for Computational Microscopy
This thesis treats two fundamental problems in computational microscopy: image reconstruction for magnetic resonance force microscopy (MRFM) and image classification for electron backscatter diffraction (EBSD). In MRFM, as in many inverse problems, the true point spread function (PSF) that blurs the image may be only partially known. The image quality may suffer from this possible mismatch when standard image reconstruction techniques are applied. To deal with the mismatch, we develop novel Bayesian sparse reconstruction methods that account for possible errors in the PSF of the microscope and for the inherent sparsity of MRFM images. Two methods are proposed: a stochastic method and a variational method. They both jointly estimate the unknown PSF and unknown image. Our proposed framework for reconstruction has the flexibility to incorporate sparsity inducing priors, thus addressing ill-posedness of this non-convex problem, Markov-Random field priors, and can be extended to other image models. To obtain scalable and tractable solutions, a dimensionality reduction technique is applied to the highly nonlinear PSF space. The experiments clearly demonstrate that the proposed methods have superior performance compared to previous methods.
In EBSD we develop novel and robust dictionary-based methods for segmentation and classification of grain and sub-grain structures in polycrystalline materials. Our work is the first in EBSD analysis to use a physics-based forward model, called the dictionary, to use full diffraction patterns, and that efficiently classifies patterns into grains, boundaries, and anomalies. In particular, unlike previous methods, our method incorporates anomaly detection directly into the segmentation process. The proposed approach also permits super-resolution of grain mantle and grain boundary locations. Finally, the proposed dictionary-based segmentation method performs uncertainty quantification, i.e. p-values, for the classified grain interiors and grain boundaries. We demonstrate that the dictionary-based approach is robust to instrument drift and material differences that produce small amounts of dictionary mismatch.PhDElectrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/102296/1/seunpark_1.pd
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
Classification of linear and nonlinear modulations using Bayesian methods
La reconnaissance de modulations numériques consiste à identifier, au niveau du récepteur d'une chaîne de transmission, l'alphabet auquel appartiennent les symboles du message transmis. Cette reconnaissance est nécessaire dans de nombreux scénarios de communication, afin, par exemple, de sécuriser les transmissions pour détecter d'éventuels utilisateurs non autorisés ou bien encore de déterminer quel terminal brouille les autres. Le signal observé en réception est généralement affecté d'un certain nombre d'imperfections, dues à une synchronisation imparfaite de l'émetteur et du récepteur, une démodulation imparfaite, une égalisation imparfaite du canal de transmission. Nous proposons plusieurs méthodes de classification qui permettent d'annuler les effets liés aux imperfections de la chaîne de transmission. Les symboles reçus sont alors corrigés puis comparés à ceux du dictionnaire des symboles transmis. Plus précisément, nous étudions trois techniques permettant d'estimer la loi a posteriori d'une modulation au niveau du récepteur. La première technique estime les paramètres inconnus associés aux diverses imperfections affectant le récepteur à l'aide d'une approche Bayésienne couplée avec une méthode de simulation MCMC (Markov Chain Monte Carlo). Une deuxième technique utilise l'algorithme de Baum Welch qui permet d'estimer de manière récursive la loi a posteriori du signal reçu et de déterminer la modulation la plus probable parmi un catalogue donné. La dernière méthode étudiée dans cette thèse consiste à corriger les erreurs de synchronisation de phase et de fréquence avec une boucle de phase. Les algorithmes considérés dans cette thèse ont permis de reconnaître un certain nombre de modulations linéaires de types QAM (Quadrature Amplitude Modulation) et PSK (Phase Shift Keying) mais aussi des modulations non linéaires de type GMSK (Gaussian Minimum Shift Keying). ABSTRACT : This thesis studies classification of digital linear and nonlinear modulations using Bayesian methods. Modulation recognition consists of identifying, at the receiver, the type of modulation signals used by the transmitter. It is important in many communication scenarios, for example, to secure transmissions by detecting unauthorized users, or to determine which transmitter interferes the others. The received signal is generally affected by a number of impairments. We propose several classification methods that can mitigate the effects related to imperfections in transmission channels. More specifically, we study three techniques to estimate the posterior probabilities of the received signals conditionally to each modulation. The first technique estimates the unknown parameters associated with various imperfections using a Bayesian approach coupled with Markov Chain Monte Carlo (MCMC) methods. A second technique uses the Baum Welch (BW) algorithm to estimate recursively the posterior probabilities and determine the most likely modulation type from a catalogue. The last method studied in this thesis corrects synchronization errors (phase and frequency offsets) with a phase-locked loop (PLL). The classification algorithms considered in this thesis can recognize a number of linear modulations such as Quadrature Amplitude Modulation (QAM), Phase Shift Keying (PSK), and nonlinear modulations such as Gaussian Minimum Shift Keying (GMSK
Graph-Based Offline Signature Verification
Graphs provide a powerful representation formalism that offers great promise
to benefit tasks like handwritten signature verification. While most
state-of-the-art approaches to signature verification rely on fixed-size
representations, graphs are flexible in size and allow modeling local features
as well as the global structure of the handwriting. In this article, we present
two recent graph-based approaches to offline signature verification: keypoint
graphs with approximated graph edit distance and inkball models. We provide a
comprehensive description of the methods, propose improvements both in terms of
computational time and accuracy, and report experimental results for four
benchmark datasets. The proposed methods achieve top results for several
benchmarks, highlighting the potential of graph-based signature verification
- …