20 research outputs found

    Robust Raw Waveform Speech Recognition Using Relevance Weighted Representations

    Full text link
    Speech recognition in noisy and channel distorted scenarios is often challenging as the current acoustic modeling schemes are not adaptive to the changes in the signal distribution in the presence of noise. In this work, we develop a novel acoustic modeling framework for noise robust speech recognition based on relevance weighting mechanism. The relevance weighting is achieved using a sub-network approach that performs feature selection. A relevance sub-network is applied on the output of first layer of a convolutional network model operating on raw speech signals while a second relevance sub-network is applied on the second convolutional layer output. The relevance weights for the first layer correspond to an acoustic filterbank selection while the relevance weights in the second layer perform modulation filter selection. The model is trained for a speech recognition task on noisy and reverberant speech. The speech recognition experiments on multiple datasets (Aurora-4, CHiME-3, VOiCES) reveal that the incorporation of relevance weighting in the neural network architecture improves the speech recognition word error rates significantly (average relative improvements of 10% over the baseline systems)Comment: arXiv admin note: text overlap with arXiv:2001.0706

    Sound Object Recognition

    Get PDF
    Humans are constantly exposed to a variety of acoustic stimuli ranging from music and speech to more complex acoustic scenes like a noisy marketplace. The human auditory perception mechanism is able to analyze these different kinds of sounds and extract meaningful information suggesting that the same processing mechanism is capable of representing different sound classes. In this thesis, we test this hypothesis by proposing a high dimensional sound object representation framework, that captures the various modulations of sound by performing a multi-resolution mapping. We then show that this model is able to capture a wide variety of sound classes (speech, music, soundscapes) by applying it to the tasks of speech recognition, speaker verification, musical instrument recognition and acoustic soundscape recognition. We propose a multi-resolution analysis approach that captures the detailed variations in the spectral characterists as a basis for recognizing sound objects. We then show how such a system can be fine tuned to capture both the message information (speech content) and the messenger information (speaker identity). This system is shown to outperform state-of-art system for noise robustness at both automatic speech recognition and speaker verification tasks. The proposed analysis scheme with the included ability to analyze temporal modulations was used to capture musical sound objects. We showed that using a model of cortical processing, we were able to accurately replicate the human perceptual similarity judgments and also were able to get a good classification performance on a large set of musical instruments. We also show that neither just the spectral feature or the marginals of the proposed model are sufficient to capture human perception. Moreover, we were able to extend this model to continuous musical recordings by proposing a new method to extract notes from the recordings. Complex acoustic scenes like a sports stadium have multiple sources producing sounds at the same time. We show that the proposed representation scheme can not only capture these complex acoustic scenes, but provides a flexible mechanism to adapt to target sources of interest. The human auditory perception system is known to be a complex system where there are both bottom-up analysis pathways and top-down feedback mechanisms. The top-down feedback enhances the output of the bottom-up system to better realize the target sounds. In this thesis we propose an implementation of top-down attention module which is complimentary to the high dimensional acoustic feature extraction mechanism. This attention module is a distributed system operating at multiple stages of representation, effectively acting as a retuning mechanism, that adapts the same system to different tasks. We showed that such an adaptation mechanism is able to tremendously improve the performance of the system at detecting the target source in the presence of various distracting background sources

    Computational Models of Representation and Plasticity in the Central Auditory System

    Get PDF
    The performance for automated speech processing tasks like speech recognition and speech activity detection rapidly degrades in challenging acoustic conditions. It is therefore necessary to engineer systems that extract meaningful information from sound while exhibiting invariance to background noise, different speakers, and other disruptive channel conditions. In this thesis, we take a biomimetic approach to these problems, and explore computational strategies used by the central auditory system that underlie neural information extraction from sound. In the first part of this thesis, we explore coding strategies employed by the central auditory system that yield neural responses that exhibit desirable noise robustness. We specifically demonstrate that a coding strategy based on sustained neural firings yields richly structured spectro-temporal receptive fields (STRFs) that reflect the structure and diversity of natural sounds. The emergent receptive fields are comparable to known physiological neuronal properties and can be employed as a signal processing strategy to improve noise invariance in a speech recognition task. Next, we extend the model of sound encoding based on spectro-temporal receptive fields to incorporate the cognitive effects of selective attention. We propose a framework for modeling attention-driven plasticity that induces changes to receptive fields driven by task demands. We define a discriminative cost function whose optimization and solution reflect a biologically plausible strategy for STRF adaptation that helps listeners better attend to target sounds. Importantly, the adaptation patterns predicted by the framework have a close correspondence with known neurophysiological data. We next generalize the framework to act on the spectro-temporal dynamics of task-relevant stimuli, and make predictions for tasks that have yet to be experimentally measured. We argue that our generalization represents a form of object-based attention, which helps shed light on the current debate about auditory attentional mechanisms. Finally, we show how attention-modulated STRFs form a high-fidelity representation of the attended target, and we apply our results to obtain improvements in a speech activity detection task. Overall, the results of this thesis improve our general understanding of central auditory processing, and our computational frameworks can be used to guide further studies in animal models. Furthermore, our models inspire signal processing strategies that are useful for automated speech and sound processing tasks

    Modeling speech intelligibility based on the signal-to-noise envelope power ratio

    Get PDF

    Low latency modeling of temporal contexts for speech recognition

    Get PDF
    This thesis focuses on the development of neural network acoustic models for large vocabulary continuous speech recognition (LVCSR) to satisfy the design goals of low latency and low computational complexity. Low latency enables online speech recognition; and low computational complexity helps reduce the computational cost both during training and inference. Long span sequential dependencies and sequential distortions in the input vector sequence are a major challenge in acoustic modeling. Recurrent neural networks have been shown to effectively model these dependencies. Specifically, bidirectional long short term memory (BLSTM) networks, provide state-of-the-art performance across several LVCSR tasks. However the deployment of bidirectional models for online LVCSR is non-trivial due to their large latency; and unidirectional LSTM models are typically preferred. In this thesis we explore the use of hierarchical temporal convolution to model long span temporal dependencies. We propose a sub-sampled variant of these temporal convolution neural networks, termed time-delay neural networks (TDNNs). These sub-sampled TDNNs reduce the computation complexity by ~5x, compared to TDNNs, during frame randomized pre-training. These models are shown to be effective in modeling long-span temporal contexts, however there is a performance gap compared to (B)LSTMs. As recent advancements in acoustic model training have eliminated the need for frame randomized pre-training we modify the TDNN architecture to use higher sampling rates, as the increased computation can be amortized over the sequence. These variants of sub- sampled TDNNs provide performance superior to unidirectional LSTM networks, while also affording a lower real time factor (RTF) during inference. However we show that the BLSTM models outperform both the TDNN and LSTM models. We propose a hybrid architecture interleaving temporal convolution and LSTM layers which is shown to outperform the BLSTM models. Further we improve these BLSTM models by using higher frame rates at lower layers and show that the proposed TDNN- LSTM model performs similar to these superior BLSTM models, while reducing the overall latency to 200 ms. Finally we describe an online system for reverberation robust ASR, using the above described models in conjunction with other data augmentation techniques like reverberation simulation, which simulates far-field environments, and volume perturbation, which helps tackle volume variation even without gain normalization

    Models and Analysis of Vocal Emissions for Biomedical Applications

    Get PDF
    The MAVEBA Workshop proceedings, held on a biannual basis, collect the scientific papers presented both as oral and poster contributions, during the conference. The main subjects are: development of theoretical and mechanical models as an aid to the study of main phonatory dysfunctions, as well as the biomedical engineering methods for the analysis of voice signals and images, as a support to clinical diagnosis and classification of vocal pathologies

    Discriminative features for GMM and i-vector based speaker diarization

    Get PDF
    Speaker diarization has received several research attentions over the last decade. Among the different domains of speaker diarization, diarization in meeting domain is the most challenging one. It usually contains spontaneous speech and is, for example, susceptible to reverberation. The appropriate selection of speech features is one of the factors that affect the performance of speaker diarization systems. Mel Frequency Cepstral Coefficients (MFCC) are the most widely used short-term speech features in speaker diarization. Other factors that affect the performance of speaker diarization systems are the techniques employed to perform both speaker segmentation and speaker clustering. In this thesis, we have proposed the use of jitter and shimmer long-term voice-quality features both for Gaussian Mixture Modeling (GMM) and i-vector based speaker diarization systems. The voice-quality features are used together with the state-of-the-art short-term cepstral and long-term speech ones. The long-term features consist of prosody and Glottal-to-Noise excitation ratio (GNE) descriptors. Firstly, the voice-quality, prosodic and GNE features are stacked in the same feature vector. Then, they are fused with cepstral coefficients at the score likelihood level both for the proposed Gaussian Mixture Modeling (GMM) and i-vector based speaker diarization systems. For the proposed GMM based speaker diarization system, independent HMM models are estimated from the short-term and long-term speech feature sets. The fusion of the short-term descriptors with the long-term ones in speaker segmentation is carried out by linearly weighting the log-likelihood scores of Viterbi decoding. In the case of speaker clustering, the fusion of the short-term cepstral features with the long-term ones is carried out by linearly fusing the Bayesian Information Criterion (BIC) scores corresponding to these feature sets. For the proposed i-vector based speaker diarization system, the speaker segmentation is carried out exactly the same as in the previously mentioned GMM based speaker diarization system. However, the speaker clustering technique is based on the recently introduced factor analysis paradigm. Two set of i-vectors are extracted from the speaker segmentation hypothesis. Whilst the first i-vector is extracted from short-term cepstral features, the second one is extracted from the voice quality, prosody and GNE descriptors. Then, the cosine-distance and Probabilistic Linear Discriminant Analysis (PLDA) scores of i-vectors are linearly weighted to obtain a fused similarity score. Finally, the fused score is used as speaker clustering distance. We have also proposed the use of delta dynamic features for speaker clustering. The motivation for using deltas in clustering is that delta dynamic features capture the transitional characteristics of the speech signal which contain speaker specific information. This information is not captured by the static cepstral coefficients. The delta features are used together with the short-term static cepstral coefficients and long-term speech features (i.e., voice-quality, prosody and GNE) both for GMM and i-vector based speaker diarization systems. The experiments have been carried out on Augmented Multi-party Interaction (AMI) meeting corpus. The experimental results show that the use of voice-quality, prosody, GNE and delta dynamic features improve the performance of both GMM and i-vector based speaker diarization systems.La diarizaci贸n del altavoz ha recibido varias atenciones de investigaci贸n durante la 煤ltima d茅cada. Entre los diferentes dominios de la diarizaci贸n del hablante, la diarizaci贸n en el dominio del encuentro es la m谩s dif铆cil. Normalmente contiene habla espont谩nea y, por ejemplo, es susceptible de reverberaci贸n. La selecci贸n apropiada de las caracter铆sticas del habla es uno de los factores que afectan el rendimiento de los sistemas de diarizaci贸n de los altavoces. Los Coeficientes Cepstral de Frecuencia Mel (MFCC) son las caracter铆sticas de habla de corto plazo m谩s utilizadas en la diarizaci贸n de los altavoces. Otros factores que afectan el rendimiento de los sistemas de diarizaci贸n del altavoz son las t茅cnicas empleadas para realizar tanto la segmentaci贸n del altavoz como el agrupamiento de altavoces. En esta tesis, hemos propuesto el uso de jitter y shimmer caracter铆sticas de calidad de voz a largo plazo tanto para GMM y i-vector basada en sistemas de diarizaci贸n de altavoces. Las caracter铆sticas de calidad de voz se utilizan junto con el estado de la t茅cnica a corto plazo cepstral y de larga duraci贸n de habla. Las caracter铆sticas a largo plazo consisten en la prosodia y los descriptores de relaci贸n de excitaci贸n Glottal-a-Ruido (GNE). En primer lugar, las caracter铆sticas de calidad de voz, pros贸dica y GNE se apilan en el mismo vector de caracter铆sticas. A continuaci贸n, se fusionan con coeficientes cepstrales en el nivel de verosimilitud de puntajes tanto para los sistemas de diarizaci贸n de altavoces basados 驴驴en el modelo Gaussian Mixture Modeling (GMM) como en los sistemas basados 驴驴en i-vector. . Para el sistema de diarizaci贸n de altavoces basado en GMM propuesto, se calculan modelos HMM independientes a partir de cada conjunto de caracter铆sticas. En la segmentaci贸n de los altavoces, la fusi贸n de los descriptores a corto plazo con los de largo plazo se lleva a cabo mediante la ponderaci贸n lineal de las puntuaciones log-probabilidad de decodificaci贸n Viterbi. En la agrupaci贸n de altavoces, la fusi贸n de las caracter铆sticas cepstrales a corto plazo con las de largo plazo se lleva a cabo mediante la fusi贸n lineal de las puntuaciones Bayesian Information Criterion (BIC) correspondientes a estos conjuntos de caracter铆sticas. Para el sistema de diarizaci贸n de altavoces basado en un vector i, la fusi贸n de caracter铆sticas se realiza exactamente igual a la del sistema basado en GMM antes mencionado. Sin embargo, la t茅cnica de agrupaci贸n de altavoces se basa en el paradigma de an谩lisis de factores recientemente introducido. Dos conjuntos de i-vectores se extraen de la hip贸tesis de segmentaci贸n de altavoz. Mientras que el primer vector i se extrae de caracter铆sticas espectrales a corto plazo, el segundo se extrae de los descriptores de calidad de voz apilados, pros贸dicos y GNE. A continuaci贸n, las puntuaciones de coseno-distancia y Probabilistic Linear Discriminant Analysis (PLDA) entre i-vectores se ponderan linealmente para obtener una puntuaci贸n de similitud fundida. Finalmente, la puntuaci贸n fusionada se utiliza como distancia de agrupaci贸n de altavoces. Tambi茅n hemos propuesto el uso de caracter铆sticas din谩micas delta para la agrupaci贸n de locutores. La motivaci贸n para el uso de deltas en la agrupaci贸n es que las caracter铆sticas din谩micas delta capturan las caracter铆sticas de transici贸n de la se帽al de voz que contienen informaci贸n espec铆fica del locutor. Esta informaci贸n no es capturada por los coeficientes cepstrales est谩ticos. Las caracter铆sticas delta se usan junto con los coeficientes cepstrales est谩ticos a corto plazo y las caracter铆sticas de voz a largo plazo (es decir, calidad de voz, prosodia y GNE) tanto para sistemas de diarizaci贸n de altavoces basados en GMM como en sistemas i-vector. Los resultados experimentales sobre AMI muestran que el uso de calidad vocal, pros贸dica, GNE y din谩micas delta mejoran el rendimiento de los sistemas de diarizaci贸n de altavoces basados en GMM e i-vector.Postprint (published version
    corecore