138 research outputs found
ASR Feature Extraction with Morphologically-Filtered Power-Normalized Cochleograms
Proceedings of: 15th Annual Conference of the International Speech Communication Association. Singapore, September 14-18, 2014.In this paper we present advances in the modeling of the masking behavior of the Human Auditory System to enhance the robustness of the feature extraction stage in Automatic Speech Recognition. The solution adopted is based on a non-linear filtering of a spectro-temporal representation applied simultaneously on both the frequency and time domains, by processing it using mathematical morphology operations as if it were an image. A particularly important component of this architecture is the so called structuring element: biologically-based considerations are addressed in the present contribution to design an element that closely resembles the masking phenomena taking place in the cochlea. The second feature of this contribution is the choice of underlying spectro-temporal representation. The best results were achieved by the representation introduced as part of the Power Normalized Cepstral Coefficients together with a spectral subtraction step. On the Aurora 2 noisy continuous digits task, we report relative error reductions of 18.7% compared to PNCC and 39.5% compared to MFCC.This contribution has been supported by an Airbus Defense and Space Grant (Open Innovation - SAVIER) and Spanish Government-CICYT project 2011-26807/TEC.Publicad
ROBUST SPEAKER RECOGNITION BASED ON LATENT VARIABLE MODELS
Automatic speaker recognition in uncontrolled environments is a very challenging task due to channel distortions, additive noise and reverberation. To address these issues, this thesis studies probabilistic latent variable models of short-term spectral information that leverage large amounts of data to achieve robustness in challenging conditions.
Current speaker recognition systems represent an entire speech utterance as a single point in a high-dimensional space. This representation is known as "supervector". This thesis starts by analyzing the properties of this representation. A novel visualization procedure of supervectors is presented by which qualitative insight about the information being captured is obtained. We then propose the use of an overcomplete dictionary to explicitly decompose a supervector into a speaker-specific component and an undesired variability component. An algorithm to learn the dictionary from a large collection of data is discussed and analyzed. A subset of the entries of the dictionary is learned to represent speaker-specific information and another subset to represent distortions. After encoding the supervector as a linear combination of the dictionary entries, the undesired variability is removed by discarding the contribution of the distortion components. This paradigm is closely related to the previously proposed paradigm of Joint Factor Analysis modeling of supervectors. We establish a connection between the two approaches and show how our proposed method provides improvements in terms of computation and recognition accuracy.
An alternative way to handle undesired variability in supervector representations is to first project them into a lower dimensional space and then to model them in the reduced subspace. This low-dimensional projection is known as "i-vector". Unfortunately, i-vectors exhibit non-Gaussian behavior, and direct statistical modeling requires the use of heavy-tailed distributions for optimal performance. These approaches lack closed-form solutions, and therefore are hard to analyze. Moreover, they do not scale well to large datasets. Instead of directly modeling i-vectors, we propose to first apply a non-linear transformation and then use a linear-Gaussian model. We present two alternative transformations and show experimentally that the transformed i-vectors can be optimally modeled by a simple linear-Gaussian model (factor analysis). We evaluate our method on a benchmark dataset with a large amount of channel variability and show that the results compare favorably against the competitors. Also, our approach has closed-form solutions and scales gracefully to large datasets.
Finally, a multi-classifier architecture trained on a multicondition fashion is proposed to address the problem of speaker recognition in the presence of additive noise. A large number of experiments are conducted to analyze the proposed architecture and to obtain guidelines for optimal performance in noisy environments. Overall, it is shown that multicondition training of multi-classifier architectures not only produces great robustness in the anticipated conditions, but also generalizes well to unseen conditions
Bio-motivated features and deep learning for robust speech recognition
Mención Internacional en el tÃtulo de doctorIn spite of the enormous leap forward that the Automatic Speech
Recognition (ASR) technologies has experienced over the last five years
their performance under hard environmental condition is still far from
that of humans preventing their adoption in several real applications.
In this thesis the challenge of robustness of modern automatic speech
recognition systems is addressed following two main research lines.
The first one focuses on modeling the human auditory system to
improve the robustness of the feature extraction stage yielding to novel
auditory motivated features. Two main contributions are produced.
On the one hand, a model of the masking behaviour of the Human
Auditory System (HAS) is introduced, based on the non-linear filtering
of a speech spectro-temporal representation applied simultaneously
to both frequency and time domains. This filtering is accomplished
by using image processing techniques, in particular mathematical
morphology operations with an specifically designed Structuring Element
(SE) that closely resembles the masking phenomena that take
place in the cochlea. On the other hand, the temporal patterns of
auditory-nerve firings are modeled. Most conventional acoustic features
are based on short-time energy per frequency band discarding
the information contained in the temporal patterns. Our contribution
is the design of several types of feature extraction schemes based on
the synchrony effect of auditory-nerve activity, showing that the modeling
of this effect can indeed improve speech recognition accuracy in
the presence of additive noise. Both models are further integrated into
the well known Power Normalized Cepstral Coefficients (PNCC).
The second research line addresses the problem of robustness in
noisy environments by means of the use of Deep Neural Networks
(DNNs)-based acoustic modeling and, in particular, of Convolutional
Neural Networks (CNNs) architectures. A deep residual network
scheme is proposed and adapted for our purposes, allowing Residual
Networks (ResNets), originally intended for image processing tasks,
to be used in speech recognition where the network input is small
in comparison with usual image dimensions. We have observed that
ResNets on their own already enhance the robustness of the whole system
against noisy conditions. Moreover, our experiments demonstrate
that their combination with the auditory motivated features devised
in this thesis provide significant improvements in recognition accuracy
in comparison to other state-of-the-art CNN-based ASR systems
under mismatched conditions, while maintaining the performance in
matched scenarios.
The proposed methods have been thoroughly tested and compared
with other state-of-the-art proposals for a variety of datasets and
conditions. The obtained results prove that our methods outperform
other state-of-the-art approaches and reveal that they are suitable for
practical applications, specially where the operating conditions are
unknown.El objetivo de esta tesis se centra en proponer soluciones al problema
del reconocimiento de habla robusto; por ello, se han llevado a cabo
dos lÃneas de investigación.
En la primera lÃınea se han propuesto esquemas de extracción de caracterÃsticas novedosos, basados en el modelado del comportamiento
del sistema auditivo humano, modelando especialmente los fenómenos
de enmascaramiento y sincronÃa. En la segunda, se propone mejorar
las tasas de reconocimiento mediante el uso de técnicas de
aprendizaje profundo, en conjunto con las caracterÃsticas propuestas.
Los métodos propuestos tienen como principal objetivo, mejorar la
precisión del sistema de reconocimiento cuando las condiciones de
operación no son conocidas, aunque el caso contrario también ha sido
abordado.
En concreto, nuestras principales propuestas son los siguientes:
Simular el sistema auditivo humano con el objetivo de mejorar
la tasa de reconocimiento en condiciones difÃciles, principalmente
en situaciones de alto ruido, proponiendo esquemas de
extracción de caracterÃsticas novedosos.
Siguiendo esta dirección, nuestras principales propuestas se detallan a continuación:
• Modelar el comportamiento de enmascaramiento del sistema
auditivo humano, usando técnicas del procesado de
imagen sobre el espectro, en concreto, llevando a cabo el
diseño de un filtro morfológico que captura este efecto.
• Modelar el efecto de la sincronà que tiene lugar en el nervio
auditivo.
• La integración de ambos modelos en los conocidos Power
Normalized Cepstral Coefficients (PNCC).
La aplicación de técnicas de aprendizaje profundo con el objetivo
de hacer el sistema más robusto frente al ruido, en particular
con el uso de redes neuronales convolucionales profundas, como
pueden ser las redes residuales.
Por último, la aplicación de las caracterÃsticas propuestas en
combinación con las redes neuronales profundas, con el objetivo
principal de obtener mejoras significativas, cuando las condiciones
de entrenamiento y test no coinciden.Programa Oficial de Doctorado en Multimedia y ComunicacionesPresidente: Javier Ferreiros López.- Secretario: Fernando DÃaz de MarÃa.- Vocal: Rubén Solera Ureñ
Perceptual compensation for reverberation in human listeners and machines
This thesis explores compensation for reverberation in human listeners and machines. Late reverberation is typically understood as a distortion which degrades intelligibility. Recent research, however, shows that late reverberation is not always detrimental to human speech perception. At times, prolonged exposure to reverberation can provide a helpful acoustic context which improves identification of reverberant speech sounds. The physiology underpinning our robustness to reverberation has not yet been elucidated, but is speculated in this thesis to include efferent processes which have previously been shown to improve discrimination of noisy speech. These efferent pathways descend from higher auditory centres, effectively recalibrating the encoding of sound in the cochlea. Moreover, this thesis proposes that efferent-inspired computational models based on psychoacoustic principles may also improve performance for machine listening systems in reverberant environments.
A candidate model for perceptual compensation for reverberation is proposed in which efferent suppression derives from the level of reverberation detected in the simulated auditory nerve response. The model simulates human performance in a phoneme-continuum identification task under a range of reverberant conditions, where a synthetically controlled test-word and its surrounding context phrase are independently reverberated. Addressing questions which arose from the model, a series of perceptual experiments used naturally spoken speech materials to investigate aspects of the psychoacoustic mechanism underpinning compensation. These experiments demonstrate a monaural compensation mechanism that is influenced by both the preceding context (which need not be intelligible speech) and by the test-word itself, and which depends on the time-direction of reverberation. Compensation was shown to act rapidly (within a second or so), indicating a monaural mechanism that is likely to be effective in everyday listening. Finally, the implications of these findings for the future development of computational models of auditory perception are considered
Evaluating automatic speaker recognition systems: an overview of the nist speaker recognition evaluations (1996-2014)
2014 CSIC. Manuscripts published in this Journal are the property of the Consejo Superior de Investigaciones CientÃficas, and quoting this source is a requirement for any partial or full reproduction.Automatic Speaker Recognition systems show interesting properties, such as speed of processing or repeatability of results, in contrast to speaker recognition by humans. But they will be usable just if they are reliable. Testability, or the ability to extensively evaluate the goodness of the speaker detector decisions, becomes then critical. In the last 20 years, the US National Institute of Standards and Technology (NIST) has organized, providing the proper speech data and evaluation protocols, a series of text-independent Speaker Recognition Evaluations (SRE).
Those evaluations have become not just a periodical benchmark test, but also a meeting point of a collaborative community of scientists that have been deeply involved in the cycle of evaluations, allowing tremendous progress in a specially complex task where the speaker information is spread across different information levels (acoustic, prosodic, linguistic…) and is strongly affected by speaker intrinsic and extrinsic variability factors. In this paper, we outline how the evaluations progressively challenged the technology including new speaking conditions and sources of variability, and how the scientific community gave answers to those demands. Finally, NIST SREs will be shown to be not free of inconveniences, and future challenges to speaker recognition assessment will also be discussed
Vocal tract constancy in birds and humans
Humans perceive speech as relatively stable despite acoustic variation caused by vocal tract (VT) differences between speakers. Humans use perceptual ‘vocal tract normalisation’ (VTN) and other processes to achieve this stability. Similarity in vocal apparatus/acoustics between birds and humans means that birds might also experience VT variation. This has the potential to impede bird communication. No known studies have explicitly examined this, but a number of studies show perceptual stability or ‘perceptual constancy’ in birds similar to that seen in humans when dealing with VT variation. This review explores similarities between birds and humans and concludes that birds show sufficient evidence of perceptual constancy to warrant further research in this area. Future work should 1) quantify the multiple sources of variation in bird vocalisations, including, but not limited to VT variations, 2) determine whether vocalisations are perniciously disrupted by any of these and 3) investigate how birds reduce variation to maintain perceptual constancy and perceptual efficiency.PostprintPeer reviewe
Towards Single-Channel Speech Separation in Noise and Reverberation
Many speech technologies, such as automatic speech recognition and speaker identification, are conventionally designed to only work on single speech streams. As a result, these systems can suffer severely degraded performance in cases of overlapping speech, i.e. when two or more people are speaking at the same time. Speech separation systems aim to address this problem by taking a recording of a speech mixture and outputting a single recording for each speaker in the mixture, where the interfering speech has been removed. The advancements in speech technology provided by deep neural networks have extended to speech separation, resulting in the first effectively functional single-channel speech separation systems. As performance of these systems has improved, there has been a desire to extend their capabilities beyond the clean studio recordings using close-talking microphones that the technology was initially developed on. In this dissertation, we focus on the extension of these technologies to the noisy and reverberant conditions more representative of real-world applications. Contributions of this dissertation include producing and releasing new data appropriate for training and evaluation of single-channel speech separation techniques, performing benchmark experiments to establish the degradation of conventional methods in more realistic settings, theoretical analysis of the impact, and development of new techniques targeted at improving system performance in these adverse conditions
- …