203 research outputs found
Bio-motivated features and deep learning for robust speech recognition
Mención Internacional en el título de doctorIn spite of the enormous leap forward that the Automatic Speech
Recognition (ASR) technologies has experienced over the last five years
their performance under hard environmental condition is still far from
that of humans preventing their adoption in several real applications.
In this thesis the challenge of robustness of modern automatic speech
recognition systems is addressed following two main research lines.
The first one focuses on modeling the human auditory system to
improve the robustness of the feature extraction stage yielding to novel
auditory motivated features. Two main contributions are produced.
On the one hand, a model of the masking behaviour of the Human
Auditory System (HAS) is introduced, based on the non-linear filtering
of a speech spectro-temporal representation applied simultaneously
to both frequency and time domains. This filtering is accomplished
by using image processing techniques, in particular mathematical
morphology operations with an specifically designed Structuring Element
(SE) that closely resembles the masking phenomena that take
place in the cochlea. On the other hand, the temporal patterns of
auditory-nerve firings are modeled. Most conventional acoustic features
are based on short-time energy per frequency band discarding
the information contained in the temporal patterns. Our contribution
is the design of several types of feature extraction schemes based on
the synchrony effect of auditory-nerve activity, showing that the modeling
of this effect can indeed improve speech recognition accuracy in
the presence of additive noise. Both models are further integrated into
the well known Power Normalized Cepstral Coefficients (PNCC).
The second research line addresses the problem of robustness in
noisy environments by means of the use of Deep Neural Networks
(DNNs)-based acoustic modeling and, in particular, of Convolutional
Neural Networks (CNNs) architectures. A deep residual network
scheme is proposed and adapted for our purposes, allowing Residual
Networks (ResNets), originally intended for image processing tasks,
to be used in speech recognition where the network input is small
in comparison with usual image dimensions. We have observed that
ResNets on their own already enhance the robustness of the whole system
against noisy conditions. Moreover, our experiments demonstrate
that their combination with the auditory motivated features devised
in this thesis provide significant improvements in recognition accuracy
in comparison to other state-of-the-art CNN-based ASR systems
under mismatched conditions, while maintaining the performance in
matched scenarios.
The proposed methods have been thoroughly tested and compared
with other state-of-the-art proposals for a variety of datasets and
conditions. The obtained results prove that our methods outperform
other state-of-the-art approaches and reveal that they are suitable for
practical applications, specially where the operating conditions are
unknown.El objetivo de esta tesis se centra en proponer soluciones al problema
del reconocimiento de habla robusto; por ello, se han llevado a cabo
dos líneas de investigación.
En la primera líınea se han propuesto esquemas de extracción de características novedosos, basados en el modelado del comportamiento
del sistema auditivo humano, modelando especialmente los fenómenos
de enmascaramiento y sincronía. En la segunda, se propone mejorar
las tasas de reconocimiento mediante el uso de técnicas de
aprendizaje profundo, en conjunto con las características propuestas.
Los métodos propuestos tienen como principal objetivo, mejorar la
precisión del sistema de reconocimiento cuando las condiciones de
operación no son conocidas, aunque el caso contrario también ha sido
abordado.
En concreto, nuestras principales propuestas son los siguientes:
Simular el sistema auditivo humano con el objetivo de mejorar
la tasa de reconocimiento en condiciones difíciles, principalmente
en situaciones de alto ruido, proponiendo esquemas de
extracción de características novedosos.
Siguiendo esta dirección, nuestras principales propuestas se detallan a continuación:
• Modelar el comportamiento de enmascaramiento del sistema
auditivo humano, usando técnicas del procesado de
imagen sobre el espectro, en concreto, llevando a cabo el
diseño de un filtro morfológico que captura este efecto.
• Modelar el efecto de la sincroní que tiene lugar en el nervio
auditivo.
• La integración de ambos modelos en los conocidos Power
Normalized Cepstral Coefficients (PNCC).
La aplicación de técnicas de aprendizaje profundo con el objetivo
de hacer el sistema más robusto frente al ruido, en particular
con el uso de redes neuronales convolucionales profundas, como
pueden ser las redes residuales.
Por último, la aplicación de las características propuestas en
combinación con las redes neuronales profundas, con el objetivo
principal de obtener mejoras significativas, cuando las condiciones
de entrenamiento y test no coinciden.Programa Oficial de Doctorado en Multimedia y ComunicacionesPresidente: Javier Ferreiros López.- Secretario: Fernando Díaz de María.- Vocal: Rubén Solera Ureñ
Optimization of data-driven filterbank for automatic speaker verification
Most of the speech processing applications use triangular filters spaced in
mel-scale for feature extraction. In this paper, we propose a new data-driven
filter design method which optimizes filter parameters from a given speech
data. First, we introduce a frame-selection based approach for developing
speech-signal-based frequency warping scale. Then, we propose a new method for
computing the filter frequency responses by using principal component analysis
(PCA). The main advantage of the proposed method over the recently introduced
deep learning based methods is that it requires very limited amount of
unlabeled speech-data. We demonstrate that the proposed filterbank has more
speaker discriminative power than commonly used mel filterbank as well as
existing data-driven filterbank. We conduct automatic speaker verification
(ASV) experiments with different corpora using various classifier back-ends. We
show that the acoustic features created with proposed filterbank are better
than existing mel-frequency cepstral coefficients (MFCCs) and
speech-signal-based frequency cepstral coefficients (SFCCs) in most cases. In
the experiments with VoxCeleb1 and popular i-vector back-end, we observe 9.75%
relative improvement in equal error rate (EER) over MFCCs. Similarly, the
relative improvement is 4.43% with recently introduced x-vector system. We
obtain further improvement using fusion of the proposed method with standard
MFCC-based approach.Comment: Published in Digital Signal Processing journal (Elsevier
Glottal-synchronous speech processing
Glottal-synchronous speech processing is a field of speech science where the pseudoperiodicity
of voiced speech is exploited. Traditionally, speech processing involves segmenting
and processing short speech frames of predefined length; this may fail to exploit the inherent
periodic structure of voiced speech which glottal-synchronous speech frames have
the potential to harness. Glottal-synchronous frames are often derived from the glottal
closure instants (GCIs) and glottal opening instants (GOIs).
The SIGMA algorithm was developed for the detection of GCIs and GOIs from
the Electroglottograph signal with a measured accuracy of up to 99.59%. For GCI and
GOI detection from speech signals, the YAGA algorithm provides a measured accuracy
of up to 99.84%. Multichannel speech-based approaches are shown to be more robust to
reverberation than single-channel algorithms.
The GCIs are applied to real-world applications including speech dereverberation,
where SNR is improved by up to 5 dB, and to prosodic manipulation where the importance
of voicing detection in glottal-synchronous algorithms is demonstrated by subjective
testing. The GCIs are further exploited in a new area of data-driven speech modelling,
providing new insights into speech production and a set of tools to aid deployment into
real-world applications. The technique is shown to be applicable in areas of speech coding,
identification and artificial bandwidth extension of telephone speec
Single-Microphone Speech Enhancement and Separation Using Deep Learning
The cocktail party problem comprises the challenging task of understanding a
speech signal in a complex acoustic environment, where multiple speakers and
background noise signals simultaneously interfere with the speech signal of
interest. A signal processing algorithm that can effectively increase the
speech intelligibility and quality of speech signals in such complicated
acoustic situations is highly desirable. Especially for applications involving
mobile communication devices and hearing assistive devices. Due to the
re-emergence of machine learning techniques, today, known as deep learning, the
challenges involved with such algorithms might be overcome. In this PhD thesis,
we study and develop deep learning-based techniques for two sub-disciplines of
the cocktail party problem: single-microphone speech enhancement and
single-microphone multi-talker speech separation. Specifically, we conduct
in-depth empirical analysis of the generalizability capability of modern deep
learning-based single-microphone speech enhancement algorithms. We show that
performance of such algorithms is closely linked to the training data, and good
generalizability can be achieved with carefully designed training data.
Furthermore, we propose uPIT, a deep learning-based algorithm for
single-microphone speech separation and we report state-of-the-art results on a
speaker-independent multi-talker speech separation task. Additionally, we show
that uPIT works well for joint speech separation and enhancement without
explicit prior knowledge about the noise type or number of speakers. Finally,
we show that deep learning-based speech enhancement algorithms designed to
minimize the classical short-time spectral amplitude mean squared error leads
to enhanced speech signals which are essentially optimal in terms of STOI, a
state-of-the-art speech intelligibility estimator.Comment: PhD Thesis. 233 page
Recent Advances in Signal Processing
The signal processing task is a very critical issue in the majority of new technological inventions and challenges in a variety of applications in both science and engineering fields. Classical signal processing techniques have largely worked with mathematical models that are linear, local, stationary, and Gaussian. They have always favored closed-form tractability over real-world accuracy. These constraints were imposed by the lack of powerful computing tools. During the last few decades, signal processing theories, developments, and applications have matured rapidly and now include tools from many areas of mathematics, computer science, physics, and engineering. This book is targeted primarily toward both students and researchers who want to be exposed to a wide variety of signal processing techniques and algorithms. It includes 27 chapters that can be categorized into five different areas depending on the application at hand. These five categories are ordered to address image processing, speech processing, communication systems, time-series analysis, and educational packages respectively. The book has the advantage of providing a collection of applications that are completely independent and self-contained; thus, the interested reader can choose any chapter and skip to another without losing continuity
- …