117 research outputs found
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments
Eliminating the negative effect of non-stationary environmental noise is a
long-standing research topic for automatic speech recognition that stills
remains an important challenge. Data-driven supervised approaches, including
ones based on deep neural networks, have recently emerged as potential
alternatives to traditional unsupervised approaches and with sufficient
training, can alleviate the shortcomings of the unsupervised methods in various
real-life acoustic environments. In this light, we review recently developed,
representative deep learning approaches for tackling non-stationary additive
and convolutional degradation of speech with the aim of providing guidelines
for those involved in the development of environmentally robust speech
recognition systems. We separately discuss single- and multi-channel techniques
developed for the front-end and back-end of speech recognition systems, as well
as joint front-end and back-end training frameworks
음향 이벤트 탐지를 위한 효율적 데이터 활용 및 약한 교사학습 기법
학위논문(박사)--서울대학교 대학원 :공과대학 전기·컴퓨터공학부,2020. 2. 김남수.Conventional audio event detection (AED) models are based on supervised approaches. For supervised approaches, strongly labeled data is required. However, collecting large-scale strongly labeled data of audio events is challenging due to the diversity of audio event types and labeling difficulties. In this thesis, we propose data-efficient and weakly supervised techniques for AED.
In the first approach, a data-efficient AED system is proposed. In the proposed system, data augmentation is performed to deal with the data sparsity problem and generate polyphonic event examples. An exemplar-based noise reduction algorithm is proposed for feature enhancement. For polyphonic event detection, a multi-labeled deep neural network (DNN) classifier is employed. An adaptive thresholding algorithm is applied as a post-processing method for robust event detection in noisy conditions. From the experimental results, the proposed algorithm has shown promising performance for AED on a low-resource dataset.
In the second approach, a convolutional neural network (CNN)-based audio tagging system is proposed. The proposed model consists of a local detector and a global classifier. The local detector detects local audio words that contain distinct characteristics of events, and the global classifier summarizes the information to predict audio events on the recording. From the experimental results, we have found that the proposed model outperforms conventional artificial neural network models.
In the final approach, we propose a weakly supervised AED model. The proposed model takes advantage of strengthening feature propagation from DenseNet and modeling channel-wise relationships by SENet. Also, the correlations among segments in audio recordings are represented by a recurrent neural network (RNN) and conditional random field (CRF). RNN utilizes contextual information and CRF post-processing helps to refine segment-level predictions. We evaluate our proposed method and compare its performance with a CNN based baseline approach. From a number of experiments, it has been shown that the proposed method is effective both on audio tagging and weakly supervised AED.일반적인 음향 이벤트 탐지 시스템은 교사학습을 통해 훈련된다. 교사학습을 위해서는 강한 레이블 데이터가 요구된다. 하지만 강한 레이블 데이터는 음향 이벤트의 다양성 및 레이블의 난이도로 인해 큰 데이터베이스를 구축하기 어렵다는 문제가 있다. 본 논문에서는 이러한 문제를 해결하기 위해 음향 이벤트 탐지를 위한 데이터 효율적 활용 및 약한 교사학습 기법에 대해 제안한다.
첫 번째 접근법으로서, 데이터 효율적인 음향 이벤트 탐지 시스템을 제안한다. 제안된 시스템에서는 데이터 증대 기법을 사용해 데이터 희소성 문제에 대응하고 중첩 이벤트 데이터를 생성하였다. 특징 벡터 향상을 위해 잡음 억제 기법이 사용되었고 중첩 음향 이벤트 탐지를 위해 다중 레이블 심층 인공신경망(DNN) 분류기가 사용되었다. 실험 결과, 제안된 알고리즘은 불충분한 데이터에서도 우수한 음향 이벤트 탐지 성능을 나타내었다.
두 번째 접근법으로서, 컨볼루션 신경망(CNN) 기반 오디오 태깅 시스템을 제안한다. 제안된 모델은 로컬 검출기와 글로벌 분류기로 구성된다. 로컬 검출기는 고유한 음향 이벤트 특성을 포함하는 로컬 오디오 단어를 감지하고 글로벌 분류기는 탐지된 정보를 요약하여 오디오 이벤트를 예측한다. 실험 결과, 제안된 모델이 기존 인공신경망 기법보다 우수한 성능을 나타내었다.
마지막 접근법으로서, 약한 교사학습 음향 이벤트 탐지 모델을 제안한다. 제안된 모델은 DenseNet의 구조를 활용하여 정보의 원활한 흐름을 가능하게 하고 SENet을 활용해 채널간의 상관관계를 모델링 한다. 또한, 오디오 신호에서 부분 간의 상관관계 정보를 재순환 신경망(RNN) 및 조건부 무작위 필드(CRF)를 사용해 활용하였다. 여러 실험을 통해 제안된 모델이 기존 CNN 기반 기법보다 오디오 태깅 및 음향 이벤트 탐지 모두에서 더 나은 성능을 나타냄을 보였다.1 Introduction 1
2 Audio Event Detection 5
2.1 Data-Ecient Audio Event Detection 6
2.2 Audio Tagging 7
2.3 Weakly Supervised Audio Event Detection 9
2.4 Metrics 10
3 Data-Ecient Techniques for Audio Event Detection 17
3.1 Introduction 17
3.2 DNN-Based AED system 18
3.2.1 Data Augmentation 20
3.2.2 Exemplar-Based Approach for Noise Reduction 21
3.2.3 DNN Classier 22
3.2.4 Post-Processing 23
3.3 Experiments 24
3.4 Summary 27
4 Audio Tagging using Local Detector and Global Classier 29
4.1 Introduction 29
4.2 CNN-Based Audio Tagging Model 31
4.2.1 Local Detector and Global Classier 32
4.2.2 Temporal Localization of Events 34
4.3 Experiments 34
4.3.1 Dataset and Feature 34
4.3.2 Model Training 35
4.3.3 Results 36
4.4 Summary 39
5 Deep Convolutional Neural Network with Structured Prediction for Weakly Supervised Audio Event Detection 41
5.1 Introduction 41
5.2 CNN with Structured Prediction for Weakly Supervised AED 46
5.2.1 DenseNet 47
5.2.2 Squeeze-and-Excitation 48
5.2.3 Global Pooling for Aggregation 49
5.2.4 Structured Prediction for Accurate Event Localization 50
5.3 Experiments 53
5.3.1 Dataset 53
5.3.2 Feature Extraction 54
5.3.3 DSNet and DSNet-RNN Structures 54
5.3.4 Baseline CNN Structure 56
5.3.5 Training and Evaluation 57
5.3.6 Metrics 57
5.3.7 Results and Discussion 58
5.3.8 Comparison with the DCASE 2017 task 4 Results 61
5.4 Summary 62
6 Conclusions 65
Bibliography 67
요 약 77
감사의 글 79Docto
Deep Learning for Distant Speech Recognition
Deep learning is an emerging technology that is considered one of the most
promising directions for reaching higher levels of artificial intelligence.
Among the other achievements, building computers that understand speech
represents a crucial leap towards intelligent machines. Despite the great
efforts of the past decades, however, a natural and robust human-machine speech
interaction still appears to be out of reach, especially when users interact
with a distant microphone in noisy and reverberant environments. The latter
disturbances severely hamper the intelligibility of a speech signal, making
Distant Speech Recognition (DSR) one of the major open challenges in the field.
This thesis addresses the latter scenario and proposes some novel techniques,
architectures, and algorithms to improve the robustness of distant-talking
acoustic models. We first elaborate on methodologies for realistic data
contamination, with a particular emphasis on DNN training with simulated data.
We then investigate on approaches for better exploiting speech contexts,
proposing some original methodologies for both feed-forward and recurrent
neural networks. Lastly, inspired by the idea that cooperation across different
DNNs could be the key for counteracting the harmful effects of noise and
reverberation, we propose a novel deep learning paradigm called network of deep
neural networks. The analysis of the original concepts were based on extensive
experimental validations conducted on both real and simulated data, considering
different corpora, microphone configurations, environments, noisy conditions,
and ASR tasks.Comment: PhD Thesis Unitn, 201
Automated Speaker Independent Visual Speech Recognition: A Comprehensive Survey
Speaker-independent VSR is a complex task that involves identifying spoken
words or phrases from video recordings of a speaker's facial movements. Over
the years, there has been a considerable amount of research in the field of VSR
involving different algorithms and datasets to evaluate system performance.
These efforts have resulted in significant progress in developing effective VSR
models, creating new opportunities for further research in this area. This
survey provides a detailed examination of the progression of VSR over the past
three decades, with a particular emphasis on the transition from
speaker-dependent to speaker-independent systems. We also provide a
comprehensive overview of the various datasets used in VSR research and the
preprocessing techniques employed to achieve speaker independence. The survey
covers the works published from 1990 to 2023, thoroughly analyzing each work
and comparing them on various parameters. This survey provides an in-depth
analysis of speaker-independent VSR systems evolution from 1990 to 2023. It
outlines the development of VSR systems over time and highlights the need to
develop end-to-end pipelines for speaker-independent VSR. The pictorial
representation offers a clear and concise overview of the techniques used in
speaker-independent VSR, thereby aiding in the comprehension and analysis of
the various methodologies. The survey also highlights the strengths and
limitations of each technique and provides insights into developing novel
approaches for analyzing visual speech cues. Overall, This comprehensive review
provides insights into the current state-of-the-art speaker-independent VSR and
highlights potential areas for future research
Recommended from our members
Joint Training Methods for Tandem and Hybrid Speech Recognition Systems using Deep Neural Networks
Hidden Markov models (HMMs) have been the mainstream acoustic modelling approach for state-of-the-art automatic speech recognition (ASR) systems over the
past few decades. Recently, due to the rapid development of deep learning technologies, deep neural networks (DNNs) have become an essential part of nearly all kinds of ASR approaches. Among HMM-based ASR approaches, DNNs are most commonly used to extract features (tandem system configuration) or to directly produce HMM output probabilities (hybrid system configuration).
Although DNN tandem and hybrid systems have been shown to have superior
performance to traditional ASR systems without any DNN models, there are still
issues with such systems. First, some of the DNN settings, such as the choice of
the context-dependent (CD) output targets set and hidden activation functions, are
usually determined independently from the DNN training process. Second, different
ASR modules are separately optimised based on different criteria following a greedy
build strategy. For instance, for tandem systems, the features are often extracted by a
DNN trained to classify individual speech frames while acoustic models are built upon
such features according to a sequence level criterion. These issues mean that the best performance is not theoretically guaranteed.
This thesis focuses on alleviating both issues using joint training methods. In DNN
acoustic model joint training, the decision tree HMM state tying approach is extended
to cluster DNN-HMM states. Based on this method, an alternative CD-DNN training
procedure without relying on any additional system is proposed, which can produce
DNN acoustic models comparable in word error rate (WER) with those trained by the
conventional procedure. Meanwhile, the most common hidden activation functions,
the sigmoid and rectified linear unit (ReLU), are parameterised to enable automatic
learning of function forms. Experiments using conversational telephone speech (CTS)
Mandarin data result in an average of 3.4% and 2.2% relative character error rate (CER) reduction with sigmoid and ReLU parameterisations. Such parameterised functions can also be applied to speaker adaptation tasks.
At the ASR system level, DNN acoustic model and corresponding speaker dependent (SD) input feature transforms are jointly learned through minimum phone error
(MPE) training as an example of hybrid system joint training, which outperforms the
conventional hybrid system speaker adaptive training (SAT) method. MPE based speaker independent (SI) tandem system joint training is also studied. Experiments on
multi-genre broadcast (MGB) English data show that this method gives a reduction
in tandem system WER of 11.8% (relative), and the resulting tandem systems are
comparable to MPE hybrid systems in both WER and the number of parameters. In
addition, all approaches in this thesis have been implemented using the hidden Markov model toolkit (HTK) and the related source code has been or will be made publicly available with either recent or future HTK releases, to increase the reproducibility of the work presented in this thesis.Cambridge International Scholarship, Cambridge Overseas Trust
Research funding, EPSRC Natural Speech Technology Project
Research funding, DARPA BOLT Program
Research funding, iARPA Babel Progra
Sparse and Low-rank Modeling for Automatic Speech Recognition
This thesis deals with exploiting the low-dimensional multi-subspace structure of speech towards the goal of improving acoustic modeling for automatic speech recognition (ASR). Leveraging the parsimonious hierarchical nature of speech, we hypothesize that whenever a speech signal is measured in a high-dimensional feature space, the true class information is embedded in low-dimensional subspaces whereas noise is scattered as random high-dimensional erroneous estimations in the features. In this context, the contribution of this thesis is twofold: (i) identify sparse and low-rank modeling approaches as excellent tools for extracting the class-specific low-dimensional subspaces in speech features, and (ii) employ these tools under novel ASR frameworks to enrich the acoustic information present in the speech features towards the goal of improving ASR. Techniques developed in this thesis focus on deep neural network (DNN) based posterior features which, under the sparse and low-rank modeling approaches, unveil the underlying class-specific low-dimensional subspaces very elegantly.
In this thesis, we tackle ASR tasks of varying difficulty, ranging from isolated word recognition (IWR) and connected digit recognition (CDR) to large-vocabulary continuous speech recognition (LVCSR). For IWR and CDR, we propose a novel \textit{Compressive Sensing} (CS) perspective towards ASR. Here exemplar-based speech recognition is posed as a problem of recovering sparse high-dimensional word representations from compressed low-dimensional phonetic representations. In the context of LVCSR, this thesis argues that albeit their power in representation learning, DNN based acoustic models still have room for improvement in exploiting the \textit{union of low-dimensional subspaces} structure of speech data. Therefore, this thesis proposes to enhance DNN posteriors by projecting them onto the manifolds of the underlying classes using principal component analysis (PCA) or compressive sensing based dictionaries. Projected posteriors are shown to be more accurate training targets for learning better acoustic models, resulting in improved ASR performance. The proposed approach is evaluated on both close-talk and far-field conditions, confirming the importance of sparse and low-rank modeling of speech in building a robust ASR framework. Finally, the conclusions of this thesis are further consolidated by an information theoretic analysis approach which explicitly quantifies the contribution of proposed techniques in improving ASR
A Review of Deep Learning Techniques for Speech Processing
The field of speech processing has undergone a transformative shift with the
advent of deep learning. The use of multiple processing layers has enabled the
creation of models capable of extracting intricate features from speech data.
This development has paved the way for unparalleled advancements in speech
recognition, text-to-speech synthesis, automatic speech recognition, and
emotion recognition, propelling the performance of these tasks to unprecedented
heights. The power of deep learning techniques has opened up new avenues for
research and innovation in the field of speech processing, with far-reaching
implications for a range of industries and applications. This review paper
provides a comprehensive overview of the key deep learning models and their
applications in speech-processing tasks. We begin by tracing the evolution of
speech processing research, from early approaches, such as MFCC and HMM, to
more recent advances in deep learning architectures, such as CNNs, RNNs,
transformers, conformers, and diffusion models. We categorize the approaches
and compare their strengths and weaknesses for solving speech-processing tasks.
Furthermore, we extensively cover various speech-processing tasks, datasets,
and benchmarks used in the literature and describe how different deep-learning
networks have been utilized to tackle these tasks. Additionally, we discuss the
challenges and future directions of deep learning in speech processing,
including the need for more parameter-efficient, interpretable models and the
potential of deep learning for multimodal speech processing. By examining the
field's evolution, comparing and contrasting different approaches, and
highlighting future directions and challenges, we hope to inspire further
research in this exciting and rapidly advancing field
Bio-motivated features and deep learning for robust speech recognition
Mención Internacional en el título de doctorIn spite of the enormous leap forward that the Automatic Speech
Recognition (ASR) technologies has experienced over the last five years
their performance under hard environmental condition is still far from
that of humans preventing their adoption in several real applications.
In this thesis the challenge of robustness of modern automatic speech
recognition systems is addressed following two main research lines.
The first one focuses on modeling the human auditory system to
improve the robustness of the feature extraction stage yielding to novel
auditory motivated features. Two main contributions are produced.
On the one hand, a model of the masking behaviour of the Human
Auditory System (HAS) is introduced, based on the non-linear filtering
of a speech spectro-temporal representation applied simultaneously
to both frequency and time domains. This filtering is accomplished
by using image processing techniques, in particular mathematical
morphology operations with an specifically designed Structuring Element
(SE) that closely resembles the masking phenomena that take
place in the cochlea. On the other hand, the temporal patterns of
auditory-nerve firings are modeled. Most conventional acoustic features
are based on short-time energy per frequency band discarding
the information contained in the temporal patterns. Our contribution
is the design of several types of feature extraction schemes based on
the synchrony effect of auditory-nerve activity, showing that the modeling
of this effect can indeed improve speech recognition accuracy in
the presence of additive noise. Both models are further integrated into
the well known Power Normalized Cepstral Coefficients (PNCC).
The second research line addresses the problem of robustness in
noisy environments by means of the use of Deep Neural Networks
(DNNs)-based acoustic modeling and, in particular, of Convolutional
Neural Networks (CNNs) architectures. A deep residual network
scheme is proposed and adapted for our purposes, allowing Residual
Networks (ResNets), originally intended for image processing tasks,
to be used in speech recognition where the network input is small
in comparison with usual image dimensions. We have observed that
ResNets on their own already enhance the robustness of the whole system
against noisy conditions. Moreover, our experiments demonstrate
that their combination with the auditory motivated features devised
in this thesis provide significant improvements in recognition accuracy
in comparison to other state-of-the-art CNN-based ASR systems
under mismatched conditions, while maintaining the performance in
matched scenarios.
The proposed methods have been thoroughly tested and compared
with other state-of-the-art proposals for a variety of datasets and
conditions. The obtained results prove that our methods outperform
other state-of-the-art approaches and reveal that they are suitable for
practical applications, specially where the operating conditions are
unknown.El objetivo de esta tesis se centra en proponer soluciones al problema
del reconocimiento de habla robusto; por ello, se han llevado a cabo
dos líneas de investigación.
En la primera líınea se han propuesto esquemas de extracción de características novedosos, basados en el modelado del comportamiento
del sistema auditivo humano, modelando especialmente los fenómenos
de enmascaramiento y sincronía. En la segunda, se propone mejorar
las tasas de reconocimiento mediante el uso de técnicas de
aprendizaje profundo, en conjunto con las características propuestas.
Los métodos propuestos tienen como principal objetivo, mejorar la
precisión del sistema de reconocimiento cuando las condiciones de
operación no son conocidas, aunque el caso contrario también ha sido
abordado.
En concreto, nuestras principales propuestas son los siguientes:
Simular el sistema auditivo humano con el objetivo de mejorar
la tasa de reconocimiento en condiciones difíciles, principalmente
en situaciones de alto ruido, proponiendo esquemas de
extracción de características novedosos.
Siguiendo esta dirección, nuestras principales propuestas se detallan a continuación:
• Modelar el comportamiento de enmascaramiento del sistema
auditivo humano, usando técnicas del procesado de
imagen sobre el espectro, en concreto, llevando a cabo el
diseño de un filtro morfológico que captura este efecto.
• Modelar el efecto de la sincroní que tiene lugar en el nervio
auditivo.
• La integración de ambos modelos en los conocidos Power
Normalized Cepstral Coefficients (PNCC).
La aplicación de técnicas de aprendizaje profundo con el objetivo
de hacer el sistema más robusto frente al ruido, en particular
con el uso de redes neuronales convolucionales profundas, como
pueden ser las redes residuales.
Por último, la aplicación de las características propuestas en
combinación con las redes neuronales profundas, con el objetivo
principal de obtener mejoras significativas, cuando las condiciones
de entrenamiento y test no coinciden.Programa Oficial de Doctorado en Multimedia y ComunicacionesPresidente: Javier Ferreiros López.- Secretario: Fernando Díaz de María.- Vocal: Rubén Solera Ureñ
- …