20,546 research outputs found
A Review of Deep Learning Techniques for Speech Processing
The field of speech processing has undergone a transformative shift with the
advent of deep learning. The use of multiple processing layers has enabled the
creation of models capable of extracting intricate features from speech data.
This development has paved the way for unparalleled advancements in speech
recognition, text-to-speech synthesis, automatic speech recognition, and
emotion recognition, propelling the performance of these tasks to unprecedented
heights. The power of deep learning techniques has opened up new avenues for
research and innovation in the field of speech processing, with far-reaching
implications for a range of industries and applications. This review paper
provides a comprehensive overview of the key deep learning models and their
applications in speech-processing tasks. We begin by tracing the evolution of
speech processing research, from early approaches, such as MFCC and HMM, to
more recent advances in deep learning architectures, such as CNNs, RNNs,
transformers, conformers, and diffusion models. We categorize the approaches
and compare their strengths and weaknesses for solving speech-processing tasks.
Furthermore, we extensively cover various speech-processing tasks, datasets,
and benchmarks used in the literature and describe how different deep-learning
networks have been utilized to tackle these tasks. Additionally, we discuss the
challenges and future directions of deep learning in speech processing,
including the need for more parameter-efficient, interpretable models and the
potential of deep learning for multimodal speech processing. By examining the
field's evolution, comparing and contrasting different approaches, and
highlighting future directions and challenges, we hope to inspire further
research in this exciting and rapidly advancing field
An Unsupervised Autoregressive Model for Speech Representation Learning
This paper proposes a novel unsupervised autoregressive neural model for
learning generic speech representations. In contrast to other speech
representation learning methods that aim to remove noise or speaker
variabilities, ours is designed to preserve information for a wide range of
downstream tasks. In addition, the proposed model does not require any phonetic
or word boundary labels, allowing the model to benefit from large quantities of
unlabeled data. Speech representations learned by our model significantly
improve performance on both phone classification and speaker verification over
the surface features and other supervised and unsupervised approaches. Further
analysis shows that different levels of speech information are captured by our
model at different layers. In particular, the lower layers tend to be more
discriminative for speakers, while the upper layers provide more phonetic
content.Comment: Accepted to Interspeech 2019. Code available at:
https://github.com/iamyuanchung/Autoregressive-Predictive-Codin
QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via Vector-Quantized Self-Supervised Speech Representation Learning
This paper proposes a novel semi-supervised TTS framework, QS-TTS, to improve
TTS quality with lower supervised data requirements via Vector-Quantized
Self-Supervised Speech Representation Learning (VQ-S3RL) utilizing more
unlabeled speech audio. This framework comprises two VQ-S3R learners: first,
the principal learner aims to provide a generative Multi-Stage Multi-Codebook
(MSMC) VQ-S3R via the MSMC-VQ-GAN combined with the contrastive S3RL, while
decoding it back to the high-quality audio; then, the associate learner further
abstracts the MSMC representation into a highly-compact VQ representation
through a VQ-VAE. These two generative VQ-S3R learners provide profitable
speech representations and pre-trained models for TTS, significantly improving
synthesis quality with the lower requirement for supervised data. QS-TTS is
evaluated comprehensively under various scenarios via subjective and objective
tests in experiments. The results powerfully demonstrate the superior
performance of QS-TTS, winning the highest MOS over supervised or
semi-supervised baseline TTS approaches, especially in low-resource scenarios.
Moreover, comparing various speech representations and transfer learning
methods in TTS further validates the notable improvement of the proposed
VQ-S3RL to TTS, showing the best audio quality and intelligibility metrics. The
trend of slower decay in the synthesis quality of QS-TTS with decreasing
supervised data further highlights its lower requirements for supervised data,
indicating its great potential in low-resource scenarios
Transfer Learning for Speech and Language Processing
Transfer learning is a vital technique that generalizes models trained for
one setting or task to other settings or tasks. For example in speech
recognition, an acoustic model trained for one language can be used to
recognize speech in another language, with little or no re-training data.
Transfer learning is closely related to multi-task learning (cross-lingual vs.
multilingual), and is traditionally studied in the name of `model adaptation'.
Recent advance in deep learning shows that transfer learning becomes much
easier and more effective with high-level abstract features learned by deep
models, and the `transfer' can be conducted not only between data distributions
and data types, but also between model structures (e.g., shallow nets and deep
nets) or even model types (e.g., Bayesian models and neural models). This
review paper summarizes some recent prominent research towards this direction,
particularly for speech and language processing. We also report some results
from our group and highlight the potential of this very interesting research
field.Comment: 13 pages, APSIPA 201
- …