40 research outputs found
A Review of Deep Learning Techniques for Speech Processing
The field of speech processing has undergone a transformative shift with the
advent of deep learning. The use of multiple processing layers has enabled the
creation of models capable of extracting intricate features from speech data.
This development has paved the way for unparalleled advancements in speech
recognition, text-to-speech synthesis, automatic speech recognition, and
emotion recognition, propelling the performance of these tasks to unprecedented
heights. The power of deep learning techniques has opened up new avenues for
research and innovation in the field of speech processing, with far-reaching
implications for a range of industries and applications. This review paper
provides a comprehensive overview of the key deep learning models and their
applications in speech-processing tasks. We begin by tracing the evolution of
speech processing research, from early approaches, such as MFCC and HMM, to
more recent advances in deep learning architectures, such as CNNs, RNNs,
transformers, conformers, and diffusion models. We categorize the approaches
and compare their strengths and weaknesses for solving speech-processing tasks.
Furthermore, we extensively cover various speech-processing tasks, datasets,
and benchmarks used in the literature and describe how different deep-learning
networks have been utilized to tackle these tasks. Additionally, we discuss the
challenges and future directions of deep learning in speech processing,
including the need for more parameter-efficient, interpretable models and the
potential of deep learning for multimodal speech processing. By examining the
field's evolution, comparing and contrasting different approaches, and
highlighting future directions and challenges, we hope to inspire further
research in this exciting and rapidly advancing field
Low-resource speech translation
We explore the task of speech-to-text translation (ST), where speech in one language
(source) is converted to text in a different one (target). Traditional ST systems go
through an intermediate step where the source language speech is first converted to
source language text using an automatic speech recognition (ASR) system, which
is then converted to target language text using a machine translation (MT) system.
However, this pipeline based approach is impractical for unwritten languages spoken by
millions of people around the world, leaving them without access to free and automated
translation services such as Google Translate. The lack of such translation services can
have important real-world consequences. For example, in the aftermath of a disaster
scenario, easily available translation services can help better co-ordinate relief efforts.
How can we expand the coverage of automated ST systems to include scenarios which
lack source language text? In this thesis we investigate one possible solution: we
build ST systems to directly translate source language speech into target language text,
thereby forgoing the dependency on source language text. To build such a system, we
use only speech data paired with text translations as training data. We also specifically
focus on low-resource settings, where we expect at most tens of hours of training data
to be available for unwritten or endangered languages.
Our work can be broadly divided into three parts. First we explore how we can leverage
prior work to build ST systems. We find that neural sequence-to-sequence models are
an effective and convenient method for ST, but produce poor quality translations when
trained in low-resource settings.
In the second part of this thesis, we explore methods to improve the translation performance
of our neural ST systems which do not require labeling additional speech
data in the low-resource language, a potentially tedious and expensive process. Instead
we exploit labeled speech data for high-resource languages which is widely available
and relatively easier to obtain. We show that pretraining a neural model with ASR data
from a high-resource language, different from both the source and target ST languages,
improves ST performance.
In the final part of our thesis, we study whether ST systems can be used to build
applications which have traditionally relied on the availability of ASR systems, such
as information retrieval, clustering audio documents, or question/answering. We build
proof-of-concept systems for two downstream applications: topic prediction for speech
and cross-lingual keyword spotting. Our results indicate that low-resource ST systems
can still outperform simple baselines for these tasks, leaving the door open for further
exploratory work.
This thesis provides, for the first time, an in-depth study of neural models for the
task of direct ST across a range of training data settings on a realistic multi-speaker
speech corpus. Our contributions include a set of open-source tools to encourage further
research
Conditioning Text-to-Speech synthesis on dialect accent: a case study
Modern text-to-speech systems are modular in many different ways. In recent years, end-users gained the ability to control speech attributes such as degree of emotion, rhythm and timbre, along with other suprasegmental features. More ambitious objectives are related to modelling a combination of speakers and languages, e.g. to enable cross-speaker language transfer. Though, no prior work has been done on the more fine-grained analysis of regional accents. To fill this gap, in this thesis we present practical end-to-end solutions to synthesise speech while controlling within-country variations of the same language, and we do so for 6 different dialects of the British Isles. In particular, we first conduct an extensive study of the speaker verification field and tweak state-of-the-art embedding models to work with dialect accents. Then, we adapt standard acoustic models and voice conversion systems by conditioning them on dialect accent representations and finally compare our custom pipelines with a cutting-edge end-to-end architecture from the multi-lingual world. Results show that the adopted models are suitable and have enough capacity to accomplish the task of regional accent conversion. Indeed, we are able to produce speech closely resembling the selected speaker and dialect accent, where the most accurate synthesis is obtained via careful fine-tuning of the multi-lingual model to the multi-dialect case. Finally, we delineate limitations of our multi-stage approach and propose practical mitigations, to be explored in future work
Natural Language Processing: Emerging Neural Approaches and Applications
This Special Issue highlights the most recent research being carried out in the NLP field to discuss relative open issues, with a particular focus on both emerging approaches for language learning, understanding, production, and grounding interactively or autonomously from data in cognitive and neural systems, as well as on their potential or real applications in different domains