2,329 research outputs found
DCTX-Conformer: Dynamic context carry-over for low latency unified streaming and non-streaming Conformer ASR
Conformer-based end-to-end models have become ubiquitous these days and are
commonly used in both streaming and non-streaming automatic speech recognition
(ASR). Techniques like dual-mode and dynamic chunk training helped unify
streaming and non-streaming systems. However, there remains a performance gap
between streaming with a full and limited past context. To address this issue,
we propose the integration of a novel dynamic contextual carry-over mechanism
in a state-of-the-art (SOTA) unified ASR system. Our proposed dynamic context
Conformer (DCTX-Conformer) utilizes a non-overlapping contextual carry-over
mechanism that takes into account both the left context of a chunk and one or
more preceding context embeddings. We outperform the SOTA by a relative 25.0%
word error rate, with a negligible latency impact due to the additional context
embeddings
Features of hearing: applications of machine learning to uncover the building blocks of hearing
Recent advances in machine learning have instigated a renewed interest in using machine learning approaches to better understand human sensory processing. This line of research is particularly interesting for speech research since speech comprehension is uniquely human, which complicates obtaining detailed neural recordings. In this thesis, I explore how machine learning can be used to uncover new knowledge about the auditory system, with a focus on discovering robust auditory features. The resulting increased understanding of the noise robustness of human hearing may help to better assist those with hearing loss and improve Automatic Speech Recognition (ASR) systems. First, I show how computational neuroscience and machine learning can be combined to generate hypotheses about auditory features. I introduce a neural feature detection model with a modest number of parameters that is compatible with auditory physiology. By testing feature detector variants in a speech classification task, I confirm the importance of both well-studied and lesser-known auditory features. Second, I investigate whether ASR software is a good candidate model of the human auditory system. By comparing several state-of-the-art ASR systems to the results from humans on a range of psychometric experiments, I show that these ASR systems diverge markedly from humans in at least some psychometric tests. This implies that none of these systems act as a strong proxy for human speech recognition, although some may be useful when asking more narrowly defined questions. For neuroscientists, this thesis exemplifies how machine learning can be used to generate new hypotheses about human hearing, while also highlighting the caveats of investigating systems that may work fundamentally differently from the human brain. For machine learning engineers, I point to tangible directions for improving ASR systems. To motivate the continued cross-fertilization between these fields, a toolbox that allows researchers to assess new ASR systems has been released.Open Acces
A Review of Deep Learning Techniques for Speech Processing
The field of speech processing has undergone a transformative shift with the
advent of deep learning. The use of multiple processing layers has enabled the
creation of models capable of extracting intricate features from speech data.
This development has paved the way for unparalleled advancements in speech
recognition, text-to-speech synthesis, automatic speech recognition, and
emotion recognition, propelling the performance of these tasks to unprecedented
heights. The power of deep learning techniques has opened up new avenues for
research and innovation in the field of speech processing, with far-reaching
implications for a range of industries and applications. This review paper
provides a comprehensive overview of the key deep learning models and their
applications in speech-processing tasks. We begin by tracing the evolution of
speech processing research, from early approaches, such as MFCC and HMM, to
more recent advances in deep learning architectures, such as CNNs, RNNs,
transformers, conformers, and diffusion models. We categorize the approaches
and compare their strengths and weaknesses for solving speech-processing tasks.
Furthermore, we extensively cover various speech-processing tasks, datasets,
and benchmarks used in the literature and describe how different deep-learning
networks have been utilized to tackle these tasks. Additionally, we discuss the
challenges and future directions of deep learning in speech processing,
including the need for more parameter-efficient, interpretable models and the
potential of deep learning for multimodal speech processing. By examining the
field's evolution, comparing and contrasting different approaches, and
highlighting future directions and challenges, we hope to inspire further
research in this exciting and rapidly advancing field
Radio2Text: Streaming Speech Recognition Using mmWave Radio Signals
Millimeter wave (mmWave) based speech recognition provides more possibility
for audio-related applications, such as conference speech transcription and
eavesdropping. However, considering the practicality in real scenarios, latency
and recognizable vocabulary size are two critical factors that cannot be
overlooked. In this paper, we propose Radio2Text, the first mmWave-based system
for streaming automatic speech recognition (ASR) with a vocabulary size
exceeding 13,000 words. Radio2Text is based on a tailored streaming Transformer
that is capable of effectively learning representations of speech-related
features, paving the way for streaming ASR with a large vocabulary. To
alleviate the deficiency of streaming networks unable to access entire future
inputs, we propose the Guidance Initialization that facilitates the transfer of
feature knowledge related to the global context from the non-streaming
Transformer to the tailored streaming Transformer through weight inheritance.
Further, we propose a cross-modal structure based on knowledge distillation
(KD), named cross-modal KD, to mitigate the negative effect of low quality
mmWave signals on recognition performance. In the cross-modal KD, the audio
streaming Transformer provides feature and response guidance that inherit
fruitful and accurate speech information to supervise the training of the
tailored radio streaming Transformer. The experimental results show that our
Radio2Text can achieve a character error rate of 5.7% and a word error rate of
9.4% for the recognition of a vocabulary consisting of over 13,000 words.Comment: Accepted by Proceedings of the ACM on Interactive, Mobile, Wearable
and Ubiquitous Technologies (ACM IMWUT/UbiComp 2023
- …