104 research outputs found
Attention-Inspired Artificial Neural Networks for Speech Processing: A Systematic Review
Artificial Neural Networks (ANNs) were created inspired by the neural networks in the human brain and have been widely applied in speech processing. The application areas of ANN include: Speech recognition, speech emotion recognition, language identification, speech enhancement, and speech separation, amongst others. Likewise, given that speech processing performed by humans involves complex cognitive processes known as auditory attention, there has been a growing amount of papers proposing ANNs supported by deep learning algorithms in conjunction with some mechanism to achieve symmetry with the human attention process. However, while these ANN approaches include attention, there is no categorization of attention integrated into the deep learning algorithms and their relation with human auditory attention. Therefore, we consider it necessary to have a review of the different ANN approaches inspired in attention to show both academic and industry experts the available models for a wide variety of applications. Based on the PRISMA methodology, we present a systematic review of the literature published since 2000, in which deep learning algorithms are applied to diverse problems related to speech processing. In this paper 133 research works are selected and the following aspects are described: (i) Most relevant features, (ii) ways in which attention has been implemented, (iii) their hypothetical relationship with human attention, and (iv) the evaluation metrics used. Additionally, the four publications most related with human attention were analyzed and their strengths and weaknesses were determined
Audio-visual fine-tuning of audio-only ASR models
Audio-visual automatic speech recognition (AV-ASR) models are very effective
at reducing word error rates on noisy speech, but require large amounts of
transcribed AV training data. Recently, audio-visual self-supervised learning
(SSL) approaches have been developed to reduce this dependence on transcribed
AV data, but these methods are quite complex and computationally expensive. In
this work, we propose replacing these expensive AV-SSL methods with a simple
and fast \textit{audio-only} SSL method, and then performing AV supervised
fine-tuning. We show that this approach is competitive with state-of-the-art
(SOTA) AV-SSL methods on the LRS3-TED benchmark task (within 0.5% absolute
WER), while being dramatically simpler and more efficient (12-30x faster to
pre-train). Furthermore, we show we can extend this approach to convert a SOTA
audio-only ASR model into an AV model. By doing so, we match SOTA AV-SSL
results, even though no AV data was used during pre-training
Make More of Your Data: Minimal Effort Data Augmentation for Automatic Speech Recognition and Translation
Data augmentation is a technique to generate new training data based on
existing data. We evaluate the simple and cost-effective method of
concatenating the original data examples to build new training instances.
Continued training with such augmented data is able to improve off-the-shelf
Transformer and Conformer models that were optimized on the original data only.
We demonstrate considerable improvements on the LibriSpeech-960h test sets (WER
2.83 and 6.87 for test-clean and test-other), which carry over to models
combined with shallow fusion (WER 2.55 and 6.27). Our method of continued
training also leads to improvements of up to 0.9 WER on the ASR part of
CoVoST-2 for four non English languages, and we observe that the gains are
highly dependent on the size of the original training data. We compare
different concatenation strategies and found that our method does not need
speaker information to achieve its improvements. Finally, we demonstrate on two
datasets that our methods also works for speech translation tasks
TASE: Task-Aware Speech Enhancement for Wake-Up Word Detection in Voice Assistants
Wake-up word spotting in noisy environments is a critical task for an excellent user experience with voice assistants. Unwanted activation of the device is often due to the presence of noises coming from background conversations, TVs, or other domestic appliances. In this work, we propose the use of a speech enhancement convolutional autoencoder, coupled with on-device keyword spotting, aimed at improving the trigger word detection in noisy environments. The end-to-end system learns by optimizing a linear combination of losses: a reconstruction-based loss, both at the log-mel spectrogram and at the waveform level, as well as a specific task loss that accounts for the cross-entropy error reported along the keyword spotting detection. We experiment with several neural network classifiers and report that deeply coupling the speech enhancement together with a wake-up word detector, e.g., by jointly training them, significantly improves the performance in the noisiest conditions. Additionally, we introduce a new publicly available speech database recorded for the Telefónica's voice assistant, Aura. The OK Aura Wake-up Word Dataset incorporates rich metadata, such as speaker demographics or room conditions, and comprises hard negative examples that were studiously selected to present different levels of phonetic similarity with respect to the trigger words 'OK Aura'. Keywords: speech enhancement; wake-up word; keyword spotting; deep learning; convolutional neural networ
- …