4 research outputs found
How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition
Audio-Visual Speech Recognition (AVSR) seeks to model, and thereby exploit,
the dynamic relationship between a human voice and the corresponding mouth
movements. A recently proposed multimodal fusion strategy, AV Align, based on
state-of-the-art sequence to sequence neural networks, attempts to model this
relationship by explicitly aligning the acoustic and visual representations of
speech. This study investigates the inner workings of AV Align and visualises
the audio-visual alignment patterns. Our experiments are performed on two of
the largest publicly available AVSR datasets, TCD-TIMIT and LRS2. We find that
AV Align learns to align acoustic and visual representations of speech at the
frame level on TCD-TIMIT in a generally monotonic pattern. We also determine
the cause of initially seeing no improvement over audio-only speech recognition
on the more challenging LRS2. We propose a regularisation method which involves
predicting lip-related Action Units from visual representations. Our
regularisation method leads to better exploitation of the visual modality, with
performance improvements between 7% and 30% depending on the noise level.
Furthermore, we show that the alternative Watch, Listen, Attend, and Spell
network is affected by the same problem as AV Align, and that our proposed
approach can effectively help it learn visual representations. Our findings
validate the suitability of the regularisation method to AVSR and encourage
researchers to rethink the multimodal convergence problem when having one
dominant modality.Comment: in IEEE/ACM Transactions on Audio, Speech, and Language Processing
(to appear
Should we hard-code the recurrence concept or learn it instead ? Exploring the Transformer architecture for Audio-Visual Speech Recognition
The audio-visual speech fusion strategy AV Align has shown significant
performance improvements in audio-visual speech recognition (AVSR) on the
challenging LRS2 dataset. Performance improvements range between 7% and 30%
depending on the noise level when leveraging the visual modality of speech in
addition to the auditory one. This work presents a variant of AV Align where
the recurrent Long Short-term Memory (LSTM) computation block is replaced by
the more recently proposed Transformer block. We compare the two methods,
discussing in greater detail their strengths and weaknesses. We find that
Transformers also learn cross-modal monotonic alignments, but suffer from the
same visual convergence problems as the LSTM model, calling for a deeper
investigation into the dominant modality problem in machine learning.Comment: Submitted to INTERSPEECH 202
End-to-end Audio-visual Speech Recognition with Conformers
In this work, we present a hybrid CTC/Attention model based on a ResNet-18
and Convolution-augmented transformer (Conformer), that can be trained in an
end-to-end manner. In particular, the audio and visual encoders learn to
extract features directly from raw pixels and audio waveforms, respectively,
which are then fed to conformers and then fusion takes place via a Multi-Layer
Perceptron (MLP). The model learns to recognise characters using a combination
of CTC and an attention mechanism. We show that end-to-end training, instead of
using pre-computed visual features which is common in the literature, the use
of a conformer, instead of a recurrent network, and the use of a
transformer-based language model, significantly improve the performance of our
model. We present results on the largest publicly available datasets for
sentence-level speech recognition, Lip Reading Sentences 2 (LRS2) and Lip
Reading Sentences 3 (LRS3), respectively. The results show that our proposed
models raise the state-of-the-art performance by a large margin in audio-only,
visual-only, and audio-visual experiments.Comment: Accepted to ICASSP 202
Learning to Count Words in Fluent Speech enables Online Speech Recognition
Sequence to Sequence models, in particular the Transformer, achieve state of
the art results in Automatic Speech Recognition. Practical usage is however
limited to cases where full utterance latency is acceptable. In this work we
introduce Taris, a Transformer-based online speech recognition system aided by
an auxiliary task of incremental word counting. We use the cumulative word sum
to dynamically segment speech and enable its eager decoding into words.
Experiments performed on the LRS2, LibriSpeech, and Aishell-1 datasets of
English and Mandarin speech show that the online system performs comparable
with the offline one when having a dynamic algorithmic delay of 5 segments.
Furthermore, we show that the estimated segment length distribution resembles
the word length distribution obtained with forced alignment, although our
system does not require an exact segment-to-word equivalence. Taris introduces
a negligible overhead compared to a standard Transformer, while the local
relationship modelling between inputs and outputs grants invariance to sequence
length by design.Comment: Accepted at the 8th IEEE Spoken Language Technology Workshop (SLT
2021