125 research outputs found
The listening talker: A review of human and algorithmic context-induced modifications of speech
International audienceSpeech output technology is finding widespread application, including in scenarios where intelligibility might be compromised - at least for some listeners - by adverse conditions. Unlike most current algorithms, talkers continually adapt their speech patterns as a response to the immediate context of spoken communication, where the type of interlocutor and the environment are the dominant situational factors influencing speech production. Observations of talker behaviour can motivate the design of more robust speech output algorithms. Starting with a listener-oriented categorisation of possible goals for speech modification, this review article summarises the extensive set of behavioural findings related to human speech modification, identifies which factors appear to be beneficial, and goes on to examine previous computational attempts to improve intelligibility in noise. The review concludes by tabulating 46 speech modifications, many of which have yet to be perceptually or algorithmically evaluated. Consequently, the review provides a roadmap for future work in improving the robustness of speech output
Investigating the Lombard Effect Influence on End-to-End Audio-Visual Speech Recognition
Several audio-visual speech recognition models have been recently proposed
which aim to improve the robustness over audio-only models in the presence of
noise. However, almost all of them ignore the impact of the Lombard effect,
i.e., the change in speaking style in noisy environments which aims to make
speech more intelligible and affects both the acoustic characteristics of
speech and the lip movements. In this paper, we investigate the impact of the
Lombard effect in audio-visual speech recognition. To the best of our
knowledge, this is the first work which does so using end-to-end deep
architectures and presents results on unseen speakers. Our results show that
properly modelling Lombard speech is always beneficial. Even if a relatively
small amount of Lombard speech is added to the training set then the
performance in a real scenario, where noisy Lombard speech is present, can be
significantly improved. We also show that the standard approach followed in the
literature, where a model is trained and tested on noisy plain speech, provides
a correct estimate of the video-only performance and slightly underestimates
the audio-visual performance. In case of audio-only approaches, performance is
overestimated for SNRs higher than -3dB and underestimated for lower SNRs.Comment: Accepted for publication at Interspeech 201
The impact of automatic exaggeration of the visual articulatory features of a talker on the intelligibility of spectrally distorted speech
Visual speech information plays a key role in supporting speech perception, especially
when acoustic features are distorted or inaccessible. Recent research suggests that for
spectrally distorted speech, the use of visual speech in auditory training improves not only
subjects’ audiovisual speech recognition, but also their subsequent auditory-only speech
recognition. Visual speech cues, however, can be affected by a number of facial visual signals
that vary across talkers, such as lip emphasis and speaking style. In a previous study, we
enhanced the visual speech videos used in perception training by automatically tracking
and colouring a talker’s lips. This improved the subjects’ audiovisual and subsequent
auditory speech recognition compared with those who were trained via unmodified videos or
audio-only methods. In this paper, we report on two issues related to automatic exaggeration
of the movement of the lips/ mouth area. First, we investigate subjects’ ability to adapt
to the conflict between the articulation energy in the visual signals and the vocal effort in
the acoustic signals (since the acoustic signals remained unexaggerated). Second, we have
examined whether or not
this visual exaggeration can improve the subjects’ performance of auditory and audiovisual
speech recognition when used in perception training. To test this concept, we used spectrally
distorted speech to train groups of listeners using four different training regimes: (1) audio
only, (2) audiovisual, (3) audiovisual visually exaggerated, and (4) audiovisual visually
exaggerated and lip-coloured. We used spectrally distorted speech (cochlear-implant-simulated
speech) because the longer-term aim of our work is to employ these concepts in a training
system for cochlear-implant (CI) users.
The results suggest that after exposure to visually exaggerated speech, listeners had the
ability to adapt alongside the conflicting audiovisual signals. In addition, subjects trained
with enhanced visual cues (regimes 3 and 4) achieved better audiovisual recognition for
a number of phoneme classes than those who were trained with unmodified visual speech
(regime 2). There was no evidence of an improvement in the subsequent audio-only listening
skills, however. The subjects’ adaptation to the conflicting audiovisual signals may have
slowed down auditory perceptual learning, and impeded the ability of the visual speech to
improve the training gains
Deep audio-visual speech recognition
Decades of research in acoustic speech recognition have led to systems that we use in our everyday life. However, even the most advanced speech recognition systems fail in the presence of noise. The degraded performance can be compensated by introducing visual speech information. However, Visual Speech Recognition (VSR) in naturalistic conditions is very challenging, in part due to the lack of architectures and annotations.
This thesis contributes towards the problem of Audio-Visual Speech Recognition (AVSR) from different aspects. Firstly, we develop AVSR models for isolated words. In contrast to previous state-of-the-art methods that consists of a two-step approach, feature extraction and recognition, we present an End-to-End (E2E) approach inside a deep neural network, and this has led to a significant improvement in audio-only, visual-only and audio-visual experiments. We further replace Bi-directional Gated Recurrent Unit (BGRU) with Temporal Convolutional Networks (TCN) to greatly simplify the training procedure.
Secondly, we extend our AVSR model for continuous speech by presenting a hybrid Connectionist Temporal Classification (CTC)/Attention model, that can be trained in an end-to-end manner. We then propose the addition of prediction-based auxiliary tasks to a VSR model and highlight the importance of hyper-parameter optimisation and appropriate data augmentations.
Next, we present a self-supervised framework, Learning visual speech Representations from Audio via self-supervision (LiRA). Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech, and find that this pre-trained model can be leveraged towards word-level and sentence-level lip-reading.
We also investigate the Lombard effect influence in an end-to-end AVSR system, which is the first work using end-to-end deep architectures and presents results on unseen speakers. We show that even if a relatively small amount of Lombard speech is added to the training set then the performance in a real scenario, where noisy Lombard speech is present, can be significantly improved.
Lastly, we propose a detection method against adversarial examples in an AVSR system, where the strong correlation between audio and visual streams is leveraged. The synchronisation confidence score is leveraged as a proxy for audio-visual correlation and based on it, we can detect adversarial attacks. We apply recent adversarial attacks on two AVSR models and the experimental results demonstrate that the proposed approach is an effective way for detecting such attacks.Open Acces
Visual Speech Enhancement and its Application in Speech Perception Training
This thesis investigates methods for visual speech enhancement to support auditory and audiovisual speech perception. Normal-hearing non-native listeners receiving cochlear implant (CI) simulated speech are used as ‘proxy’ listeners for CI users, a proposed user group who could benefit from such enhancement methods in speech perception training. Both CI users and non-native listeners share similarities with regards to audiovisual speech perception, including increased sensitivity to visual speech cues.
Two enhancement methods are proposed: (i) an appearance based method, which modifies the appearance of a talker’s lips using colour and luminance blending to apply a ‘lipstick effect’ to increase the saliency of mouth shapes; and (ii) a kinematics based method, which amplifies the kinematics of the talker’s mouth to create the effect of more pronounced speech (an ‘exaggeration effect’). The application that is used to test the enhancements is speech perception training, or audiovisual training, which can be used to improve listening skills.
An audiovisual training framework is presented which structures the evaluation of the effectiveness of these methods. It is used in two studies. The first study, which evaluates the effectiveness of the lipstick effect, found a significant improvement in audiovisual and auditory perception. The second study, which evaluates the effectiveness of the exaggeration effect, found improvement in the audiovisual perception of a number of phoneme classes; no evidence was found of improvements in the subsequent auditory perception, as audiovisual recalibration to visually exaggerated speech may have impeded learning when used in the audiovisual training.
The thesis also investigates an example of kinematics based enhancement which is observed in Lombard speech, by studying the behaviour of visual Lombard phonemes in different contexts. Due to the lack of suitable datasets for this analysis, the thesis presents a novel audiovisual Lombard speech dataset recorded under high SNR, which offers two, fixed head-pose, synchronised views of each talker in the dataset
Individual and environment-related acoustic-phonetic strategies for communicating in adverse conditions
In many situations it is necessary to produce speech in ‘adverse conditions’: that is, conditions that make speech communication difficult. Research has demonstrated that speaker strategies, as described by a range of acoustic-phonetic measures, can vary both at the individual level and according to the environment, and are argued to facilitate communication. There has been debate as to the environmental specificity of these adaptations, and their effectiveness in overcoming communication difficulty. Furthermore, the manner and extent to which adaptation strategies differ between individuals is not yet well understood. This thesis presents three studies that explore the acoustic-phonetic adaptations of speakers in noisy and degraded communication conditions and their relationship with intelligibility. Study 1 investigated the effects of temporally fluctuating maskers on global acoustic-phonetic measures associated with speech in noise (Lombard speech). The results replicated findings of increased power in the modulation spectrum in Lombard speech, but showed little evidence of adaptation to masker fluctuations via the temporal envelope. Study 2 collected a larger corpus of semi-spontaneous communicative speech in noise and other degradations perturbing specific acoustic dimensions. Speakers showed different adaptations across the environments that were likely suited to overcome noise (steady and temporally fluctuating), restricted spectral and pitch information by a noise-excited vocoder, and a sensorineural hearing loss simulation. Analyses of inter-speaker variation in both studies 1 and 2 showed behaviour was highly variable and some strategy combinations were identified. Study 3 investigated the intelligibility of strategies ‘tailored’ to specific environments and the relationship between intelligibility and speaker acoustics, finding a benefit of tailored speech adaptations and discussing the potential roles of speaker flexibility, adaptation level, and intrinsic intelligibility. The overall results are discussed in relation to models of communication in adverse conditions and a model accounting for individual variability in these conditions is proposed
Acoustic-phonetic characteristics of speech produced with communicative intent to counter adverse listening conditions
This study investigated whether speech produced in spontaneous interactions when addressing a talker experiencing actual challenging conditions differs in acoustic-phonetic characteristics from speech produced: (a) with communicative intent under more ideal conditions, and (b) without communicative intent under imaginary challenging conditions (read, clear speech). It also investigated whether acoustic-phonetic modifications made to counteract the effects of a challenging listening condition are tailored to the condition under which communication occurs. 40 talkers were recorded in pairs while engaged in ‘spot the difference’ picture tasks in good and challenging conditions. In the challenging conditions, one talker heard the other: (1) via a three-channel noise vocoder (VOC); (2) with simultaneous babble noise (BABBLE). Read, clear speech showed more extreme changes in median F0, F0 range and speaking rate than speech produced to counter the effects of a challenging listening condition. In the VOC condition, where F0 and intensity enhancements are unlikely to aid intelligibility, talkers did not change their F0 median and range; mean energy and vowel F1 increased less than in the BABBLE condition. This suggests that speech production is listener-focused, and that talkers modulate their speech according to their interlocutors’ needs, even when not directly experiencing the challenging listening condition
- …