13 research outputs found
Improved Speech Reconstruction from Silent Video
Speechreading is the task of inferring phonetic information from visually
observed articulatory facial movements, and is a notoriously difficult task for
humans to perform. In this paper we present an end-to-end model based on a
convolutional neural network (CNN) for generating an intelligible and
natural-sounding acoustic speech signal from silent video frames of a speaking
person. We train our model on speakers from the GRID and TCD-TIMIT datasets,
and evaluate the quality and intelligibility of reconstructed speech using
common objective measurements. We show that speech predictions from the
proposed model attain scores which indicate significantly improved quality over
existing models. In addition, we show promising results towards reconstructing
speech from an unconstrained dictionary.Comment: Accepted to ICCV 2017 Workshop on Computer Vision for Audio-Visual
Media. Supplementary video: https://www.youtube.com/watch?v=Xjbn7h7tpg0.
arXiv admin note: text overlap with arXiv:1701.0049
Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks
Speech is a rich biometric signal that contains information about the
identity, gender and emotional state of the speaker. In this work, we explore
its potential to generate face images of a speaker by conditioning a Generative
Adversarial Network (GAN) with raw speech input. We propose a deep neural
network that is trained from scratch in an end-to-end fashion, generating a
face directly from the raw speech waveform without any additional identity
information (e.g reference image or one-hot encoding). Our model is trained in
a self-supervised approach by exploiting the audio and visual signals naturally
aligned in videos. With the purpose of training from video data, we present a
novel dataset collected for this work, with high-quality videos of youtubers
with notable expressiveness in both the speech and visual signals.Comment: ICASSP 2019. Projevct website at
https://imatge-upc.github.io/wav2pix
Dynamic Temporal Alignment of Speech to Lips
Many speech segments in movies are re-recorded in a studio during
postproduction, to compensate for poor sound quality as recorded on location.
Manual alignment of the newly-recorded speech with the original lip movements
is a tedious task. We present an audio-to-video alignment method for automating
speech to lips alignment, stretching and compressing the audio signal to match
the lip movements. This alignment is based on deep audio-visual features,
mapping the lips video and the speech signal to a shared representation. Using
this shared representation we compute the lip-sync error between every short
speech period and every video frame, followed by the determination of the
optimal corresponding frame for each short sound period over the entire video
clip. We demonstrate successful alignment both quantitatively, using a human
perception-inspired metric, as well as qualitatively. The strongest advantage
of our audio-to-video approach is in cases where the original voice in unclear,
and where a constant shift of the sound can not give a perfect alignment. In
these cases state-of-the-art methods will fail
MobiVSR: A Visual Speech Recognition Solution for Mobile Devices
Visual speech recognition (VSR) is the task of recognizing spoken language
from video input only, without any audio. VSR has many applications as an
assistive technology, especially if it could be deployed in mobile devices and
embedded systems. The need of intensive computational resources and large
memory footprint are two of the major obstacles in developing neural network
models for VSR in a resource constrained environment. We propose a novel
end-to-end deep neural network architecture for word level VSR called MobiVSR
with a design parameter that aids in balancing the model's accuracy and
parameter count. We use depthwise-separable 3D convolution for the first time
in the domain of VSR and show how it makes our model efficient. MobiVSR
achieves an accuracy of 73\% on a challenging Lip Reading in the Wild dataset
with 6 times fewer parameters and 20 times lesser memory footprint than the
current state of the art. MobiVSR can also be compressed to 6 MB by applying
post training quantization
Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
We present a joint audio-visual model for isolating a single speech signal
from a mixture of sounds such as other speakers and background noise. Solving
this task using only audio as input is extremely challenging and does not
provide an association of the separated speech signals with speakers in the
video. In this paper, we present a deep network-based model that incorporates
both visual and auditory signals to solve this task. The visual features are
used to "focus" the audio on desired speakers in a scene and to improve the
speech separation quality. To train our joint audio-visual model, we introduce
AVSpeech, a new dataset comprised of thousands of hours of video segments from
the Web. We demonstrate the applicability of our method to classic speech
separation tasks, as well as real-world scenarios involving heated interviews,
noisy bars, and screaming children, only requiring the user to specify the face
of the person in the video whose speech they want to isolate. Our method shows
clear advantage over state-of-the-art audio-only speech separation in cases of
mixed speech. In addition, our model, which is speaker-independent (trained
once, applicable to any speaker), produces better results than recent
audio-visual speech separation methods that are speaker-dependent (require
training a separate model for each speaker of interest).Comment: Accepted to SIGGRAPH 2018. Project webpage:
https://looking-to-listen.github.i
Video-Driven Speech Reconstruction using Generative Adversarial Networks
Speech is a means of communication which relies on both audio and visual
information. The absence of one modality can often lead to confusion or
misinterpretation of information. In this paper we present an end-to-end
temporal model capable of directly synthesising audio from silent video,
without needing to transform to-and-from intermediate features. Our proposed
approach, based on GANs is capable of producing natural sounding, intelligible
speech which is synchronised with the video. The performance of our model is
evaluated on the GRID dataset for both speaker dependent and speaker
independent scenarios. To the best of our knowledge this is the first method
that maps video directly to raw audio and the first to produce intelligible
speech when tested on previously unseen speakers. We evaluate the synthesised
audio not only based on the sound quality but also on the accuracy of the
spoken words
Lip to Speech Synthesis with Visual Context Attentional GAN
In this paper, we propose a novel lip-to-speech generative adversarial
network, Visual Context Attentional GAN (VCA-GAN), which can jointly model
local and global lip movements during speech synthesis. Specifically, the
proposed VCA-GAN synthesizes the speech from local lip visual features by
finding a mapping function of viseme-to-phoneme, while global visual context is
embedded into the intermediate layers of the generator to clarify the ambiguity
in the mapping induced by homophene. To achieve this, a visual context
attention module is proposed where it encodes global representations from the
local visual features, and provides the desired global visual context
corresponding to the given coarse speech representation to the generator
through audio-visual attention. In addition to the explicit modelling of local
and global visual representations, synchronization learning is introduced as a
form of contrastive learning that guides the generator to synthesize a speech
in sync with the given input lip movements. Extensive experiments demonstrate
that the proposed VCA-GAN outperforms existing state-of-the-art and is able to
effectively synthesize the speech from multi-speaker that has been barely
handled in the previous works.Comment: Published at NeurIPS 202
Discriminative Multi-modality Speech Recognition
Vision is often used as a complementary modality for audio speech recognition
(ASR), especially in the noisy environment where performance of solo audio
modality significantly deteriorates. After combining visual modality, ASR is
upgraded to the multi-modality speech recognition (MSR). In this paper, we
propose a two-stage speech recognition model. In the first stage, the target
voice is separated from background noises with help from the corresponding
visual information of lip movements, making the model 'listen' clearly. At the
second stage, the audio modality combines visual modality again to better
understand the speech by a MSR sub-network, further improving the recognition
rate. There are some other key contributions: we introduce a pseudo-3D residual
convolution (P3D)-based visual front-end to extract more discriminative
features; we upgrade the temporal convolution block from 1D ResNet with the
temporal convolutional network (TCN), which is more suitable for the temporal
tasks; the MSR sub-network is built on the top of Element-wise-Attention Gated
Recurrent Unit (EleAtt-GRU), which is more effective than Transformer in long
sequences. We conducted extensive experiments on the LRS3-TED and the LRW
datasets. Our two-stage model (audio enhanced multi-modality speech
recognition, AE-MSR) consistently achieves the state-of-the-art performance by
a significant margin, which demonstrates the necessity and effectiveness of
AE-MSR.Comment: CVPR202
Latent Variable Algorithms for Multimodal Learning and Sensor Fusion
Multimodal learning has been lacking principled ways of combining information
from different modalities and learning a low-dimensional manifold of meaningful
representations. We study multimodal learning and sensor fusion from a latent
variable perspective. We first present a regularized recurrent attention filter
for sensor fusion. This algorithm can dynamically combine information from
different types of sensors in a sequential decision making task. Each sensor is
bonded with a modular neural network to maximize utility of its own
information. A gating modular neural network dynamically generates a set of
mixing weights for outputs from sensor networks by balancing utility of all
sensors' information. We design a co-learning mechanism to encourage
co-adaption and independent learning of each sensor at the same time, and
propose a regularization based co-learning method. In the second part, we focus
on recovering the manifold of latent representation. We propose a co-learning
approach using probabilistic graphical model which imposes a structural prior
on the generative model: multimodal variational RNN (MVRNN) model, and derive a
variational lower bound for its objective functions. In the third part, we
extend the siamese structure to sensor fusion for robust acoustic event
detection. We perform experiments to investigate the latent representations
that are extracted; works will be done in the following months. Our experiments
show that the recurrent attention filter can dynamically combine different
sensor inputs according to the information carried in the inputs. We consider
MVRNN can identify latent representations that are useful for many downstream
tasks such as speech synthesis, activity recognition, and control and planning.
Both algorithms are general frameworks which can be applied to other tasks
where different types of sensors are jointly used for decision making
Multi-modal Multi-channel Target Speech Separation
Target speech separation refers to extracting a target speaker's voice from
an overlapped audio of simultaneous talkers. Previously the use of visual
modality for target speech separation has demonstrated great potentials. This
work proposes a general multi-modal framework for target speech separation by
utilizing all the available information of the target speaker, including
his/her spatial location, voice characteristics and lip movements. Also, under
this framework, we investigate on the fusion methods for multi-modal joint
modeling. A factorized attention-based fusion method is proposed to aggregate
the high-level semantic information of multi-modalities at embedding level.
This method firstly factorizes the mixture audio into a set of acoustic
subspaces, then leverages the target's information from other modalities to
enhance these subspace acoustic embeddings with a learnable attention scheme.
To validate the robustness of proposed multi-modal separation model in
practical scenarios, the system was evaluated under the condition that one of
the modalities is temporarily missing, invalid or corrupted. Experiments are
conducted on a large-scale audio-visual dataset collected from YouTube (to be
released) that spatialized by simulated room impulse responses (RIRs).
Experiment results illustrate that our proposed multi-modal framework
significantly outperforms single-modal and bi-modal speech separation
approaches, while can still support real-time processing.Comment: accepted in IEEE Journal of Selcted Topics in Signal Processin