39 research outputs found
Combining Multiple Views for Visual Speech Recognition
Visual speech recognition is a challenging research problem with a particular
practical application of aiding audio speech recognition in noisy scenarios.
Multiple camera setups can be beneficial for the visual speech recognition
systems in terms of improved performance and robustness. In this paper, we
explore this aspect and provide a comprehensive study on combining multiple
views for visual speech recognition. The thorough analysis covers fusion of all
possible view angle combinations both at feature level and decision level. The
employed visual speech recognition system in this study extracts features
through a PCA-based convolutional neural network, followed by an LSTM network.
Finally, these features are processed in a tandem system, being fed into a
GMM-HMM scheme. The decision fusion acts after this point by combining the
Viterbi path log-likelihoods. The results show that the complementary
information contained in recordings from different view angles improves the
results significantly. For example, the sentence correctness on the test set is
increased from 76% for the highest performing single view () to up to
83% when combining this view with the frontal and view angles
Hierarchical Cross-Modal Talking Face Generationwith Dynamic Pixel-Wise Loss
We devise a cascade GAN approach to generate talking face video, which is
robust to different face shapes, view angles, facial characteristics, and noisy
audio conditions. Instead of learning a direct mapping from audio to video
frames, we propose first to transfer audio to high-level structure, i.e., the
facial landmarks, and then to generate video frames conditioned on the
landmarks. Compared to a direct audio-to-image approach, our cascade approach
avoids fitting spurious correlations between audiovisual signals that are
irrelevant to the speech content. We, humans, are sensitive to temporal
discontinuities and subtle artifacts in video. To avoid those pixel jittering
problems and to enforce the network to focus on audiovisual-correlated regions,
we propose a novel dynamically adjustable pixel-wise loss with an attention
mechanism. Furthermore, to generate a sharper image with well-synchronized
facial movements, we propose a novel regression-based discriminator structure,
which considers sequence-level information along with frame-level information.
Thoughtful experiments on several datasets and real-world samples demonstrate
significantly better results obtained by our method than the state-of-the-art
methods in both quantitative and qualitative comparisons
Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge
This paper proposes a novel lip reading framework, especially for
low-resource languages, which has not been well addressed in the previous
literature. Since low-resource languages do not have enough video-text paired
data to train the model to have sufficient power to model lip movements and
language, it is regarded as challenging to develop lip reading models for
low-resource languages. In order to mitigate the challenge, we try to learn
general speech knowledge, the ability to model lip movements, from a
high-resource language through the prediction of speech units. It is known that
different languages partially share common phonemes, thus general speech
knowledge learned from one language can be extended to other languages. Then,
we try to learn language-specific knowledge, the ability to model language, by
proposing Language-specific Memory-augmented Decoder (LMDecoder). LMDecoder
saves language-specific audio features into memory banks and can be trained on
audio-text paired data which is more easily accessible than video-text paired
data. Therefore, with LMDecoder, we can transform the input speech units into
language-specific audio features and translate them into texts by utilizing the
learned rich language knowledge. Finally, by combining general speech knowledge
and language-specific knowledge, we can efficiently develop lip reading models
even for low-resource languages. Through extensive experiments using five
languages, English, Spanish, French, Italian, and Portuguese, the effectiveness
of the proposed method is evaluated.Comment: Accepted at ICCV 202
Visual speech recognition:from traditional to deep learning frameworks
Speech is the most natural means of communication for humans. Therefore, since the beginning of computers it has been a goal to interact with machines via speech. While there have been gradual improvements in this field over the decades, and with recent drastic progress more and more commercial software is available that allow voice commands, there are still many ways in which it can be improved.
One way to do this is with visual speech information, more specifically, the visible articulations of the mouth. Based on the information contained in these articulations, visual speech recognition (VSR) transcribes an utterance from a video sequence. It thus helps extend speech recognition from audio-only to other scenarios such as silent or whispered speech (e.g.\ in cybersecurity), mouthings in sign language, as an additional modality in noisy audio scenarios for audio-visual automatic speech recognition, to better understand speech production and disorders, or by itself for human machine interaction and as a transcription method.
In this thesis, we present and compare different ways to build systems for VSR: We start with the traditional hidden Markov models that have been used in the field for decades, especially in combination with handcrafted features. These are compared to models taking into account recent developments in the fields of computer vision and speech recognition through deep learning. While their superior performance is confirmed, certain limitations with respect to computing power for these systems are also discussed.
This thesis also addresses multi-view processing and fusion, which is an important topic for many current applications. This is due to the fact that a single camera view often cannot provide enough flexibility with speakers moving in front of the camera. Technology companies are willing to integrate more cameras into their products, such as cars and mobile devices, due to lower hardware cost for both cameras and processing units, as well as the availability of higher processing power and high performance algorithms. Multi-camera and multi-view solutions are thus becoming more common, which means that algorithms can benefit from taking these into account. In this work we propose several methods of fusing the views of multiple cameras to improve the overall results.
We can show that both, relying on deep learning-based approaches for feature extraction and sequence modelling, as well as taking into account the complementary information contained in several views, improves performance considerably. To further improve the results, it would be necessary to move from data recorded in a lab environment, to multi-view data in realistic scenarios. Furthermore, the findings and models could be transferred to other domains such as audio-visual speech recognition or the study of speech production and disorders
Efficient Region-Aware Neural Radiance Fields for High-Fidelity Talking Portrait Synthesis
This paper presents ER-NeRF, a novel conditional Neural Radiance Fields
(NeRF) based architecture for talking portrait synthesis that can concurrently
achieve fast convergence, real-time rendering, and state-of-the-art performance
with small model size. Our idea is to explicitly exploit the unequal
contribution of spatial regions to guide talking portrait modeling.
Specifically, to improve the accuracy of dynamic head reconstruction, a compact
and expressive NeRF-based Tri-Plane Hash Representation is introduced by
pruning empty spatial regions with three planar hash encoders. For speech
audio, we propose a Region Attention Module to generate region-aware condition
feature via an attention mechanism. Different from existing methods that
utilize an MLP-based encoder to learn the cross-modal relation implicitly, the
attention mechanism builds an explicit connection between audio features and
spatial regions to capture the priors of local motions. Moreover, a direct and
fast Adaptive Pose Encoding is introduced to optimize the head-torso separation
problem by mapping the complex transformation of the head pose into spatial
coordinates. Extensive experiments demonstrate that our method renders better
high-fidelity and audio-lips synchronized talking portrait videos, with
realistic details and high efficiency compared to previous methods.Comment: Accepted by ICCV 202
Less is More: Facial Landmarks can Recognize a Spontaneous Smile
Smile veracity classification is a task of interpreting social interactions.
Broadly, it distinguishes between spontaneous and posed smiles. Previous
approaches used hand-engineered features from facial landmarks or considered
raw smile videos in an end-to-end manner to perform smile classification tasks.
Feature-based methods require intervention from human experts on feature
engineering and heavy pre-processing steps. On the contrary, raw smile video
inputs fed into end-to-end models bring more automation to the process with the
cost of considering many redundant facial features (beyond landmark locations)
that are mainly irrelevant to smile veracity classification. It remains unclear
to establish discriminative features from landmarks in an end-to-end manner. We
present a MeshSmileNet framework, a transformer architecture, to address the
above limitations. To eliminate redundant facial features, our landmarks input
is extracted from Attention Mesh, a pre-trained landmark detector. Again, to
discover discriminative features, we consider the relativity and trajectory of
the landmarks. For the relativity, we aggregate facial landmark that
conceptually formats a curve at each frame to establish local spatial features.
For the trajectory, we estimate the movements of landmark composed features
across time by self-attention mechanism, which captures pairwise dependency on
the trajectory of the same landmark. This idea allows us to achieve
state-of-the-art performances on UVA-NEMO, BBC, MMI Facial Expression, and SPOS
datasets