10 research outputs found
Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis
Humans involuntarily tend to infer parts of the conversation from lip
movements when the speech is absent or corrupted by external noise. In this
work, we explore the task of lip to speech synthesis, i.e., learning to
generate natural speech given only the lip movements of a speaker.
Acknowledging the importance of contextual and speaker-specific cues for
accurate lip-reading, we take a different path from existing works. We focus on
learning accurate lip sequences to speech mappings for individual speakers in
unconstrained, large vocabulary settings. To this end, we collect and release a
large-scale benchmark dataset, the first of its kind, specifically to train and
evaluate the single-speaker lip to speech task in natural settings. We propose
a novel approach with key design choices to achieve accurate, natural lip to
speech synthesis in such unconstrained scenarios for the first time. Extensive
evaluation using quantitative, qualitative metrics and human evaluation shows
that our method is four times more intelligible than previous works in this
space. Please check out our demo video for a quick overview of the paper,
method, and qualitative results.
https://www.youtube.com/watch?v=HziA-jmlk_4&feature=youtu.beComment: 10 pages (including references), 5 figures, Accepted in CVPR, 202
Compressing Video Calls using Synthetic Talking Heads
We leverage the modern advancements in talking head generation to propose an
end-to-end system for talking head video compression. Our algorithm transmits
pivot frames intermittently while the rest of the talking head video is
generated by animating them. We use a state-of-the-art face reenactment network
to detect key points in the non-pivot frames and transmit them to the receiver.
A dense flow is then calculated to warp a pivot frame to reconstruct the
non-pivot ones. Transmitting key points instead of full frames leads to
significant compression. We propose a novel algorithm to adaptively select the
best-suited pivot frames at regular intervals to provide a smooth experience.
We also propose a frame-interpolater at the receiver's end to improve the
compression levels further. Finally, a face enhancement network improves
reconstruction quality, significantly improving several aspects like the
sharpness of the generations. We evaluate our method both qualitatively and
quantitatively on benchmark datasets and compare it with multiple compression
techniques. We release a demo video and additional information at
https://cvit.iiit.ac.in/research/projects/cvit-projects/talking-video-compression.Comment: British Machine Vision Conference (BMVC), 202
A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild
In this work, we investigate the problem of lip-syncing a talking face video
of an arbitrary identity to match a target speech segment. Current works excel
at producing accurate lip movements on a static image or videos of specific
people seen during the training phase. However, they fail to accurately morph
the lip movements of arbitrary identities in dynamic, unconstrained talking
face videos, resulting in significant parts of the video being out-of-sync with
the new audio. We identify key reasons pertaining to this and hence resolve
them by learning from a powerful lip-sync discriminator. Next, we propose new,
rigorous evaluation benchmarks and metrics to accurately measure lip
synchronization in unconstrained videos. Extensive quantitative evaluations on
our challenging benchmarks show that the lip-sync accuracy of the videos
generated by our Wav2Lip model is almost as good as real synced videos. We
provide a demo video clearly showing the substantial impact of our Wav2Lip
model and evaluation benchmarks on our website:
\url{cvit.iiit.ac.in/research/projects/cvit-projects/a-lip-sync-expert-is-all-you-need-for-speech-to-lip-generation-in-the-wild}.
The code and models are released at this GitHub repository:
\url{github.com/Rudrabha/Wav2Lip}. You can also try out the interactive demo at
this link: \url{bhaasha.iiit.ac.in/lipsync}.Comment: 9 pages (including references), 3 figures, Accepted in ACM
Multimedia, 202
DualLip: A System for Joint Lip Reading and Generation
Lip reading aims to recognize text from talking lip, while lip generation
aims to synthesize talking lip according to text, which is a key component in
talking face generation and is a dual task of lip reading. In this paper, we
develop DualLip, a system that jointly improves lip reading and generation by
leveraging the task duality and using unlabeled text and lip video data. The
key ideas of the DualLip include: 1) Generate lip video from unlabeled text
with a lip generation model, and use the pseudo pairs to improve lip reading;
2) Generate text from unlabeled lip video with a lip reading model, and use the
pseudo pairs to improve lip generation. We further extend DualLip to talking
face generation with two additionally introduced components: lip to face
generation and text to speech generation. Experiments on GRID and TCD-TIMIT
demonstrate the effectiveness of DualLip on improving lip reading, lip
generation, and talking face generation by utilizing unlabeled data.
Specifically, the lip generation model in our DualLip system trained with
only10% paired data surpasses the performance of that trained with the whole
paired data. And on the GRID benchmark of lip reading, we achieve 1.16%
character error rate and 2.71% word error rate, outperforming the
state-of-the-art models using the same amount of paired data.Comment: Accepted by ACM Multimedia 202
Revisiting Low Resource Status of Indian Languages in Machine Translation
Indian language machine translation performance is hampered due to the lack
of large scale multi-lingual sentence aligned corpora and robust benchmarks.
Through this paper, we provide and analyse an automated framework to obtain
such a corpus for Indian language neural machine translation (NMT) systems. Our
pipeline consists of a baseline NMT system, a retrieval module, and an
alignment module that is used to work with publicly available websites such as
press releases by the government. The main contribution towards this effort is
to obtain an incremental method that uses the above pipeline to iteratively
improve the size of the corpus as well as improve each of the components of our
system. Through our work, we also evaluate the design choices such as the
choice of pivoting language and the effect of iterative incremental increase in
corpus size. Our work in addition to providing an automated framework also
results in generating a relatively larger corpus as compared to existing
corpora that are available for Indian languages. This corpus helps us obtain
substantially improved results on the publicly available WAT evaluation
benchmark and other standard evaluation benchmarks.Comment: 10 pages, few figures, Preprint under revie
Visual speech enhancement without a real visual stream
In this work, we re-think the task of speech enhancement in unconstrained real-world environments. Current state- of-the-art methods use only the audio stream and are limited in their performance in a wide range of real-world noises. Recent works using lip movements as additional cues improve the quality of generated speech over "audio-only "methods. But, these methods cannot be used for several applications where the visual stream is unreliable or completely absent. We propose a new paradigm for speech enhancement by exploiting recent breakthroughs in speech- driven lip synthesis. Using one such model as a teacher network, we train a robust student network to produce accurate lip movements that mask away the noise, thus acting as a "visual noise filter". The intelligibility of the speech enhanced by our pseudo-lip approach is comparable ( < 3% difference) to the case of using real lips. This implies that we can exploit the advantages of using lip movements even in the absence of a real video stream. We rigorously evaluate our model using quantitative metrics as well as human evaluations. Additional ablation studies and a demo video on our website containing qualitative comparisons and results clearly illustrate the effectiveness of our approach.</p
NTIRE 2018 Challenge on Spectral Reconstruction from RGB Images
This paper reviews the first challenge on spectral image reconstruction from RGB images, i.e., the recovery of whole-scene hyperspectral (HS) information from a 3- channel RGB image. The challenge was divided into 2 tracks: the “Clean” track sought HS recovery from noise- less RGB images obtained from a known response func- tion (representing spectrally-calibrated camera) while the “Real World” track challenged participants to recover HS cubes from JPEG-compressed RGB images generated by an unknown response function. To facilitate the challenge, the BGU Hyperspectral Image Database was extended to provide participants with 256 natural HS training images, and 5+10 additional images for validation and testing, re- spectively. The “Clean” and “Real World” tracks had 73 and 63 registered participants respectively, with 12 teams competing in the final testing phase. Proposed methods and their corresponding results are reported in this review
NTIRE 2019 Challenge on Video Super-Resolution: Methods and Results
This paper reviews the first NTIRE challenge on video super-resolution (restoration of rich details in low-resolution video frames) with focus on proposed solutions and results. A new REalistic and Diverse Scenes dataset (REDS) was employed. The challenge was divided into 2 tracks. Track 1 employed standard bicubic downscaling setup while Track 2 had realistic dynamic motion blurs.
Each competition had 124 and 104 registered participants. There were total 14 teams in the final testing phase. They gauge the state-of-the-art in video super-resolution
NTIRE 2018 Challenge on Single Image Super-Resolution: Methods and Results
This paper reviews the 2nd NTIRE challenge on single image super-resolution (restoration of rich details in a low resolution image) with focus on proposed solutions and results. The challenge had 4 tracks. Track 1 employed the standard bicubic downscaling setup, while Tracks 2, 3 and 4 had realistic unknown downgrading operators simulating camera image acquisition pipeline. The operators were learnable through provided pairs of low and high resolution train images. The tracks had 145, 114, 101, and 113 registered participants, resp., and 31 teams competed in the final testing phase. They gauge the state-of-the-art in single image super-resolution