2,056 research outputs found
Cross-language Speech Dependent Lip-synchronization
Understanding videos of people speaking across international borders is hard as audiences from different demographies do not understand the language. Such speech videos are often supplemented with language subtitles. However, these hamper the viewing experience as the attention is shared. Simple audio dubbing in a different language makes the video appear unnatural due to unsynchronized lip motion. In this paper, we propose a system for automated cross-language lip synchronization for re-dubbed videos. Our model generates superior photorealistic lip-synchronization over original video in comparison to the current re-dubbing method. With the help of a user-based study, we verify that our method is preferred over unsynchronized videos.</p
A Multilingual Parallel Corpora Collection Effort for Indian Languages
We present sentence aligned parallel corpora across 10 Indian Languages -
Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi,
Punjabi, and English - many of which are categorized as low resource. The
corpora are compiled from online sources which have content shared across
languages. The corpora presented significantly extends present resources that
are either not large enough or are restricted to a specific domain (such as
health). We also provide a separate test corpus compiled from an independent
online source that can be independently used for validating the performance in
10 Indian languages. Alongside, we report on the methods of constructing such
corpora using tools enabled by recent advances in machine translation and
cross-lingual retrieval using deep neural network based methods.Comment: 9 pages. Accepted in LREC 202
INR-V: A Continuous Representation Space for Video-based Generative Tasks
Generating videos is a complex task that is accomplished by generating a set
of temporally coherent images frame-by-frame. This limits the expressivity of
videos to only image-based operations on the individual video frames needing
network designs to obtain temporally coherent trajectories in the underlying
image space. We propose INR-V, a video representation network that learns a
continuous space for video-based generative tasks. INR-V parameterizes videos
using implicit neural representations (INRs), a multi-layered perceptron that
predicts an RGB value for each input pixel location of the video. The INR is
predicted using a meta-network which is a hypernetwork trained on neural
representations of multiple video instances. Later, the meta-network can be
sampled to generate diverse novel videos enabling many downstream video-based
generative tasks. Interestingly, we find that conditional regularization and
progressive weight initialization play a crucial role in obtaining INR-V. The
representation space learned by INR-V is more expressive than an image space
showcasing many interesting properties not possible with the existing works.
For instance, INR-V can smoothly interpolate intermediate videos between known
video instances (such as intermediate identities, expressions, and poses in
face videos). It can also in-paint missing portions in videos to recover
temporally coherent full videos. In this work, we evaluate the space learned by
INR-V on diverse generative tasks such as video interpolation, novel video
generation, video inversion, and video inpainting against the existing
baselines. INR-V significantly outperforms the baselines on several of these
demonstrated tasks, clearly showcasing the potential of the proposed
representation space.Comment: Published in Transactions on Machine Learning Research (10/2022);
https://openreview.net/forum?id=aIoEkwc2o
Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis
Humans involuntarily tend to infer parts of the conversation from lip
movements when the speech is absent or corrupted by external noise. In this
work, we explore the task of lip to speech synthesis, i.e., learning to
generate natural speech given only the lip movements of a speaker.
Acknowledging the importance of contextual and speaker-specific cues for
accurate lip-reading, we take a different path from existing works. We focus on
learning accurate lip sequences to speech mappings for individual speakers in
unconstrained, large vocabulary settings. To this end, we collect and release a
large-scale benchmark dataset, the first of its kind, specifically to train and
evaluate the single-speaker lip to speech task in natural settings. We propose
a novel approach with key design choices to achieve accurate, natural lip to
speech synthesis in such unconstrained scenarios for the first time. Extensive
evaluation using quantitative, qualitative metrics and human evaluation shows
that our method is four times more intelligible than previous works in this
space. Please check out our demo video for a quick overview of the paper,
method, and qualitative results.
https://www.youtube.com/watch?v=HziA-jmlk_4&feature=youtu.beComment: 10 pages (including references), 5 figures, Accepted in CVPR, 202
- …