18 research outputs found
Talking Face Generation by Adversarially Disentangled Audio-Visual Representation
Talking face generation aims to synthesize a sequence of face images that
correspond to a clip of speech. This is a challenging task because face
appearance variation and semantics of speech are coupled together in the subtle
movements of the talking face regions. Existing works either construct specific
face appearance model on specific subjects or model the transformation between
lip motion and speech. In this work, we integrate both aspects and enable
arbitrary-subject talking face generation by learning disentangled audio-visual
representation. We find that the talking face sequence is actually a
composition of both subject-related information and speech-related information.
These two spaces are then explicitly disentangled through a novel
associative-and-adversarial training process. This disentangled representation
has an advantage where both audio and video can serve as inputs for generation.
Extensive experiments show that the proposed approach generates realistic
talking face sequences on arbitrary subjects with much clearer lip motion
patterns than previous work. We also demonstrate the learned audio-visual
representation is extremely useful for the tasks of automatic lip reading and
audio-video retrieval.Comment: AAAI Conference on Artificial Intelligence (AAAI 2019) Oral
Presentation. Code, models, and video results are available on our webpage:
https://liuziwei7.github.io/projects/TalkingFace.htm
URNet : User-Resizable Residual Networks with Conditional Gating Module
Convolutional Neural Networks are widely used to process spatial scenes, but
their computational cost is fixed and depends on the structure of the network
used. There are methods to reduce the cost by compressing networks or varying
its computational path dynamically according to the input image. However, since
a user can not control the size of the learned model, it is difficult to
respond dynamically if the amount of service requests suddenly increases. We
propose User-Resizable Residual Networks (URNet), which allows users to adjust
the scale of the network as needed during evaluation. URNet includes
Conditional Gating Module (CGM) that determines the use of each residual block
according to the input image and the desired scale. CGM is trained in a
supervised manner using the newly proposed scale loss and its corresponding
training methods. URNet can control the amount of computation according to
user's demand without degrading the accuracy significantly. It can also be used
as a general compression method by fixing the scale size during training. In
the experiments on ImageNet, URNet based on ResNet-101 maintains the accuracy
of the baseline even when resizing it to approximately 80% of the original
network, and demonstrates only about 1% accuracy degradation when using about
65% of the computation.Comment: 12 page