6,375 research outputs found
A Generative Adversarial Approach for Zero-Shot Learning from Noisy Texts
Most existing zero-shot learning methods consider the problem as a visual
semantic embedding one. Given the demonstrated capability of Generative
Adversarial Networks(GANs) to generate images, we instead leverage GANs to
imagine unseen categories from text descriptions and hence recognize novel
classes with no examples being seen. Specifically, we propose a simple yet
effective generative model that takes as input noisy text descriptions about an
unseen class (e.g.Wikipedia articles) and generates synthesized visual features
for this class. With added pseudo data, zero-shot learning is naturally
converted to a traditional classification problem. Additionally, to preserve
the inter-class discrimination of the generated features, a visual pivot
regularization is proposed as an explicit supervision. Unlike previous methods
using complex engineered regularizers, our approach can suppress the noise well
without additional regularization. Empirically, we show that our method
consistently outperforms the state of the art on the largest available
benchmarks on Text-based Zero-shot Learning.Comment: To appear in CVPR1
Text-based Editing of Talking-head Video
Editing talking-head video to change the speech content or to remove filler words is challenging. We propose a novel method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has been modified, while maintaining a seamless audio-visual flow (i.e. no jump cuts). Our method automatically annotates an input talking-head video with phonemes, visemes, 3D face pose and geometry, reflectance, expression and scene illumination per frame. To edit a video, the user has to only edit the transcript, and an optimization strategy then chooses segments of the input corpus as base material. The annotated parameters corresponding to the selected segments are seamlessly stitched together and used to produce an intermediate video representation in which the lower half of the face is rendered with a parametric face model. Finally, a recurrent video generation network transforms this representation to a photorealistic video that matches the edited transcript. We demonstrate a large variety of edits, such as the addition, removal, and alteration of words, as well as convincing language translation and full sentence synthesis
Where and Who? Automatic Semantic-Aware Person Composition
Image compositing is a method used to generate realistic yet fake imagery by
inserting contents from one image to another. Previous work in compositing has
focused on improving appearance compatibility of a user selected foreground
segment and a background image (i.e. color and illumination consistency). In
this work, we instead develop a fully automated compositing model that
additionally learns to select and transform compatible foreground segments from
a large collection given only an input image background. To simplify the task,
we restrict our problem by focusing on human instance composition, because
human segments exhibit strong correlations with their background and because of
the availability of large annotated data. We develop a novel branching
Convolutional Neural Network (CNN) that jointly predicts candidate person
locations given a background image. We then use pre-trained deep feature
representations to retrieve person instances from a large segment database.
Experimental results show that our model can generate composite images that
look visually convincing. We also develop a user interface to demonstrate the
potential application of our method.Comment: 10 pages, 9 figure
- …