499 research outputs found
Variational Inference for Learning Representations of Natural Language Edits
Document editing has become a pervasive component of the production of
information, with version control systems enabling edits to be efficiently
stored and applied. In light of this, the task of learning distributed
representations of edits has been recently proposed. With this in mind, we
propose a novel approach that employs variational inference to learn a
continuous latent space of vector representations to capture the underlying
semantic information with regard to the document editing process. We achieve
this by introducing a latent variable to explicitly model the aforementioned
features. This latent variable is then combined with a document representation
to guide the generation of an edited version of this document. Additionally, to
facilitate standardized automatic evaluation of edit representations, which has
heavily relied on direct human input thus far, we also propose a suite of
downstream tasks, PEER, specifically designed to measure the quality of edit
representations in the context of natural language processing.Comment: Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21
DORSal: Diffusion for Object-centric Representations of Scenes et al
Recent progress in 3D scene understanding enables scalable learning of
representations across large datasets of diverse scenes. As a consequence,
generalization to unseen scenes and objects, rendering novel views from just a
single or a handful of input images, and controllable scene generation that
supports editing, is now possible. However, training jointly on a large number
of scenes typically compromises rendering quality when compared to single-scene
optimized models such as NeRFs. In this paper, we leverage recent progress in
diffusion models to equip 3D scene representation learning models with the
ability to render high-fidelity novel views, while retaining benefits such as
object-level scene editing to a large degree. In particular, we propose DORSal,
which adapts a video diffusion architecture for 3D scene generation conditioned
on frozen object-centric slot-based representations of scenes. On both complex
synthetic multi-object scenes and on the real-world large-scale Street View
dataset, we show that DORSal enables scalable neural rendering of 3D scenes
with object-level editing and improves upon existing approaches.Comment: Project page: https://www.sjoerdvansteenkiste.com/dorsa
Language-Based Image Editing with Recurrent Attentive Models
We investigate the problem of Language-Based Image Editing (LBIE). Given a
source image and a natural language description, we want to generate a target
image by editing the source image based on the description. We propose a
generic modeling framework for two sub-tasks of LBIE: language-based image
segmentation and image colorization. The framework uses recurrent attentive
models to fuse image and language features. Instead of using a fixed step size,
we introduce for each region of the image a termination gate to dynamically
determine after each inference step whether to continue extrapolating
additional information from the textual description. The effectiveness of the
framework is validated on three datasets. First, we introduce a synthetic
dataset, called CoSaL, to evaluate the end-to-end performance of our LBIE
system. Second, we show that the framework leads to state-of-the-art
performance on image segmentation on the ReferIt dataset. Third, we present the
first language-based colorization result on the Oxford-102 Flowers dataset.Comment: Accepted to CVPR 2018 as a Spotligh
Disentangling Content and Motion for Text-Based Neural Video Manipulation
Giving machines the ability to imagine possible new objects or scenes from
linguistic descriptions and produce their realistic renderings is arguably one
of the most challenging problems in computer vision. Recent advances in deep
generative models have led to new approaches that give promising results
towards this goal. In this paper, we introduce a new method called DiCoMoGAN
for manipulating videos with natural language, aiming to perform local and
semantic edits on a video clip to alter the appearances of an object of
interest. Our GAN architecture allows for better utilization of multiple
observations by disentangling content and motion to enable controllable
semantic edits. To this end, we introduce two tightly coupled networks: (i) a
representation network for constructing a concise understanding of motion
dynamics and temporally invariant content, and (ii) a translation network that
exploits the extracted latent content representation to actuate the
manipulation according to the target description. Our qualitative and
quantitative evaluations demonstrate that DiCoMoGAN significantly outperforms
existing frame-based methods, producing temporally coherent and semantically
more meaningful results
LADIS: Language Disentanglement for 3D Shape Editing
Natural language interaction is a promising direction for democratizing 3D
shape design. However, existing methods for text-driven 3D shape editing face
challenges in producing decoupled, local edits to 3D shapes. We address this
problem by learning disentangled latent representations that ground language in
3D geometry. To this end, we propose a complementary tool set including a novel
network architecture, a disentanglement loss, and a new editing procedure.
Additionally, to measure edit locality, we define a new metric that we call
part-wise edit precision. We show that our method outperforms existing SOTA
methods by 20% in terms of edit locality, and up to 6.6% in terms of language
reference resolution accuracy. Our work suggests that by solely disentangling
language representations, downstream 3D shape editing can become more local to
relevant parts, even if the model was never given explicit part-based
supervision
- …