3,893 research outputs found
Manipulating Attributes of Natural Scenes via Hallucination
In this study, we explore building a two-stage framework for enabling users
to directly manipulate high-level attributes of a natural scene. The key to our
approach is a deep generative network which can hallucinate images of a scene
as if they were taken at a different season (e.g. during winter), weather
condition (e.g. in a cloudy day) or time of the day (e.g. at sunset). Once the
scene is hallucinated with the given attributes, the corresponding look is then
transferred to the input image while preserving the semantic details intact,
giving a photo-realistic manipulation result. As the proposed framework
hallucinates what the scene will look like, it does not require any reference
style image as commonly utilized in most of the appearance or style transfer
approaches. Moreover, it allows to simultaneously manipulate a given scene
according to a diverse set of transient attributes within a single model,
eliminating the need of training multiple networks per each translation task.
Our comprehensive set of qualitative and quantitative results demonstrate the
effectiveness of our approach against the competing methods.Comment: Accepted for publication in ACM Transactions on Graphic
Text-to-image Editing by Image Information Removal
Diffusion models have demonstrated impressive performance in text-guided
image generation. To leverage the knowledge of text-guided image generation
models in image editing, current approaches either fine-tune the pretrained
models using the input image (e.g., Imagic) or incorporate structure
information as additional constraints into the pretrained models (e.g.,
ControlNet). However, fine-tuning large-scale diffusion models on a single
image can lead to severe overfitting issues and lengthy inference time. The
information leakage from pretrained models makes it challenging to preserve the
text-irrelevant content of the input image while generating new features guided
by language descriptions. On the other hand, methods that incorporate
structural guidance (e.g., edge maps, semantic maps, keypoints) as additional
constraints face limitations in preserving other attributes of the original
image, such as colors or textures. A straightforward way to incorporate the
original image is to directly use it as an additional control. However, since
image editing methods are typically trained on the image reconstruction task,
the incorporation can lead to the identical mapping issue, where the model
learns to output an image identical to the input, resulting in limited editing
capabilities. To address these challenges, we propose a text-to-image editing
model with Image Information Removal module (IIR) to selectively erase
color-related and texture-related information from the original image, allowing
us to better preserve the text-irrelevant content and avoid the identical
mapping issue. We evaluate our model on three benchmark datasets: CUB, Outdoor
Scenes, and COCO. Our approach achieves the best editability-fidelity
trade-off, and our edited images are approximately 35% more preferred by
annotators than the prior-arts on COCO
Learning to Map the Visual and Auditory World
The appearance of the world varies dramatically not only from place to place but also from hour to hour and month to month. Billions of images that capture this complex relationship are uploaded to social-media websites every day and often are associated with precise time and location metadata. This rich source of data can be beneficial to improve our understanding of the globe. In this work, we propose a general framework that uses these publicly available images for constructing dense maps of different ground-level attributes from overhead imagery. In particular, we use well-defined probabilistic models and a weakly-supervised, multi-task training strategy to provide an estimate of the expected visual and auditory ground-level attributes consisting of the type of scenes, objects, and sounds a person can experience at a location. Through a large-scale evaluation on real data, we show that our learned models can be used for applications including mapping, image localization, image retrieval, and metadata verification
- …