6 research outputs found
MAGIC: Mask-Guided Image Synthesis by Inverting a Quasi-Robust Classifier
We offer a method for one-shot mask-guided image synthesis that allows
controlling manipulations of a single image by inverting a quasi-robust
classifier equipped with strong regularizers. Our proposed method, entitled
MAGIC, leverages structured gradients from a pre-trained quasi-robust
classifier to better preserve the input semantics while preserving its
classification accuracy, thereby guaranteeing credibility in the synthesis.
Unlike current methods that use complex primitives to supervise the process or
use attention maps as a weak supervisory signal, MAGIC aggregates gradients
over the input, driven by a guide binary mask that enforces a strong, spatial
prior. MAGIC implements a series of manipulations with a single framework
achieving shape and location control, intense non-rigid shape deformations, and
copy/move operations in the presence of repeating objects and gives users firm
control over the synthesis by requiring to simply specify binary guide masks.
Our study and findings are supported by various qualitative comparisons with
the state-of-the-art on the same images sampled from ImageNet and quantitative
analysis using machine perception along with a user survey of 100+ participants
that endorse our synthesis quality. Project page at
https://mozhdehrouhsedaghat.github.io/magic.html. Code is available at
https://github.com/mozhdehrouhsedaghat/magicComment: Accepted to the Thirty-Seventh Conference on Artificial Intelligence
(AAAI) 2023 - 12 pages, 9 figure
How well can Text-to-Image Generative Models understand Ethical Natural Language Interventions?
Text-to-image generative models have achieved unprecedented success in
generating high-quality images based on natural language descriptions. However,
it is shown that these models tend to favor specific social groups when
prompted with neutral text descriptions (e.g., 'a photo of a lawyer').
Following Zhao et al. (2021), we study the effect on the diversity of the
generated images when adding ethical intervention that supports equitable
judgment (e.g., 'if all individuals can be a lawyer irrespective of their
gender') in the input prompts. To this end, we introduce an Ethical NaTural
Language Interventions in Text-to-Image GENeration (ENTIGEN) benchmark dataset
to evaluate the change in image generations conditional on ethical
interventions across three social axes -- gender, skin color, and culture.
Through ENTIGEN framework, we find that the generations from minDALL.E,
DALL.E-mini and Stable Diffusion cover diverse social groups while preserving
the image quality. Preliminary studies indicate that a large change in the
model predictions is triggered by certain phrases such as 'irrespective of
gender' in the context of gender bias in the ethical interventions. We release
code and annotated data at https://github.com/Hritikbansal/entigen_emnlp.Comment: 13 pages, 8 figures, 6 tables. Accepted as Oral Presentation at EMNLP
202
MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models
Large-scale language models have shown the ability to adapt to a new task via
conditioning on a few demonstrations (i.e., in-context learning). However, in
the vision-language domain, most large-scale pre-trained vision-language (VL)
models do not possess the ability to conduct in-context learning. How can we
enable in-context learning for VL models? In this paper, we study an
interesting hypothesis: can we transfer the in-context learning ability from
the language domain to VL domain? Specifically, we first meta-trains a language
model to perform in-context learning on NLP tasks (as in MetaICL); then we
transfer this model to perform VL tasks by attaching a visual encoder. Our
experiments suggest that indeed in-context learning ability can be transferred
cross modalities: our model considerably improves the in-context learning
capability on VL tasks and can even compensate for the size of the model
significantly. On VQA, OK-VQA, and GQA, our method could outperform the
baseline model while having 20 times fewer parameters
MAGIC: Mask-Guided Image Synthesis by Inverting a Quasi-robust Classifier
We offer a method for one-shot mask-guided image synthesis that allows controlling manipulations of a single image by inverting a quasi-robust classifier equipped with strong regularizers. Our proposed method, entitled MAGIC, leverages structured gradients from a pre-trained quasi-robust classifier to better preserve the input semantics while preserving its classification accuracy, thereby guaranteeing credibility in the synthesis.
Unlike current methods that use complex primitives to supervise the process or use attention maps as a weak supervisory signal, MAGIC aggregates gradients over the input, driven by a guide binary mask that enforces a strong, spatial prior. MAGIC implements a series of manipulations with a single framework achieving shape and location control, intense non-rigid shape deformations, and copy/move operations in the presence of repeating objects and gives users firm control over the synthesis by requiring to simply specify binary guide masks.
Our study and findings are supported by various qualitative comparisons with the state-of-the-art on the same images sampled from ImageNet and quantitative analysis using machine perception along with a user survey of 100+ participants that endorse our synthesis quality
GeoMLAMA: Geo-Diverse Commonsense Probing on Multilingual Pre-Trained Language Models
Recent work has shown that Pre-trained Language Models (PLMs) have the
ability to store the relational knowledge from pre-training data in their model
parameters. However, it is not clear up to what extent do PLMs store
geo-diverse commonsense knowledge, the knowledge associated with a culture and
only shared locally. For instance, the color of bridal dress is white in
American weddings whereas it is red in Chinese weddings. Here, we wish to probe
if PLMs can predict red and white as the color of the bridal dress when queried
for American and Chinese weddings, respectively. To this end, we introduce a
framework for geo-diverse commonsense probing on multilingual PLMs (mPLMs) and
introduce a corresponding benchmark Geo-diverse Commonsense Multilingual
Language Model Analysis (GeoMLAMA) dataset. GeoMLAMA contains 3125 prompts in
English, Chinese, Hindi, Persian, and Swahili, with a wide coverage of concepts
shared by people from American, Chinese, Indian, Iranian and Kenyan cultures.
We benchmark 11 standard mPLMs which include variants of mBERT, XLM, mT5, and
XGLM on GeoMLAMA. Interestingly, we find that 1) larger mPLM variants do not
necessarily store geo-diverse concepts better than its smaller variant; 2)
mPLMs are not intrinsically biased towards knowledge from the Western countries
(the United States); 3) the native language of a country may not be the best
language to probe its knowledge and 4) a language may better probe knowledge
about a non-native country than its native country