197 research outputs found
ILLUME: Rationalizing Vision-Language Models through Human Interactions
Bootstrapping from pre-trained language models has been proven to be an
efficient approach for building vision-language models (VLM) for tasks such as
image captioning or visual question answering. However, outputs of these models
rarely align with user's rationales for specific answers. In order to improve
this alignment and reinforce commonsense reasons, we propose a tuning paradigm
based on human interactions with machine generated data. Our ILLUME executes
the following loop: Given an image-question-answer prompt, the VLM samples
multiple candidate rationales, and a human critic provides minimal feedback via
preference selection, used for fine-tuning. This loop increases the training
data and gradually carves out the VLM's rationalization capabilities that are
aligned with human intend. Our exhaustive experiments demonstrate that ILLUME
is competitive with standard supervised fine-tuning while using significantly
fewer training data and only requiring minimal feedback
Mitigating Inappropriateness in Image Generation: Can there be Value in Reflecting the World's Ugliness?
Text-conditioned image generation models have recently achieved astonishing
results in image quality and text alignment and are consequently employed in a
fast-growing number of applications. Since they are highly data-driven, relying
on billion-sized datasets randomly scraped from the web, they also reproduce
inappropriate human behavior. Specifically, we demonstrate inappropriate
degeneration on a large-scale for various generative text-to-image models, thus
motivating the need for monitoring and moderating them at deployment. To this
end, we evaluate mitigation strategies at inference to suppress the generation
of inappropriate content. Our findings show that we can use models'
representations of the world's ugliness to align them with human preferences
Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models
Text-conditioned image generation models have recently achieved astonishing
results in image quality and text alignment and are consequently employed in a
fast-growing number of applications. Since they are highly data-driven, relying
on billion-sized datasets randomly scraped from the internet, they also suffer,
as we demonstrate, from degenerated and biased human behavior. In turn, they
may even reinforce such biases. To help combat these undesired side effects, we
present safe latent diffusion (SLD). Specifically, to measure the inappropriate
degeneration due to unfiltered and imbalanced training sets, we establish a
novel image generation test bed-inappropriate image prompts (I2P)-containing
dedicated, real-world image-to-text prompts covering concepts such as nudity
and violence. As our exhaustive empirical evaluation demonstrates, the
introduced SLD removes and suppresses inappropriate image parts during the
diffusion process, with no additional training required and no adverse effect
on overall image quality or text alignment.Comment: Proceedings of the 22nd IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 202
Does CLIP Know My Face?
With the rise of deep learning in various applications, privacy concerns
around the protection of training data has become a critical area of research.
Whereas prior studies have focused on privacy risks in single-modal models, we
introduce a novel method to assess privacy for multi-modal models, specifically
vision-language models like CLIP. The proposed Identity Inference Attack (IDIA)
reveals whether an individual was included in the training data by querying the
model with images of the same person. Letting the model choose from a wide
variety of possible text labels, the model reveals whether it recognizes the
person and, therefore, was used for training. Our large-scale experiments on
CLIP demonstrate that individuals used for training can be identified with very
high accuracy. We confirm that the model has learned to associate names with
depicted individuals, implying the existence of sensitive information that can
be extracted by adversaries. Our results highlight the need for stronger
privacy protection in large-scale models and suggest that IDIAs can be used to
prove the unauthorized use of data for training and to enforce privacy laws.Comment: 15 pages, 6 figure
LEDITS++: Limitless Image Editing using Text-to-Image Models
Text-to-image diffusion models have recently received increasing interest for
their astonishing ability to produce high-fidelity images from solely text
inputs. Subsequent research efforts aim to exploit and apply their capabilities
to real image editing. However, existing image-to-image methods are often
inefficient, imprecise, and of limited versatility. They either require
time-consuming fine-tuning, deviate unnecessarily strongly from the input
image, and/or lack support for multiple, simultaneous edits. To address these
issues, we introduce LEDITS++, an efficient yet versatile and precise textual
image manipulation technique. LEDITS++'s novel inversion approach requires no
tuning nor optimization and produces high-fidelity results with a few diffusion
steps. Second, our methodology supports multiple simultaneous edits and is
architecture-agnostic. Third, we use a novel implicit masking technique that
limits changes to relevant image regions. We propose the novel TEdBench++
benchmark as part of our exhaustive evaluation. Our results demonstrate the
capabilities of LEDITS++ and its improvements over previous methods. The
project page is available at https://leditsplusplus-project.static.hf.space
Fair Diffusion: Instructing Text-to-Image Generation Models on Fairness
Generative AI models have recently achieved astonishing results in quality
and are consequently employed in a fast-growing number of applications.
However, since they are highly data-driven, relying on billion-sized datasets
randomly scraped from the internet, they also suffer from degenerated and
biased human behavior, as we demonstrate. In fact, they may even reinforce such
biases. To not only uncover but also combat these undesired effects, we present
a novel strategy, called Fair Diffusion, to attenuate biases after the
deployment of generative text-to-image models. Specifically, we demonstrate
shifting a bias, based on human instructions, in any direction yielding
arbitrarily new proportions for, e.g., identity groups. As our empirical
evaluation demonstrates, this introduced control enables instructing generative
image models on fairness, with no data filtering and additional training
required
MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal Image Generation
The recent popularity of text-to-image diffusion models (DM) can largely be
attributed to the intuitive interface they provide to users. The intended
generation can be expressed in natural language, with the model producing
faithful interpretations of text prompts. However, expressing complex or
nuanced ideas in text alone can be difficult. To ease image generation, we
propose MultiFusion that allows one to express complex and nuanced concepts
with arbitrarily interleaved inputs of multiple modalities and languages.
MutliFusion leverages pre-trained models and aligns them for integration into a
cohesive system, thereby avoiding the need for extensive training from scratch.
Our experimental results demonstrate the efficient transfer of capabilities
from individual modules to the downstream model. Specifically, the fusion of
all independent components allows the image generation module to utilize
multilingual, interleaved multimodal inputs despite being trained solely on
monomodal data in a single language
Measurement of the cosmic ray spectrum above eV using inclined events detected with the Pierre Auger Observatory
A measurement of the cosmic-ray spectrum for energies exceeding
eV is presented, which is based on the analysis of showers
with zenith angles greater than detected with the Pierre Auger
Observatory between 1 January 2004 and 31 December 2013. The measured spectrum
confirms a flux suppression at the highest energies. Above
eV, the "ankle", the flux can be described by a power law with
index followed by
a smooth suppression region. For the energy () at which the
spectral flux has fallen to one-half of its extrapolated value in the absence
of suppression, we find
eV.Comment: Replaced with published version. Added journal reference and DO
Energy Estimation of Cosmic Rays with the Engineering Radio Array of the Pierre Auger Observatory
The Auger Engineering Radio Array (AERA) is part of the Pierre Auger
Observatory and is used to detect the radio emission of cosmic-ray air showers.
These observations are compared to the data of the surface detector stations of
the Observatory, which provide well-calibrated information on the cosmic-ray
energies and arrival directions. The response of the radio stations in the 30
to 80 MHz regime has been thoroughly calibrated to enable the reconstruction of
the incoming electric field. For the latter, the energy deposit per area is
determined from the radio pulses at each observer position and is interpolated
using a two-dimensional function that takes into account signal asymmetries due
to interference between the geomagnetic and charge-excess emission components.
The spatial integral over the signal distribution gives a direct measurement of
the energy transferred from the primary cosmic ray into radio emission in the
AERA frequency range. We measure 15.8 MeV of radiation energy for a 1 EeV air
shower arriving perpendicularly to the geomagnetic field. This radiation energy
-- corrected for geometrical effects -- is used as a cosmic-ray energy
estimator. Performing an absolute energy calibration against the
surface-detector information, we observe that this radio-energy estimator
scales quadratically with the cosmic-ray energy as expected for coherent
emission. We find an energy resolution of the radio reconstruction of 22% for
the data set and 17% for a high-quality subset containing only events with at
least five radio stations with signal.Comment: Replaced with published version. Added journal reference and DO
- …