197 research outputs found

    ILLUME: Rationalizing Vision-Language Models through Human Interactions

    Full text link
    Bootstrapping from pre-trained language models has been proven to be an efficient approach for building vision-language models (VLM) for tasks such as image captioning or visual question answering. However, outputs of these models rarely align with user's rationales for specific answers. In order to improve this alignment and reinforce commonsense reasons, we propose a tuning paradigm based on human interactions with machine generated data. Our ILLUME executes the following loop: Given an image-question-answer prompt, the VLM samples multiple candidate rationales, and a human critic provides minimal feedback via preference selection, used for fine-tuning. This loop increases the training data and gradually carves out the VLM's rationalization capabilities that are aligned with human intend. Our exhaustive experiments demonstrate that ILLUME is competitive with standard supervised fine-tuning while using significantly fewer training data and only requiring minimal feedback

    Mitigating Inappropriateness in Image Generation: Can there be Value in Reflecting the World's Ugliness?

    Full text link
    Text-conditioned image generation models have recently achieved astonishing results in image quality and text alignment and are consequently employed in a fast-growing number of applications. Since they are highly data-driven, relying on billion-sized datasets randomly scraped from the web, they also reproduce inappropriate human behavior. Specifically, we demonstrate inappropriate degeneration on a large-scale for various generative text-to-image models, thus motivating the need for monitoring and moderating them at deployment. To this end, we evaluate mitigation strategies at inference to suppress the generation of inappropriate content. Our findings show that we can use models' representations of the world's ugliness to align them with human preferences

    Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models

    Full text link
    Text-conditioned image generation models have recently achieved astonishing results in image quality and text alignment and are consequently employed in a fast-growing number of applications. Since they are highly data-driven, relying on billion-sized datasets randomly scraped from the internet, they also suffer, as we demonstrate, from degenerated and biased human behavior. In turn, they may even reinforce such biases. To help combat these undesired side effects, we present safe latent diffusion (SLD). Specifically, to measure the inappropriate degeneration due to unfiltered and imbalanced training sets, we establish a novel image generation test bed-inappropriate image prompts (I2P)-containing dedicated, real-world image-to-text prompts covering concepts such as nudity and violence. As our exhaustive empirical evaluation demonstrates, the introduced SLD removes and suppresses inappropriate image parts during the diffusion process, with no additional training required and no adverse effect on overall image quality or text alignment.Comment: Proceedings of the 22nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 202

    Does CLIP Know My Face?

    Full text link
    With the rise of deep learning in various applications, privacy concerns around the protection of training data has become a critical area of research. Whereas prior studies have focused on privacy risks in single-modal models, we introduce a novel method to assess privacy for multi-modal models, specifically vision-language models like CLIP. The proposed Identity Inference Attack (IDIA) reveals whether an individual was included in the training data by querying the model with images of the same person. Letting the model choose from a wide variety of possible text labels, the model reveals whether it recognizes the person and, therefore, was used for training. Our large-scale experiments on CLIP demonstrate that individuals used for training can be identified with very high accuracy. We confirm that the model has learned to associate names with depicted individuals, implying the existence of sensitive information that can be extracted by adversaries. Our results highlight the need for stronger privacy protection in large-scale models and suggest that IDIAs can be used to prove the unauthorized use of data for training and to enforce privacy laws.Comment: 15 pages, 6 figure

    LEDITS++: Limitless Image Editing using Text-to-Image Models

    Full text link
    Text-to-image diffusion models have recently received increasing interest for their astonishing ability to produce high-fidelity images from solely text inputs. Subsequent research efforts aim to exploit and apply their capabilities to real image editing. However, existing image-to-image methods are often inefficient, imprecise, and of limited versatility. They either require time-consuming fine-tuning, deviate unnecessarily strongly from the input image, and/or lack support for multiple, simultaneous edits. To address these issues, we introduce LEDITS++, an efficient yet versatile and precise textual image manipulation technique. LEDITS++'s novel inversion approach requires no tuning nor optimization and produces high-fidelity results with a few diffusion steps. Second, our methodology supports multiple simultaneous edits and is architecture-agnostic. Third, we use a novel implicit masking technique that limits changes to relevant image regions. We propose the novel TEdBench++ benchmark as part of our exhaustive evaluation. Our results demonstrate the capabilities of LEDITS++ and its improvements over previous methods. The project page is available at https://leditsplusplus-project.static.hf.space

    Fair Diffusion: Instructing Text-to-Image Generation Models on Fairness

    Full text link
    Generative AI models have recently achieved astonishing results in quality and are consequently employed in a fast-growing number of applications. However, since they are highly data-driven, relying on billion-sized datasets randomly scraped from the internet, they also suffer from degenerated and biased human behavior, as we demonstrate. In fact, they may even reinforce such biases. To not only uncover but also combat these undesired effects, we present a novel strategy, called Fair Diffusion, to attenuate biases after the deployment of generative text-to-image models. Specifically, we demonstrate shifting a bias, based on human instructions, in any direction yielding arbitrarily new proportions for, e.g., identity groups. As our empirical evaluation demonstrates, this introduced control enables instructing generative image models on fairness, with no data filtering and additional training required

    MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal Image Generation

    Full text link
    The recent popularity of text-to-image diffusion models (DM) can largely be attributed to the intuitive interface they provide to users. The intended generation can be expressed in natural language, with the model producing faithful interpretations of text prompts. However, expressing complex or nuanced ideas in text alone can be difficult. To ease image generation, we propose MultiFusion that allows one to express complex and nuanced concepts with arbitrarily interleaved inputs of multiple modalities and languages. MutliFusion leverages pre-trained models and aligns them for integration into a cohesive system, thereby avoiding the need for extensive training from scratch. Our experimental results demonstrate the efficient transfer of capabilities from individual modules to the downstream model. Specifically, the fusion of all independent components allows the image generation module to utilize multilingual, interleaved multimodal inputs despite being trained solely on monomodal data in a single language

    Measurement of the cosmic ray spectrum above 4×10184{\times}10^{18} eV using inclined events detected with the Pierre Auger Observatory

    Full text link
    A measurement of the cosmic-ray spectrum for energies exceeding 4×10184{\times}10^{18} eV is presented, which is based on the analysis of showers with zenith angles greater than 6060^{\circ} detected with the Pierre Auger Observatory between 1 January 2004 and 31 December 2013. The measured spectrum confirms a flux suppression at the highest energies. Above 5.3×10185.3{\times}10^{18} eV, the "ankle", the flux can be described by a power law EγE^{-\gamma} with index γ=2.70±0.02(stat)±0.1(sys)\gamma=2.70 \pm 0.02 \,\text{(stat)} \pm 0.1\,\text{(sys)} followed by a smooth suppression region. For the energy (EsE_\text{s}) at which the spectral flux has fallen to one-half of its extrapolated value in the absence of suppression, we find Es=(5.12±0.25(stat)1.2+1.0(sys))×1019E_\text{s}=(5.12\pm0.25\,\text{(stat)}^{+1.0}_{-1.2}\,\text{(sys)}){\times}10^{19} eV.Comment: Replaced with published version. Added journal reference and DO

    Energy Estimation of Cosmic Rays with the Engineering Radio Array of the Pierre Auger Observatory

    Full text link
    The Auger Engineering Radio Array (AERA) is part of the Pierre Auger Observatory and is used to detect the radio emission of cosmic-ray air showers. These observations are compared to the data of the surface detector stations of the Observatory, which provide well-calibrated information on the cosmic-ray energies and arrival directions. The response of the radio stations in the 30 to 80 MHz regime has been thoroughly calibrated to enable the reconstruction of the incoming electric field. For the latter, the energy deposit per area is determined from the radio pulses at each observer position and is interpolated using a two-dimensional function that takes into account signal asymmetries due to interference between the geomagnetic and charge-excess emission components. The spatial integral over the signal distribution gives a direct measurement of the energy transferred from the primary cosmic ray into radio emission in the AERA frequency range. We measure 15.8 MeV of radiation energy for a 1 EeV air shower arriving perpendicularly to the geomagnetic field. This radiation energy -- corrected for geometrical effects -- is used as a cosmic-ray energy estimator. Performing an absolute energy calibration against the surface-detector information, we observe that this radio-energy estimator scales quadratically with the cosmic-ray energy as expected for coherent emission. We find an energy resolution of the radio reconstruction of 22% for the data set and 17% for a high-quality subset containing only events with at least five radio stations with signal.Comment: Replaced with published version. Added journal reference and DO