6 research outputs found
ILLUME: Rationalizing Vision-Language Models through Human Interactions
Bootstrapping from pre-trained language models has been proven to be an
efficient approach for building vision-language models (VLM) for tasks such as
image captioning or visual question answering. However, outputs of these models
rarely align with user's rationales for specific answers. In order to improve
this alignment and reinforce commonsense reasons, we propose a tuning paradigm
based on human interactions with machine generated data. Our ILLUME executes
the following loop: Given an image-question-answer prompt, the VLM samples
multiple candidate rationales, and a human critic provides minimal feedback via
preference selection, used for fine-tuning. This loop increases the training
data and gradually carves out the VLM's rationalization capabilities that are
aligned with human intend. Our exhaustive experiments demonstrate that ILLUME
is competitive with standard supervised fine-tuning while using significantly
fewer training data and only requiring minimal feedback
Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models
Text-conditioned image generation models have recently achieved astonishing
results in image quality and text alignment and are consequently employed in a
fast-growing number of applications. Since they are highly data-driven, relying
on billion-sized datasets randomly scraped from the internet, they also suffer,
as we demonstrate, from degenerated and biased human behavior. In turn, they
may even reinforce such biases. To help combat these undesired side effects, we
present safe latent diffusion (SLD). Specifically, to measure the inappropriate
degeneration due to unfiltered and imbalanced training sets, we establish a
novel image generation test bed-inappropriate image prompts (I2P)-containing
dedicated, real-world image-to-text prompts covering concepts such as nudity
and violence. As our exhaustive empirical evaluation demonstrates, the
introduced SLD removes and suppresses inappropriate image parts during the
diffusion process, with no additional training required and no adverse effect
on overall image quality or text alignment.Comment: Proceedings of the 22nd IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 202
Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization
Large Language Models (LLMs) have reshaped natural language processing with
their impressive capabilities. Their ever-increasing size, however, raised
concerns about their effective deployment and the need for LLM compressions.
This study introduces the Divergent Token metrics (DTMs), a novel approach for
assessing compressed LLMs, addressing the limitations of traditional perplexity
or accuracy measures that fail to accurately reflect text generation quality.
DTMs focus on token divergence, that allow deeper insights into the subtleties
of model compression, i.p. when evaluating component's impacts individually.
Utilizing the First Divergent Token metric (FDTM) in model sparsification
reveals that a quarter of all attention components can be pruned beyond 90% on
the Llama-2 model family, still keeping SOTA performance. For quantization FDTM
suggests that over 80% of parameters can naively be transformed to int8 without
special outlier management. These evaluations indicate the necessity of
choosing appropriate compressions for parameters individually-and that FDTM can
identify those-while standard metrics result in deteriorated outcomes
Speaking Multiple Languages Affects the Moral Bias of Language Models
Pre-trained multilingual language models (PMLMs) are commonly used when
dealing with data from multiple languages and cross-lingual transfer. However,
PMLMs are trained on varying amounts of data for each language. In practice
this means their performance is often much better on English than many other
languages. We explore to what extent this also applies to moral norms. Do the
models capture moral norms from English and impose them on other languages? Do
the models exhibit random and thus potentially harmful beliefs in certain
languages? Both these issues could negatively impact cross-lingual transfer and
potentially lead to harmful outcomes. In this paper, we (1) apply the
MoralDirection framework to multilingual models, comparing results in German,
Czech, Arabic, Chinese, and English, (2) analyse model behaviour on filtered
parallel subtitles corpora, and (3) apply the models to a Moral Foundations
Questionnaire, comparing with human responses from different countries. Our
experiments demonstrate that, indeed, PMLMs encode differing moral biases, but
these do not necessarily correspond to cultural differences or commonalities in
human opinions. We release our code and models.Comment: To appear in ACL Findings 202
Computing on Authenticated Data for Adjustable Predicates
The notion of P-homomorphic signatures, introduced by Ahn et al. (TCC 2012), generalizes various approaches for public computations on authenticated data. For a given predicate P anyone can derive a signature for a message m\u27 from the signatures of a set of messages M, as long as P(M,m\u27)=1. This definition hence comprises notions and constructions for concrete predicates P such as homomorphic signatures and redactable signatures.
In our work we address the question of how to combine Pi-homomorphic schemes for different predicates P1,P2,... to create a richer and more flexible class of supported predicates. One approach is to statically combine schemes for predicates into new schemes for logical formulas over the predicates, such as a scheme for AND (P1 AND P2). The other approach for more flexibility is to derive schemes which allow the signer to dynamically decide which predicate to use when signing a message, instead of supporting only a single, fixed predicate.
We present two main results. One is to show that one can indeed devise solutions for the static combination for AND, and for dynamically adjustable solutions for choosing the predicate on the fly. Moreover, our constructions are practical and add only a negligible overhead. The other main result is an impossibility result for static combinations. Namely, we prove that, in contrast to the case of AND, many other formulas like the logical OR (P1 OR P2) and the NOT (NOT P) do not admit generic combinations through so-called canonical constructions. This implies that one cannot rely on general constructions in these cases, but must use other methods instead, like finding new predicate-specific solutions from scratch
MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal Image Generation
The recent popularity of text-to-image diffusion models (DM) can largely be
attributed to the intuitive interface they provide to users. The intended
generation can be expressed in natural language, with the model producing
faithful interpretations of text prompts. However, expressing complex or
nuanced ideas in text alone can be difficult. To ease image generation, we
propose MultiFusion that allows one to express complex and nuanced concepts
with arbitrarily interleaved inputs of multiple modalities and languages.
MutliFusion leverages pre-trained models and aligns them for integration into a
cohesive system, thereby avoiding the need for extensive training from scratch.
Our experimental results demonstrate the efficient transfer of capabilities
from individual modules to the downstream model. Specifically, the fusion of
all independent components allows the image generation module to utilize
multilingual, interleaved multimodal inputs despite being trained solely on
monomodal data in a single language