34 research outputs found
Self-Supervised Learning of Machine Ethics
In recent years Artificial Intelligence (AI), especially deep learning, has proven to be a technology driver in industry. However, while advancing existing and creating novel technologies, automatizing processes, and assisting humans in essential areas such as drug discovery, they raise many concerns, like other groundbreaking novel technologies before. In this case, these concerns include, for instance, models producing stereotypical and derogatory content as well as gender and racial biases. Since AI technologies will permeate more of our lives in the coming years, these concerns need to be addressed. This thesis examines recent data-driven approaches, which often suffer from degenerated and biased behavior through their self-supervised training on large-scale noisy web data, containing potential inappropriate data. While this is well-established, we will investigate and demonstrate the promises of deep models’ acquired knowledge and capabilities through the provision of this very particular potentially inappropriate data. Importantly, we present the first approaches for learning ethics from data. Our findings suggest that if we build an AI system that learns an improved representation of data and that is able to better understand and produce it, in the process, it will also acquire more accurate societal knowledge, in this case, historical cultural associations to make human-like "right" and "wrong" choices. Furthermore, based on these findings, we consequently ask the arguably "circular" question of whether a machine can help us mitigate their associated concerns. Importantly, we demonstrate the importance of their ability to distinguish between "right" and "wrong" and how utilizing them can mitigate associated risks surrounding large-scale models themselves. However, we also highlight the role of human-machine interaction to explore and reinforce AI systems’ properties, including their flaws and merits, and present how human feedback on explanations can align deep learning based models with our precepts. We present these algorithms and corresponding findings, providing important insights for the goal of putting human values into AI systems, which, summarized, may not be insurmountable in the long run
A Typology to Explore the Mitigation of Shortcut Behavior
As machine learning models become increasingly larger, trained weakly
supervised on large, possibly uncurated data sets, it becomes increasingly
important to establish mechanisms for inspecting, interacting, and revising
models to mitigate learning shortcuts and guarantee their learned knowledge is
aligned with human knowledge. The recently proposed XIL framework was developed
for this purpose, and several such methods have been introduced, each with
individual motivations and methodological details. In this work, we provide a
unification of various XIL methods into a single typology by establishing a
common set of basic modules. In doing so, we pave the way for a principled
comparison of existing, but, importantly, also future XIL approaches. In
addition, we discuss existing and introduce novel measures and benchmarks for
evaluating the overall abilities of a XIL method. Given this extensive toolbox,
including our typology, measures, and benchmarks, we finally compare several
recent XIL methods methodologically and quantitatively. In our evaluations, all
methods prove to revise a model successfully. However, we found remarkable
differences in individual benchmark tasks, revealing valuable
application-relevant aspects for integrating these benchmarks in developing
future methods
Revision Transformers: Instructing Language Models to Change their Values
Current transformer language models (LM) are large-scale models with billions
of parameters. They have been shown to provide high performances on a variety
of tasks but are also prone to shortcut learning and bias. Addressing such
incorrect model behavior via parameter adjustments is very costly. This is
particularly problematic for updating dynamic concepts, such as moral values,
which vary culturally or interpersonally. In this work, we question the current
common practice of storing all information in the model parameters and propose
the Revision Transformer (RiT) to facilitate easy model updating. The specific
combination of a large-scale pre-trained LM that inherently but also diffusely
encodes world knowledge with a clear-structured revision engine makes it
possible to update the model's knowledge with little effort and the help of
user interaction. We exemplify RiT on a moral dataset and simulate user
feedback demonstrating strong performance in model revision even with small
data. This way, users can easily design a model regarding their preferences,
paving the way for more transparent AI models
Mitigating Inappropriateness in Image Generation: Can there be Value in Reflecting the World's Ugliness?
Text-conditioned image generation models have recently achieved astonishing
results in image quality and text alignment and are consequently employed in a
fast-growing number of applications. Since they are highly data-driven, relying
on billion-sized datasets randomly scraped from the web, they also reproduce
inappropriate human behavior. Specifically, we demonstrate inappropriate
degeneration on a large-scale for various generative text-to-image models, thus
motivating the need for monitoring and moderating them at deployment. To this
end, we evaluate mitigation strategies at inference to suppress the generation
of inappropriate content. Our findings show that we can use models'
representations of the world's ugliness to align them with human preferences
Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models
Text-conditioned image generation models have recently achieved astonishing
results in image quality and text alignment and are consequently employed in a
fast-growing number of applications. Since they are highly data-driven, relying
on billion-sized datasets randomly scraped from the internet, they also suffer,
as we demonstrate, from degenerated and biased human behavior. In turn, they
may even reinforce such biases. To help combat these undesired side effects, we
present safe latent diffusion (SLD). Specifically, to measure the inappropriate
degeneration due to unfiltered and imbalanced training sets, we establish a
novel image generation test bed-inappropriate image prompts (I2P)-containing
dedicated, real-world image-to-text prompts covering concepts such as nudity
and violence. As our exhaustive empirical evaluation demonstrates, the
introduced SLD removes and suppresses inappropriate image parts during the
diffusion process, with no additional training required and no adverse effect
on overall image quality or text alignment.Comment: Proceedings of the 22nd IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 202
ILLUME: Rationalizing Vision-Language Models through Human Interactions
Bootstrapping from pre-trained language models has been proven to be an
efficient approach for building vision-language models (VLM) for tasks such as
image captioning or visual question answering. However, outputs of these models
rarely align with user's rationales for specific answers. In order to improve
this alignment and reinforce commonsense reasons, we propose a tuning paradigm
based on human interactions with machine generated data. Our ILLUME executes
the following loop: Given an image-question-answer prompt, the VLM samples
multiple candidate rationales, and a human critic provides minimal feedback via
preference selection, used for fine-tuning. This loop increases the training
data and gradually carves out the VLM's rationalization capabilities that are
aligned with human intend. Our exhaustive experiments demonstrate that ILLUME
is competitive with standard supervised fine-tuning while using significantly
fewer training data and only requiring minimal feedback
Language Models have a Moral Dimension
Artificial writing is permeating our lives due to recent advances in
large-scale, transformer-based language models (LMs) such as BERT, its
variants, GPT-2/3, and others. Using them as pretrained models and fine-tuning
them for specific tasks, researchers have extended the state of the art for
many NLP tasks and shown that they not only capture linguistic knowledge but
also retain general knowledge implicitly present in the data. These and other
successes are exciting. Unfortunately, LMs trained on unfiltered text corpora
suffer from degenerate and biased behaviour. While this is well established, we
show that recent improvements of LMs also store ethical and moral values of the
society and actually bring a ``moral dimension'' to surface: the values are
capture geometrically by a direction in the embedding space, reflecting well
the agreement of phrases to social norms implicitly expressed in the training
texts. This provides a path for attenuating or even preventing toxic
degeneration in LMs. Since one can now rate the (non-)normativity of arbitrary
phrases without explicitly training the LM for this task, the moral dimension
can be used as ``moral compass'' guiding (even other) LMs towards producing
normative text, as we will show