28 research outputs found
Mind your Language (Model): Fact-Checking LLMs and their Role in NLP Research and Practice
Much of the recent discourse within the NLP research community has been
centered around Large Language Models (LLMs), their functionality and potential
-- yet not only do we not have a working definition of LLMs, but much of this
discourse relies on claims and assumptions that are worth re-examining. This
position paper contributes a definition of LLMs, explicates some of the
assumptions made regarding their functionality, and outlines the existing
evidence for and against them. We conclude with suggestions for research
directions and their framing in future work
Counting Carbon: A Survey of Factors Influencing the Emissions of Machine Learning
Machine learning (ML) requires using energy to carry out computations during
the model training process. The generation of this energy comes with an
environmental cost in terms of greenhouse gas emissions, depending on quantity
used and the energy source. Existing research on the environmental impacts of
ML has been limited to analyses covering a small number of models and does not
adequately represent the diversity of ML models and tasks. In the current
study, we present a survey of the carbon emissions of 95 ML models across time
and different tasks in natural language processing and computer vision. We
analyze them in terms of the energy sources used, the amount of CO2 emissions
produced, how these emissions evolve across time and how they relate to model
performance. We conclude with a discussion regarding the carbon footprint of
our field and propose the creation of a centralized repository for reporting
and tracking these emissions
Power Hungry Processing: Watts Driving the Cost of AI Deployment?
Recent years have seen a surge in the popularity of commercial AI products
based on generative, multi-purpose AI systems promising a unified approach to
building machine learning (ML) models into technology. However, this ambition
of "generality" comes at a steep cost to the environment, given the amount of
energy these systems require and the amount of carbon that they emit. In this
work, we propose the first systematic comparison of the ongoing inference cost
of various categories of ML systems, covering both task-specific (i.e.
finetuned models that carry out a single task) and `general-purpose' models,
(i.e. those trained for multiple tasks). We measure deployment cost as the
amount of energy and carbon required to perform 1,000 inferences on
representative benchmark dataset using these models. We find that
multi-purpose, generative architectures are orders of magnitude more expensive
than task-specific systems for a variety of tasks, even when controlling for
the number of model parameters. We conclude with a discussion around the
current trend of deploying multi-purpose generative ML systems, and caution
that their utility should be more intentionally weighed against increased costs
in terms of energy and emissions. All the data from our study can be accessed
via an interactive demo to carry out further exploration and analysis
Stable Bias: Analyzing Societal Representations in Diffusion Models
As machine learning-enabled Text-to-Image (TTI) systems are becoming
increasingly prevalent and seeing growing adoption as commercial services,
characterizing the social biases they exhibit is a necessary first step to
lowering their risk of discriminatory outcomes. This evaluation, however, is
made more difficult by the synthetic nature of these systems' outputs: common
definitions of diversity are grounded in social categories of people living in
the world, whereas the artificial depictions of fictive humans created by these
systems have no inherent gender or ethnicity. To address this need, we propose
a new method for exploring the social biases in TTI systems. Our approach
relies on characterizing the variation in generated images triggered by
enumerating gender and ethnicity markers in the prompts, and comparing it to
the variation engendered by spanning different professions. This allows us to
(1) identify specific bias trends, (2) provide targeted scores to directly
compare models in terms of diversity and representation, and (3) jointly model
interdependent social variables to support a multidimensional analysis. We
leverage this method to analyze images generated by 3 popular TTI systems
(Dall-E 2, Stable Diffusion v 1.4 and 2) and find that while all of their
outputs show correlations with US labor demographics, they also consistently
under-represent marginalized identities to different extents. We also release
the datasets and low-code interactive bias exploration platforms developed for
this work, as well as the necessary tools to similarly evaluate additional TTI
systems.Comment: Accepted to NeurIPS Datasets and Benchmarks 2023 (spotlight
Energy and Carbon Considerations of Fine-Tuning BERT
Despite the popularity of the `pre-train then fine-tune' paradigm in the NLP
community, existing work quantifying energy costs and associated carbon
emissions has largely focused on language model pre-training. Although a single
pre-training run draws substantially more energy than fine-tuning, fine-tuning
is performed more frequently by many more individual actors, and thus must be
accounted for when considering the energy and carbon footprint of NLP. In order
to better characterize the role of fine-tuning in the landscape of energy and
carbon emissions in NLP, we perform a careful empirical study of the
computational costs of fine-tuning across tasks, datasets, hardware
infrastructure and measurement modalities. Our experimental results allow us to
place fine-tuning energy and carbon costs into perspective with respect to
pre-training and inference, and outline recommendations to NLP researchers and
practitioners who wish to improve their fine-tuning energy efficiency.Comment: EMNLP 2023 Findings; First two authors contributed equally; 12 page
Fair Diffusion: Instructing Text-to-Image Generation Models on Fairness
Generative AI models have recently achieved astonishing results in quality
and are consequently employed in a fast-growing number of applications.
However, since they are highly data-driven, relying on billion-sized datasets
randomly scraped from the internet, they also suffer from degenerated and
biased human behavior, as we demonstrate. In fact, they may even reinforce such
biases. To not only uncover but also combat these undesired effects, we present
a novel strategy, called Fair Diffusion, to attenuate biases after the
deployment of generative text-to-image models. Specifically, we demonstrate
shifting a bias, based on human instructions, in any direction yielding
arbitrarily new proportions for, e.g., identity groups. As our empirical
evaluation demonstrates, this introduced control enables instructing generative
image models on fairness, with no data filtering and additional training
required
Measuring Data
We identify the task of measuring data to quantitatively characterize the
composition of machine learning data and datasets. Similar to an object's
height, width, and volume, data measurements quantify different attributes of
data along common dimensions that support comparison. Several lines of research
have proposed what we refer to as measurements, with differing terminology; we
bring some of this work together, particularly in fields of computer vision and
language, and build from it to motivate measuring data as a critical component
of responsible AI development. Measuring data aids in systematically building
and analyzing machine learning (ML) data towards specific goals and gaining
better control of what modern ML systems will learn. We conclude with a
discussion of the many avenues of future work, the limitations of data
measurements, and how to leverage these measurement approaches in research and
practice
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Large language models (LLMs) have been shown to be able to perform new tasks
based on a few demonstrations or natural language instructions. While these
capabilities have led to widespread adoption, most LLMs are developed by
resource-rich organizations and are frequently kept from the public. As a step
towards democratizing this powerful technology, we present BLOOM, a
176B-parameter open-access language model designed and built thanks to a
collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer
language model that was trained on the ROOTS corpus, a dataset comprising
hundreds of sources in 46 natural and 13 programming languages (59 in total).
We find that BLOOM achieves competitive performance on a wide variety of
benchmarks, with stronger results after undergoing multitask prompted
finetuning. To facilitate future research and applications using LLMs, we
publicly release our models and code under the Responsible AI License
A Practical Guide to Quantifying Carbon Emissions for Machine Learning researchers and practitioners
The goal of this short guide is to help the Machine Learning (ML) community better understand their carbon impact and to take steps to mitigate it. Carbon Tracking At the center of the climate crisis is a commonplace but very important concept: that of carbon dioxide (CO 2), low amounts of which occur naturally in the Earth's atmosphere, but its concentration has been rapidly increasing due to human activity. This increase is dangerous because of CO 2 's effect as a greenhouse gas, which contributes to global warming. It is therefore important to: 1) quantify the carbon impact of our actions; and 2) reduce, or mitigate, that impact in order to help slow down global warming and climate change more broadly