162 research outputs found
Fast or Accurate? Governing Conflicting Goals in Highly Autonomous Vehicles
The tremendous excitement around the deployment of autonomous vehicles (AVs)
comes from their purported promise. In addition to decreasing accidents, AVs
are projected to usher in a new era of equity in human autonomy by providing
affordable, accessible, and widespread mobility for disabled, elderly, and
low-income populations. However, to realize this promise, it is necessary to
ensure that AVs are safe for deployment, and to contend with the risks AV
technology poses, which threaten to eclipse its benefits. In this Article, we
focus on an aspect of AV engineering currently unexamined in the legal
literature, but with critical implications for safety, accountability,
liability, and power. Specifically, we explain how understanding the
fundamental engineering trade-off between accuracy and speed in AVs is critical
for policymakers to regulate the uncertainty and risk inherent in AV systems.
We discuss how understanding the trade-off will help create tools that will
enable policymakers to assess how the trade-off is being implemented. Such
tools will facilitate opportunities for developing concrete, ex ante AV safety
standards and conclusive mechanisms for ex post determination of accountability
after accidents occur. This will shift the balance of power from manufacturers
to the public by facilitating effective regulation, reducing barriers to tort
recovery, and ensuring that public values like safety and accountability are
appropriately balanced.Comment: Vol. 20, pp. 249-27
Talkin' 'Bout AI Generation: Copyright and the Generative-AI Supply Chain
"Does generative AI infringe copyright?" is an urgent question. It is also a
difficult question, for two reasons. First, "generative AI" is not just one
product from one company. It is a catch-all name for a massive ecosystem of
loosely related technologies, including conversational text chatbots like
ChatGPT, image generators like Midjourney and DALL-E, coding assistants like
GitHub Copilot, and systems that compose music and create videos. These systems
behave differently and raise different legal issues. The second problem is that
copyright law is notoriously complicated, and generative-AI systems manage to
touch on a great many corners of it: authorship, similarity, direct and
indirect liability, fair use, and licensing, among much else. These issues
cannot be analyzed in isolation, because there are connections everywhere.
In this Article, we aim to bring order to the chaos. To do so, we introduce
the generative-AI supply chain: an interconnected set of stages that transform
training data (millions of pictures of cats) into generations (a new,
potentially never-seen-before picture of a cat that has never existed).
Breaking down generative AI into these constituent stages reveals all of the
places at which companies and users make choices that have copyright
consequences. It enables us to trace the effects of upstream technical designs
on downstream uses, and to assess who in these complicated sociotechnical
systems bears responsibility for infringement when it happens. Because we
engage so closely with the technology of generative AI, we are able to shed
more light on the copyright questions. We do not give definitive answers as to
who should and should not be held liable. Instead, we identify the key
decisions that courts will need to make as they grapple with these issues, and
point out the consequences that would likely flow from different liability
regimes.Comment: Forthcoming, Journal of the Copyright Society of the USA '2
CD-GraB: Coordinating Distributed Example Orders for Provably Accelerated Training
Recent research on online Gradient Balancing (GraB) has revealed that there
exist permutation-based example orderings that are guaranteed to outperform
random reshuffling (RR). Whereas RR arbitrarily permutes training examples,
GraB leverages stale gradients from prior epochs to order examples -- achieving
a provably faster convergence rate than RR. However, GraB is limited by design:
While it demonstrates an impressive ability to scale-up training on centralized
data, it does not naturally extend to modern distributed ML workloads. We
therefore propose Coordinated Distributed GraB (CD-GraB), which uses insights
from prior work on kernel thinning to translate the benefits of provably faster
permutation-based example ordering to distributed settings. With negligible
overhead, CD-GraB exhibits a linear speedup in convergence rate over
centralized GraB and outperforms baselines empirically, including distributed
RR, on a variety of benchmark tasks
CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images
We assemble a dataset of Creative-Commons-licensed (CC) images, which we use
to train a set of open diffusion models that are qualitatively competitive with
Stable Diffusion 2 (SD2). This task presents two challenges: (1)
high-resolution CC images lack the captions necessary to train text-to-image
generative models; (2) CC images are relatively scarce. In turn, to address
these challenges, we use an intuitive transfer learning technique to produce a
set of high-quality synthetic captions paired with curated CC images. We then
develop a data- and compute-efficient training recipe that requires as little
as 3% of the LAION-2B data needed to train existing SD2 models, but obtains
comparable quality. These results indicate that we have a sufficient number of
CC images (~70 million) for training high-quality models. Our training recipe
also implements a variety of optimizations that achieve ~3X training speed-ups,
enabling rapid model iteration. We leverage this recipe to train several
high-quality text-to-image models, which we dub the CommonCanvas family. Our
largest model achieves comparable performance to SD2 on a human evaluation,
despite being trained on our CC dataset that is significantly smaller than
LAION and using synthetic captions for training. We release our models, data,
and code at
https://github.com/mosaicml/diffusion/blob/main/assets/common-canvas.m
Scalable Extraction of Training Data from (Production) Language Models
This paper studies extractable memorization: training data that an adversary
can efficiently extract by querying a machine learning model without prior
knowledge of the training dataset. We show an adversary can extract gigabytes
of training data from open-source language models like Pythia or GPT-Neo,
semi-open models like LLaMA or Falcon, and closed models like ChatGPT. Existing
techniques from the literature suffice to attack unaligned models; in order to
attack the aligned ChatGPT, we develop a new divergence attack that causes the
model to diverge from its chatbot-style generations and emit training data at a
rate 150x higher than when behaving properly. Our methods show practical
attacks can recover far more data than previously thought, and reveal that
current alignment techniques do not eliminate memorization
Is My Prediction Arbitrary? The Confounding Effects of Variance in Fair Classification Benchmarks
Variance in predictions across different trained models is a significant,
under-explored source of error in fair classification. In practice, the
variance on some data examples is so large that decisions can be effectively
arbitrary. To investigate this problem, we take an experimental approach and
make four overarching contributions: We 1) Define a metric called
self-consistency, derived from variance, which we use as a proxy for measuring
and reducing arbitrariness; 2) Develop an ensembling algorithm that abstains
from classification when a prediction would be arbitrary; 3) Conduct the
largest to-date empirical study of the role of variance (vis-a-vis
self-consistency and arbitrariness) in fair classification; and, 4) Release a
toolkit that makes the US Home Mortgage Disclosure Act (HMDA) datasets easily
usable for future research. Altogether, our experiments reveal shocking
insights about the reliability of conclusions on benchmark datasets. Most
fairness classification benchmarks are close-to-fair when taking into account
the amount of arbitrariness present in predictions -- before we even try to
apply common fairness interventions. This finding calls into question the
practical utility of common algorithmic fairness methods, and in turn suggests
that we should fundamentally reconsider how we choose to measure fairness in
machine learning
Vortex nucleation as a case study of symmetry breaking in quantum systems
Mean-field methods are a very powerful tool for investigating weakly
interacting many-body systems in many branches of physics. In particular, they
describe with excellent accuracy trapped Bose-Einstein condensates. A generic,
but difficult question concerns the relation between the symmetry properties of
the true many-body state and its mean-field approximation. Here, we address
this question by considering, theoretically, vortex nucleation in a rotating
Bose-Einstein condensate. A slow sweep of the rotation frequency changes the
state of the system from being at rest to the one containing one vortex. Within
the mean-field framework, the jump in symmetry occurs through a turbulent phase
around a certain critical frequency. The exact many-body ground state at the
critical frequency exhibits strong correlations and entanglement. We believe
that this constitutes a paradigm example of symmetry breaking in - or change of
the order parameter of - quantum many-body systems in the course of adiabatic
evolution.Comment: Minor change
- …