59 research outputs found
Asymptotically Unambitious Artificial General Intelligence
General intelligence, the ability to solve arbitrary solvable problems, is
supposed by many to be artificially constructible. Narrow intelligence, the
ability to solve a given particularly difficult problem, has seen impressive
recent development. Notable examples include self-driving cars, Go engines,
image classifiers, and translators. Artificial General Intelligence (AGI)
presents dangers that narrow intelligence does not: if something smarter than
us across every domain were indifferent to our concerns, it would be an
existential threat to humanity, just as we threaten many species despite no ill
will. Even the theory of how to maintain the alignment of an AGI's goals with
our own has proven highly elusive. We present the first algorithm we are aware
of for asymptotically unambitious AGI, where "unambitiousness" includes not
seeking arbitrary power. Thus, we identify an exception to the Instrumental
Convergence Thesis, which is roughly that by default, an AGI would seek power,
including over us.Comment: 9 pages with 5 figures; 10 page Appendix with 2 figure
Multiparty Dynamics and Failure Modes for Machine Learning and Artificial Intelligence
An important challenge for safety in machine learning and artificial
intelligence systems is a~set of related failures involving specification
gaming, reward hacking, fragility to distributional shifts, and Goodhart's or
Campbell's law. This paper presents additional failure modes for interactions
within multi-agent systems that are closely related. These multi-agent failure
modes are more complex, more problematic, and less well understood than the
single-agent case, and are also already occurring, largely unnoticed. After
motivating the discussion with examples from poker-playing artificial
intelligence (AI), the paper explains why these failure modes are in some
senses unavoidable. Following this, the paper categorizes failure modes,
provides definitions, and cites examples for each of the modes: accidental
steering, coordination failures, adversarial misalignment, input spoofing and
filtering, and goal co-option or direct hacking. The paper then discusses how
extant literature on multi-agent AI fails to address these failure modes, and
identifies work which may be useful for the mitigation of these failure modes.Comment: 12 Pages, This version re-submitted to Big Data and Cognitive
Computing, Special Issue "Artificial Superintelligence: Coordination &
Strategy
Chess as a Testing Grounds for the Oracle Approach to AI Safety
To reduce the danger of powerful super-intelligent AIs, we might make the
first such AIs oracles that can only send and receive messages. This paper
proposes a possibly practical means of using machine learning to create two
classes of narrow AI oracles that would provide chess advice: those aligned
with the player's interest, and those that want the player to lose and give
deceptively bad advice. The player would be uncertain which type of oracle it
was interacting with. As the oracles would be vastly more intelligent than the
player in the domain of chess, experience with these oracles might help us
prepare for future artificial general intelligence oracles
Safeguarding the safeguards: How best to promote AI alignment in the public interest
AI alignment work is important from both a commercial and a safety lens. With
this paper, we aim to help actors who support alignment efforts to make these
efforts as effective as possible, and to avoid potential adverse effects. We
begin by suggesting that institutions that are trying to act in the public
interest (such as governments) should aim to support specifically alignment
work that reduces accident or misuse risks. We then describe four problems
which might cause alignment efforts to be counterproductive, increasing
large-scale AI risks. We suggest mitigations for each problem. Finally, we make
a broader recommendation that institutions trying to act in the public interest
should think systematically about how to make their alignment efforts as
effective, and as likely to be beneficial, as possible.Comment: Update Dec-15: Added a missing acknowledgement and fixed minor
formatting error
Evaluating Superhuman Models with Consistency Checks
If machine learning models were to achieve superhuman abilities at various
reasoning or decision-making tasks, how would we go about evaluating such
models, given that humans would necessarily be poor proxies for ground truth?
In this paper, we propose a framework for evaluating superhuman models via
consistency checks. Our premise is that while the correctness of superhuman
decisions may be impossible to evaluate, we can still surface mistakes if the
model's decisions fail to satisfy certain logical, human-interpretable rules.
We instantiate our framework on three tasks where correctness of decisions is
hard to evaluate due to either superhuman model abilities, or to otherwise
missing ground truth: evaluating chess positions, forecasting future events,
and making legal judgments. We show that regardless of a model's (possibly
superhuman) performance on these tasks, we can discover logical inconsistencies
in decision making. For example: a chess engine assigning opposing valuations
to semantically identical boards; GPT-4 forecasting that sports records will
evolve non-monotonically over time; or an AI judge assigning bail to a
defendant only after we add a felony to their criminal record.Comment: 31 pages, 15 figures. Under review. Code and data are available at
https://github.com/ethz-spylab/superhuman-ai-consistenc
- …