10 research outputs found
Towards Safe Artificial General Intelligence
The field of artificial intelligence has recently experienced a
number of breakthroughs thanks to progress in deep learning and
reinforcement learning. Computer algorithms now outperform humans
at Go, Jeopardy, image classification, and lip reading, and are
becoming very competent at driving cars and interpreting natural
language. The rapid development has led many to conjecture that
artificial intelligence with greater-than-human ability on a wide
range of tasks may not be far. This in turn raises concerns
whether we know how to control such systems, in case we were to
successfully build them.
Indeed, if humanity would find itself in conflict with a system
of much greater intelligence than itself, then human society
would likely lose. One way to make sure we avoid such a conflict
is to ensure that any future AI system with potentially
greater-than-human-intelligence has goals that are aligned with
the goals of the rest of humanity. For example, it should not
wish to kill humans or steal their resources.
The main focus of this thesis will therefore be goal alignment,
i.e. how to design artificially intelligent agents with goals
coinciding with the goals of their designers. Focus will mainly
be directed towards variants of reinforcement learning, as
reinforcement learning currently seems to be the most promising
path towards powerful artificial intelligence. We identify and
categorize goal misalignment problems in reinforcement learning
agents as designed today, and give examples of how these agents
may cause catastrophes in the future. We also suggest a number of
reasonably modest modifications that can be used to avoid or
mitigate each identified misalignment problem. Finally, we also
study various choices of decision algorithms, and conditions for
when a powerful reinforcement learning system will permit us to
shut it down.
The central conclusion is that while reinforcement learning
systems as designed today are inherently unsafe to scale to human
levels of intelligence, there are ways to potentially address
many of these issues without straying too far from the currently
so successful reinforcement learning paradigm. Much work remains
in turning the high-level proposals suggested in this thesis into
practical algorithms, however
Impossibility Results in AI: A Survey
An impossibility theorem demonstrates that a particular problem or set of problems cannot be solved as described in the claim. Such theorems put limits on what is possible to do concerning artificial intelligence, especially the super-intelligent one. As such, these results serve as guidelines, reminders, and warnings to AI safety, AI policy, and governance researchers. These might enable solutions to some long-standing questions in the form of formalizing theories in the framework of constraint satisfaction without committing to one option. In this paper, we have categorized impossibility theorems applicable to the domain of AI into five categories: deduction, indistinguishability, induction, tradeoffs, and intractability. We found that certain theorems are too specific or have implicit assumptions that limit application. Also, we added a new result (theorem) about the unfairness of explainability, the first explainability-related result in the induction category. We concluded that deductive impossibilities deny 100%-guarantees for security. In the end, we give some ideas that hold potential in explainability, controllability, value alignment, ethics, and group decision-making. They can be deepened by further investigation
Achilles Heels for AGI/ASI via Decision Theoretic Adversaries
As progress in AI continues to advance, it is crucial to know how advanced
systems will make choices and in what ways they may fail. Machines can already
outsmart humans in some domains, and understanding how to safely build ones
which may have capabilities at or above the human level is of particular
concern. One might suspect that artificially generally intelligent (AGI) and
artificially superintelligent (ASI) systems should be modeled as as something
which humans, by definition, can't reliably outsmart. As a challenge to this
assumption, this paper presents the Achilles Heel hypothesis which states that
even a potentially superintelligent system may nonetheless have stable
decision-theoretic delusions which cause them to make obviously irrational
decisions in adversarial settings. In a survey of relevant dilemmas and
paradoxes from the decision theory literature, a number of these potential
Achilles Heels are discussed in context of this hypothesis. Several novel
contributions are made toward understanding the ways in which these weaknesses
might be implanted into a system.Comment: Contact info for author at stephencasper.co
The Shutdown Problem: Three Theorems
I explain the shutdown problem: the problem of designing artificial agents that (1) shut down when a shutdown button is pressed, (2) don’t try to prevent or cause the pressing of the shutdown button, and (3) otherwise pursue goals competently. I prove three theorems that make the difficulty precise. These theorems show that a small number of innocuous-seeming conditions together preclude shutdownability. Agents with preferences satisfying these conditions will try to prevent or cause the pressing of the shutdown button even in cases where it’s costly to do so. And patience trades off against shutdownability: the more patient an agent, the greater the costs that agent is willing to incur to manipulate the shutdown button. I end by noting that these theorems can guide our search for solutions
Non-Ideal Decision Theory
My dissertation is about Bayesian rationality for non-ideal agents. I show how to derive subjective probabilities from preferences using much weaker rationality assumptions than other standard representation theorems. I argue that non-ideal agents might be uncertain about how they will update on new information and consider two consequences of this uncertainty: such agents should sometimes reject free information and make choices which, taken together, yield sure loss. The upshot is that Bayesian rationality for non-ideal agents makes very different normative demands than ideal Bayesian
rationality
READING NEUROSCIENCE: VENTRILOQUISM AS A METAPHOR FOR MULTIPLE READINGS OF SELF
This thesis argues that the consensus models of self forwarded and upheld in the fields of
discourse most concerned with its description, indicate a process of ventriloquism where
agency slips between dual poles of body and mind and cannot be tracked to a hiding
place. Just as with ventriloquism, in these models of self it is unclear who is doing the
'talking', and the skill of performance would seem to make the distinction almost
redundant. The self seems a complicity of often conflicting agents when analysed as its
constituent parts, and not there at all when viewed as a whole. This thesis takes as its
starting point the confusion of Edgar Bergen when struggling to justify his philosophical
conversations with his dummy: who is at work here, and where would agency reside in
such a dialogue? That it serves us to assume the 'theory of mind' explanation for the
behaviours of others, and by extension place ourselves within a scaffold of causal
motives, says more for the use value of such a theory than for the presence of 'mind'.
Why this 'theory of mind' rather than any other? Because that is how mind and motive
are presented to us during our acquisition of a spoken language. Mediation,
transformation and referral: this thesis argues that these are qualities which characterize
ventriloquism, and also the human means of perception and self-perception. There are a
number of unfulfilled potentialities that reach their heaven in the unified self. The 'drive'
to unity culls these lost futures and condemns us to another fulfilment, that of'oneness'.
Most of these resolutions regarding self are predicated on what is 'in' and what is 'out';
how does the discriminatory self establish grounds for inclusivity or exclusivity? This
thesis means to provide a lexicon of other possibilities regarding the conceptualization of
self