10 research outputs found

    Towards Safe Artificial General Intelligence

    Get PDF
    The field of artificial intelligence has recently experienced a number of breakthroughs thanks to progress in deep learning and reinforcement learning. Computer algorithms now outperform humans at Go, Jeopardy, image classification, and lip reading, and are becoming very competent at driving cars and interpreting natural language. The rapid development has led many to conjecture that artificial intelligence with greater-than-human ability on a wide range of tasks may not be far. This in turn raises concerns whether we know how to control such systems, in case we were to successfully build them. Indeed, if humanity would find itself in conflict with a system of much greater intelligence than itself, then human society would likely lose. One way to make sure we avoid such a conflict is to ensure that any future AI system with potentially greater-than-human-intelligence has goals that are aligned with the goals of the rest of humanity. For example, it should not wish to kill humans or steal their resources. The main focus of this thesis will therefore be goal alignment, i.e. how to design artificially intelligent agents with goals coinciding with the goals of their designers. Focus will mainly be directed towards variants of reinforcement learning, as reinforcement learning currently seems to be the most promising path towards powerful artificial intelligence. We identify and categorize goal misalignment problems in reinforcement learning agents as designed today, and give examples of how these agents may cause catastrophes in the future. We also suggest a number of reasonably modest modifications that can be used to avoid or mitigate each identified misalignment problem. Finally, we also study various choices of decision algorithms, and conditions for when a powerful reinforcement learning system will permit us to shut it down. The central conclusion is that while reinforcement learning systems as designed today are inherently unsafe to scale to human levels of intelligence, there are ways to potentially address many of these issues without straying too far from the currently so successful reinforcement learning paradigm. Much work remains in turning the high-level proposals suggested in this thesis into practical algorithms, however

    Impossibility Results in AI: A Survey

    Get PDF
    An impossibility theorem demonstrates that a particular problem or set of problems cannot be solved as described in the claim. Such theorems put limits on what is possible to do concerning artificial intelligence, especially the super-intelligent one. As such, these results serve as guidelines, reminders, and warnings to AI safety, AI policy, and governance researchers. These might enable solutions to some long-standing questions in the form of formalizing theories in the framework of constraint satisfaction without committing to one option. In this paper, we have categorized impossibility theorems applicable to the domain of AI into five categories: deduction, indistinguishability, induction, tradeoffs, and intractability. We found that certain theorems are too specific or have implicit assumptions that limit application. Also, we added a new result (theorem) about the unfairness of explainability, the first explainability-related result in the induction category. We concluded that deductive impossibilities deny 100%-guarantees for security. In the end, we give some ideas that hold potential in explainability, controllability, value alignment, ethics, and group decision-making. They can be deepened by further investigation

    Achilles Heels for AGI/ASI via Decision Theoretic Adversaries

    Full text link
    As progress in AI continues to advance, it is crucial to know how advanced systems will make choices and in what ways they may fail. Machines can already outsmart humans in some domains, and understanding how to safely build ones which may have capabilities at or above the human level is of particular concern. One might suspect that artificially generally intelligent (AGI) and artificially superintelligent (ASI) systems should be modeled as as something which humans, by definition, can't reliably outsmart. As a challenge to this assumption, this paper presents the Achilles Heel hypothesis which states that even a potentially superintelligent system may nonetheless have stable decision-theoretic delusions which cause them to make obviously irrational decisions in adversarial settings. In a survey of relevant dilemmas and paradoxes from the decision theory literature, a number of these potential Achilles Heels are discussed in context of this hypothesis. Several novel contributions are made toward understanding the ways in which these weaknesses might be implanted into a system.Comment: Contact info for author at stephencasper.co

    The Shutdown Problem: Three Theorems

    Get PDF
    I explain the shutdown problem: the problem of designing artificial agents that (1) shut down when a shutdown button is pressed, (2) don’t try to prevent or cause the pressing of the shutdown button, and (3) otherwise pursue goals competently. I prove three theorems that make the difficulty precise. These theorems show that a small number of innocuous-seeming conditions together preclude shutdownability. Agents with preferences satisfying these conditions will try to prevent or cause the pressing of the shutdown button even in cases where it’s costly to do so. And patience trades off against shutdownability: the more patient an agent, the greater the costs that agent is willing to incur to manipulate the shutdown button. I end by noting that these theorems can guide our search for solutions

    Non-Ideal Decision Theory

    Get PDF
    My dissertation is about Bayesian rationality for non-ideal agents. I show how to derive subjective probabilities from preferences using much weaker rationality assumptions than other standard representation theorems. I argue that non-ideal agents might be uncertain about how they will update on new information and consider two consequences of this uncertainty: such agents should sometimes reject free information and make choices which, taken together, yield sure loss. The upshot is that Bayesian rationality for non-ideal agents makes very different normative demands than ideal Bayesian rationality

    READING NEUROSCIENCE: VENTRILOQUISM AS A METAPHOR FOR MULTIPLE READINGS OF SELF

    Get PDF
    This thesis argues that the consensus models of self forwarded and upheld in the fields of discourse most concerned with its description, indicate a process of ventriloquism where agency slips between dual poles of body and mind and cannot be tracked to a hiding place. Just as with ventriloquism, in these models of self it is unclear who is doing the 'talking', and the skill of performance would seem to make the distinction almost redundant. The self seems a complicity of often conflicting agents when analysed as its constituent parts, and not there at all when viewed as a whole. This thesis takes as its starting point the confusion of Edgar Bergen when struggling to justify his philosophical conversations with his dummy: who is at work here, and where would agency reside in such a dialogue? That it serves us to assume the 'theory of mind' explanation for the behaviours of others, and by extension place ourselves within a scaffold of causal motives, says more for the use value of such a theory than for the presence of 'mind'. Why this 'theory of mind' rather than any other? Because that is how mind and motive are presented to us during our acquisition of a spoken language. Mediation, transformation and referral: this thesis argues that these are qualities which characterize ventriloquism, and also the human means of perception and self-perception. There are a number of unfulfilled potentialities that reach their heaven in the unified self. The 'drive' to unity culls these lost futures and condemns us to another fulfilment, that of'oneness'. Most of these resolutions regarding self are predicated on what is 'in' and what is 'out'; how does the discriminatory self establish grounds for inclusivity or exclusivity? This thesis means to provide a lexicon of other possibilities regarding the conceptualization of self
    corecore