6 research outputs found

    Agents Need Not Know Their Purpose

    Full text link
    Ensuring artificial intelligence behaves in such a way that is aligned with human values is commonly referred to as the alignment challenge. Prior work has shown that rational agents, behaving in such a way that maximizes a utility function, will inevitably behave in such a way that is not aligned with human values, especially as their level of intelligence goes up. Prior work has also shown that there is no "one true utility function"; solutions must include a more holistic approach to alignment. This paper describes oblivious agents: agents that are architected in such a way that their effective utility function is an aggregation of a known and hidden sub-functions. The hidden component, to be maximized, is internally implemented as a black box, preventing the agent from examining it. The known component, to be minimized, is knowledge of the hidden sub-function. Architectural constraints further influence how agent actions can evolve its internal environment model. We show that an oblivious agent, behaving rationally, constructs an internal approximation of designers' intentions (i.e., infers alignment), and, as a consequence of its architecture and effective utility function, behaves in such a way that maximizes alignment; i.e., maximizing the approximated intention function. We show that, paradoxically, it does this for whatever utility function is used as the hidden component and, in contrast with extant techniques, chances of alignment actually improve as agent intelligence grows

    Q-learning and Deep Q-learning in OpenAI Gym CartPole classic control environment

    Get PDF
    Abstract. This thesis focuses on the basics of reinforcement learning and the implementation of Deep Q-learning, also referred to as Deep Q-network, to emphasize the artificial neural network, and Q-learning to the CartPole-v0 classic control learning environment. This work also presents the idea of a Markov Decision process, standard algorithms, and some basic information about the OpenAI Gym toolkit. DQN is a deep learning version of regular Q-learning, the crucial difference being the use of a neural network and experience replay. Cartpole-v0 can be considered an easy learning problem, especially for DQN, since the number of states and specific actions is relatively low. The learning results between Q-learning and DQN were examined by comparing the convergence and stability of rewards, the cumulative reward gain, and how quickly the Cartpole-v0 learning environment was solved. While it is tough to determine which implementation solved the CartPole-v0 problem better, it can be concluded that while DQN is often seen as the more advanced and complicated version of regular Q-learning, it did not perform better than Q-learning.Q-oppiminen ja Syvä Q-oppiminen OpenAI Gym CartPole-säätöympäristössä. Tiivistelmä. Tämä työ keskittyy esittelemään vahvistusoppimisen perusteita, sekä vertailemaan oppimista Q-oppimisen ja syvän Q-oppimisen välillä CartPole-v0 säätöympäristössä. Työ käsittelee myös Markovin päätöksentekoprosessia ja niissä käytettäviä algoritmeja. Tärkein ero syvän Q-oppimisen ja Q-oppimisen välillä on se, että syvä Q-oppiminen käyttää neuroverkkoa ja muistista oppimista tavallisen Q-oppimisessa käytetyn Q-taulukon sijaan. CartPole-v0 oppimisympäristöä voidaan pitää helppona oppimisympäristönä erityisesti syvä Q-oppimiselle, sillä CartPole-oppimisympäristössä mahdollisten tilojen määrä on verrattain pieni. Oppimista implementaatioiden välillä vertailtiin tarkastelemalla palkintojen suppenemista ja vakautta, palkintojen kumulatiivista arvoa ja oppimisympäristön ratkaisunopeutta. Syvää Q-oppimista pidetään tavallisen Q-oppimisen monimutkaisempana muotona, ja se pärjääkin yleensä paremmin monimutkaisemmissa ympäristöissä, joissa tilojen määrä kasvaa erittäin suureksi. Etukäteen on mahdotonta sanoa, kumpi implementaatio oppii kohdeympäristön tehokkaammin. Syvä Q-oppiminen oppii vaikeita ympäristöjä paljon tehokkaammin kuin tavallinen Q-oppiminen, kun taas Q-oppiminen oppii vähätilaisia ympäristöjä tehokkaammin, koska sen ei tarvitse käyttää muistista oppimista, joka hidastaa harjoitusprosessia

    Learning Interpretable Models of Aircraft Handling Behaviour by Reinforcement Learning from Human Feedback

    Get PDF
    We propose a method to capture the handling abilities of fast jet pilots in a software model via reinforcement learning (RL) from human preference feedback. We use pairwise preferences over simulated flight trajectories to learn an interpretable rule-based model called a reward tree, which enables the automated scoring of trajectories alongside an explanatory rationale. We train an RL agent to execute high-quality handling behaviour by using the reward tree as the objective, and thereby generate data for iterative preference collection and further refinement of both tree and agent. Experiments with synthetic preferences show reward trees to be competitive with uninterpretable neural network reward models on quantitative and qualitative evaluations

    Reward Learning with Trees:Methods and Evaluation

    Get PDF
    Recent efforts to learn reward functions from human feedback have tended to use deep neural networks, whose lack of transparency hampers our ability to explain agent behaviour or verify alignment. We explore the merits of learning intrinsically interpretable tree models instead. We develop a recently proposed method for learning reward trees from preference labels, and show it to be broadly competitive with neural networks on challenging high-dimensional tasks, with good robustness to limited or corrupted data. Having found that reward tree learning can be done effectively in complex settings, we then consider why it should be used, demonstrating that the interpretable reward structure gives significant scope for traceability, verification and explanation

    Terrain Aware Traverse Planning for Mars Rovers

    Get PDF
    NASA is proposing a Mars Sample Return mission, to be completed within one Martian year, that will require enhanced autonomy to perform its duties faster, safer, and more efficiently. With its main purpose being to retrieve samples possibly tens of kilometers away, it will need to drive beyond line-of-sight to get to its target more quickly than any rovers before. This research proposes a new methodology to support a sample return mission and is divided into three compo-nents: map preparation (map of traversability, i.e., ability of a terrain to sustain the traversal of a vehicle), path planning (pre-planning and replanning), and terrain analysis. The first component aims at creating a better knowledge of terrain traversability to support planning, by predicting rover slip and drive speed along the traverse using orbital data. By overlapping slope, rock abundance and terrain types at the same location, the expected drive velocity is obtained. By combining slope and thermal data, additional information about the experienced slip is derived, indicating whether it will be low (less than 30%) or medium to high (more than 30%). The second component involves planning the traverse for one Martian day (or sol) at a time, based on the map of expected drive speed. This research proposes to plan, offline, several paths traversable in one sol. Once online, the rover chooses the fastest option (the path cost being calculated using the distance divided by the expected velocity). During its drive, the rover monitors the terrain via analysis of its experienced wheel slip and actual speed. This information is then passed along the different pre-planned paths over a given distance (e.g., 25 m) and the map of traversability is locally updated given this new knowledge. When an update occurs, the rover calculates the new time of arrival of the various paths and replans its route if necessary. When tested in a simulation study on maps of the Columbia Hills, Mars, the rover successfully updates the map given new information drawn from a modified map used as ground truth for simulation purposes and replans its traverse when needed. The third component describes a method to assess the soil in-situ in case of dangerous terrain detected during the map update, or if the monitoring is not enough to confirm the traversability predicted by the map. The rover would deploy a shear vane instrument to compute intrinsic terrain parameters, information then propagated ahead of the rover to update the map and replan if necessary. Experiments in a laboratory setting as well as in the field showed promising results, the mounted shear vane giving values close to the expected terrain parameters of the tested soils

    Tree Models for Interpretable Agents

    Get PDF
    corecore