8,086 research outputs found
Democratic Reinforcement: Learning via Self-Organization
The problem of learning in the absence of external intelligence is discussed
in the context of a simple model. The model consists of a set of randomly
connected, or layered integrate-and fire neurons. Inputs to and outputs from
the environment are connected randomly to subsets of neurons. The connections
between firing neurons are strengthened or weakened according to whether the
action is successful or not. The model departs from the traditional
gradient-descent based approaches to learning by operating at a highly
susceptible ``critical'' state, with low activity and sparse connections
between firing neurons. Quantitative studies on the performance of our model in
a simple association task show that by tuning our system close to this critical
state we can obtain dramatic gains in performance.Comment: 9 pages (TeX), 3 figures supllied on reques
Primitive Skill-based Robot Learning from Human Evaluative Feedback
Reinforcement learning (RL) algorithms face significant challenges when
dealing with long-horizon robot manipulation tasks in real-world environments
due to sample inefficiency and safety issues. To overcome these challenges, we
propose a novel framework, SEED, which leverages two approaches: reinforcement
learning from human feedback (RLHF) and primitive skill-based reinforcement
learning. Both approaches are particularly effective in addressing sparse
reward issues and the complexities involved in long-horizon tasks. By combining
them, SEED reduces the human effort required in RLHF and increases safety in
training robot manipulation with RL in real-world settings. Additionally,
parameterized skills provide a clear view of the agent's high-level intentions,
allowing humans to evaluate skill choices before they are executed. This
feature makes the training process even safer and more efficient. To evaluate
the performance of SEED, we conducted extensive experiments on five
manipulation tasks with varying levels of complexity. Our results show that
SEED significantly outperforms state-of-the-art RL algorithms in sample
efficiency and safety. In addition, SEED also exhibits a substantial reduction
of human effort compared to other RLHF methods. Further details and video
results can be found at https://seediros23.github.io/
Human Engagement Providing Evaluative and Informative Advice for Interactive Reinforcement Learning
Reinforcement learning is an approach used by intelligent agents to
autonomously learn new skills. Although reinforcement learning has been
demonstrated to be an effective learning approach in several different
contexts, a common drawback exhibited is the time needed in order to
satisfactorily learn a task, especially in large state-action spaces. To
address this issue, interactive reinforcement learning proposes the use of
externally-sourced information in order to speed up the learning process. Up to
now, different information sources have been used to give advice to the learner
agent, among them human-sourced advice. When interacting with a learner agent,
humans may provide either evaluative or informative advice. From the agent's
perspective these styles of interaction are commonly referred to as
reward-shaping and policy-shaping respectively. Evaluation requires the human
to provide feedback on the prior action performed, while informative advice
they provide advice on the best action to select for a given situation. Prior
research has focused on the effect of human-sourced advice on the interactive
reinforcement learning process, specifically aiming to improve the learning
speed of the agent, while reducing the engagement with the human. This work
presents an experimental setup for a human-trial designed to compare the
methods people use to deliver advice in term of human engagement. Obtained
results show that users giving informative advice to the learner agents provide
more accurate advice, are willing to assist the learner agent for a longer
time, and provide more advice per episode. Additionally, self-evaluation from
participants using the informative approach has indicated that the agent's
ability to follow the advice is higher, and therefore, they feel their own
advice to be of higher accuracy when compared to people providing evaluative
advice.Comment: 33 pages, 15 figure
Investigation of sequence processing: A cognitive and computational neuroscience perspective
Serial order processing or sequence processing underlies
many human activities such as speech, language, skill
learning, planning, problem-solving, etc. Investigating
the neural bases of sequence processing enables us to
understand serial order in cognition and also helps in
building intelligent devices. In this article, we review
various cognitive issues related to sequence processing
with examples. Experimental results that give evidence
for the involvement of various brain areas will be described.
Finally, a theoretical approach based on statistical
models and reinforcement learning paradigm is
presented. These theoretical ideas are useful for studying
sequence learning in a principled way. This article
also suggests a two-way process diagram integrating
experimentation (cognitive neuroscience) and theory/
computational modelling (computational neuroscience).
This integrated framework is useful not only in the present
study of serial order, but also for understanding
many cognitive processes
Reinforcement Learning for Value Alignment
[eng] As autonomous agents become increasingly sophisticated and we allow them to perform more complex tasks, it is of utmost importance to guarantee that they will act in alignment with human values. This problem has received in the AI literature the name of the value alignment problem. Current approaches apply reinforcement learning to align agents with values due to its recent successes at solving complex sequential decision-making problems. However, they follow an agent-centric approach by expecting that the agent applies the reinforcement learning algorithm correctly to learn an ethical behaviour, without formal guarantees that the learnt ethical behaviour will be ethical. This thesis proposes a novel environment-designer approach for solving the value alignment problem with theoretical guarantees.
Our proposed environment-designer approach advances the state of the art with a process for designing ethical environments wherein it is in the agent's best interest to learn ethical behaviours. Our process specifies the ethical knowledge of a moral value in terms that can be used in a reinforcement learning context. Next, our process embeds this knowledge in the agent's learning environment to design an ethical learning environment. The resulting ethical environment incentivises the agent to learn an ethical behaviour while pursuing its own objective.
We further contribute to the state of the art by providing a novel algorithm that, following our ethical environment design process, is formally guaranteed to create ethical environments. In other words, this algorithm guarantees that it is in the agent's best interest to learn value- aligned behaviours.
We illustrate our algorithm by applying it in a case study environment wherein the agent is expected to learn to behave in alignment with the moral value of respect. In it, a conversational agent is in charge of doing surveys, and we expect it to ask the users questions respectfully while trying to get as much information as possible. In the designed ethical environment, results confirm our theoretical results: the agent learns an ethical behaviour while pursuing its individual objective.[cat] A mesura que els agents autònoms es tornen cada cop més sofisticats i els permetem realitzar tasques més complexes, és de la màxima importància garantir que actuaran d'acord amb els valors humans. Aquest problema ha rebut a la literatura d'IA el nom del problema d'alineació de valors. Els enfocaments actuals apliquen aprenentatge per reforç per alinear els agents amb els valors a causa dels seus èxits recents a l'hora de resoldre problemes complexos de presa de decisions seqüencials. Tanmateix, segueixen un enfocament centrat en l'agent en esperar que l'agent apliqui correctament l'algorisme d'aprenentatge de reforç per aprendre un comportament ètic, sense garanties formals que el comportament ètic après serà ètic. Aquesta tesi proposa un nou enfocament de dissenyador d'entorn per resoldre el problema d'alineació de valors amb garanties teòriques.
El nostre enfocament de disseny d'entorns proposat avança l'estat de l'art amb un procés per dissenyar entorns ètics en què és del millor interès de l'agent aprendre comportaments ètics. El nostre procés especifica el coneixement ètic d'un valor moral en termes que es poden utilitzar en un context d'aprenentatge de reforç. A continuació, el nostre procés incorpora aquest coneixement a l'entorn d'aprenentatge de l'agent per dissenyar un entorn d'aprenentatge ètic. L'entorn ètic resultant incentiva l'agent a aprendre un comportament ètic mentre persegueix el seu propi objectiu.
A més, contribuïm a l'estat de l'art proporcionant un algorisme nou que, seguint el nostre procés de disseny d'entorns ètics, està garantit formalment per crear entorns ètics. En altres paraules, aquest algorisme garanteix que és del millor interès de l'agent aprendre comportaments alineats amb valors.
Il·lustrem el nostre algorisme aplicant-lo en un estudi de cas on s'espera que l'agent aprengui a comportar-se d'acord amb el valor moral del respecte. En ell, un agent de conversa s'encarrega de fer enquestes, i esperem que faci preguntes als usuaris amb respecte tot intentant obtenir la màxima informació possible. En l'entorn ètic dissenyat, els resultats confirmen els nostres resultats teòrics: l'agent aprèn un comportament ètic mentre persegueix el seu objectiu individual
Computational mechanisms underlying social evaluation learning and associations with depressive symptoms during adolescence
There is a sharp increase in depression in adolescence, but why this occurs is not well understood. We investigated how adolescents learn about social evaluation and whether learning is associated with depressive symptoms. In a cross-sectional school-based study, 598 adolescents (aged 11-15 years) completed a social evaluation learning task and the short Mood and Feelings Questionnaire. We developed and validated reinforcement learning models, formalising the processes hypothesised to underlie learning about social evaluation. Adolescents started the learning task with a positive expectation that they and others would be liked, and this positive bias was larger for the self than others. Expectations about the self were more resistant to feedback than expectations about others. Only initial expectations were associated with depressive symptoms; adolescents whose expectations were less positive had more severe symptoms. Consistent with cognitive theories, prior beliefs about social evaluation may be a risk factor for depressive symptoms
The Cultural Evolution of Teaching
Teaching is an important process of cultural transmission. Some have argued that human teaching is a cognitive instinct – a form of ‘natural cognition’ centred on mindreading, shaped by genetic evolution for the education of juveniles, and with a normative developmental trajectory driven by the unfolding of a genetically inherited predisposition to teach. Here, we argue instead that human teaching is a culturally evolved trait that exhibits characteristics of a cognitive gadget. Children learn to teach by participating in teaching interactions with socializing agents, which shape their own teaching practices. This process hijacks psychological mechanisms involved in prosociality and a range of domain-general cognitive abilities, such as reinforcement learning and executive function, but not a suite of cognitive adaptations specifically for teaching. Four lines of evidence converge on this hypothesis. The first, based on psychological experiments in industrialised societies, indicates that domain-general cognitive processes are important for teaching. The second and third lines, based on naturalistic and experimental research in small-scale societies, indicate marked cross-cultural variation in mature teaching practice, and in the ontogeny of teaching among children. The fourth line indicates that teaching has been subject to cumulative cultural evolution, i.e. the gradual accumulation of functional changes across generations
Society-in-the-Loop: Programming the Algorithmic Social Contract
Recent rapid advances in Artificial Intelligence (AI) and Machine Learning
have raised many questions about the regulatory and governance mechanisms for
autonomous machines. Many commentators, scholars, and policy-makers now call
for ensuring that algorithms governing our lives are transparent, fair, and
accountable. Here, I propose a conceptual framework for the regulation of AI
and algorithmic systems. I argue that we need tools to program, debug and
maintain an algorithmic social contract, a pact between various human
stakeholders, mediated by machines. To achieve this, we can adapt the concept
of human-in-the-loop (HITL) from the fields of modeling and simulation, and
interactive machine learning. In particular, I propose an agenda I call
society-in-the-loop (SITL), which combines the HITL control paradigm with
mechanisms for negotiating the values of various stakeholders affected by AI
systems, and monitoring compliance with the agreement. In short, `SITL = HITL +
Social Contract.'Comment: (in press), Ethics of Information Technology, 201
Improving Multimodal Interactive Agents with Reinforcement Learning from Human Feedback
An important goal in artificial intelligence is to create agents that can
both interact naturally with humans and learn from their feedback. Here we
demonstrate how to use reinforcement learning from human feedback (RLHF) to
improve upon simulated, embodied agents trained to a base level of competency
with imitation learning. First, we collected data of humans interacting with
agents in a simulated 3D world. We then asked annotators to record moments
where they believed that agents either progressed toward or regressed from
their human-instructed goal. Using this annotation data we leveraged a novel
method - which we call "Inter-temporal Bradley-Terry" (IBT) modelling - to
build a reward model that captures human judgments. Agents trained to optimise
rewards delivered from IBT reward models improved with respect to all of our
metrics, including subsequent human judgment during live interactions with
agents. Altogether our results demonstrate how one can successfully leverage
human judgments to improve agent behaviour, allowing us to use reinforcement
learning in complex, embodied domains without programmatic reward functions.
Videos of agent behaviour may be found at https://youtu.be/v_Z9F2_eKk4
- …