768 research outputs found
Improving Search through A3C Reinforcement Learning based Conversational Agent
We develop a reinforcement learning based search assistant which can assist
users through a set of actions and sequence of interactions to enable them
realize their intent. Our approach caters to subjective search where the user
is seeking digital assets such as images which is fundamentally different from
the tasks which have objective and limited search modalities. Labeled
conversational data is generally not available in such search tasks and
training the agent through human interactions can be time consuming. We propose
a stochastic virtual user which impersonates a real user and can be used to
sample user behavior efficiently to train the agent which accelerates the
bootstrapping of the agent. We develop A3C algorithm based context preserving
architecture which enables the agent to provide contextual assistance to the
user. We compare the A3C agent with Q-learning and evaluate its performance on
average rewards and state values it obtains with the virtual user in validation
episodes. Our experiments show that the agent learns to achieve higher rewards
and better states.Comment: 17 pages, 7 figure
Reinforcement Learning for Generative AI: State of the Art, Opportunities and Open Research Challenges
Generative Artificial Intelligence (AI) is one of the most exciting
developments in Computer Science of the last decade. At the same time,
Reinforcement Learning (RL) has emerged as a very successful paradigm for a
variety of machine learning tasks. In this survey, we discuss the state of the
art, opportunities and open research questions in applying RL to generative AI.
In particular, we will discuss three types of applications, namely, RL as an
alternative way for generation without specified objectives; as a way for
generating outputs while concurrently maximizing an objective function; and,
finally, as a way of embedding desired characteristics, which cannot be easily
captured by means of an objective function, into the generative process. We
conclude the survey with an in-depth discussion of the opportunities and
challenges in this fascinating emerging area.Comment: Published in JAIR at
https://www.jair.org/index.php/jair/article/view/1527
Causal-aware Safe Policy Improvement for Task-oriented dialogue
The recent success of reinforcement learning's (RL) in solving complex tasks
is most often attributed to its capacity to explore and exploit an environment
where it has been trained. Sample efficiency is usually not an issue since
cheap simulators are available to sample data on-policy. On the other hand,
task oriented dialogues are usually learnt from offline data collected using
human demonstrations. Collecting diverse demonstrations and annotating them is
expensive. Unfortunately, use of RL methods trained on off-policy data are
prone to issues of bias and generalization, which are further exacerbated by
stochasticity in human response and non-markovian belief state of a dialogue
management system. To this end, we propose a batch RL framework for task
oriented dialogue policy learning: causal aware safe policy improvement
(CASPI). This method gives guarantees on dialogue policy's performance and also
learns to shape rewards according to intentions behind human responses, rather
than just mimicking demonstration data; this couple with batch-RL helps overall
with sample efficiency of the framework. We demonstrate the effectiveness of
this framework on a dialogue-context-to-text Generation and end-to-end dialogue
task of the Multiwoz2.0 dataset. The proposed method outperforms the current
state of the art on these metrics, in both case. In the end-to-end case, our
method trained only on 10\% of the data was able to out perform current state
in three out of four evaluation metrics
Adaptive Policy Learning for Offline-to-Online Reinforcement Learning
Conventional reinforcement learning (RL) needs an environment to collect
fresh data, which is impractical when online interactions are costly. Offline
RL provides an alternative solution by directly learning from the previously
collected dataset. However, it will yield unsatisfactory performance if the
quality of the offline datasets is poor. In this paper, we consider an
offline-to-online setting where the agent is first learned from the offline
dataset and then trained online, and propose a framework called Adaptive Policy
Learning for effectively taking advantage of offline and online data.
Specifically, we explicitly consider the difference between the online and
offline data and apply an adaptive update scheme accordingly, that is, a
pessimistic update strategy for the offline dataset and an optimistic/greedy
update scheme for the online dataset. Such a simple and effective method
provides a way to mix the offline and online RL and achieve the best of both
worlds. We further provide two detailed algorithms for implementing the
framework through embedding value or policy-based RL algorithms into it.
Finally, we conduct extensive experiments on popular continuous control tasks,
and results show that our algorithm can learn the expert policy with high
sample efficiency even when the quality of offline dataset is poor, e.g.,
random dataset.Comment: AAAI202
Reinforcement Learning for Generative AI: A Survey
Deep Generative AI has been a long-standing essential topic in the machine
learning community, which can impact a number of application areas like text
generation and computer vision. The major paradigm to train a generative model
is maximum likelihood estimation, which pushes the learner to capture and
approximate the target data distribution by decreasing the divergence between
the model distribution and the target distribution. This formulation
successfully establishes the objective of generative tasks, while it is
incapable of satisfying all the requirements that a user might expect from a
generative model. Reinforcement learning, serving as a competitive option to
inject new training signals by creating new objectives that exploit novel
signals, has demonstrated its power and flexibility to incorporate human
inductive bias from multiple angles, such as adversarial learning,
hand-designed rules and learned reward model to build a performant model.
Thereby, reinforcement learning has become a trending research field and has
stretched the limits of generative AI in both model design and application. It
is reasonable to summarize and conclude advances in recent years with a
comprehensive review. Although there are surveys in different application areas
recently, this survey aims to shed light on a high-level review that spans a
range of application areas. We provide a rigorous taxonomy in this area and
make sufficient coverage on various models and applications. Notably, we also
surveyed the fast-developing large language model area. We conclude this survey
by showing the potential directions that might tackle the limit of current
models and expand the frontiers for generative AI
- …