39 research outputs found
Deep Reinforcement Learning in Parameterized Action Space
Recent work has shown that deep neural networks are capable of approximating
both value functions and policies in reinforcement learning domains featuring
continuous state and action spaces. However, to the best of our knowledge no
previous work has succeeded at using deep neural networks in structured
(parameterized) continuous action spaces. To fill this gap, this paper focuses
on learning within the domain of simulated RoboCup soccer, which features a
small set of discrete action types, each of which is parameterized with
continuous variables. The best learned agent can score goals more reliably than
the 2012 RoboCup champion agent. As such, this paper represents a successful
extension of deep reinforcement learning to the class of parameterized action
space MDPs
Deep Recurrent Q-Learning for Partially Observable MDPs
Deep Reinforcement Learning has yielded proficient controllers for complex
tasks. However, these controllers have limited memory and rely on being able to
perceive the complete game screen at each decision point. To address these
shortcomings, this article investigates the effects of adding recurrency to a
Deep Q-Network (DQN) by replacing the first post-convolutional fully-connected
layer with a recurrent LSTM. The resulting \textit{Deep Recurrent Q-Network}
(DRQN), although capable of seeing only a single frame at each timestep,
successfully integrates information through time and replicates DQN's
performance on standard Atari games and partially observed equivalents
featuring flickering game screens. Additionally, when trained with partial
observations and evaluated with incrementally more complete observations,
DRQN's performance scales as a function of observability. Conversely, when
trained with full observations and evaluated with partial observations, DRQN's
performance degrades less than DQN's. Thus, given the same length of history,
recurrency is a viable alternative to stacking a history of frames in the DQN's
input layer and while recurrency confers no systematic advantage when learning
to play the game, the recurrent net can better adapt at evaluation time if the
quality of observations changes
Multi-Preference Actor Critic
Policy gradient algorithms typically combine discounted future rewards with
an estimated value function, to compute the direction and magnitude of
parameter updates. However, for most Reinforcement Learning tasks, humans can
provide additional insight to constrain the policy learning. We introduce a
general method to incorporate multiple different feedback channels into a
single policy gradient loss. In our formulation, the Multi-Preference Actor
Critic (M-PAC), these different types of feedback are implemented as
constraints on the policy. We use a Lagrangian relaxation to satisfy these
constraints using gradient descent while learning a policy that maximizes
rewards. Experiments in Atari and Pendulum verify that constraints are being
respected and can accelerate the learning process.Comment: NeurIPS Workshop on Deep RL, 201
Graph Constrained Reinforcement Learning for Natural Language Action Spaces
Interactive Fiction games are text-based simulations in which an agent
interacts with the world purely through natural language. They are ideal
environments for studying how to extend reinforcement learning agents to meet
the challenges of natural language understanding, partial observability, and
action generation in combinatorially-large text-based action spaces. We present
KG-A2C, an agent that builds a dynamic knowledge graph while exploring and
generates actions using a template-based action space. We contend that the dual
uses of the knowledge graph to reason about game state and to constrain natural
language generation are the keys to scalable exploration of combinatorially
large natural language actions. Results across a wide variety of IF games show
that KG-A2C outperforms current IF agents despite the exponential increase in
action space size.Comment: Accepted to ICLR 202
ScriptNet: Neural Static Analysis for Malicious JavaScript Detection
Malicious scripts are an important computer infection threat vector in the
wild. For web-scale processing, static analysis offers substantial computing
efficiencies. We propose the ScriptNet system for neural malicious JavaScript
detection which is based on static analysis. We use the Convoluted Partitioning
of Long Sequences (CPoLS) model, which processes Javascript files as byte
sequences. Lower layers capture the sequential nature of these byte sequences
while higher layers classify the resulting embedding as malicious or benign.
Unlike previously proposed solutions, our model variants are trained in an
end-to-end fashion allowing discriminative training even for the sequential
processing layers. Evaluating this model on a large corpus of 212,408
JavaScript files indicates that the best performing CPoLS model offers a 97.20%
true positive rate (TPR) for the first 60K byte subsequence at a false positive
rate (FPR) of 0.50%. The best performing CPoLS model significantly outperform
several baseline models
Leveraging Grammar and Reinforcement Learning for Neural Program Synthesis
Program synthesis is the task of automatically generating a program
consistent with a specification. Recent years have seen proposal of a number of
neural approaches for program synthesis, many of which adopt a sequence
generation paradigm similar to neural machine translation, in which
sequence-to-sequence models are trained to maximize the likelihood of known
reference programs. While achieving impressive results, this strategy has two
key limitations. First, it ignores Program Aliasing: the fact that many
different programs may satisfy a given specification (especially with
incomplete specifications such as a few input-output examples). By maximizing
the likelihood of only a single reference program, it penalizes many
semantically correct programs, which can adversely affect the synthesizer
performance. Second, this strategy overlooks the fact that programs have a
strict syntax that can be efficiently checked. To address the first limitation,
we perform reinforcement learning on top of a supervised model with an
objective that explicitly maximizes the likelihood of generating semantically
correct programs. For addressing the second limitation, we introduce a training
procedure that directly maximizes the probability of generating syntactically
correct programs that fulfill the specification. We show that our contributions
lead to improved accuracy of the models, especially in cases where the training
data is limited.Comment: ICLR 201
Learning Calibratable Policies using Programmatic Style-Consistency
We study the important and challenging problem of controllable generation of long-term sequential behaviors. Solutions to this problem would impact many applications, such as calibrating behaviors of AI agents in games or predicting player trajectories in sports. In contrast to the well-studied areas of controllable generation of images, text, and speech, there are significant challenges that are unique to or exacerbated by generating long-term behaviors: how should we specify the factors of variation to control, and how can we ensure that the generated temporal behavior faithfully demonstrates diverse styles? In this paper, we leverage large amounts of raw behavioral data to learn policies that can be calibrated to generate a diverse range of behavior styles (e.g., aggressive versus passive play in sports). Inspired by recent work on leveraging programmatic labeling functions, we present a novel framework that combines imitation learning with data programming to learn style-calibratable policies. Our primary technical contribution is a formal notion of style-consistency as a learning objective, and its integration with conventional imitation learning approaches. We evaluate our framework using demonstrations from professional basketball players and agents in the MuJoCo physics environment, and show that our learned policies can be accurately calibrated to generate interesting behavior styles in both domains
Neural Program Meta-Induction
Most recently proposed methods for Neural Program Induction work under the
assumption of having a large set of input/output (I/O) examples for learning
any underlying input-output mapping. This paper aims to address the problem of
data and computation efficiency of program induction by leveraging information
from related tasks. Specifically, we propose two approaches for cross-task
knowledge transfer to improve program induction in limited-data scenarios. In
our first proposal, portfolio adaptation, a set of induction models is
pretrained on a set of related tasks, and the best model is adapted towards the
new task using transfer learning. In our second approach, meta program
induction, a -shot learning approach is used to make a model generalize to
new tasks without additional training. To test the efficacy of our methods, we
constructed a new benchmark of programs written in the Karel programming
language. Using an extensive experimental evaluation on the Karel benchmark, we
demonstrate that our proposals dramatically outperform the baseline induction
method that does not use knowledge transfer. We also analyze the relative
performance of the two approaches and study conditions in which they perform
best. In particular, meta induction outperforms all existing approaches under
extreme data sparsity (when a very small number of examples are available),
i.e., fewer than ten. As the number of available I/O examples increase (i.e. a
thousand or more), portfolio adapted program induction becomes the best
approach. For intermediate data sizes, we demonstrate that the combined method
of adapted meta program induction has the strongest performance.Comment: 8 Pages + 1 page appendi
Reading and Acting while Blindfolded: The Need for Semantics in Text Game Agents
Text-based games simulate worlds and interact with players using natural
language. Recent work has used them as a testbed for autonomous
language-understanding agents, with the motivation being that understanding the
meanings of words or semantics is a key component of how humans understand,
reason, and act in these worlds. However, it remains unclear to what extent
artificial agents utilize semantic understanding of the text. To this end, we
perform experiments to systematically reduce the amount of semantic information
available to a learning agent. Surprisingly, we find that an agent is capable
of achieving high scores even in the complete absence of language semantics,
indicating that the currently popular experimental setup and models may be
poorly designed to understand and leverage game texts. To remedy this
deficiency, we propose an inverse dynamics decoder to regularize the
representation space and encourage exploration, which shows improved
performance on several games including Zork I. We discuss the implications of
our findings for designing future agents with stronger semantic understanding.Comment: NAACL 202
Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents
The Arcade Learning Environment (ALE) is an evaluation platform that poses
the challenge of building AI agents with general competency across dozens of
Atari 2600 games. It supports a variety of different problem settings and it
has been receiving increasing attention from the scientific community, leading
to some high-profile success stories such as the much publicized Deep
Q-Networks (DQN). In this article we take a big picture look at how the ALE is
being used by the research community. We show how diverse the evaluation
methodologies in the ALE have become with time, and highlight some key concerns
when evaluating agents in the ALE. We use this discussion to present some
methodological best practices and provide new benchmark results using these
best practices. To further the progress in the field, we introduce a new
version of the ALE that supports multiple game modes and provides a form of
stochasticity we call sticky actions. We conclude this big picture look by
revisiting challenges posed when the ALE was introduced, summarizing the
state-of-the-art in various problems and highlighting problems that remain
open