21,790 research outputs found
Learning in Reactive Environments with Arbitrary Dependence
In reinforcement learning the task
for an agent is to attain the best possible asymptotic reward
where the true generating environment is unknown but belongs to a
known countable family of environments.
This task generalises the sequence prediction problem, in which
the environment does not react to the behaviour of the agent.
Solomonoff induction solves the sequence prediction problem
for any countable class of measures; however, it is easy to see
that such result is impossible for reinforcement learning - not any
countable class of environments can be learnt.
We find some sufficient conditions
on the class of environments under
which an agent exists which attains the best asymptotic reward
for any environment in the class. We analyze how tight these conditions are and how they
relate to different probabilistic assumptions known in
reinforcement learning and related fields, such as Markov
Decision Processes and mixing conditions
On the Possibility of Learning in Reactive Environments with Arbitrary Dependence
We address the problem of reinforcement learning in which observations may
exhibit an arbitrary form of stochastic dependence on past observations and
actions, i.e. environments more general than (PO)MDPs. The task for an agent is
to attain the best possible asymptotic reward where the true generating
environment is unknown but belongs to a known countable family of environments.
We find some sufficient conditions on the class of environments under which an
agent exists which attains the best asymptotic reward for any environment in
the class. We analyze how tight these conditions are and how they relate to
different probabilistic assumptions known in reinforcement learning and related
fields, such as Markov Decision Processes and mixing conditions.Comment: 20 page
Invariant Manifolds and Rate Constants in Driven Chemical Reactions
Reaction rates of chemical reactions under nonequilibrium conditions can be
determined through the construction of the normally hyperbolic invariant
manifold (NHIM) [and moving dividing surface (DS)] associated with the
transition state trajectory. Here, we extend our recent methods by constructing
points on the NHIM accurately even for multidimensional cases. We also advance
the implementation of machine learning approaches to construct smooth versions
of the NHIM from a known high-accuracy set of its points. That is, we expand on
our earlier use of neural nets, and introduce the use of Gaussian process
regression for the determination of the NHIM. Finally, we compare and contrast
all of these methods for a challenging two-dimensional model barrier case so as
to illustrate their accuracy and general applicability.Comment: 28 pages, 13 figures, table of contents figur
Security Evaluation of Support Vector Machines in Adversarial Environments
Support Vector Machines (SVMs) are among the most popular classification
techniques adopted in security applications like malware detection, intrusion
detection, and spam filtering. However, if SVMs are to be incorporated in
real-world security systems, they must be able to cope with attack patterns
that can either mislead the learning algorithm (poisoning), evade detection
(evasion), or gain information about their internal parameters (privacy
breaches). The main contributions of this chapter are twofold. First, we
introduce a formal general framework for the empirical evaluation of the
security of machine-learning systems. Second, according to our framework, we
demonstrate the feasibility of evasion, poisoning and privacy attacks against
SVMs in real-world security problems. For each attack technique, we evaluate
its impact and discuss whether (and how) it can be countered through an
adversary-aware design of SVMs. Our experiments are easily reproducible thanks
to open-source code that we have made available, together with all the employed
datasets, on a public repository.Comment: 47 pages, 9 figures; chapter accepted into book 'Support Vector
Machine Applications
Evolutionary Tournament-Based Comparison of Learning and Non-Learning Algorithms for Iterated Games
Evolutionary tournaments have been used effectively as a tool for comparing game-playing algorithms. For instance, in the late 1970's, Axelrod organized tournaments to compare algorithms for playing the iterated prisoner's dilemma (PD) game. These tournaments capture the dynamics in a population of agents that periodically adopt relatively successful algorithms in the environment. While these tournaments have provided us with a better understanding of the relative merits of algorithms for iterated PD, our understanding is less clear about algorithms for playing iterated versions of arbitrary single-stage games in an environment of heterogeneous agents. While the Nash equilibrium solution concept has been used to recommend using Nash equilibrium strategies for rational players playing general-sum games, learning algorithms like fictitious play may be preferred for playing against sub-rational players. In this paper, we study the relative performance of learning and non-learning algorithms in an evolutionary tournament where agents periodically adopt relatively successful algorithms in the population. The tournament is played over a testbed composed of all possible structurally distinct 2×2 conflicted games with ordinal payoffs: a baseline, neutral testbed for comparing algorithms. Before analyzing results from the evolutionary tournament, we discuss the testbed, our choice of representative learning and non-learning algorithms and relative rankings of these algorithms in a round-robin competition. The results from the tournament highlight the advantage of learning algorithms over players using static equilibrium strategies for repeated plays of arbitrary single-stage games. The results are likely to be of more benefit compared to work on static analysis of equilibrium strategies for choosing decision procedures for open, adapting agent society consisting of a variety of competitors.Repeated Games, Evolution, Simulation
The Sample-Complexity of General Reinforcement Learning
We present a new algorithm for general reinforcement learning where the true
environment is known to belong to a finite class of N arbitrary models. The
algorithm is shown to be near-optimal for all but O(N log^2 N) time-steps with
high probability. Infinite classes are also considered where we show that
compactness is a key criterion for determining the existence of uniform
sample-complexity bounds. A matching lower bound is given for the finite case.Comment: 16 page
Learning from Scarce Experience
Searching the space of policies directly for the optimal policy has been one
popular method for solving partially observable reinforcement learning
problems. Typically, with each change of the target policy, its value is
estimated from the results of following that very policy. This requires a large
number of interactions with the environment as different polices are
considered. We present a family of algorithms based on likelihood ratio
estimation that use data gathered when executing one policy (or collection of
policies) to estimate the value of a different policy. The algorithms combine
estimation and optimization stages. The former utilizes experience to build a
non-parametric representation of an optimized function. The latter performs
optimization on this estimate. We show positive empirical results and provide
the sample complexity bound.Comment: 8 pages 4 figure
Optimistic Agents are Asymptotically Optimal
We use optimism to introduce generic asymptotically optimal reinforcement
learning agents. They achieve, with an arbitrary finite or compact class of
environments, asymptotically optimal behavior. Furthermore, in the finite
deterministic case we provide finite error bounds.Comment: 13 LaTeX page
Chasing Ghosts: Competing with Stateful Policies
We consider sequential decision making in a setting where regret is measured
with respect to a set of stateful reference policies, and feedback is limited
to observing the rewards of the actions performed (the so called "bandit"
setting). If either the reference policies are stateless rather than stateful,
or the feedback includes the rewards of all actions (the so called "expert"
setting), previous work shows that the optimal regret grows like
in terms of the number of decision rounds .
The difficulty in our setting is that the decision maker unavoidably loses
track of the internal states of the reference policies, and thus cannot
reliably attribute rewards observed in a certain round to any of the reference
policies. In fact, in this setting it is impossible for the algorithm to
estimate which policy gives the highest (or even approximately highest) total
reward. Nevertheless, we design an algorithm that achieves expected regret that
is sublinear in , of the form . Our algorithm is based
on a certain local repetition lemma that may be of independent interest. We
also show that no algorithm can guarantee expected regret better than
- …