2,913 research outputs found
Reward is enough for convex MDPs
Maximising a cumulative reward function that is Markov and stationary, i.e.,
defined over state-action pairs and independent of time, is sufficient to
capture many kinds of goals in a Markov decision process (MDP). However, not
all goals can be captured in this manner. In this paper we study convex MDPs in
which goals are expressed as convex functions of the stationary distribution
and show that they cannot be formulated using stationary reward functions.
Convex MDPs generalize the standard reinforcement learning (RL) problem
formulation to a larger framework that includes many supervised and
unsupervised RL problems, such as apprenticeship learning, constrained MDPs,
and so-called `pure exploration'. Our approach is to reformulate the convex MDP
problem as a min-max game involving policy and cost (negative reward)
`players', using Fenchel duality. We propose a meta-algorithm for solving this
problem and show that it unifies many existing algorithms in the literature
Successor Feature Sets: Generalizing Successor Representations Across Policies
Successor-style representations have many advantages for reinforcement
learning: for example, they can help an agent generalize from past experience
to new goals, and they have been proposed as explanations of behavioral and
neural data from human and animal learners. They also form a natural bridge
between model-based and model-free RL methods: like the former they make
predictions about future experiences, and like the latter they allow efficient
prediction of total discounted rewards. However, successor-style
representations are not optimized to generalize across policies: typically, we
maintain a limited-length list of policies, and share information among them by
representation learning or GPI. Successor-style representations also typically
make no provision for gathering information or reasoning about latent
variables. To address these limitations, we bring together ideas from
predictive state representations, belief space value iteration, successor
features, and convex analysis: we develop a new, general successor-style
representation, together with a Bellman equation that connects multiple sources
of information within this representation, including different latent states,
policies, and reward functions. The new representation is highly expressive:
for example, it lets us efficiently read off an optimal policy for a new reward
function, or a policy that imitates a new demonstration. For this paper, we
focus on exact computation of the new representation in small, known
environments, since even this restricted setting offers plenty of interesting
questions. Our implementation does not scale to large, unknown environments --
nor would we expect it to, since it generalizes POMDP value iteration, which is
difficult to scale. However, we believe that future work will allow us to
extend our ideas to approximate reasoning in large, unknown environments
Provably Efficient Maximum Entropy Exploration
Suppose an agent is in a (possibly unknown) Markov Decision Process in the
absence of a reward signal, what might we hope that an agent can efficiently
learn to do? This work studies a broad class of objectives that are defined
solely as functions of the state-visitation frequencies that are induced by how
the agent behaves. For example, one natural, intrinsically defined, objective
problem is for the agent to learn a policy which induces a distribution over
state space that is as uniform as possible, which can be measured in an
entropic sense. We provide an efficient algorithm to optimize such such
intrinsically defined objectives, when given access to a black box planning
oracle (which is robust to function approximation). Furthermore, when
restricted to the tabular setting where we have sample based access to the MDP,
our proposed algorithm is provably efficient, both in terms of its sample and
computational complexities. Key to our algorithmic methodology is utilizing the
conditional gradient method (a.k.a. the Frank-Wolfe algorithm) which utilizes
an approximate MDP solver.Comment: Updated experiment results; minor revisions in writin
What is the Mathematics in Mathematics Education?
In this paper I tackle the question What is the mathematics in mathematics education? By providing three different frames for the word mathematics. 1. Frame 1: Mathematics as an abstract body of knowledge/ideas, the organization of that into systems and structures, and a set of methods for reaching conclusions. 2. Frame 2: Mathematics as contextual, ever present, as a lens or language to make sense of the world. 3. Frame 3: Mathematics as a verb (not a noun), a human activity, part of oneâs identity.
After introducing the frames and examining their distinction and their overlap, I discuss their implication with respect to student-centered classroom, context, and culture
- âŠ