227 research outputs found
Diverse Exploration via Conjugate Policies for Policy Gradient Methods
We address the challenge of effective exploration while maintaining good
performance in policy gradient methods. As a solution, we propose diverse
exploration (DE) via conjugate policies. DE learns and deploys a set of
conjugate policies which can be conveniently generated as a byproduct of
conjugate gradient descent. We provide both theoretical and empirical results
showing the effectiveness of DE at achieving exploration, improving policy
performance, and the advantage of DE over exploration by random policy
perturbations.Comment: AAAI 201
Re-examining assumptions in fair and unbiased learning to rank
In this thesis, we re-examine the assumptions of existing methods for bias correction and fairness optimization in ranking. Consequently, we propose methods that are more general than the existing ones, in the sense that they rely on less assumptions, or they are applicable in more situations. On the bias side, we first show that the click model assumption matters and propose cascade model-based inverse propensity scoring (IPS). Next, we prove that the unbiasedness of IPS relies on the assumption that the clicks do not suffer from trust bias. When trust bias exists, we extend IPS and propose the affine correction (AC) method and prove that, in contrast to IPS, it gives unbiased estimates of the relevance. Finally, we show that the unbiasedness proofs of IPS and AC are conditioned on an accurate estimation of the bias parameters, and propose a bias correction method that does not rely on relevance estimation. On the fairness side, we re-examine the implicit assumption that fair distribution of exposure leads to fair treatment by the users. We argue that fairness of exposure is necessary but not enough for a fair treatment and propose a correction method for this type of bias. Finally, we notice that the existing general post-processing framework for optimizing fairness of ranking metrics is based on the Plackett-Luce distribution, the optimization of which has room for improvement for queries with a small number of repeating sessions. To close this gap, we propose a new permutation distribution based on permutation graphs
Information Theory and Machine Learning
The recent successes of machine learning, especially regarding systems based on deep neural networks, have encouraged further research activities and raised a new set of challenges in understanding and designing complex machine learning algorithms. New applications require learning algorithms to be distributed, have transferable learning results, use computation resources efficiently, convergence quickly on online settings, have performance guarantees, satisfy fairness or privacy constraints, incorporate domain knowledge on model structures, etc. A new wave of developments in statistical learning theory and information theory has set out to address these challenges. This Special Issue, "Machine Learning and Information Theory", aims to collect recent results in this direction reflecting a diverse spectrum of visions and efforts to extend conventional theories and develop analysis tools for these complex machine learning systems
Policy-Gradient Algorithms for Partially Observable Markov Decision Processes
Partially observable Markov decision processes are interesting because of their ability to model most conceivable real-world learning problems, for example, robot navigation, driving a car, speech recognition, stock trading, and playing games. The downside of this generality is that exact algorithms are computationally intractable. Such computational complexity motivates approximate approaches. One such class of algorithms are the so-called policy-gradient methods from reinforcement learning. They seek to adjust the parameters of an agent in the direction that maximises the long-term average of a reward signal. Policy-gradient methods are attractive as a \emph{scalable} approach for controlling partially observable Markov decision processes (POMDPs). In the most general case POMDP policies require some form of internal state, or memory, in order to act optimally. Policy-gradient methods have shown promise for problems admitting memory-less policies but have been less successful when memory is required. This thesis develops several improved algorithms for learning policies with memory in an infinite-horizon setting. Directly, when the dynamics of the world are known, and via Monte-Carlo methods otherwise. The algorithms simultaneously learn how to act and what to remember. ..
Policy-Gradient Algorithms for Partially Observable Markov Decision Processes
Partially observable Markov decision processes are interesting because of their ability to model most conceivable real-world learning problems, for example, robot navigation, driving a car, speech recognition, stock trading, and playing games. The downside of this generality is that exact algorithms are computationally intractable. Such computational complexity motivates approximate approaches. One such class of algorithms are the so-called policy-gradient methods from reinforcement learning. They seek to adjust the parameters of an agent in the direction that maximises the long-term average of a reward signal. Policy-gradient methods are attractive as a \emph{scalable} approach for controlling partially observable Markov decision processes (POMDPs). In the most general case POMDP policies require some form of internal state, or memory, in order to act optimally. Policy-gradient methods have shown promise for problems admitting memory-less policies but have been less successful when memory is required. This thesis develops several improved algorithms for learning policies with memory in an infinite-horizon setting. Directly, when the dynamics of the world are known, and via Monte-Carlo methods otherwise. The algorithms simultaneously learn how to act and what to remember. ..
- …