352 research outputs found
Online Optimization of Teaching Sequences with Multi-Armed Bandits
International audienceWe present an approach to Intelligent Tutoring Systems which adaptively personalizes sequences of learning activities to maximize skills acquired by each student, taking into account limited time and motivational resources. At a given point in time, the system tries to propose to the student the activity which makes him progress best. We introduce two algorithms that rely on the empirical estimation of the learning progress, one that uses information about the difficulty of each exercise RiARiT and another that does not use any knowledge about the problem ZPDES
Multi-Armed Bandits for Intelligent Tutoring Systems
We present an approach to Intelligent Tutoring Systems which adaptively
personalizes sequences of learning activities to maximize skills acquired by
students, taking into account the limited time and motivational resources. At a
given point in time, the system proposes to the students the activity which
makes them progress faster. We introduce two algorithms that rely on the
empirical estimation of the learning progress, RiARiT that uses information
about the difficulty of each exercise and ZPDES that uses much less knowledge
about the problem.
The system is based on the combination of three approaches. First, it
leverages recent models of intrinsically motivated learning by transposing them
to active teaching, relying on empirical estimation of learning progress
provided by specific activities to particular students. Second, it uses
state-of-the-art Multi-Arm Bandit (MAB) techniques to efficiently manage the
exploration/exploitation challenge of this optimization process. Third, it
leverages expert knowledge to constrain and bootstrap initial exploration of
the MAB, while requiring only coarse guidance information of the expert and
allowing the system to deal with didactic gaps in its knowledge. The system is
evaluated in a scenario where 7-8 year old schoolchildren learn how to
decompose numbers while manipulating money. Systematic experiments are
presented with simulated students, followed by results of a user study across a
population of 400 school children
A Methodology for Discovering how to Adaptively Personalize to Users using Experimental Comparisons
We explain and provide examples of a formalism that supports the methodology
of discovering how to adapt and personalize technology by combining randomized
experiments with variables associated with user models. We characterize a
formal relationship between the use of technology to conduct A/B experiments
and use of technology for adaptive personalization. The MOOClet Formalism [11]
captures the equivalence between experimentation and personalization in its
conceptualization of modular components of a technology. This motivates a
unified software design pattern that enables technology components that can be
compared in an experiment to also be adapted based on contextual data, or
personalized based on user characteristics. With the aid of a concrete use
case, we illustrate the potential of the MOOClet formalism for a methodology
that uses randomized experiments of alternative micro-designs to discover how
to adapt technology based on user characteristics, and then dynamically
implements these personalized improvements in real time
New Models qnd Algorithms for Bandits and Markets
Inspired by advertising markets, we consider large-scale sequential decision making problems in which a learner must deploy an algorithm to behave optimally under uncertainty. Although many of these problems can be modeled as contextual bandit problems, we argue that the tools and techniques for analyzing bandit problems with large numbers of actions and contexts can be greatly expanded. While convexity and metric-similarity assumptions on the process generating rewards have yielded some algorithms in existing literature, certain types of assumptions that have been fruitful in offline supervised learning settings have yet to even be considered. Notably missing, for example, is any kind of graphical model approach to assuming structured rewards, despite the success such assumptions have
achieved in inducing scalable learning and inference with high-dimensional distributions. Similarly, we observe that there are countless tools for understanding the relationship between a choice of model class in supervised learning, and the generalization error of the best fit from that class, such as the celebrated VC-theory. However, an analogous notion of dimensionality, which relates a generic structural assumption on rewards to regret rates in an online optimization problem, is not fully developed. The primary goal of this dissertation, therefore, will be to fill out the space of models, algorithms, and assumptions used in sequential decision making problems. Toward this end, we will develop a theory for bandit problems with structured rewards that permit a graphical model representation. We will give an efficient algorithm for regret-minimization in such a setting, and along the way will develop a deeper connection between online supervised learning and regret-minimization. This dissertation will also introduce a complexity measure for generic structural assumptions on reward functions, which we call the Haystack Dimension. We will prove that the Haystack Dimension characterizes the optimal rates achievable up to log factors. Finally, we will describe more application-oriented techniques for solving problems in advertising markets, which again demonstrate how methods from traditional disciplines, such as statistical survival analysis, can be leveraged to design novel algorithms for optimization in markets
Tutoring Students with Adaptive Strategies
Adaptive learning is a crucial part in intelligent tutoring systems. It provides students with appropriate tutoring interventions, based on students’ characteristics, status, and other related features, in order to optimize their learning outcomes. It is required to determine students’ knowledge level or learning progress, based on which it then uses proper techniques to choose the optimal interventions. In this dissertation work, I focus on these aspects related to the process in adaptive learning: student modeling, k-armed bandits, and contextual bandits. Student modeling. The main objective of student modeling is to develop cognitive models of students, including modeling content skills and knowledge about learning. In this work, we investigate the effect of prerequisite skill in predicting students’ knowledge in post skills, and we make use of the prerequisite performance in different student models. As a result, this makes them superior to traditional models. K-armed bandits. We apply k-armed bandit algorithms to personalize interventions for students, to optimize their learning outcomes. Due to the lack of diverse interventions and small difference of intervention effectiveness in educational experiments, we also propose a simple selection strategy, and compare it with several k-armed bandit algorithms. Contextual bandits. In contextual bandit problem, additional side information, also called context, can be used to determine which action to select. First, we construct a feature evaluation mechanism, which determines which feature to be combined with bandits. Second, we propose a new decision tree algorithm, which is capable of detecting aptitude treatment effect for students. Third, with combined bandits with the decision tree, we apply the contextual bandits to make personalization in two different types of data, simulated data and real experimental data
Robust Bandit Learning with Imperfect Context
A standard assumption in contextual multi-arm bandit is that the true context
is perfectly known before arm selection. Nonetheless, in many practical
applications (e.g., cloud resource management), prior to arm selection, the
context information can only be acquired by prediction subject to errors or
adversarial modification. In this paper, we study a contextual bandit setting
in which only imperfect context is available for arm selection while the true
context is revealed at the end of each round. We propose two robust arm
selection algorithms: MaxMinUCB (Maximize Minimum UCB) which maximizes the
worst-case reward, and MinWD (Minimize Worst-case Degradation) which minimizes
the worst-case regret. Importantly, we analyze the robustness of MaxMinUCB and
MinWD by deriving both regret and reward bounds compared to an oracle that
knows the true context. Our results show that as time goes on, MaxMinUCB and
MinWD both perform as asymptotically well as their optimal counterparts that
know the reward function. Finally, we apply MaxMinUCB and MinWD to online edge
datacenter selection, and run synthetic simulations to validate our theoretical
analysis
- …