216 research outputs found
Automatic discovery of ranking formulas for playing with multi-armed bandits
We propose an approach for discovering in an automatic way formulas for ranking arms while playing with multi-armed bandits. The approach works by de ning a grammar made of basic elements such as for example addition, subtraction, the max operator, the average values of rewards collected by an arm, their standard deviation etc., and by exploiting this grammar to generate and test a large number of formulas. The systematic search for good candidate formulas is carried out by a built-on-purpose optimization algorithm used to navigate inside this large set of candidate formulas towards those that give high performances when using them on some multi-armed bandit problems. We have applied this approach on a set of bandit problems made of Bernoulli, Gaussian and truncated Gaussian distributions and have identi ed a few simple ranking formulas that provide interesting results on every problem of this set. In particular, they clearly outperform several reference policies previously introduced in the literature. We argue that these newly found formulas as well as the procedure for generating them may suggest new directions for studying bandit problems.Peer reviewe
Contributions to Monte Carlo Search
This research is motivated by improving decision making under uncertainty and in particular for games and symbolic regression. The present dissertation gathers research contributions in the field of Monte Carlo Search. These contributions are focused around the selection, the simulation and the recommendation policies. Moreover, we develop a methodology to automatically generate an MCS algorithm for a given problem.
For the selection policy, in most of the bandit literature, it is assumed that there is no structure or similarities between arms. Thus each arm is independent from one another. In several instances however, arms can be closely related. We show both theoretically and empirically, that a significant improvement over the state-of-the-art selection policies is possible.
For the contribution on simulation policy, we focus on the symbolic regression problem and ponder on how to consistently generate different expressions by changing the probability to draw each symbol. We formalize the situation into an optimization problem and try different approaches. We show a clear improvement in the sampling process for any length. We further test the best approach by embedding it into a MCS algorithm and it still shows an improvement.
For the contribution on recommendation policy, we study the most common in combination with selection policies. A good recommendation policy is a policy that works well with a given selection policy. We show that there is a trend that seems to favor a robust recommendation policy over a riskier one.
We also present a contribution where we automatically generate several MCS algorithms from a list of core components upon which most MCS algorithms are built upon and compare them to generic algorithms. The results show that it often enables discovering new variants of MCS that significantly outperform generic MCS algorithms
Recommended from our members
Optimization of Item Selection with Prediction Uncertainty
Selecting items from a candidate pool to maximize the total return is a classical problem, which is faced by people frequently in real life and also engineers in information technology industry, e.g., digital advertising, e-commerce, web search, etc. For example, web UI designers always try to find the best web design among many candidates to display to users, Google needs to select personalized engaging ads to display to users based on their historical online behaviors. Each of these industries has hundreds of billions of dollars market, which means that even a small performance improvement of item selection efficiency can drive hundreds of millions of dollars growth in the real world. In these applications, the true value of each item is unknown and can only be estimated from observed historical data. There is a large volume of significant research about building prediction models which are trained on historical data to estimate the item values. Given data volume and computation resource restrictions, engineers choose different models, e.g., deep neutral network, gradient boosting tree, or logistic regression to solve the problems. We will not dive into this area too much in this dissertation. Instead, our focus is how to maximize the total return given these predictions, especially taking into account the prediction uncertainties for the value optimization.In the large-scale real applications, the candidate pool can be extraordinary large. It is infeasible to pick some items from the pool to get interactive feedback for exploration. Actually, not only is exploration infeasible, but even estimating the value of each item through a complex estimation mode is almost impossible due to the need of real-time response. For example, Apple needs to estimate users’ favorite apps and recommends them to users when they visit Apple store. Google needs to select ads to display to users given a users’ search queries. There are millions of candidates needing to be estimated from prediction models. It is very challenging to support such a large scale of model prediction under the low-latency constraint. Besides that, to have a good prediction accuracy, the models used in industry are getting more and more complex, e.g hidden neurons and layers of deep neural network increases rapidly in real applications, which also increases latency significantly. All of these make it infeasible to evaluate all candidates through one single complex model in large scale application. To solve this problem, engineers usually leverage the cascading waterfall filtering method to filter items sequentially, which means instead of using one complex model to estimate the values of all candidates, multiple stages are adopted to filter out candidates sequentially. For example, a simple model is used in the first stage to estimate candidates’ values for choosing a small subset from all candidates. These selected items are then passed to another stage to be estimated by a more complex model. Intuitively, this cascading waterfall filtering method provides a good trade-off between infrastructure cost and prediction accuracy, which can save computational resources use substantially, and simultaneously select most promising items accurately. However, there is no systematic study about how to efficiently choose the number of waterfalls and how many filtered items in each waterfall. Engineers tune the settings of this system heuristically through personal experience or online experiments, which is very inefficient, especially when the system is dynamic and changes rapidly. In this dissertation, we propose a theoretical framework for the cascading waterfall filtering problem and develop a mathematical algorithm to obtain the optimal solutions. Our method achieves a dramatic improvement in an important real-world application, which adopts cascading water filtering system to select a few items from tens of millions of candidates.There are also some cases in which the candidate pool is relatively small. For instance, the number of web UI candidates is usually less than one hundred. Then, we are able to explore during item selection process. A typical exploration case is online experimentation, which is widely used to test and select items in real applications. In this situation, we can get interactive feedback to evaluate items. Considering online experiments for example, we usually randomly segments users into several groups, show them different candidates, and then compare the overall performance of each candidate to find the item with the largest value. Among all designs, A/B testing, which usually segments users into two statasitically equivalent groups to measure the difference between two versions of a single variable, is the most popular. For instance, in order to compare the impact of an ad versus another, we need to see the impact of exposing a user to viewing the first ad, and not the second, and then compare with the converse situation. However, a user cannot both see the first ad and not see it. Consequently, we need to create two “statistically equivalent populations” and expose users randomly to one or the other. This method is straightforward. However, the defect of this method is also obvious: to measure both versions, this method cannot expose all users to the best version, which leads to potential value loss. Some multi-armed bandit algorithms, e.g., Randomized Probability Matching (RPM), Upper Confidence Bounds (UCB), whose objective is maximizing the total return in experiment, have been proposed for improvement. However, these methods do not take into account the statistical confidence levels of the final result from the experiment and the corresponding impact on the subsequent item selection in the post-experimental stage. To solve this problem, we develop algorithms to achieve a good trade-off between reducing statistical uncertainty and maximizing cumulative reward, which aims at maximizing the total expected reward of item selection over a total duration, which includes both the current experimental stage and the post-experimental stage. The proposed algorithms demonstrate consistent and statistically significant improvements across different settings, outperforming both A/B testing and multi-armed bandit algorithms significantly
Supply Side Optimisation in Online Display Advertising
On the Internet there are publishers (the supply side) who provide free contents (e.g., news) and services (e.g., email) to attract users. Publishers get paid by selling ad displaying opportunities (i.e., impressions) to advertisers. Advertisers then sell products to users who are converted by ads. Better supply side revenue allows more free content and services to be created, thus, benefiting the entire online advertising ecosystem. This thesis addresses several optimisation problems for the supply side. When a publisher creates an ad-supported website, he needs to decide the percentage of ads first. The thesis reports a large-scale empirical study of Internet ad density over past seven years, then presents a model that includes many factors, especially the competition among similar publishers, and gives an optimal dynamic ad density that generates the maximum revenue over time. This study also unveils the tragedy of the commons in online advertising where users' attention has been overgrazed which results in a global sub-optimum. After deciding the ad density, the publisher retrieves ads from various sources, including contracts, ad networks, and ad exchanges. This forms an exploration-exploitation problem when ad sources are typically unknown before trail. This problem is modelled using Partially Observable Markov Decision Process (POMDP), and the exploration efficiency is increased by utilising the correlation of ads. The proposed method reports 23.4% better than the best performing baseline in the real-world data based experiments. Since some ad networks allow (or expect) an input of keywords, the thesis also presents an adaptive keyword extraction system using BM25F algorithm and the multi-armed bandits model. This system has been tested by a domain service provider in crowdsourcing based experiments. If the publisher selects a Real-Time Bidding (RTB) ad source, he can use reserve price to manipulate auctions for better payoff. This thesis proposes a simplified game model that considers the competition between seller and buyer to be one-shot instead of repeated and gives heuristics that can be easily implemented. The model has been evaluated in a production environment and reported 12.3% average increase of revenue. The documentation of a prototype system for reserve price optimisation is also presented in the appendix of the thesis
Machine Learning for SAT Solvers
Boolean SAT solvers are indispensable tools in a variety of domains in computer science and engineering where efficient search is required. Not only does this relieve the burden on the users of implementing their own search algorithm, they also leverage the surprising effectiveness of modern SAT solvers. Thanks to many decades of cumulative effort, researchers have made persistent improvements to SAT technology to the point where nowadays the best solvers are routinely used to solve extremely large instances with millions of variables. Even though our current paradigm of SAT solvers runs in worst-case exponential time, it appears that the techniques and heuristics embedded in these solvers avert the worst-case exponential time in practice. The implementations of these various solver heuristics and techniques are vital to the solvers effectiveness in practice.
The state-of-the-art heuristics and techniques gather data during the run of the solver to inform their choices like which variable to branch on next or when to invoke a restart. The goal of these choices is to minimize the solving time. The methods in which these heuristics and techniques process the data generally do not have theoretical underpinnings. Consequently, understanding why these heuristics and techniques perform so well in practice remains a challenge and systematically improving them is rather difficult. This goes to the heart of this thesis, that is to utilize machine learning to process the data as part of an optimization problem to minimize solving time. Research in machine learning exploded over the past decade due to its success in extracting useful information out of large volumes of data. Machine learning outclasses manual handcoding in a wide variety of complex tasks where data are plentiful. This is also the case in modern SAT solvers where propagations, conflict analysis, and clause learning produces plentiful of data to be analyzed, and exploiting this data to the fullest is naturally where machine learning comes in. Many machine learning techniques have a theoretical basis that makes them easy to analyze and understand why they perform well.
The branching heuristic is the first target for injecting machine learning. First we studied extant branching heuristics to understand what makes a branching heuristics good empirically. The fundamental observation is that good branching heuristics cause lots of clause learning by triggering conflicts as quickly as possible. This suggests that variables that cause conflicts are a valuable source of data. Another important observation is that the state-of-the-art VSIDS branching heuristic internally implements an exponential moving average. This highlights the importance of accounting for the temporal nature of the data when deciding to branch. These observations led to our proposal of a series of machine learning-based branching heuristics with the common goal of selecting the branching variables to increase probability of inducing conflicts. These branching heuristics are shown empirically to either be on par or outcompete the current state-of-the art.
The second area of interest for machine learning is the restart policy. Just like in the branching heuristic work, we first study restarts to observe why they are effective in practice. The important observation here is that restarts shrink the assignment stack as conjectured by other researchers. We show that this leads to better clause learning by lowering the LBD of learnt clauses. Machine learning is used to predict the LBD of the next clause, and a restart is triggered when the LBD is excessively high. This policy is shown to be on par with state-of-the-art. The success of incorporating machine learning into branching and restarts goes to show that machine learning has an important role in the future of heuristic and technique design for SAT solvers
Dynamic Difficulty Adjustment
One of the challenges that a computer game developer faces when creating a new game is getting the difficulty right. Providing a game with an ability to automatically scale the difficulty depending on the current player would make the games more engaging over longer time. In this work we aim at a dynamic difficulty adjustment algorithm that can be used as a black box: universal, nonintrusive, and with guarantees on its performance. While there are a few commercial games that boast about having such a system, as well as a few published results on this topic, to the best of our knowledge none of them satisfy all three of these properties. On the way to our destination we first consider a game as an interaction between a player and her opponent. In this context, assuming their goals are mutually exclusive, difficulty adjustment consists of tuning the skill of the opponent to match the skill of the player. We propose a way to estimate the latter and adjust the former based on ranking the moves available to each player. Two sets of empirical experiments demonstrate the power, but also the limitations of this approach. Most importantly, the assumptions we make restrict the class of games it can be applied to. Looking for universality, we drop the constraints on the types of games we consider. We rely on the power of supervised learning and use the data collected from game testers to learn models of difficulty adjustment, as well as a mapping from game traces to models. Given a short game trace, the corresponding model tells the game what difficulty adjustment should be used. Using a self-developed game, we show that the predicted adjustments match players' preferences. The quality of the difficulty models depends on the quality of existing training data. The desire to dispense with the need for it leads us to the last approach. We propose a formalization of dynamic difficulty adjustment as a novel learning problem in the context of online learning and provide an algorithm to solve it, together with an upper bound on its performance. We show empirical results obtained in simulation and in two qualitatively different games with human participants. Due to its general nature, this algorithm can indeed be used as a black box for dynamic difficulty adjustment: It is applicable to any game with various difficulty states; it does not interfere with the player's experience; and it has a theoretical guarantee on how many mistakes it can possibly make
- …