826 research outputs found
Diverse Conventions for Human-AI Collaboration
Conventions are crucial for strong performance in cooperative multi-agent
games, because they allow players to coordinate on a shared strategy without
explicit communication. Unfortunately, standard multi-agent reinforcement
learning techniques, such as self-play, converge to conventions that are
arbitrary and non-diverse, leading to poor generalization when interacting with
new partners. In this work, we present a technique for generating diverse
conventions by (1) maximizing their rewards during self-play, while (2)
minimizing their rewards when playing with previously discovered conventions
(cross-play), stimulating conventions to be semantically different. To ensure
that learned policies act in good faith despite the adversarial optimization of
cross-play, we introduce \emph{mixed-play}, where an initial state is randomly
generated by sampling self-play and cross-play transitions and the player
learns to maximize the self-play reward from this initial state. We analyze the
benefits of our technique on various multi-agent collaborative games, including
Overcooked, and find that our technique can adapt to the conventions of humans,
surpassing human-level performance when paired with real users.Comment: 25 pages, 9 figures, 37th Conference on Neural Information Processing
Systems (NeurIPS 2023
Recommended from our members
Optimization of Item Selection with Prediction Uncertainty
Selecting items from a candidate pool to maximize the total return is a classical problem, which is faced by people frequently in real life and also engineers in information technology industry, e.g., digital advertising, e-commerce, web search, etc. For example, web UI designers always try to find the best web design among many candidates to display to users, Google needs to select personalized engaging ads to display to users based on their historical online behaviors. Each of these industries has hundreds of billions of dollars market, which means that even a small performance improvement of item selection efficiency can drive hundreds of millions of dollars growth in the real world. In these applications, the true value of each item is unknown and can only be estimated from observed historical data. There is a large volume of significant research about building prediction models which are trained on historical data to estimate the item values. Given data volume and computation resource restrictions, engineers choose different models, e.g., deep neutral network, gradient boosting tree, or logistic regression to solve the problems. We will not dive into this area too much in this dissertation. Instead, our focus is how to maximize the total return given these predictions, especially taking into account the prediction uncertainties for the value optimization.In the large-scale real applications, the candidate pool can be extraordinary large. It is infeasible to pick some items from the pool to get interactive feedback for exploration. Actually, not only is exploration infeasible, but even estimating the value of each item through a complex estimation mode is almost impossible due to the need of real-time response. For example, Apple needs to estimate users’ favorite apps and recommends them to users when they visit Apple store. Google needs to select ads to display to users given a users’ search queries. There are millions of candidates needing to be estimated from prediction models. It is very challenging to support such a large scale of model prediction under the low-latency constraint. Besides that, to have a good prediction accuracy, the models used in industry are getting more and more complex, e.g hidden neurons and layers of deep neural network increases rapidly in real applications, which also increases latency significantly. All of these make it infeasible to evaluate all candidates through one single complex model in large scale application. To solve this problem, engineers usually leverage the cascading waterfall filtering method to filter items sequentially, which means instead of using one complex model to estimate the values of all candidates, multiple stages are adopted to filter out candidates sequentially. For example, a simple model is used in the first stage to estimate candidates’ values for choosing a small subset from all candidates. These selected items are then passed to another stage to be estimated by a more complex model. Intuitively, this cascading waterfall filtering method provides a good trade-off between infrastructure cost and prediction accuracy, which can save computational resources use substantially, and simultaneously select most promising items accurately. However, there is no systematic study about how to efficiently choose the number of waterfalls and how many filtered items in each waterfall. Engineers tune the settings of this system heuristically through personal experience or online experiments, which is very inefficient, especially when the system is dynamic and changes rapidly. In this dissertation, we propose a theoretical framework for the cascading waterfall filtering problem and develop a mathematical algorithm to obtain the optimal solutions. Our method achieves a dramatic improvement in an important real-world application, which adopts cascading water filtering system to select a few items from tens of millions of candidates.There are also some cases in which the candidate pool is relatively small. For instance, the number of web UI candidates is usually less than one hundred. Then, we are able to explore during item selection process. A typical exploration case is online experimentation, which is widely used to test and select items in real applications. In this situation, we can get interactive feedback to evaluate items. Considering online experiments for example, we usually randomly segments users into several groups, show them different candidates, and then compare the overall performance of each candidate to find the item with the largest value. Among all designs, A/B testing, which usually segments users into two statasitically equivalent groups to measure the difference between two versions of a single variable, is the most popular. For instance, in order to compare the impact of an ad versus another, we need to see the impact of exposing a user to viewing the first ad, and not the second, and then compare with the converse situation. However, a user cannot both see the first ad and not see it. Consequently, we need to create two “statistically equivalent populations” and expose users randomly to one or the other. This method is straightforward. However, the defect of this method is also obvious: to measure both versions, this method cannot expose all users to the best version, which leads to potential value loss. Some multi-armed bandit algorithms, e.g., Randomized Probability Matching (RPM), Upper Confidence Bounds (UCB), whose objective is maximizing the total return in experiment, have been proposed for improvement. However, these methods do not take into account the statistical confidence levels of the final result from the experiment and the corresponding impact on the subsequent item selection in the post-experimental stage. To solve this problem, we develop algorithms to achieve a good trade-off between reducing statistical uncertainty and maximizing cumulative reward, which aims at maximizing the total expected reward of item selection over a total duration, which includes both the current experimental stage and the post-experimental stage. The proposed algorithms demonstrate consistent and statistically significant improvements across different settings, outperforming both A/B testing and multi-armed bandit algorithms significantly
How Technology Impacts and Compares to Humans in Socially Consequential Arenas
One of the main promises of technology development is for it to be adopted by
people, organizations, societies, and governments -- incorporated into their
life, work stream, or processes. Often, this is socially beneficial as it
automates mundane tasks, frees up more time for other more important things, or
otherwise improves the lives of those who use the technology. However, these
beneficial results do not apply in every scenario and may not impact everyone
in a system the same way. Sometimes a technology is developed which produces
both benefits and inflicts some harm. These harms may come at a higher cost to
some people than others, raising the question: {\it how are benefits and harms
weighed when deciding if and how a socially consequential technology gets
developed?} The most natural way to answer this question, and in fact how
people first approach it, is to compare the new technology to what used to
exist. As such, in this work, I make comparative analyses between humans and
machines in three scenarios and seek to understand how sentiment about a
technology, performance of that technology, and the impacts of that technology
combine to influence how one decides to answer my main research question.Comment: Doctoral thesis proposal. arXiv admin note: substantial text overlap
with arXiv:2110.08396, arXiv:2108.12508, arXiv:2006.1262
Multi-Objective Learning for Multi-Modal Natural Language Generation
One of the important goals of Artificial Intelligence (AI) is to mimic the ability of humans to leverage the knowledge or skill from previously learned tasks to quickly learn a new task. For example, humans can reapply the learned skill of balancing the bicycle for learning to ride a motorbike. In a similar context, the field of Natural Language Processing (NLP) has several tasks including machine translation, textual summarization, image/video captioning, sentiment analysis, dialog systems, natural language inference, question answering, etc. While these different NLP tasks are often trained separately, leveraging the knowledge or skill from related tasks via joint training or training one task after another task in a sequential fashion, can have potential advantages. To this end, this dissertation explores various NLP tasks (especially multi-modal text generation and pair-wise classification tasks covering both natural language generation (NLG) and natural language understanding (NLU)) leveraging information from the related auxiliary tasks in an effective way via novel multi-objective learning strategies. These proposed novel learning strategies can be broadly classified into three paradigms: multi-task learning, multi-reward reinforcement learning, and continual learning. In multi-task learning, we mainly focus on intuitively finding what related auxiliary tasks can benefit the multi-modal video caption generation task and textual summarization task. We explore effective ways of sharing the parameters across these related tasks via joint training. In multi-reward reinforcement learning, we teach various skills to multi-modal text generation models in the form of rewards. For example, we try to teach the entailment skill to the video captioning model with entailment rewards. Further, we propose novel and effective ways of inducing multiple skills by `dynamically' choosing the auxiliary tasks (in MTL) or rewards (in RL) during the training in an automatic way using multi-armed bandits based approaches. Finally, in continual learning, we explore sharing of information across various tasks in a sequential way, where the model continually evolves during the sequential training without losing the performance on previously learned tasks. This kind of sharing allows the later tasks to benefit from previously trained tasks and vice-versa in some cases. For this, we propose a novel method that continually changes the model architecture to accommodate new tasks while retaining performance on old tasks. We empirically evaluate our method on three natural language inference tasks.Doctor of Philosoph
Auditing black-box prediction models for data minimization compliance
In this paper, we focus on auditing black-box prediction models for compliance with
the GDPR’s data minimization principle. This principle restricts prediction models
to use the minimal information that is necessary for performing the task at hand.
Given the challenge of the black-box setting, our key idea is to check if each of the
prediction model’s input features is individually necessary by assigning it some
constant value (i.e., applying a simple imputation) across all prediction instances,
and measuring the extent to which the model outcomes would change. We introduce
a metric for data minimization that is based on model instability under simple
imputations. We extend the applicability of this metric from a finite sample model
to a distributional setting by introducing a probabilistic data minimization guarantee,
which we derive using a Bayesian approach. Furthermore, we address the auditing
problem under a constraint on the number of queries to the prediction system. We
formulate the problem of allocating a budget of system queries to feasible simple
imputations (for investigating model instability) as a multi-armed bandit framework
with probabilistic success metrics. We define two bandit problems for providing a
probabilistic data minimization guarantee at a given confidence level: a decision
problem given a data minimization level, and a measurement problem given a fixed
query budget. We design efficient algorithms for these auditing problems using
novel exploration strategies that expand classical bandit strategies. Our experiments
with real-world prediction systems show that our auditing algorithms significantly
outperform simpler benchmarks in both measurement and decision problems.Published versio
Estimating Dependency, Monitoring and Knowledge Discovery in High-Dimensional Data Streams
Data Mining – known as the process of extracting knowledge from massive data sets – leads to phenomenal impacts on our society, and now affects nearly every aspect of our lives: from the layout in our local grocery store, to the ads and product recommendations we receive, the availability of treatments for common diseases, the prevention of crime, or the efficiency of industrial production processes.
However, Data Mining remains difficult when (1) data is high-dimensional, i.e., has many attributes, and when (2) data comes as a stream. Extracting knowledge from high-dimensional data streams is impractical because one must cope with two orthogonal sets of challenges. On the one hand, the effects of the so-called "curse of dimensionality" bog down the performance of statistical methods and yield to increasingly complex Data Mining problems. On the other hand, the statistical properties of data streams may evolve in unexpected ways, a phenomenon known in the community as "concept drift". Thus, one needs to update their knowledge about data over time, i.e., to monitor the stream.
While previous work addresses high-dimensional data sets and data streams to some extent, the intersection of both has received much less attention. Nevertheless, extracting knowledge in this setting is advantageous for many industrial applications: identifying patterns from high-dimensional data streams in real-time may lead to larger production volumes, or reduce operational costs. The goal of this dissertation is to bridge this gap.
We first focus on dependency estimation, a fundamental task of Data Mining. Typically, one estimates dependency by quantifying the strength of statistical relationships. We identify the requirements for dependency estimation in high-dimensional data streams and propose a new estimation framework, Monte Carlo Dependency Estimation (MCDE), that fulfils them all. We show that MCDE leads to efficient dependency monitoring.
Then, we generalise the task of monitoring by introducing the Scaling Multi-Armed Bandit (S-MAB) algorithms, extending the Multi-Armed Bandit (MAB) model. We show that our algorithms can efficiently monitor statistics by leveraging user-specific criteria.
Finally, we describe applications of our contributions to Knowledge Discovery. We propose an algorithm, Streaming Greedy Maximum Random Deviation (SGMRD), which exploits our new methods to extract patterns, e.g., outliers, in high-dimensional data streams. Also, we present a new approach, that we name kj-Nearest Neighbours (kj-NN), to detect outlying documents within massive text corpora.
We support our algorithmic contributions with theoretical guarantees, as well as extensive experiments against both synthetic and real-world data. We demonstrate the benefits of our methods against real-world use cases. Overall, this dissertation establishes fundamental tools for Knowledge Discovery in high-dimensional data streams, which help with many applications in the industry, e.g., anomaly detection, or predictive maintenance.
To facilitate the application of our results and future research, we publicly release our implementations, experiments, and benchmark data via open-source platforms
- …