7 research outputs found
Discovering Valuable Items from Massive Data
Suppose there is a large collection of items, each with an associated cost
and an inherent utility that is revealed only once we commit to selecting it.
Given a budget on the cumulative cost of the selected items, how can we pick a
subset of maximal value? This task generalizes several important problems such
as multi-arm bandits, active search and the knapsack problem. We present an
algorithm, GP-Select, which utilizes prior knowledge about similarity be- tween
items, expressed as a kernel function. GP-Select uses Gaussian process
prediction to balance exploration (estimating the unknown value of items) and
exploitation (selecting items of high value). We extend GP-Select to be able to
discover sets that simultaneously have high utility and are diverse. Our
preference for diversity can be specified as an arbitrary monotone submodular
function that quantifies the diminishing returns obtained when selecting
similar items. Furthermore, we exploit the structure of the model updates to
achieve an order of magnitude (up to 40X) speedup in our experiments without
resorting to approximations. We provide strong guarantees on the performance of
GP-Select and apply it to three real-world case studies of industrial
relevance: (1) Refreshing a repository of prices in a Global Distribution
System for the travel industry, (2) Identifying diverse, binding-affine
peptides in a vaccine de- sign task and (3) Maximizing clicks in a web-scale
recommender system by recommending items to users
Trip Prediction by Leveraging Trip Histories from Neighboring Users
We propose a novel approach for trip prediction by analyzing user's trip
histories. We augment users' (self-) trip histories by adding 'similar' trips
from other users, which could be informative and useful for predicting future
trips for a given user. This also helps to cope with noisy or sparse trip
histories, where the self-history by itself does not provide a reliable
prediction of future trips. We show empirical evidence that by enriching the
users' trip histories with additional trips, one can improve the prediction
error by 15%-40%, evaluated on multiple subsets of the Nancy2012 dataset. This
real-world dataset is collected from public transportation ticket validations
in the city of Nancy, France. Our prediction tool is a central component of a
trip simulator system designed to analyze the functionality of public
transportation in the city of Nancy
Fake News Detection in Social Networks via Crowd Signals
Our work considers leveraging crowd signals for detecting fake news and is
motivated by tools recently introduced by Facebook that enable users to flag
fake news. By aggregating users' flags, our goal is to select a small subset of
news every day, send them to an expert (e.g., via a third-party fact-checking
organization), and stop the spread of news identified as fake by an expert. The
main objective of our work is to minimize the spread of misinformation by
stopping the propagation of fake news in the network. It is especially
challenging to achieve this objective as it requires detecting fake news with
high-confidence as quickly as possible. We show that in order to leverage
users' flags efficiently, it is crucial to learn about users' flagging
accuracy. We develop a novel algorithm, DETECTIVE, that performs Bayesian
inference for detecting fake news and jointly learns about users' flagging
accuracy over time. Our algorithm employs posterior sampling to actively trade
off exploitation (selecting news that maximize the objective value at a given
epoch) and exploration (selecting news that maximize the value of information
towards learning about users' flagging accuracy). We demonstrate the
effectiveness of our approach via extensive experiments and show the power of
leveraging community signals for fake news detection
Data Summarization with Social Contexts
While social data is being widely used in various applications such as sentiment analysis and trend prediction, its sheer size also presents great challenges for storing, sharing and processing such data. These challenges can be addressed by data summarization which transforms the original dataset into a smaller, yet still useful, subset. Existing methods find such subsets with objective functions based on data properties such as representativeness or informativeness but do not exploit social contexts, which are distinct characteristics of social data. Further, till date very little work has focused on topic preserving data summarization, despite the abundant work on topic modeling. This is a challenging task for two reasons. First, since topic model is based on latent variables, existing methods are not well-suited to capture latent topics. Second, it is difficult to find such social contexts that provide valuable information for building effective topic-preserving summarization model. To tackle these challenges, in this paper, we focus on exploiting social contexts to summarize social data while preserving topics in the original dataset. We take Twitter data as a case study. Through analyzing Twitter data, we discover two social contexts which are important for topic generation and dissemination, namely (i) CrowdExp topic score that captures the influence of both the crowd and the expert users in Twitter and (ii) Retweet topic score that captures the influence of Twitter users' actions. We conduct extensive experiments on two real-world Twitter datasets using two applications. The experimental results show that, by leveraging social contexts, our proposed solution can enhance topic-preserving data summarization and improve application performance by up to 18%
Metodologia de planejamento e redução do número de experimentos em problemas de busca ativa
Many engineering problems involve the optimization of the unknown objective function.
Recently, active search has emerged as a powerful tool to solve problems of this nature, whose
objective function involves high evaluation costs, whether computational or experimental. This
thesis proposal seeks to find an object (x) with an optimal value for a given property (y).
However, direct determination of this property of interest across all available objects may not
be a viable option given the resources, workload and/or time required. Thus, this proposes an
active machine learning approach, called active search, to find an optimal solution, using the
design of experiments for the initial search. To apply this method, two regression techniques
were used, called k-nearest-neighbours and Gaussian processes. Furthermore, a stopping
criterion was defined for the Gaussian regression technique to reduce the algorithm processing
time. The originality of the theme lies in the proposed methodology, in the use of experimental
design, no active search algorithm using regression techniques that quickly converge to a
global optimum, and in the use of a stopping criterion for the algorithm based on statistical
criteria. The studies were carried out with simulated data and with real data for the production
of medicines, agrochemicals and application in electrical microgrids. In all cases, active search
reduced the number of experiments and simulations to obtain the property of interest, compared
to traditional algorithms such as Optimal Experiment Design and Kennard-Stone.Existem muitos problemas de engenharia que envolvem a otimização da função objetivo
desconhecida. Recentemente, a busca ativa surgiu como uma ferramenta poderosa para resolver
problemas dessa natureza, cujas funções objetivo envolvem alto custo de avaliação, seja este
computacional ou experimental. Nesta proposta de doutorado busca-se encontrar um objeto (x)
com valor ótimo para uma determinada propriedade (y). No entanto, a determinação direta desta
propriedade de interesse em todos os objetos disponíveis pode não ser uma opção viável, tendo
em vista os recursos, a carga de trabalho e/ou o tempo necessários. Dessa forma, este estudo
propõe uma abordagem de aprendizado ativo de máquina, chamada busca ativa, destinado a
encontrar uma solução ótima, utilizando delineamento de experimentos para busca inicial. Para
aplicação do método foram utilizadas duas técnicas de regressão, chamadas de k-vizinhos-mais próximos e processos Gaussianos. Além disso, um critério de parada foi definido para a técnica
de regressão Gaussiana, com o objetivo de reduzir o tempo de processamento do algoritmo. A
originalidade do tema encontra-se na metodologia proposta, na utilização de delineamento de
experimentos, no algoritmo de busca ativa usando técnicas de regressão que convergem
rapidamente para um ótimo global e na utilização de um critério de parada para o algoritmo
baseado em critérios estatísticos. Os estudos foram realizados com dados simulados e com
dados reais para produção de medicamentos, agroquímicos e aplicação em microrredes
elétricas. Em todos esses casos, a busca ativa reduziu o número de experimentos e simulações
para obter a propriedade de interesse, em comparação com os algoritmos tradicionais, como o
planejamento ótimo de experimentos e o Kennard-Stone
Automating Active Learning for Gaussian Processes
In many problems in science, technology, and engineering, unlabeled data is abundant but acquiring labeled observations is expensive -- it requires a human annotator, a costly laboratory experiment, or a time-consuming computer simulation. Active learning is a machine learning paradigm designed to minimize the cost of obtaining labeled data by carefully selecting which new data should be gathered next. However, excessive machine learning expertise is often required to effectively apply these techniques in their current form. In this dissertation, we propose solutions that further automate active learning. Our core contributions are active learning algorithms that are easy for non-experts to use but that deliver results competitive with or better than human-expert solutions. We begin introducing a novel active search algorithm that automatically and dynamically balances exploration against exploitation --- without relying on a parameter to control this tradeoff. We also provide a theoretical investigation on the hardness of this problem, proving that no polynomial-time policy can achieve a constant factor approximation ratio for the expected utility of the optimal policy. Next, we introduce a novel information-theoretic approach for active model selection. Our method is based on maximizing the mutual information between the output variable and the model class. This is the first active-model-selection approach that does not require updating each model for every candidate point. As a result, we successfully developed an automated audiometry test for rapid screening of noise-induced hearing loss, a widespread and preventable disability, if diagnosed early. We proceed by introducing a novel model selection algorithm for fixed-size datasets, called Bayesian optimization for model selection (BOMS). Our proposed model search method is based on Bayesian optimization in model space, where we reason about the model evidence as a function to be maximized. BOMS is capable of finding a model that explains the dataset well without any human assistance. Finally, we extend BOMS to active learning, creating a fully automatic active learning framework. We apply this framework to Bayesian optimization, creating a sample-efficient automated system for black-box optimization. Crucially, we account for the uncertainty in the choice of model; our method uses multiple and carefully-selected models to represent its current belief about the latent objective function. Our algorithms are completely general and can be extended to any class of probabilistic models. In this dissertation, however, we mainly use the powerful class of Gaussian process models to perform inference. Extensive experimental evidence is provided to demonstrate that all proposed algorithms outperform previously developed solutions to these problems