7 research outputs found

    Discovering Valuable Items from Massive Data

    Full text link
    Suppose there is a large collection of items, each with an associated cost and an inherent utility that is revealed only once we commit to selecting it. Given a budget on the cumulative cost of the selected items, how can we pick a subset of maximal value? This task generalizes several important problems such as multi-arm bandits, active search and the knapsack problem. We present an algorithm, GP-Select, which utilizes prior knowledge about similarity be- tween items, expressed as a kernel function. GP-Select uses Gaussian process prediction to balance exploration (estimating the unknown value of items) and exploitation (selecting items of high value). We extend GP-Select to be able to discover sets that simultaneously have high utility and are diverse. Our preference for diversity can be specified as an arbitrary monotone submodular function that quantifies the diminishing returns obtained when selecting similar items. Furthermore, we exploit the structure of the model updates to achieve an order of magnitude (up to 40X) speedup in our experiments without resorting to approximations. We provide strong guarantees on the performance of GP-Select and apply it to three real-world case studies of industrial relevance: (1) Refreshing a repository of prices in a Global Distribution System for the travel industry, (2) Identifying diverse, binding-affine peptides in a vaccine de- sign task and (3) Maximizing clicks in a web-scale recommender system by recommending items to users

    Trip Prediction by Leveraging Trip Histories from Neighboring Users

    Full text link
    We propose a novel approach for trip prediction by analyzing user's trip histories. We augment users' (self-) trip histories by adding 'similar' trips from other users, which could be informative and useful for predicting future trips for a given user. This also helps to cope with noisy or sparse trip histories, where the self-history by itself does not provide a reliable prediction of future trips. We show empirical evidence that by enriching the users' trip histories with additional trips, one can improve the prediction error by 15%-40%, evaluated on multiple subsets of the Nancy2012 dataset. This real-world dataset is collected from public transportation ticket validations in the city of Nancy, France. Our prediction tool is a central component of a trip simulator system designed to analyze the functionality of public transportation in the city of Nancy

    Fake News Detection in Social Networks via Crowd Signals

    Full text link
    Our work considers leveraging crowd signals for detecting fake news and is motivated by tools recently introduced by Facebook that enable users to flag fake news. By aggregating users' flags, our goal is to select a small subset of news every day, send them to an expert (e.g., via a third-party fact-checking organization), and stop the spread of news identified as fake by an expert. The main objective of our work is to minimize the spread of misinformation by stopping the propagation of fake news in the network. It is especially challenging to achieve this objective as it requires detecting fake news with high-confidence as quickly as possible. We show that in order to leverage users' flags efficiently, it is crucial to learn about users' flagging accuracy. We develop a novel algorithm, DETECTIVE, that performs Bayesian inference for detecting fake news and jointly learns about users' flagging accuracy over time. Our algorithm employs posterior sampling to actively trade off exploitation (selecting news that maximize the objective value at a given epoch) and exploration (selecting news that maximize the value of information towards learning about users' flagging accuracy). We demonstrate the effectiveness of our approach via extensive experiments and show the power of leveraging community signals for fake news detection

    Data Summarization with Social Contexts

    Get PDF
    While social data is being widely used in various applications such as sentiment analysis and trend prediction, its sheer size also presents great challenges for storing, sharing and processing such data. These challenges can be addressed by data summarization which transforms the original dataset into a smaller, yet still useful, subset. Existing methods find such subsets with objective functions based on data properties such as representativeness or informativeness but do not exploit social contexts, which are distinct characteristics of social data. Further, till date very little work has focused on topic preserving data summarization, despite the abundant work on topic modeling. This is a challenging task for two reasons. First, since topic model is based on latent variables, existing methods are not well-suited to capture latent topics. Second, it is difficult to find such social contexts that provide valuable information for building effective topic-preserving summarization model. To tackle these challenges, in this paper, we focus on exploiting social contexts to summarize social data while preserving topics in the original dataset. We take Twitter data as a case study. Through analyzing Twitter data, we discover two social contexts which are important for topic generation and dissemination, namely (i) CrowdExp topic score that captures the influence of both the crowd and the expert users in Twitter and (ii) Retweet topic score that captures the influence of Twitter users' actions. We conduct extensive experiments on two real-world Twitter datasets using two applications. The experimental results show that, by leveraging social contexts, our proposed solution can enhance topic-preserving data summarization and improve application performance by up to 18%

    Metodologia de planejamento e redução do número de experimentos em problemas de busca ativa

    Get PDF
    Many engineering problems involve the optimization of the unknown objective function. Recently, active search has emerged as a powerful tool to solve problems of this nature, whose objective function involves high evaluation costs, whether computational or experimental. This thesis proposal seeks to find an object (x) with an optimal value for a given property (y). However, direct determination of this property of interest across all available objects may not be a viable option given the resources, workload and/or time required. Thus, this proposes an active machine learning approach, called active search, to find an optimal solution, using the design of experiments for the initial search. To apply this method, two regression techniques were used, called k-nearest-neighbours and Gaussian processes. Furthermore, a stopping criterion was defined for the Gaussian regression technique to reduce the algorithm processing time. The originality of the theme lies in the proposed methodology, in the use of experimental design, no active search algorithm using regression techniques that quickly converge to a global optimum, and in the use of a stopping criterion for the algorithm based on statistical criteria. The studies were carried out with simulated data and with real data for the production of medicines, agrochemicals and application in electrical microgrids. In all cases, active search reduced the number of experiments and simulations to obtain the property of interest, compared to traditional algorithms such as Optimal Experiment Design and Kennard-Stone.Existem muitos problemas de engenharia que envolvem a otimização da função objetivo desconhecida. Recentemente, a busca ativa surgiu como uma ferramenta poderosa para resolver problemas dessa natureza, cujas funções objetivo envolvem alto custo de avaliação, seja este computacional ou experimental. Nesta proposta de doutorado busca-se encontrar um objeto (x) com valor ótimo para uma determinada propriedade (y). No entanto, a determinação direta desta propriedade de interesse em todos os objetos disponíveis pode não ser uma opção viável, tendo em vista os recursos, a carga de trabalho e/ou o tempo necessários. Dessa forma, este estudo propõe uma abordagem de aprendizado ativo de máquina, chamada busca ativa, destinado a encontrar uma solução ótima, utilizando delineamento de experimentos para busca inicial. Para aplicação do método foram utilizadas duas técnicas de regressão, chamadas de k-vizinhos-mais próximos e processos Gaussianos. Além disso, um critério de parada foi definido para a técnica de regressão Gaussiana, com o objetivo de reduzir o tempo de processamento do algoritmo. A originalidade do tema encontra-se na metodologia proposta, na utilização de delineamento de experimentos, no algoritmo de busca ativa usando técnicas de regressão que convergem rapidamente para um ótimo global e na utilização de um critério de parada para o algoritmo baseado em critérios estatísticos. Os estudos foram realizados com dados simulados e com dados reais para produção de medicamentos, agroquímicos e aplicação em microrredes elétricas. Em todos esses casos, a busca ativa reduziu o número de experimentos e simulações para obter a propriedade de interesse, em comparação com os algoritmos tradicionais, como o planejamento ótimo de experimentos e o Kennard-Stone

    Automating Active Learning for Gaussian Processes

    Get PDF
    In many problems in science, technology, and engineering, unlabeled data is abundant but acquiring labeled observations is expensive -- it requires a human annotator, a costly laboratory experiment, or a time-consuming computer simulation. Active learning is a machine learning paradigm designed to minimize the cost of obtaining labeled data by carefully selecting which new data should be gathered next. However, excessive machine learning expertise is often required to effectively apply these techniques in their current form. In this dissertation, we propose solutions that further automate active learning. Our core contributions are active learning algorithms that are easy for non-experts to use but that deliver results competitive with or better than human-expert solutions. We begin introducing a novel active search algorithm that automatically and dynamically balances exploration against exploitation --- without relying on a parameter to control this tradeoff. We also provide a theoretical investigation on the hardness of this problem, proving that no polynomial-time policy can achieve a constant factor approximation ratio for the expected utility of the optimal policy. Next, we introduce a novel information-theoretic approach for active model selection. Our method is based on maximizing the mutual information between the output variable and the model class. This is the first active-model-selection approach that does not require updating each model for every candidate point. As a result, we successfully developed an automated audiometry test for rapid screening of noise-induced hearing loss, a widespread and preventable disability, if diagnosed early. We proceed by introducing a novel model selection algorithm for fixed-size datasets, called Bayesian optimization for model selection (BOMS). Our proposed model search method is based on Bayesian optimization in model space, where we reason about the model evidence as a function to be maximized. BOMS is capable of finding a model that explains the dataset well without any human assistance. Finally, we extend BOMS to active learning, creating a fully automatic active learning framework. We apply this framework to Bayesian optimization, creating a sample-efficient automated system for black-box optimization. Crucially, we account for the uncertainty in the choice of model; our method uses multiple and carefully-selected models to represent its current belief about the latent objective function. Our algorithms are completely general and can be extended to any class of probabilistic models. In this dissertation, however, we mainly use the powerful class of Gaussian process models to perform inference. Extensive experimental evidence is provided to demonstrate that all proposed algorithms outperform previously developed solutions to these problems
    corecore