13 research outputs found
Active Learning with Logged Data
We consider active learning with logged data, where labeled examples are
drawn conditioned on a predetermined logging policy, and the goal is to learn a
classifier on the entire population, not just conditioned on the logging
policy. Prior work addresses this problem either when only logged data is
available, or purely in a controlled random experimentation setting where the
logged data is ignored. In this work, we combine both approaches to provide an
algorithm that uses logged data to bootstrap and inform experimentation, thus
achieving the best of both worlds. Our work is inspired by a connection between
controlled random experimentation and active learning, and modifies existing
disagreement-based active learning algorithms to exploit logged data.Comment: ICML 201
Active learning of driving scenario trajectories
Annotated driving scenario trajectories are crucial for verification and validation of autonomous vehicles. However, annotation of such trajectories based only on explicit rules (i.e. knowledge-based methods) may be prone to errors, such as false positive/negative classification of scenarios that lie on the border of two scenario classes, missing unknown scenario classes, or even failing to detect anomalies. On the other hand, verification of labels by annotators is not cost-efficient. For this purpose, active learning (AL) could potentially improve the annotation procedure by including an annotator/expert in an efficient way. In this study, we develop a generic active learning framework to annotate driving trajectory time series data. We first compute an embedding of the trajectories into a latent space in order to extract the temporal nature of the data. Given such an embedding, the framework becomes task agnostic since active learning can be performed using any classification method and any query strategy, regardless of the structure of the original time series data. Furthermore, we utilize our active learning framework to discover unknown driving scenario trajectories. This will ensure that previously unknown trajectory types can be effectively detected and included in the labeled dataset. We evaluate our proposed framework in different settings on novel real-world datasets consisting of driving trajectories collected by Volvo Cars Corporation. We observe that active learning constitutes an effective tool for labeling driving trajectories as well as for detecting unknown classes. Expectedly, the quality of the embedding plays an important role in the success of the proposed framework
A Unified Active Learning Framework for Annotating Graph Data with Application to Software Source Code Performance Prediction
Most machine learning and data analytics applications, including performance
engineering in software systems, require a large number of annotations and
labelled data, which might not be available in advance. Acquiring annotations
often requires significant time, effort, and computational resources, making it
challenging. We develop a unified active learning framework, specializing in
software performance prediction, to address this task. We begin by parsing the
source code to an Abstract Syntax Tree (AST) and augmenting it with data and
control flow edges. Then, we convert the tree representation of the source code
to a Flow Augmented-AST graph (FA-AST) representation. Based on the graph
representation, we construct various graph embeddings (unsupervised and
supervised) into a latent space. Given such an embedding, the framework becomes
task agnostic since active learning can be performed using any regression
method and query strategy suited for regression. Within this framework, we
investigate the impact of using different levels of information for active and
passive learning, e.g., partially available labels and unlabeled test data. Our
approach aims to improve the investment in AI models for different software
performance predictions (execution time) based on the structure of the source
code. Our real-world experiments reveal that respectable performance can be
achieved by querying labels for only a small subset of all the data
Model-Centric and Data-Centric Aspects of Active Learning for Neural Network Models
We study different data-centric and model-centric aspects of active learning
with neural network models. i) We investigate incremental and cumulative
training modes that specify how the currently labeled data are used for
training. ii) Neural networks are models with a large capacity. Thus, we study
how active learning depends on the number of epochs and neurons as well as the
choice of batch size. iii) We analyze in detail the behavior of query
strategies and their corresponding informativeness measures and accordingly
propose more efficient querying and active learning paradigms. iv) We perform
statistical analyses, e.g., on actively learned classes and test error
estimation, that reveal several insights about active learning
Recommended from our members
Algorithms for Query-Efficient Active Learning
Recent decades have witnessed great success of machine learning, especially for tasks where large annotated datasets are available for training models. However, in many applications, raw data, such as images, are abundant, but annotations, such as descriptions of images, are scarce. Annotating data requires human effort and can be expensive. Consequently, one of the central problems in machine learning is how to train an accurate model with as few human annotations as possible. Active learning addresses this problem by bringing the annotator to work together with the learner in the learning process. In active learning, a learner can sequentially select examples and ask the annotator for labels, so that it may require fewer annotations if the learning algorithm avoids querying less informative examples.This dissertation focuses on designing provable query-efficient active learning algorithms. The main contributions are as follows. First, we study noise-tolerant active learning in the standard stream-based setting. We propose a computationally efficient algorithm for actively learning homogeneous halfspaces under bounded noise, and prove it achieves nearly optimal label complexity. Second, we theoretically investigate a novel interactive model where the annotator can not only return noisy labels, but also abstain from labeling. We propose an algorithm which utilizes abstention responses, and analyze its statistical consistency and query complexity under different conditions of the noise and abstention rate. Finally, we study how to utilize auxiliary datasets in active learning. We consider a scenario where the learner has access to a logged observational dataset where labeled examples are observed conditioned on a selection policy. We propose algorithms that effectively take advantage of both auxiliary datasets and active learning. We prove that these algorithms are statistically consistent, and achieve a lower label requirement than alternative methods theoretically and empirically
Active learning for treatment effects
Companies introduce new features to their websites through carefully testing different variations. In its most basic form, they do this by randomly splitting the population into different groups and showing different versions to the different groups. After the experiment has finished, they can compare the performance of the different versions and decide which one to keep.They decide this by evaluating the treatment effect, the value associated to the change. This product development process is called A/B testing and is widely used in many industries. However, A/B testing is not always the best approach. For example, when the treatment effect is small, the sample size required to detect the effect can be prohibitively large. Furthermore, when different costs are associated with the units, A/B tests can be suboptimal. This thesis focuses on the problem of active learning for treatment effects, where the goal is to learn the treatment effect of a new feature as quickly as possible by selecting the most informative units for treatment.The thesis consists of three studies. The first two introduces new algorithms for actively selecting units for experiments, while the third one introduces a programming package written in Python, called Asbe, that helps researchers and practitioners to develop and evaluate new active learning algorithms. The thesis contains several simulations, on simulated and real world data as well._Bedrijven introduceren nieuwe functies op hun websites door verschillende varianten zorgvuldig te testen. In de meest basale vorm doen ze dit door de bevolking willekeurig in verschillende groepen te verdelen en verschillende versies aan de verschillende groepen te tonen. Nadat het experiment is afgelopen, kunnen ze de prestaties van de verschillende versies vergelijken en beslissen welke ze willen behouden. Zij beslissen hierover door het behandeleffect, de waarde die aan de verandering is gekoppeld, te evalueren.Dit productontwikkelingsproces wordt A/B-testen genoemd en wordt in veel industrie¨en veel gebruikt. A/B-testen zijn echter niet altijd de beste aanpak. Als het behandeleffect bijvoorbeeld klein is, kan de steekproefomvang die nodig is om het effect te detecteren onbetaalbaar groot zijn. Bovendien kunnen A/B-tests suboptimaal zijn als er verschillende kosten aan de eenheden zijn verbonden. Dit proefschrift richt zich op het probleem van active learning voor behandeleffecten, waarbij het doel is om het behandeleffect van een nieuw kenmerk zo snel mogelijk te leren kennen door de meest informatieve eenheden voor behandeling te selecteren.Het proefschrift bestaat uit drie onderzoeken. De eerste twee introduceren nieuwe algoritmen voor het actief selecteren van eenheden voor experimenten, terwijl de derde een in Python geschreven programmeerpakket introduceert, genaamd Asbe, dat onderzoekers en praktijkmensen helpt nieuwe actieve leer algoritmen te ontwikkelen en evalueren. Het proefschrift bevat verschillende simulaties, zowel op gesimuleerde als op echte gegevens