41 research outputs found
Active learning for fraud detection
Dissertação de mestrado integrado em Informatics EngineeringUm obstáculo comum em vários domínios no processo de preparação de um modelo de Machine Learning (ML) é a escassez de labels (i.e., etiquetas dos dados). Em aplicações reais, algures no processo de construção de um dataset existe um especialista a fazer anotação manual de cada instância dos dados para identificar a respetiva label. Dentro do domínio de deteção de fraude, que é normalmente tratado como um problema de ML supervisionado, a existência de analistas de fraude a reverem todas as transações que ocorrem representaria um nível de custos em recursos humanos inexequível. Isto leva a que apenas uma fração dos dados possam ser manualmente analisados. O sub-campo de ML conhecido como Active Learning (AL) surgiu em resposta a este problema. Em AL são implementados algoritmos que selecionam de forma eficiente quais as instâncias dos dados que devem ser analisadas de forma a otimizarem-se os custos de anotação dos dados. O objetivo principal deste processo é a criação de um modelo de previsão eficaz treinado com a menor quantidade de dados possível. Neste trabalho, apresentamos um estudo detalhado de diversas estratégias de AL em que realizamos experiências com dados de aplicações reais. Focamo-nos principalmente no cenário em que a anotação dos dados é iniciada a partir do primeiro dia de geração dos mesmos, não tendo à partida dados prévios para a construção de perfis dos utilizadores nem quaisquer labels. Apresentamos avaliações de novos algoritmos e configurações de AL, assim como métodos pré-existentes, através de múltiplas experiências. Estas experiências são realizadas num ambiente em streaming (tal como nos sistemas de produção em causa), em que as transações ao processadas em tempo real. Para além da escolha do algoritmo de AL existem outros parâmetros a definir na configuração geral. Realizamos estudos que nos permitem compreender quais os valores mais favoráveis de vários destes parâmetros, incluindo o impacto da escolha do método de pré-processamento de dados e do modelo de ML usado em avaliação. A maioria dos algoritmos de AL existentes na literatura exigem um conjunto de dados já com labels que tenha elementos de todas as classes existentes (e.g., transações legítimas e fraudulentas). Dado que no domínio da deteção de fraude é comum a ocorrência de transações fraudulentas ser rara, isto pode limitar quão rápido um algoritmo de AL totalmente supervisionado pode começar a ser utilizado nas primeiras iterações do processo. Em resposta a este problema nos apresentamos uma framework de AL em três fases que utiliza, num período intermédio, um algoritmo de AL que recorre à estrutura dos dados com labels sem utilizar as mesmas. Isto resulta num aumento da eficácia do sistema de AL. Dada a hipótese de que dois algoritmos de AL podem ser combinados de forma a produzir um que seja melhor que as suas partes, também desenvolvemos e estudamos vários métodos de combinação destes algoritmos. Realizamos uma comparação com uma grande quantidade de combinações que nos levam à conclusão de que tais combinações não aumentam a eficácia relativamente aos algoritmos individuais numa framework de três fases. Finalmente, realizamos um conjunto de experiências em larga escala que cobrem os diversos casos de uso da deteção de fraude. Os resultados indicam que AL é uma solução adequada para os casos de banking e merchant, principalmente quando utilizados algoritmos de AL baseados em incerteza. Contudo, o nosso estudo não demonstrou resultados positivos para um dataset de banking com ocorrências de fraude extremamente raras nem para o dataset de merchant acquirer.A problem that arises in many domains when preparing a machine learning (ML) model is label
scarcity. In various real world applications, somewhere in the loop of building a dataset, there is a
human expert manually annotating each dataset entry with the class label it belongs to. In fraud
detection, which is usually addressed as a supervised machine learning problem, having fraud
experts carefully reviewing every single transaction is often too expensive, so only a subset of them
can be manually annotated. The sub-field of ML known as active learning (AL) has emerged to
address this problem. AL implements policies that intelligently choose which instances should be
labeled by a human annotator in order to optimize the data labelling costs. The ultimate goal of this
procedure is to create a robust predictive model with as little data as possible [Settles (2009)].
In this work, we present a detailed study of various proposed AL strategies by performing
experiments with real world data. We focus, primarily, on the scenario where the annotation starts
from day-one with no previous data to build historical user profiles and, hence, no labeled data. We
present evaluations of several new and already existing types of AL policies and AL configurations
through various sets of experiments. The analysis is performed in a streaming setup (as required by
the production systems under study) where transactions are processed in real-time.
Besides the choice of a policy, there are other parameters that must be chosen in our AL setup.
We conduct dedicated studies to assess the most suitable choices for several such parameters. These
studies include the understanding of the impact on the choice of the data pre-processing methods
and the ML model to use in evaluations.
Since most AL policies proposed in the literature require that the pool of labeled instances contains
labels from all classes, the extreme class imbalance in the fraud detection domain can limit how fast
a fully supervised AL policy can start being used in the first iterations of an AL process. To address
this issue, we introduce a three-phase AL framework, which uses an intermediate stage policy that
does not resort to the label values but can still exploit the labeled pool. This improves the overall
performance of all policies used.
Based on the hypothesis that two AL policies can be combined to produce one that outperforms
each part, we also develop and study several policy combination methods. We perform a comparison
on a large set of combinations that leads us to the conclusion that these do not increase performance
when compared to the individual policies in a three-phase setup.
Finally, we perform a set of large-scale experiments that cover several business cases for fraud
detection. The results support that AL is an appropriate solution for the banking and merchant
business cases, especially when using uncertainty sampling as final policy. However, our study did
not demonstrate good results for a banking dataset with an extremely small fraud prevalence nor for
a merchant acquirer dataset
Learning Objective-Specific Active Learning Strategies with Attentive Neural Processes
Pool-based active learning (AL) is a promising technology for increasing data-efficiency of machine learning models. However, surveys show that performance of recent AL methods is very sensitive to the choice of dataset and training setting, making them unsuitable for general application. In order to tackle this problem, the field Learning Active Learning (LAL) suggests to learn the active learning strategy itself, allowing it to adapt to the given setting. In this work, we propose a novel LAL method for classification that exploits symmetry and independence properties of the active learning problem with an Attentive Conditional Neural Process model. Our approach is based on learning from a myopic oracle, which gives our model the ability to adapt to non-standard objectives, such as those that do not equally weight the error on all data points. We experimentally verify that our Neural Process model outperforms a variety of baselines in these settings. Finally, our experiments show that our model exhibits a tendency towards improved stability to changing datasets. However, performance is sensitive to choice of classifier and more work is necessary to reduce the performance the gap with the myopic oracle and to improve scalability. We present our work as a proof-of-concept for LAL on nonstandard objectives and hope our analysis and modelling considerations inspire future LAL work
Learning Objective-Specific Active Learning Strategies with Attentive Neural Processes
Pool-based active learning (AL) is a promising technology for increasing
data-efficiency of machine learning models. However, surveys show that
performance of recent AL methods is very sensitive to the choice of dataset and
training setting, making them unsuitable for general application. In order to
tackle this problem, the field Learning Active Learning (LAL) suggests to learn
the active learning strategy itself, allowing it to adapt to the given setting.
In this work, we propose a novel LAL method for classification that exploits
symmetry and independence properties of the active learning problem with an
Attentive Conditional Neural Process model. Our approach is based on learning
from a myopic oracle, which gives our model the ability to adapt to
non-standard objectives, such as those that do not equally weight the error on
all data points. We experimentally verify that our Neural Process model
outperforms a variety of baselines in these settings. Finally, our experiments
show that our model exhibits a tendency towards improved stability to changing
datasets. However, performance is sensitive to choice of classifier and more
work is necessary to reduce the performance the gap with the myopic oracle and
to improve scalability. We present our work as a proof-of-concept for LAL on
nonstandard objectives and hope our analysis and modelling considerations
inspire future LAL work.Comment: Accepted at ECML 202
How to Select Which Active Learning Strategy is Best Suited for Your Specific Problem and Budget
In Active Learning (AL), a learner actively chooses which unlabeled examples
to query for labels from an oracle, under some budget constraints. Different AL
query strategies are more suited to different problems and budgets. Therefore,
in practice, knowing in advance which AL strategy is most suited for the
problem at hand remains an open problem. To tackle this challenge, we propose a
practical derivative-based method that dynamically identifies the best strategy
for each budget. We provide theoretical analysis of a simplified case to
motivate our approach and build intuition. We then introduce a method to
dynamically select an AL strategy based on the specific problem and budget.
Empirical results showcase the effectiveness of our approach across diverse
budgets and computer vision tasks
Active Learning to Classify Macromolecular Structures in situ for Less Supervision in Cryo-Electron Tomography
Motivation: Cryo-Electron Tomography (cryo-ET) is a 3D bioimaging tool that
visualizes the structural and spatial organization of macromolecules at a
near-native state in single cells, which has broad applications in life
science. However, the systematic structural recognition and recovery of
macromolecules captured by cryo-ET are difficult due to high structural
complexity and imaging limits. Deep learning based subtomogram classification
have played critical roles for such tasks. As supervised approaches, however,
their performance relies on sufficient and laborious annotation on a large
training dataset.
Results: To alleviate this major labeling burden, we proposed a Hybrid Active
Learning (HAL) framework for querying subtomograms for labelling from a large
unlabeled subtomogram pool. Firstly, HAL adopts uncertainty sampling to select
the subtomograms that have the most uncertain predictions. Moreover, to
mitigate the sampling bias caused by such strategy, a discriminator is
introduced to judge if a certain subtomogram is labeled or unlabeled and
subsequently the model queries the subtomogram that have higher probabilities
to be unlabeled. Additionally, HAL introduces a subset sampling strategy to
improve the diversity of the query set, so that the information overlap is
decreased between the queried batches and the algorithmic efficiency is
improved. Our experiments on subtomogram classification tasks using both
simulated and real data demonstrate that we can achieve comparable testing
performance (on average only 3% accuracy drop) by using less than 30% of the
labeled subtomograms, which shows a very promising result for subtomogram
classification task with limited labeling resources.Comment: Statement on authorship changes: Dr. Eric Xing was an academic
advisor of Mr. Haohan Wang. Dr. Xing was not directly involved in this work
and has no direct interaction or collaboration with any other authors on this
work. Therefore, Dr. Xing is removed from the author list according to his
request. Mr. Zhenxi Zhu's affiliation is updated to his current affiliatio
Parameter-Efficient Language Model Tuning with Active Learning in Low-Resource Settings
Pre-trained language models (PLMs) have ignited a surge in demand for
effective fine-tuning techniques, particularly in low-resource domains and
languages. Active learning (AL), a set of algorithms designed to decrease
labeling costs by minimizing label complexity, has shown promise in confronting
the labeling bottleneck. In parallel, adapter modules designed for
parameter-efficient fine-tuning (PEFT) have demonstrated notable potential in
low-resource settings. However, the interplay between AL and adapter-based PEFT
remains unexplored. We present an empirical study of PEFT behavior with AL in
low-resource settings for text classification tasks. Our findings affirm the
superiority of PEFT over full-fine tuning (FFT) in low-resource settings and
demonstrate that this advantage persists in AL setups. We further examine the
properties of PEFT and FFT through the lens of forgetting dynamics and
instance-level representations, where we find that PEFT yields more stable
representations of early and middle layers compared to FFT. Our research
underscores the synergistic potential of AL and PEFT in low-resource settings,
paving the way for advancements in efficient and effective fine-tuning.Comment: Accepted at EMNLP 202