11 research outputs found
Effects of sampling skewness of the importance-weighted risk estimator on model selection
Importance-weighting is a popular and well-researched technique for dealing
with sample selection bias and covariate shift. It has desirable
characteristics such as unbiasedness, consistency and low computational
complexity. However, weighting can have a detrimental effect on an estimator as
well. In this work, we empirically show that the sampling distribution of an
importance-weighted estimator can be skewed. For sample selection bias
settings, and for small sample sizes, the importance-weighted risk estimator
produces overestimates for datasets in the body of the sampling distribution,
i.e. the majority of cases, and large underestimates for data sets in the tail
of the sampling distribution. These over- and underestimates of the risk lead
to suboptimal regularization parameters when used for importance-weighted
validation.Comment: Conference paper, 6 pages, 5 figure
On Regularization Parameter Estimation under Covariate Shift
This paper identifies a problem with the usual procedure for
L2-regularization parameter estimation in a domain adaptation setting. In such
a setting, there are differences between the distributions generating the
training data (source domain) and the test data (target domain). The usual
cross-validation procedure requires validation data, which can not be obtained
from the unlabeled target data. The problem is that if one decides to use
source validation data, the regularization parameter is underestimated. One
possible solution is to scale the source validation data through importance
weighting, but we show that this correction is not sufficient. We conclude the
paper with an empirical analysis of the effect of several importance weight
estimators on the estimation of the regularization parameter.Comment: 6 pages, 2 figures, 2 tables. Accepted to ICPR 201
Transfer Learning Strategies for Credit Card Fraud Detection.
Credit card fraud jeopardizes the trust of customers in e-commerce transactions. This led in recent years to major advances in the design of automatic Fraud Detection Systems (FDS) able to detect fraudulent transactions with short reaction time and high precision. Nevertheless, the heterogeneous nature of the fraud behavior makes it difficult to tailor existing systems to different contexts (e.g. new payment systems, different countries and/or population segments). Given the high cost (research, prototype development, and implementation in production) of designing data-driven FDSs, it is crucial for transactional companies to define procedures able to adapt existing pipelines to new challenges. From an AI/machine learning perspective, this is known as the problem of transfer learning. This paper discusses the design and implementation of transfer learning approaches for e-commerce credit card fraud detection and their assessment in a real setting. The case study, based on a six-month dataset (more than 200 million e-commerce transactions) provided by the industrial partner, relates to the transfer of detection models developed for a European country to another country. In particular, we present and discuss 15 transfer learning techniques (ranging from naive baselines to state-of-the-art and new approaches), making a critical and quantitative comparison in terms of precision for different transfer scenarios. Our contributions are twofold: (i) we show that the accuracy of many transfer methods is strongly dependent on the number of labeled samples in the target domain and (ii) we propose an ensemble solution to this problem based on self-supervised and semi-supervised domain adaptation classifiers. The thorough experimental assessment shows that this solution is both highly accurate and hardly sensitive to the number of labeled samples
A review of domain adaptation without target labels
Domain adaptation has become a prominent problem setting in machine learning
and related fields. This review asks the question: how can a classifier learn
from a source domain and generalize to a target domain? We present a
categorization of approaches, divided into, what we refer to as, sample-based,
feature-based and inference-based methods. Sample-based methods focus on
weighting individual observations during training based on their importance to
the target domain. Feature-based methods revolve around on mapping, projecting
and representing features such that a source classifier performs well on the
target domain and inference-based methods incorporate adaptation into the
parameter estimation procedure, for instance through constraints on the
optimization procedure. Additionally, we review a number of conditions that
allow for formulating bounds on the cross-domain generalization error. Our
categorization highlights recurring ideas and raises questions important to
further research.Comment: 20 pages, 5 figure
SPINEX: Similarity-based Predictions and Explainable Neighbors Exploration for Regression and Classification Tasks in Machine Learning
The field of machine learning (ML) has witnessed significant advancements in
recent years. However, many existing algorithms lack interpretability and
struggle with high-dimensional and imbalanced data. This paper proposes SPINEX,
a novel similarity-based interpretable neighbor exploration algorithm designed
to address these limitations. This algorithm combines ensemble learning and
feature interaction analysis to achieve accurate predictions and meaningful
insights by quantifying each feature's contribution to predictions and
identifying interactions between features, thereby enhancing the
interpretability of the algorithm. To evaluate the performance of SPINEX,
extensive experiments on 59 synthetic and real datasets were conducted for both
regression and classification tasks. The results demonstrate that SPINEX
achieves comparative performance and, in some scenarios, may outperform
commonly adopted ML algorithms. The same findings demonstrate the effectiveness
and competitiveness of SPINEX, making it a promising approach for various
real-world applications
Stratified Learning: a general-purpose statistical method for improved learning under Covariate Shift
Covariate shift arises when the labelled training (source) data is not
representative of the unlabelled (target) data due to systematic differences in
the covariate distributions. A supervised model trained on the source data
subject to covariate shift may suffer from poor generalization on the target
data. We propose a novel, statistically principled and theoretically justified
method to improve learning under covariate shift conditions, based on
propensity score stratification, a well-established methodology in causal
inference. We show that the effects of covariate shift can be reduced or
altogether eliminated by conditioning on propensity scores. In practice, this
is achieved by fitting learners on subgroups ("strata") constructed by
partitioning the data based on the estimated propensity scores, leading to
balanced covariates and much-improved target prediction. We demonstrate the
effectiveness of our general-purpose method on contemporary research questions
in observational cosmology, and on additional benchmark examples, matching or
outperforming state-of-the-art importance weighting methods, widely studied in
the covariate shift literature. We obtain the best reported AUC (0.958) on the
updated "Supernovae photometric classification challenge" and improve upon
existing conditional density estimation of galaxy redshift from Sloan Data Sky
Survey (SDSS) data