8,012 research outputs found
Learning Latent Representations of Bank Customers With The Variational Autoencoder
Learning data representations that reflect the customers' creditworthiness
can improve marketing campaigns, customer relationship management, data and
process management or the credit risk assessment in retail banks. In this
research, we adopt the Variational Autoencoder (VAE), which has the ability to
learn latent representations that contain useful information. We show that it
is possible to steer the latent representations in the latent space of the VAE
using the Weight of Evidence and forming a specific grouping of the data that
reflects the customers' creditworthiness. Our proposed method learns a latent
representation of the data, which shows a well-defied clustering structure
capturing the customers' creditworthiness. These clusters are well suited for
the aforementioned banks' activities. Further, our methodology generalizes to
new customers, captures high-dimensional and complex financial data, and scales
to large data sets.Comment: arXiv admin note: substantial text overlap with arXiv:1806.0253
Inferring Class Label Distribution of Training Data from Classifiers: An Accuracy-Augmented Meta-Classifier Attack
Property inference attacks against machine learning (ML) models aim to infer
properties of the training data that are unrelated to the primary task of the
model, and have so far been formulated as binary decision problems, i.e.,
whether or not the training data have a certain property. However, in
industrial and healthcare applications, the proportion of labels in the
training data is quite often also considered sensitive information. In this
paper we introduce a new type of property inference attack that unlike binary
decision problems in literature, aim at inferring the class label distribution
of the training data from parameters of ML classifier models. We propose a
method based on \emph{shadow training} and a \emph{meta-classifier} trained on
the parameters of the shadow classifiers augmented with the accuracy of the
classifiers on auxiliary data. We evaluate the proposed approach for ML
classifiers with fully connected neural network architectures. We find that the
proposed \emph{meta-classifier} attack provides a maximum relative improvement
of over state of the art.Comment: 12 pages, 2022 Trustworthy and Socially Responsible Machine Learning
(TSRML 2022) co-located with NeurIPS 202
Using Feature Selection with Machine Learning for Generation of Insurance Insights
Insurance is a data-rich sector, hosting large volumes of customer data that is analysed to evaluate risk. Machine learning techniques are increasingly used in the effective management of insurance risk. Insurance datasets by their nature, however, are often of poor quality with noisy subsets of data (or features). Choosing the right features of data is a significant pre-processing step in the creation of machine learning models. The inclusion of irrelevant and redundant features has been demonstrated to affect the performance of learning models. In this article, we propose a framework for improving predictive machine learning techniques in the insurance sector via the selection of relevant features. The experimental results, based on five publicly available real insurance datasets, show the importance of applying feature selection for the removal of noisy features before performing machine learning techniques, to allow the algorithm to focus on influential features. An additional business benefit is the revelation of the most and least important features in the datasets. These insights can prove useful for decision making and strategy development in areas/business problems that are not limited to the direct target of the downstream algorithms. In our experiments, machine learning techniques based on a set of selected features suggested by feature selection algorithms outperformed the full feature set for a set of real insurance datasets. Specifically, 20% and 50% of features in our five datasets had improved downstream clustering and classification performance when compared to whole datasets. This indicates the potential for feature selection in the insurance sector to both improve model performance and to highlight influential features for business insights
Federated Learning for Tabular Data:Exploring Potential Risk to Privacy
Federated Learning (FL) has emerged as a potentially powerful privacy-preserving machine learning method-ology, since it avoids exchanging data between participants, but instead exchanges model parameters. FL has traditionally been applied to image, voice and similar data, but recently it has started to draw attention from domains including financial services where the data is predominantly tabular. However, the work on tabular data has not yet considered potential attacks, in particular attacks using Generative Adversarial Networks (GANs), which have been successfully applied to FL for non-tabular data. This paper is the first to explore leakage of private data in Federated Learning systems that process tabular data. We design a Generative Adversarial Networks (GANs)-based attack model which can be deployed on a malicious client to reconstruct data and its properties from other participants. As a side-effect of considering tabular data, we are able to statistically assess the efficacy of the attack (without relying on human observation such as done for FL for images). We implement our attack model in a recently developed generic FL software framework for tabular data processing. The experimental results demonstrate the effectiveness of the proposed attack model, thus suggesting that further research is required to counter GAN-based privacy attacks.Green Open Access added to TU Delft Institutional Repository 'You share, we take care!' - Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.Distributed System
Variable-Based Calibration for Machine Learning Classifiers
The deployment of machine learning classifiers in high-stakes domains
requires well-calibrated confidence scores for model predictions. In this paper
we introduce the notion of variable-based calibration to characterize
calibration properties of a model with respect to a variable of interest,
generalizing traditional score-based calibration and metrics such as expected
calibration error (ECE). In particular, we find that models with near-perfect
ECE can exhibit significant variable-based calibration error as a function of
features of the data. We demonstrate this phenomenon both theoretically and in
practice on multiple well-known datasets, and show that it can persist after
the application of existing recalibration methods. To mitigate this issue, we
propose strategies for detection, visualization, and quantification of
variable-based calibration error. We then examine the limitations of current
score-based recalibration methods and explore potential modifications. Finally,
we discuss the implications of these findings, emphasizing that an
understanding of calibration beyond simple aggregate measures is crucial for
endeavors such as fairness and model interpretability
- …