14 research outputs found
Recommended from our members
A very simple safe-Bayesian random forest
Random forests works by averaging several predictions of de-correlated trees. We show a conceptually radical approach to generate a random forest: random sampling of many trees from a prior distribution, and subsequently performing a weighted ensemble of predictive probabilities. Our approach uses priors that allow sampling of decision trees even before looking at the data, and a power likelihood that explores the space spanned by combination of decision trees. While each tree performs Bayesian inference to compute its predictions, our aggregation procedure uses the power likelihood rather than the likelihood and is therefore strictly speaking not Bayesian. Nonetheless, we refer to it as a Bayesian random forest but with a built-in safety. The safeness comes as it has good predictive performance even if the underlying probabilistic model is wrong. We demonstrate empirically that our Safe-Bayesian random forest outperforms MCMC or SMC based Bayesian decision trees in term of speed and accuracy, and achieves competitive performance to entropy or Gini optimised random forest, yet is very simple to construct
Recommended from our members
Algorithmic trust and regulation: Governance, ethics, legal, and social implications blueprint for Indonesia's central banking
Algorithm-driven financial systems significantly influence monetary stability and payment transactions. While these systems bring opportunities like automation and predictive analytics, they also raise ethical concerns, particularly biases embedded in historical data. Recognizing the critical role of governance, ethics, legal considerations, and social implications (GELSI), this study introduces a framework tailored for algorithmic systems in financial services, focusing on Indonesia's evolving regulatory environment. Using the Multiple Streams Approach (MSA) as our theoretical lens, we offer a framework that augments existing quantitative methodologies. Our study provides a nuanced, qualitative perspective on algorithmic trust and regulation. We proffer actionable strategies for the Central Bank of Indonesia (BI), emphasizing stringent data governance, system resilience, and cross-sector collaboration. Our findings highlight the critical importance of ethical guidelines and robust governmental policies in mitigating algorithmic risks. We combine theory and practical advice to show how to align problems, policies, and politics to create practical opportunities for algorithmic governance. This study contributes to the evolving discourse on responsible financial technology. Our study recommends a balanced way to manage the challenges of innovation, regulation, and ethics in the age of algorithms.</p
Recommended from our members
Water physics aware semantic segmentation through texture-biased U-Net architectures
Reliably identifying water bodies is an important step in being able to automate the identification of potable water. This work therefore investigates water scene segmentation, with a focus on water’s physical properties which give it features that distinguish it from other elements. We thus propose a physics aware water segmentation method, in which we adapt both a U-Net and MacUNet model so that they are biased towards texture information by using a Gabor convolutional layer as the first layer, combined with a mixture of average and maximum pooling layers in the encoder. For training the neural network, a dataset of annotated water images was created, comprising water bodies from diverse light, atmospheric, geographic, and environmental conditions. We show that the physics-aware, texture-biased models result in effective water segmentation. We then test the texture-biased models using 3 standard aerial scene segmentation benchmarks and show that in all cases they outperform the standard U-Net or MacUNet models. We suggest this because the new models are robust to light variations, meaning they can extract information from scenes affected by light canopy or shadows.</p
Recommended from our members
Are compressed language models less subgroup robust?
To reduce the inference cost of large language models, model compression is increasingly used to create smaller scalable models. However, little is known about their robustness to minority subgroups defined by the labels and attributes of a dataset. In this paper, we investigate the effects of 18 different compression methods and settings on the subgroup robustness of BERT language models. We show that worst-group performance does not depend on model size alone, but also on the compression method used. Additionally, we find that model compression does not always worsen the performance on minority subgroups. Altogether, our analysis serves to further research into the subgroup robustness of model compression
Recommended from our members
Addressing attribute bias with adversarial support-matching
When trained on diverse labelled data, machine learning models have proven themselves to be a powerful tool in all facets of society. However, due to budget limitations, deliberate or non-deliberate censorship, and other problems during data collection, certain groups may be under-represented in the labelled training set. We investigate a scenario in which the absence of certain data is linked to the second level of a two-level hierarchy in the data. Inspired by the idea of protected attributes from algorithmic fairness, we consider generalised secondary "attributes" which subdivide the classes into smaller partitions. We refer to the partitions defined by the combination of an attribute and a class label, or leaf nodes in aforementioned hierarchy, as groups. To characterise the problem, we introduce the concept of classes with incomplete attribute support. The representational bias in the training set can give rise to spurious correlations between the classes and the attributes which cause standard classification models to generalise poorly to unseen groups. To overcome this bias, we make use of an additional, diverse but unlabelled dataset, called the deployment set, to learn a representation that is invariant to the attributes. This is done by adversarially matching the support of the training and deployment sets in representation space using a set discriminator operating on sets, or bags, of samples. In order to learn the desired invariance, it is paramount that the bags are balanced by class; this is easily achieved for the training set, but requires using semi-supervised clustering for the deployment set. We demonstrate the effectiveness of our method on several datasets and realisations of the problem.</p
Addressing attribute bias with adversarial support-matching
When trained on diverse labelled data, machine learning models have proven themselves to be a powerful tool in all facets of society. However, due to budget limitations, deliberate or non-deliberate censorship, and other problems during data collection, certain groups may be under-represented in the labelled training set. We investigate a scenario in which the absence of certain data is linked to the second level of a two-level hierarchy in the data. Inspired by the idea of protected attributes from algorithmic fairness, we consider generalised secondary "attributes" which subdivide the classes into smaller partitions. We refer to the partitions defined by the combination of an attribute and a class label, or leaf nodes in aforementioned hierarchy, as groups. To characterise the problem, we introduce the concept of classes with incomplete attribute support. The representational bias in the training set can give rise to spurious correlations between the classes and the attributes which cause standard classification models to generalise poorly to unseen groups. To overcome this bias, we make use of an additional, diverse but unlabelled dataset, called the deployment set, to learn a representation that is invariant to the attributes. This is done by adversarially matching the support of the training and deployment sets in representation space using a set discriminator operating on sets, or bags, of samples. In order to learn the desired invariance, it is paramount that the bags are balanced by class; this is easily achieved for the training set, but requires using semi-supervised clustering for the deployment set. We demonstrate the effectiveness of our method on several datasets and realisations of the problem.</p
Recommended from our members
Okapi: generalising better by making statistical matches match
We propose Okapi, a simple, efficient, and general method for robust semi-supervised learning based on online statistical matching. Our method uses a nearest-neighbours-based matching procedure to generate cross-domain views for a consistency loss, while eliminating statistical outliers. In order to perform the online matching in a runtime- and memory-efficient way, we draw upon the self-supervised literature and combine a memory bank with a slow-moving momentum encoder. The consistency loss is applied within the feature space, rather than on the predictive distribution, making the method agnostic to both the modality and the task in question. We experiment on the WILDS 2.0 datasets (Sagawa et al., 2022), which significantly expands the range of modalities, applications, and shifts available for studying and benchmarking real-world unsupervised adaptation. Contrary to Sagawa et al., 2022, we show that it is in fact possible to leverage additional unlabelled data to improve upon empirical risk minimisation (ERM) results with the right method. Our method outperforms the baseline methods in terms of out-of-distribution (OOD) generalisation on the iWildCam (a multi-class classification task) and PovertyMap (a regression task) image datasets as well as the CivilComments (a binary classification task) text dataset. Furthermore, from a qualitative perspective, we show the matches obtained from the learned encoder are strongly semantically related. Code for our paper is publicly available at https://github.com/wearepal/okapi/
Recommended from our members
Scalable Gaussian process structured prediction for grid factor graph applications
Structured prediction is an important and well studied problem with many applications across machine learning. GPstruct is a recently proposed structured prediction model that offers appealing properties such as being kernelised, non-parametric, and supporting Bayesian inference (Bratières et al. 2013). The model places a Gaussian process prior over energy functions which describe relationships between input variables and structured output variables. However, the memory demand of GPstruct is quadratic in the number of latent variables and training runtime scales cubically. This prevents GPstruct from being applied to problems involving grid factor graphs, which are prevalent in computer vision and spatial statistics applications. Here we explore a scalable approach to learning GPstruct models based on ensemble learning, with weak learners (predictors) trained on subsets of the latent variables and bootstrap data, which can easily be distributed. We show experiments with 4M latent variables on image segmentation. Our method outperforms widely-used conditional random field models trained with pseudo-likelihood. Moreover, in image segmentation problems it improves over recent state-of-the-art marginal optimisation methods in terms of predictive performance and uncertainty calibration. Finally, it generalises well on all training set sizes
Recommended from our members
Privacy and accuracy implications of model complexity and integration in heterogeneous federated learning
Federated Learning (FL) has been proposed as a privacy-preserving solution for distributed machine learning, particularly in heterogeneous FL settings where clients have varying computational capabilities and thus train models with different complexities compared to the server’s model. However, FL is not without vulnerabilities: recent studies have shown that it is susceptible to membership inference attacks (MIA), which can compromise the privacy of client data. In this paper, we examine the intersection of these two aspects, heterogeneous FL and its privacy vulnerabilities, by focusing on the role of client model integration, the process through which the server integrates parameters from clients’ smaller models into its larger model. To better understand this process, we first propose a taxonomy that categorizes existing heterogeneous FL methods and enables the design of seven novel heterogeneous FL model integration strategies. Using CIFAR-10, CIFAR-100, and FEMNIST vision datasets, we evaluate the privacy and accuracy trade-offs of these approaches under three types of MIAs. Our findings reveal significant differences in privacy leakage and performance depending on the integration method. Notably, introducing randomness in the model integration process enhances client privacy while maintaining competitive accuracy for both the clients and the server. This work provides quantitative light on the privacy-accuracy implications client model integration in heterogeneous FL settings, paving the way towards more secure and efficient FL systems.</p