172,451 research outputs found
Fair Data Representation for Machine Learning at the Pareto Frontier
As machine learning powered decision making is playing an increasingly
important role in our daily lives, it is imperative to strive for fairness of
the underlying data processing and algorithms. We propose a pre-processing
algorithm for fair data representation via which L2- objective supervised
learning algorithms result in an estimation of the Pareto frontier between
prediction error and statistical disparity. In particular, the present work
applies the optimal positive definite affine transport maps to approach the
post-processing Wasserstein barycenter characterization of the optimal fair
L2-objective supervised learning via a pre-processing data deformation. We call
the resulting data Wasserstein pseudo-barycenter. Furthermore, we show that the
Wasserstein geodesics from the learning outcome marginals to the barycenter
characterizes the Pareto frontier between L2-loss and total Wasserstein
distance among learning outcome marginals. Thereby, an application of McCann
interpolation generalizes the pseudo-barycenter to a family of data
representations via which L2-objective supervised learning algorithms result in
the Pareto frontier. Numerical simulations underscore the advantages of the
proposed data representation: (1) the pre-processing step is compositive with
arbitrary L2-objective supervised learning methods and unseen data; (2) the
fair representation protects data privacy by preventing the training machine
from direct or indirect access to the sensitive information of the data; (3)
the optimal affine map results in efficient computation of fair supervised
learning on high-dimensional data; (4) experimental results shed light on the
fairness of L2-objective unsupervised learning via the proposed fair data
representation.Comment: 57 pages, 9 figure
Semi-supervised learning and fairness-aware learning under class imbalance
With the advent of Web 2.0 and the rapid technological advances, there is a plethora of data in every field; however, more data does not necessarily imply more information, rather the quality of data (veracity aspect) plays a key role. Data quality is a major issue, since machine learning algorithms are solely based on historical data to derive novel hypotheses. Data may contain noise, outliers, missing values and/or class labels, and skewed data distributions. The latter case, the so-called class-imbalance problem, is quite old and still affects dramatically machine learning algorithms. Class-imbalance causes classification models to learn effectively one particular class (majority) while ignoring other classes (minority). In extend to this issue, machine learning models that are applied in domains of high societal impact have become biased towards groups of people or individuals who are not well represented within the data. Direct and indirect discriminatory behavior is prohibited by international laws; thus, there is an urgency of mitigating discriminatory outcomes from machine learning algorithms.
In this thesis, we address the aforementioned issues and propose methods that tackle class imbalance, and mitigate discriminatory outcomes in machine learning algorithms. As part of this thesis, we make the following contributions:
⢠Tackling class-imbalance in semi-supervised learning â The class-imbalance problem is very often encountered in classification. There is a variety of methods that tackle this problem; however, there is a lack of methods that deal with class-imbalance in the semi-supervised learning. We address this problem by employing data augmentation in semi-supervised learning process in order to equalize class distributions. We show that semi-supervised learning coupled with data augmentation methods can overcome class-imbalance propagation and significantly outperform the standard semi-supervised annotation process.
⢠Mitigating unfairness in supervised models â Fairness in supervised learning has received a lot of attention over the last years. A growing body of pre-, in- and postprocessing approaches has been proposed to mitigate algorithmic bias; however, these methods consider error rate as the performance measure of the machine learning algorithm, which causes high error rates on the under-represented class. To deal with this problem, we propose approaches that operate in pre-, in- and post-processing layers while accounting for all classes. Our proposed methods outperform state-of-the-art methods in terms of performance while being able to mitigate unfair outcomes
Semantic Data Pre-Processing for Machine Learning Based Bankruptcy Prediction Computational Model
This paper studies a Bankruptcy Prediction Computational Model (BPCM model) â a comprehensive methodology of evaluating companiesâ bankruptcy level, which combines storing, structuring and pre-processing of raw financial data using semantic methods with machine learning analysis techniques. Raw financial data are interconnected, diverse, often potentially inconsistent, and open to duplication. The main goal of our research is to develop data pre-processing techniques, where ontologies play a central role. We show how ontologies are used to extract and integrate information from different sources, prepare data for further processing, and enable communication in natural language. Using ontology, we give meaning to the disparate and raw business data, build logical relationships between data in various formats and sources and establish relevant context. Our Ontology of Bankruptcy Prediction (OBP Ontology) which provides a conceptual framework for companiesâ financial analysis, is built in the widely established Prote Ěge Ě environment. An OBP Ontology can be effectively described with a graph database. Graph database expands the capabilities of traditional databases tackling the interconnected nature of economic data and providing graph-based structures to store information allowing the effective selection of the most relevant input features for the machine learning algorithm. To create and manage the BPCM Graph Database (Graph DB), we use the Neo4j environment and Neo4j query language, Cypher, to perform feature selection of the structured data. Selected key features are used for the Machine Learning Engine â supervised MLP Neural Network with Sigmoid activation function. The programming of this component is performed in Python. We illustrate the approach and advantages of semantic data pre-processing applying it to a representative use case
Group invariant machine learning by fundamental domain projections
We approach the well-studied problem of supervised group invariant and
equivariant machine learning from the point of view of geometric topology. We
propose a novel approach using a pre-processing step, which involves projecting
the input data into a geometric space which parametrises the orbits of the
symmetry group. This new data can then be the input for an arbitrary machine
learning model (neural network, random forest, support-vector machine etc).
We give an algorithm to compute the geometric projection, which is efficient
to implement, and we illustrate our approach on some example machine learning
problems (including the well-studied problem of predicting Hodge numbers of
CICY matrices), in each case finding an improvement in accuracy versus others
in the literature. The geometric topology viewpoint also allows us to give a
unified description of so-called intrinsic approaches to group equivariant
machine learning, which encompasses many other approaches in the literature.Comment: 21 pages, 4 figure
Annotated dataset creation through large language models for non-english medical NLP
Obtaining text datasets with semantic annotations is an effortful process, yet crucial for supervised training in natural language processing (NLP). In general, developing and applying new NLP pipelines in domain-specific contexts for tasks often requires custom-designed datasets to address NLP tasks in a supervised machine learning fashion. When operating in non-English languages for medical data processing, this exposes several minor and major, interconnected problems such as the lack of task-matching datasets as well as task-specific pre-trained models. In our work, we suggest to leverage pre-trained large language models for training data acquisition in order to retrieve sufficiently large datasets for training smaller and more efficient models for use-case-specific tasks. To demonstrate the effectiveness of your approach, we create a custom dataset that we use to train a medical NER model for German texts, GPTNERMED, yet our method remains language-independent in principle. Our obtained dataset as well as our pre-trained models are publicly available at https://github.com/frankkramer-lab/GPTNERMED
- âŚ