172,451 research outputs found

    Fair Data Representation for Machine Learning at the Pareto Frontier

    Full text link
    As machine learning powered decision making is playing an increasingly important role in our daily lives, it is imperative to strive for fairness of the underlying data processing and algorithms. We propose a pre-processing algorithm for fair data representation via which L2- objective supervised learning algorithms result in an estimation of the Pareto frontier between prediction error and statistical disparity. In particular, the present work applies the optimal positive definite affine transport maps to approach the post-processing Wasserstein barycenter characterization of the optimal fair L2-objective supervised learning via a pre-processing data deformation. We call the resulting data Wasserstein pseudo-barycenter. Furthermore, we show that the Wasserstein geodesics from the learning outcome marginals to the barycenter characterizes the Pareto frontier between L2-loss and total Wasserstein distance among learning outcome marginals. Thereby, an application of McCann interpolation generalizes the pseudo-barycenter to a family of data representations via which L2-objective supervised learning algorithms result in the Pareto frontier. Numerical simulations underscore the advantages of the proposed data representation: (1) the pre-processing step is compositive with arbitrary L2-objective supervised learning methods and unseen data; (2) the fair representation protects data privacy by preventing the training machine from direct or indirect access to the sensitive information of the data; (3) the optimal affine map results in efficient computation of fair supervised learning on high-dimensional data; (4) experimental results shed light on the fairness of L2-objective unsupervised learning via the proposed fair data representation.Comment: 57 pages, 9 figure

    Semi-supervised learning and fairness-aware learning under class imbalance

    Get PDF
    With the advent of Web 2.0 and the rapid technological advances, there is a plethora of data in every field; however, more data does not necessarily imply more information, rather the quality of data (veracity aspect) plays a key role. Data quality is a major issue, since machine learning algorithms are solely based on historical data to derive novel hypotheses. Data may contain noise, outliers, missing values and/or class labels, and skewed data distributions. The latter case, the so-called class-imbalance problem, is quite old and still affects dramatically machine learning algorithms. Class-imbalance causes classification models to learn effectively one particular class (majority) while ignoring other classes (minority). In extend to this issue, machine learning models that are applied in domains of high societal impact have become biased towards groups of people or individuals who are not well represented within the data. Direct and indirect discriminatory behavior is prohibited by international laws; thus, there is an urgency of mitigating discriminatory outcomes from machine learning algorithms. In this thesis, we address the aforementioned issues and propose methods that tackle class imbalance, and mitigate discriminatory outcomes in machine learning algorithms. As part of this thesis, we make the following contributions: • Tackling class-imbalance in semi-supervised learning – The class-imbalance problem is very often encountered in classification. There is a variety of methods that tackle this problem; however, there is a lack of methods that deal with class-imbalance in the semi-supervised learning. We address this problem by employing data augmentation in semi-supervised learning process in order to equalize class distributions. We show that semi-supervised learning coupled with data augmentation methods can overcome class-imbalance propagation and significantly outperform the standard semi-supervised annotation process. • Mitigating unfairness in supervised models – Fairness in supervised learning has received a lot of attention over the last years. A growing body of pre-, in- and postprocessing approaches has been proposed to mitigate algorithmic bias; however, these methods consider error rate as the performance measure of the machine learning algorithm, which causes high error rates on the under-represented class. To deal with this problem, we propose approaches that operate in pre-, in- and post-processing layers while accounting for all classes. Our proposed methods outperform state-of-the-art methods in terms of performance while being able to mitigate unfair outcomes

    Semantic Data Pre-Processing for Machine Learning Based Bankruptcy Prediction Computational Model

    Get PDF
    This paper studies a Bankruptcy Prediction Computational Model (BPCM model) – a comprehensive methodology of evaluating companies’ bankruptcy level, which combines storing, structuring and pre-processing of raw financial data using semantic methods with machine learning analysis techniques. Raw financial data are interconnected, diverse, often potentially inconsistent, and open to duplication. The main goal of our research is to develop data pre-processing techniques, where ontologies play a central role. We show how ontologies are used to extract and integrate information from different sources, prepare data for further processing, and enable communication in natural language. Using ontology, we give meaning to the disparate and raw business data, build logical relationships between data in various formats and sources and establish relevant context. Our Ontology of Bankruptcy Prediction (OBP Ontology) which provides a conceptual framework for companies’ financial analysis, is built in the widely established Prote ́ge ́ environment. An OBP Ontology can be effectively described with a graph database. Graph database expands the capabilities of traditional databases tackling the interconnected nature of economic data and providing graph-based structures to store information allowing the effective selection of the most relevant input features for the machine learning algorithm. To create and manage the BPCM Graph Database (Graph DB), we use the Neo4j environment and Neo4j query language, Cypher, to perform feature selection of the structured data. Selected key features are used for the Machine Learning Engine – supervised MLP Neural Network with Sigmoid activation function. The programming of this component is performed in Python. We illustrate the approach and advantages of semantic data pre-processing applying it to a representative use case

    Group invariant machine learning by fundamental domain projections

    Get PDF
    We approach the well-studied problem of supervised group invariant and equivariant machine learning from the point of view of geometric topology. We propose a novel approach using a pre-processing step, which involves projecting the input data into a geometric space which parametrises the orbits of the symmetry group. This new data can then be the input for an arbitrary machine learning model (neural network, random forest, support-vector machine etc). We give an algorithm to compute the geometric projection, which is efficient to implement, and we illustrate our approach on some example machine learning problems (including the well-studied problem of predicting Hodge numbers of CICY matrices), in each case finding an improvement in accuracy versus others in the literature. The geometric topology viewpoint also allows us to give a unified description of so-called intrinsic approaches to group equivariant machine learning, which encompasses many other approaches in the literature.Comment: 21 pages, 4 figure

    Annotated dataset creation through large language models for non-english medical NLP

    Get PDF
    Obtaining text datasets with semantic annotations is an effortful process, yet crucial for supervised training in natural language processing (NLP). In general, developing and applying new NLP pipelines in domain-specific contexts for tasks often requires custom-designed datasets to address NLP tasks in a supervised machine learning fashion. When operating in non-English languages for medical data processing, this exposes several minor and major, interconnected problems such as the lack of task-matching datasets as well as task-specific pre-trained models. In our work, we suggest to leverage pre-trained large language models for training data acquisition in order to retrieve sufficiently large datasets for training smaller and more efficient models for use-case-specific tasks. To demonstrate the effectiveness of your approach, we create a custom dataset that we use to train a medical NER model for German texts, GPTNERMED, yet our method remains language-independent in principle. Our obtained dataset as well as our pre-trained models are publicly available at https://github.com/frankkramer-lab/GPTNERMED
    • …
    corecore