374 research outputs found

    Learning Fair Naive Bayes Classifiers by Discovering and Eliminating Discrimination Patterns

    Full text link
    As machine learning is increasingly used to make real-world decisions, recent research efforts aim to define and ensure fairness in algorithmic decision making. Existing methods often assume a fixed set of observable features to define individuals, but lack a discussion of certain features not being observed at test time. In this paper, we study fairness of naive Bayes classifiers, which allow partial observations. In particular, we introduce the notion of a discrimination pattern, which refers to an individual receiving different classifications depending on whether some sensitive attributes were observed. Then a model is considered fair if it has no such pattern. We propose an algorithm to discover and mine for discrimination patterns in a naive Bayes classifier, and show how to learn maximum likelihood parameters subject to these fairness constraints. Our approach iteratively discovers and eliminates discrimination patterns until a fair model is learned. An empirical evaluation on three real-world datasets demonstrates that we can remove exponentially many discrimination patterns by only adding a small fraction of them as constraints

    Group Fairness by Probabilistic Modeling with Latent Fair Decisions

    Full text link
    Machine learning systems are increasingly being used to make impactful decisions such as loan applications and criminal justice risk assessments, and as such, ensuring fairness of these systems is critical. This is often challenging as the labels in the data are biased. This paper studies learning fair probability distributions from biased data by explicitly modeling a latent variable that represents a hidden, unbiased label. In particular, we aim to achieve demographic parity by enforcing certain independencies in the learned model. We also show that group fairness guarantees are meaningful only if the distribution used to provide those guarantees indeed captures the real-world data. In order to closely model the data distribution, we employ probabilistic circuits, an expressive and tractable probabilistic model, and propose an algorithm to learn them from incomplete data. We evaluate our approach on a synthetic dataset in which observed labels indeed come from fair labels but with added bias, and demonstrate that the fair labels are successfully retrieved. Moreover, we show on real-world datasets that our approach not only is a better model than existing methods of how the data was generated but also achieves competitive accuracy

    Hybrid of K-means clustering and naive Bayes classifier for predicting performance of an employee

    Get PDF
    Predicting the performance of an employee in the future is a requirement for companies to succeed. The employee is the organization's main component, the failure or organization’s success based on the performance of an employee, this has become an important interest in almost all types of companies for decision-makers and managers in the implementation of plans to find highly skilled employees correctly. Management thus becomes involved in the success of these employees. Particularly to guarantee that the right employee at the right time is assigned to the convenient job. The forecasting of analytics is a modern human resource trend. In the field of predictive analytics, data mining plays a useful role. To obtain a highly precise model, the proposed framework incorporates the K-Means clustering approach and the Naïve Bayes (NB) classification for better results in processing performance data of employees, implemented in WEKA, which enables personnel professionals and decision-makers to predict and optimize their employees' performance. The data were taken from the previous works, this was used as a test case to illustrate how the incorporates of K-Media and Naïve Bayes algorithms increases the exactness of employee performance predicting, compared with the K-Means and Naïve Bayes methods, the proposed framework increases the accuracy of predicting the performance of an employee

    Achieving Causal Fairness in Machine Learning

    Get PDF
    Fairness is a social norm and a legal requirement in today\u27s society. Many laws and regulations (e.g., the Equal Credit Opportunity Act of 1974) have been established to prohibit discrimination and enforce fairness on several grounds, such as gender, age, sexual orientation, race, and religion, referred to as sensitive attributes. Nowadays machine learning algorithms are extensively applied to make important decisions in many real-world applications, e.g., employment, admission, and loans. Traditional machine learning algorithms aim to maximize predictive performance, e.g., accuracy. Consequently, certain groups may get unfairly treated when those algorithms are applied for decision-making. Therefore, it is an imperative task to develop fairness-aware machine learning algorithms such that the decisions made by them are not only accurate but also subject to fairness requirements. In the literature, machine learning researchers have proposed association-based fairness notions, e.g., statistical parity, disparate impact, equality of opportunity, etc., and developed respective discrimination mitigation approaches. However, these works did not consider that fairness should be treated as a causal relationship. Although it is well known that association does not imply causation, the gap between association and causation is not paid sufficient attention by the fairness researchers and stakeholders. The goal of this dissertation is to study fairness in machine learning, define appropriate fairness notions, and develop novel discrimination mitigation approaches from a causal perspective. Based on Pearl\u27s structural causal model, we propose to formulate discrimination as causal effects of the sensitive attribute on the decision. We consider different types of causal effects to cope with different situations, including the path-specific effect for direct/indirect discrimination, the counterfactual effect for group/individual discrimination, and the path-specific counterfactual effect for general cases. In the attempt to measure discrimination, the unidentifiable situations pose an inevitable barrier to the accurate causal inference. To address this challenge, we propose novel bounding methods to accurately estimate the strength of unidentifiable fairness notions, including path-specific fairness, counterfactual fairness, and path-specific counterfactual fairness. Based on the estimation of fairness, we develop novel and efficient algorithms for learning fair classification models. Besides classification, we also investigate the discrimination issues in other machine learning scenarios, such as ranked data analysis

    The discriminant power of RNA features for pre-miRNA recognition

    Get PDF
    Computational discovery of microRNAs (miRNA) is based on pre-determined sets of features from miRNA precursors (pre-miRNA). These feature sets used by current tools for pre-miRNA recognition differ in construction and dimension. Some feature sets are composed of sequence-structure patterns commonly found in pre-miRNAs, while others are a combination of more sophisticated RNA features. Current tools achieve similar predictive performance even though the feature sets used - and their computational cost - differ widely. In this work, we analyze the discriminant power of seven feature sets, which are used in six pre-miRNA prediction tools. The analysis is based on the classification performance achieved with these feature sets for the training algorithms used in these tools. We also evaluate feature discrimination through the F-score and feature importance in the induction of random forests. More diverse feature sets produce classifiers with significantly higher classification performance compared to feature sets composed only of sequence-structure patterns. However, small or non-significant differences were found among the estimated classification performances of classifiers induced using sets with diversification of features, despite the wide differences in their dimension. Based on these results, we applied a feature selection method to reduce the computational cost of computing the feature set, while maintaining discriminant power. We obtained a lower-dimensional feature set, which achieved a sensitivity of 90% and a specificity of 95%. Our feature set achieves a sensitivity and specificity within 0.1% of the maximal values obtained with any feature set while it is 34x faster to compute. Even compared to another feature set, which is the computationally least expensive feature set of those from the literature which perform within 0.1% of the maximal values, it is 34x faster to compute.Comment: Submitted to BMC Bioinformatics in October 25, 2013. The material to reproduce the main results from this paper can be downloaded from http://bioinformatics.rutgers.edu/Static/Software/discriminant.tar.g

    Enhanced classification of network traffic data captured by intrusion prevention systems

    Get PDF
    A common practice in modern computer networks is the deployment of Intrusion Prevention Systems (IPSs) for the purpose of identifying security threats. Such systems provide alerts on suspicious activities based on a predefined set of rules. These alerts almost always contain high percentages of false positives and false negatives, which may impede the efficacy of their use. Therefore, with the presence of high numbers of false positives and false negatives, the analysis of network traffic data can be ineffective for decision makers which normally require concise, and preferably, visual forms to base their decisions upon. Machine learning techniques can help extract useful information from large datasets. Combined with visualisation, classification could provide a solution to false alerts and text-based outputs of IPSs. This research developed two new classification techniques that outperformed the traditional classification methods in accurate classification of computer network traffic captured by an IPS framework. They are also highly effective. The main purpose of these techniques was the effective identification of malicious network traffic and this was demonstrated via extensive experimental evaluation (where many experiments were conducted and results are reported in this thesis). In addition, an enhancement of the principal component analysis (PCA) was presented as part of this study. This enhancement proved to outperform the classical PCA on classification of IPS data. Details of the evaluation and experiments are provided in this thesis. One of the classification methods described in this thesis achieved accuracy values of 98.51% and 99.76% on two computer network traffic dataset settings, whereas the Class-balanced Similarity Based Instance Transfer Learning (CB-SBIT) algorithm achieves accuracy values of 93.56% and 96.25% respectively on the same dataset settings. This means the proposed method outperforms the state-of-the-art algorithm. As for the PCA enhancement mentioned above, using its resulting principal components as inputs to classifiers leads to improved accuracy when compared to the classical PCA
    • …
    corecore