1,941 research outputs found

    Fairness and Interpretability in Machine Learning Models

    Get PDF
    Machine Learning has become more and more prominent in our daily lives as the Information Age and Fourth industrial revolution progresses. Many of these machine learning systems are evaluated in terms of how accurately they are able to predict the correct outcome that are present in existing historical datasets. In the last years we have observed how evaluating machine learning systems in this way has allowed decision making systems to treat certain groups unfairly. Some authors have proposed methods to overcome this. These methods include new metrics which incorporate measures of unfairly treating individuals based on group affiliation, probabilistic graphical models that assume dataset labels are inherently unfair and use dataset to infer the true fair labels as well as tree based methods that introduce new splitting criterions for fairness. We have evaluated these methods on datasets used in fairness research and evaluated if the results claimed by the authors are reproducible. Additionally, we have implemented new interpretability methods on top of the proposed methods to more explicitly explain their behaviour. We have found that some of the models do not achieve their claimed results and do not learn behaviour to achieve fairness while other models do achieve better predictions in terms of fairness by affirmative actions. This thesis show that machine learning interpretability and new machine learning models and approaches are necessary to achieve more fair decision making systems

    Comparison of classification algorithms to predict outcomes of feedlot cattle identified and treated for Bovine Respiratory Disease

    Get PDF
    Bovine respiratory disease (BRD) continues to be the primary cause of morbidity and mortality in feedyard cattle. Accurate identification of those animals that will not finish the production cycle normally following initial treatment for BRD would provide feedyard managers with opportunities to more effectively manage those animals. Our objectives were to assess the ability of different classification algorithms to accurately predict an individual calf’s outcome based on data available at first identification of and treatment for BRD and also to identify characteristics of calves where predictive models performed well as gauged by accuracy. Data from 23 feedyards in multiple geographic locations within the U.S. from 2000 to 2009 representing over one million animals were analyzed to identify animals clinically diagnosed with BRD and treated with an antimicrobial. These data were analyzed both as a single dataset and as multiple datasets based on individual feedyards and partitioned into training, testing, and validation datasets. Classifiers were trained and optimized to identify calves that did not finish the production cycle with their cohort. Following classifier training, accuracy was evaluated using validation data. Analysis was also done to identify sub-groups of calves within populations where classifiers performed better compared to other sub-groups. Accuracy of individual classifiers varied by dataset. The accuracy of the best performing classifier by dataset ranged from a low of 63% in one dataset up to 95% in a different dataset. Sub-groups of calves were identified within some datasets where accuracy of a classifiers were greater than 98%; however these accuracies must be interpreted in relation to the prevalence of the class of interest within those populations. We found that by pairing the correct classifier with the data available, accurate predictions could be made that would provide feedlot managers with valuable information

    MULTIVALUED SUBSETS UNDER INFORMATION THEORY

    Get PDF
    In the fields of finance, engineering and varied sciences, Data Mining/ Machine Learning has held an eminent position in predictive analysis. Complex algorithms and adaptive decision models have contributed towards streamlining directed research as well as improve on the accuracies in forecasting. Researchers in the fields of mathematics and computer science have made significant contributions towards the development of this field. Classification based modeling, which holds a significant position amongst the different rule-based algorithms, is one of the most widely used decision making tools. The decision tree has a place of profound significance in classification-based modeling. A number of heuristics have been developed over the years to prune the decision making process. Some key benchmarks in the evolution of the decision tree could to attributed to the researchers like Quinlan (ID3 and C4.5), Fayyad (GID3/3*, continuous value discretization), etc. The most common heuristic applied for these trees is the entropy discussed under information theory by Shannon. The current application with entropy covered under the term `Information Gain\u27 is directed towards individual assessment of the attribute-value sets. The proposed study takes a look at the effects of combining the attribute-value sets, aimed at improving the information gain. Couple of key applications have been tested and presented with statistical conclusions. The first being the application towards the feature selection process, a key step in the data mining process, while the second application is targeted towards the discretization of data. A search-based heuristic tool is applied towards identifying the subsets sharing a better gain value than the ones presented in the GID approach

    A COMPARISON OF MACHINE LEARNING TECHNIQUES: E-MAIL SPAM FILTERING FROM COMBINED SWAHILI AND ENGLISH EMAIL MESSAGES

    Get PDF
    The speed of technology change is faster now compared to the past ten to fifteen years. It changes the way people live and force them to use the latest devices to match with the speed. In communication perspectives nowadays, use of electronic mail (e-mail) for people who want to communicate with friends, companies or even the universities cannot be avoided. This makes it to be the most targeted by the spammer and hackers and other bad people who want to get the benefit by sending spam emails. The report shows that the amount of emails sent through the internet in a day can be more than 10 billion among these 45% are spams. The amount is not constant as sometimes it goes higher than what is noted here. This indicates clearly the magnitude of the problem and calls for the need for more efforts to be applied to reduce this amount and also minimize the effects from the spam messages. Various measures have been taken to eliminate this problem. Once people used social methods, that is legislative means of control and now they are using technological methods which are more effective and timely in catching spams as these work by analyzing the messages content. In this paper we compare the performance of machine learning algorithms by doing the experiment for testing English language dataset, Swahili language dataset individual and combined two dataset to form one, and results from combined dataset compared them with the Gmail classifier. The classifiers which the researcher used are Naïve Bayes (NB), Sequential Minimal Optimization (SMO) and k-Nearest Neighbour (k-NN). The results for combined dataset shows that SMO classifier lead the others by achieve 98.60% of accuracy, followed by k-NN classifier which has 97.20% accuracy, and Naïve Bayes classifier has 92.89% accuracy. From this result the researcher concludes that SMO classifier can work better in dataset that combined English and Swahili languages. In English dataset shows that SMO classifier leads other algorism, it achieved 97.51% of accuracy, followed by k-NN with average accuracy of 93.52% and the last but also good accuracy is Naïve Bayes that come with 87.78%. Swahili dataset Naïve Bayes lead others by getting 99.12% accuracy followed by SMO which has 98.69% and the last was k-NN which has 98.47%

    Using Machine Learning and Graph Mining Approaches to Improve Software Requirements Quality: An Empirical Investigation

    Get PDF
    Software development is prone to software faults due to the involvement of multiple stakeholders especially during the fuzzy phases (requirements and design). Software inspections are commonly used in industry to detect and fix problems in requirements and design artifacts, thereby mitigating the fault propagation to later phases where the same faults are harder to find and fix. The output of an inspection process is list of faults that are present in software requirements specification document (SRS). The artifact author must manually read through the reviews and differentiate between true-faults and false-positives before fixing the faults. The first goal of this research is to automate the detection of useful vs. non-useful reviews. Next, post-inspection, requirements author has to manually extract key problematic topics from useful reviews that can be mapped to individual requirements in an SRS to identify fault-prone requirements. The second goal of this research is to automate this mapping by employing Key phrase extraction (KPE) algorithms and semantic analysis (SA) approaches to identify fault-prone requirements. During fault-fixations, the author has to manually verify the requirements that could have been impacted by a fix. The third goal of my research is to assist the authors post-inspection to handle change impact analysis (CIA) during fault fixation using NL processing with semantic analysis and mining solutions from graph theory. The selection of quality inspectors during inspections is pertinent to be able to carry out post-inspection tasks accurately. The fourth goal of this research is to identify skilled inspectors using various classification and feature selection approaches. The dissertation has led to the development of automated solution that can identify useful reviews, help identify skilled inspectors, extract most prominent topics/keyphrases from fault logs; and help RE author during the fault-fixation post inspection

    Comparative Analysis of Data Mining Tools and Classification Techniques using WEKA in Medical Bioinformatics

    Get PDF
    The availability of huge amounts of data resulted in great need of data mining technique in order to generate useful knowledge. In the present study we provide detailed information about data mining techniques with more focus on classification techniques as one important supervised learning technique. We also discuss WEKA software as a tool of choice to perform classification analysis for different kinds of available data. A detailed methodology is provided to facilitate utilizing the software by a wide range of users. The main features of WEKA are 49 data preprocessing tools, 76 classification/regression algorithms, 8 clustering algorithms, 3 algorithms for finding association rules, 15 attribute/subset evaluators plus 10 search algorithms for feature selection. WEKA extracts useful information from data and enables a suitable algorithm for generating an accurate predictive model from it to be identified.  Moreover, medical bioinformatics analyses have been performed to illustrate the usage of WEKA in the diagnosis of Leukemia. Keywords: Data mining, WEKA, Bioinformatics, Knowledge discovery, Gene Expression
    corecore