690 research outputs found

    Genetic rule extraction optimizing brier score

    Full text link
    Most highly accurate predictive modeling techniques produceopaque models. When comprehensible models are required,rule extraction is sometimes used to generate a transparentmodel, based on the opaque. Naturally, the extractedmodel should be as similar as possible to the opaque. Thiscriterion, called fidelity, is therefore a key part of the optimizationfunction in most rule extracting algorithms. Tothe best of our knowledge, all existing rule extraction algorithmstargeting fidelity use 0/1 fidelity, i.e., maximize thenumber of identical classifications. In this paper, we suggestand evaluate a rule extraction algorithm utilizing a moreinformed fidelity criterion. More specifically, the novel algorithm,which is based on genetic programming, minimizesthe difference in probability estimates between the extractedand the opaque models, by using the generalized Brier scoreas fitness function. Experimental results from 26 UCI datasets show that the suggested algorithm obtained considerablyhigher accuracy and significantly better AUC than boththe exact same rule extraction algorithm maximizing 0/1 fidelity,and the standard tree inducer J48. Somewhat surprisingly,rule extraction using the more informed fidelity metricnormally resulted in less complex models, making sure thatthe improved predictive performance was not achieved onthe expense of comprehensibility

    Hypothesis Testing with Classifier Systems

    Get PDF
    This thesis presents a new ML algorithm, HCS, taking inspiration from Learning Classifier Systems, Decision Trees and Statistical Hypothesis Testing, aimed at providing clearly understandable models of medical datasets. Analysis of medical datasets has some specific requirements not always fulfilled by standard Machine Learning methods. In particular, heterogeneous and missing data must be tolerated, the results should be easily interpretable. Moreover, often the combination of two or more attributes leads to non-linear effects not detectable for each attribute on its own. Although it has been designed specifically for medical datasets, HCS can be applied to a broad range of data types, making it suitable for many domains. We describe the details of the algorithm, and test its effectiveness on five real-world datasets

    Machine Learning Approaches for Improving Prediction Performance of Structure-Activity Relationship Models

    Get PDF
    In silico bioactivity prediction studies are designed to complement in vivo and in vitro efforts to assess the activity and properties of small molecules. In silico methods such as Quantitative Structure-Activity/Property Relationship (QSAR) are used to correlate the structure of a molecule to its biological property in drug design and toxicological studies. In this body of work, I started with two in-depth reviews into the application of machine learning based approaches and feature reduction methods to QSAR, and then investigated solutions to three common challenges faced in machine learning based QSAR studies. First, to improve the prediction accuracy of learning from imbalanced data, Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbor (ENN) algorithms combined with bagging as an ensemble strategy was evaluated. The Friedman’s aligned ranks test and the subsequent Bergmann-Hommel post hoc test showed that this method significantly outperformed other conventional methods. SMOTEENN with bagging became less effective when IR exceeded a certain threshold (e.g., \u3e40). The ability to separate the few active compounds from the vast amounts of inactive ones is of great importance in computational toxicology. Deep neural networks (DNN) and random forest (RF), representing deep and shallow learning algorithms, respectively, were chosen to carry out structure-activity relationship-based chemical toxicity prediction. Results suggest that DNN significantly outperformed RF (p \u3c 0.001, ANOVA) by 22-27% for four metrics (precision, recall, F-measure, and AUPRC) and by 11% for another (AUROC). Lastly, current features used for QSAR based machine learning are often very sparse and limited by the logic and mathematical processes used to compute them. Transformer embedding features (TEF) were developed as new continuous vector descriptors/features using the latent space embedding from a multi-head self-attention. The significance of TEF as new descriptors was evaluated by applying them to tasks such as predictive modeling, clustering, and similarity search. An accuracy of 84% on the Ames mutagenicity test indicates that these new features has a correlation to biological activity. Overall, the findings in this study can be applied to improve the performance of machine learning based Quantitative Structure-Activity/Property Relationship (QSAR) efforts for enhanced drug discovery and toxicology assessments

    Credit scoring using genetic programming

    Get PDF
    Internship Report presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced AnalyticsGrowing numbers in e-commerce orders lead to an increase in risk management to prevent default in payment. Default in payment is the failure of a customer to settle a bill within 90 days upon receipt. Frequently, credit scoring is employed to identify customers’ default probability. Credit scoring has been widely studied and many different methods in different fields of research have been proposed. The primary aim of this work is to develop a credit scoring model as a replacement for the pre risk check of the e-commerce risk management system risk solution services (rss). The pre risk check uses data of the order process and includes exclusion rules and a generic credit scoring model. The new model is supposed to work as a replacement for the whole pre risk check and has to be able to work in solitary and in unison with the rss main risk check. An application of Genetic Programming to credit scoring is presented. The model is developed on a real world data set provided by Arvato Financial Solutions. The data set contains order requests processed by rss. Results show that Genetic Programming outperforms the generic credit scoring model of the pre risk check in both classification accuracy and profit. Compared with Logistic Regression, Support Vector Machines and Boosted Trees, Genetic Programming achieved a similar classificatory accuracy. Furthermore, the Genetic Programming model can be used in combination with the rss main risk check in order to create a model with higher discriminatory power than its individual models

    Process improvement approaches for increasing the response of emergency departments against the Covid-19 pandemic: a systematic review

    Get PDF
    The COVID-19 pandemic has strongly affected the dynamics of Emergency Departments (EDs) worldwide and has accentuated the need for tackling different operational inefficiencies that decrease the quality of care provided to infected patients. The EDs continue to struggle against this outbreak by implementing strategies maximizing their performance within an uncertain healthcare environment. The efforts, however, have remained insufficient in view of the growing number of admissions and increased severity of the coronavirus disease. Therefore, the primary aim of this paper is to review the literature on process improvement interventions focused on increasing the ED response to the current COVID-19 outbreak to delineate future research lines based on the gaps detected in the practical scenario. Therefore, we applied the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines to perform a review containing the research papers published between December 2019 and April 2021 using ISI Web of Science, Scopus, PubMed, IEEE, Google Scholar, and Science Direct databases. The articles were further classified taking into account the research domain, primary aim, journal, and publication year. A total of 65 papers disseminated in 51 journals were concluded to satisfy the inclusion criteria. Our review found that most applications have been directed towards predicting the health outcomes in COVID-19 patients through machine learning and data analytics techniques. In the overarching pandemic, healthcare decision makers are strongly recommended to integrate artificial intelligence techniques with approaches from the operations research (OR) and quality management domains to upgrade the ED performance under social-economic restrictions

    Addressing class imbalance for logistic regression

    Get PDF
    The challenge of class imbalance arises in classification problem when the minority class is observed much less than the majority class. This characteristic is endemic in many domains. Work by [Owen, 2007] has shown that, in a theoretical context related to infinite imbalance, logistic regression behaves such that all data in the rare class can be replaced by their mean vector to achieve the same coefficient estimates. Such results suggest that cluster structure among the minority class may be a specific problem in highly imbalanced logistic regression. In this thesis, we focus on highly imbalanced logistic regression and develop mitigation methods and diagnostic tools. Theoretically, we extend the [Owen, 2007] results to show the phenomenon remains true for both weighted and penalized likelihood methods in the infinitely imbalanced regime, which suggests these alternative choices to logistic regression are not enough for highly imbalanced logistic regression. For mitigation methods, we propose a novel relabeling solution based on relabeling the minority class to handle imbalance problem when using logistic regression, which essentially assigns new labels to the minority class observations. Two algorithms (the Genetic algorithm and the Expectation Maximization algorithm) are formalized to serve as tools for computing this relabeling. In simulation and real data experiments, we show that logistic regression is not able to provide the best out-of-sample predictive performance, and our relabeling approach that can capture underlying structure in the minority class is often superior. For diagnostic tools to detect highly imbalanced logistic regression, different hypothesis testing methods, along with a graphical tool are proposed, based on the mathematical insights about highly imbalanced logistic regression. Simulation studies provide evidence that combining our diagnostic tools with mitigation methods as a systematic strategy has the potential to alleviate the class imbalance problem among logistic regression.Open Acces

    Democratizing machine learning

    Get PDF
    Modelle des maschinellen Lernens sind zunehmend in der Gesellschaft verankert, oft in Form von automatisierten Entscheidungsprozessen. Ein wesentlicher Grund dafür ist die verbesserte Zugänglichkeit von Daten, aber auch von Toolkits für maschinelles Lernen, die den Zugang zu Methoden des maschinellen Lernens für Nicht-Experten ermöglichen. Diese Arbeit umfasst mehrere Beiträge zur Demokratisierung des Zugangs zum maschinellem Lernen, mit dem Ziel, einem breiterem Publikum Zugang zu diesen Technologien zu er- möglichen. Die Beiträge in diesem Manuskript stammen aus mehreren Bereichen innerhalb dieses weiten Gebiets. Ein großer Teil ist dem Bereich des automatisierten maschinellen Lernens (AutoML) und der Hyperparameter-Optimierung gewidmet, mit dem Ziel, die oft mühsame Aufgabe, ein optimales Vorhersagemodell für einen gegebenen Datensatz zu finden, zu vereinfachen. Dieser Prozess besteht meist darin ein für vom Benutzer vorgegebene Leistungsmetrik(en) optimales Modell zu finden. Oft kann dieser Prozess durch Lernen aus vorhergehenden Experimenten verbessert oder beschleunigt werden. In dieser Arbeit werden drei solcher Methoden vorgestellt, die entweder darauf abzielen, eine feste Menge möglicher Hyperparameterkonfigurationen zu erhalten, die wahrscheinlich gute Lösungen für jeden neuen Datensatz enthalten, oder Eigenschaften der Datensätze zu nutzen, um neue Konfigurationen vorzuschlagen. Darüber hinaus wird eine Sammlung solcher erforderlichen Metadaten zu den Experimenten vorgestellt, und es wird gezeigt, wie solche Metadaten für die Entwicklung und als Testumgebung für neue Hyperparameter- Optimierungsmethoden verwendet werden können. Die weite Verbreitung von ML-Modellen in vielen Bereichen der Gesellschaft erfordert gleichzeitig eine genauere Untersuchung der Art und Weise, wie aus Modellen abgeleitete automatisierte Entscheidungen die Gesellschaft formen, und ob sie möglicherweise Individuen oder einzelne Bevölkerungsgruppen benachteiligen. In dieser Arbeit wird daher ein AutoML-Tool vorgestellt, das es ermöglicht, solche Überlegungen in die Suche nach einem optimalen Modell miteinzubeziehen. Diese Forderung nach Fairness wirft gleichzeitig die Frage auf, ob die Fairness eines Modells zuverlässig geschätzt werden kann, was in einem weiteren Beitrag in dieser Arbeit untersucht wird. Da der Zugang zu Methoden des maschinellen Lernens auch stark vom Zugang zu Software und Toolboxen abhängt, sind mehrere Beiträge in Form von Software Teil dieser Arbeit. Das R-Paket mlr3pipelines ermöglicht die Einbettung von Modellen in sogenan- nte Machine Learning Pipelines, die Vor- und Nachverarbeitungsschritte enthalten, die im maschinellen Lernen und AutoML häufig benötigt werden. Das mlr3fairness R-Paket hingegen ermöglicht es dem Benutzer, Modelle auf potentielle Benachteiligung hin zu über- prüfen und diese durch verschiedene Techniken zu reduzieren. Eine dieser Techniken, multi-calibration wurde darüberhinaus als seperate Software veröffentlicht.Machine learning artifacts are increasingly embedded in society, often in the form of automated decision-making processes. One major reason for this, along with methodological improvements, is the increasing accessibility of data but also machine learning toolkits that enable access to machine learning methodology for non-experts. The core focus of this thesis is exactly this – democratizing access to machine learning in order to enable a wider audience to benefit from its potential. Contributions in this manuscript stem from several different areas within this broader area. A major section is dedicated to the field of automated machine learning (AutoML) with the goal to abstract away the tedious task of obtaining an optimal predictive model for a given dataset. This process mostly consists of finding said optimal model, often through hyperparameter optimization, while the user in turn only selects the appropriate performance metric(s) and validates the resulting models. This process can be improved or sped up by learning from previous experiments. Three such methods one with the goal to obtain a fixed set of possible hyperparameter configurations that likely contain good solutions for any new dataset and two using dataset characteristics to propose new configurations are presented in this thesis. It furthermore presents a collection of required experiment metadata and how such meta-data can be used for the development and as a test bed for new hyperparameter optimization methods. The pervasion of models derived from ML in many aspects of society simultaneously calls for increased scrutiny with respect to how such models shape society and the eventual biases they exhibit. Therefore, this thesis presents an AutoML tool that allows incorporating fairness considerations into the search for an optimal model. This requirement for fairness simultaneously poses the question of whether we can reliably estimate a model’s fairness, which is studied in a further contribution in this thesis. Since access to machine learning methods also heavily depends on access to software and toolboxes, several contributions in the form of software are part of this thesis. The mlr3pipelines R package allows for embedding models in so-called machine learning pipelines that include pre- and postprocessing steps often required in machine learning and AutoML. The mlr3fairness R package on the other hand enables users to audit models for potential biases as well as reduce those biases through different debiasing techniques. One such technique, multi-calibration is published as a separate software package, mcboost

    Benchmarking environmental machine-learning models: methodological progress and an application to forest health

    Get PDF
    Geospatial machine learning is a versatile approach to analyze environmental data and can help to better understand the interactions and current state of our environment. Due to the artificial intelligence of these algorithms, complex relationships can possibly be discovered which might be missed by other analysis methods. Modeling the interaction of creatures with their environment is referred to as ecological modeling, which is a subcategory of environmental modeling. A subfield of ecological modeling is SDM, which aims to understand the relation between the presence or absence of certain species in their environments. SDM is different from classical mapping/detection analysis. While the latter primarily aim for a visual representation of a species spatial distribution, the former focuses on using the available data to build models and interpreting these. Because no single best option exists to build such models, different settings need to be evaluated and compared against each other. When conducting such modeling comparisons, which are commonly referred to as benchmarking, care needs to be taken throughout the analysis steps to achieve meaningful and unbiased results. These steps are composed out of data preprocessing, model optimization and performance assessment. While these general principles apply to any modeling analysis, their application in an environmental context often requires additional care with respect to data handling, possibly hidden underlying data effects and model selection. To conduct all in a programmatic (and efficient) way, toolboxes in the form of programming modules or packages are needed. This work makes methodological contributions which focus on efficient, machine-learning based analysis of environmental data. In addition, research software to generalize and simplify the described process has been created throughout this work

    Machine learning for predicting soil classes in three semi-arid landscapes

    Get PDF
    Mapping the spatial distribution of soil taxonomic classes is important for informing soil use and management decisions. Digital soil mapping (DSM) can quantitatively predict the spatial distribution of soil taxonomic classes. Key components of DSM are the method and the set of environmental covariates used to predict soil classes. Machine learning is a general term for a broad set of statistical modeling techniques. Many different machine learning models have been applied in the literature and there are different approaches for selecting covariates for DSM. However, there is little guidance as to which, if any, machine learning model and covariate set might be optimal for predicting soil classes across different landscapes. Our objective was to compare multiple machine learning models and covariate sets for predicting soil taxonomic classes at three geographically distinct areas in the semi-arid western United States of America (southern New Mexico, southwestern Utah, and northeastern Wyoming). All three areas were the focus of digital soil mapping studies. Sampling sites at each study area were selected using conditioned Latin hypercube sampling (cLHS). We compared models that had been used in other DSM studies, including clustering algorithms, discriminant analysis, multinomial logistic regression, neural networks, tree based methods, and support vector machine classifiers. Tested machine learning models were divided into three groups based on model complexity: simple, moderate, and complex. We also compared environmental covariates derived from digital elevation models and Landsat imagery that were divided into three different sets: 1) covariates selected a priori by soil scientists familiar with each area and used as input into cLHS, 2) the covariates in set 1 plus 113 additional covariates, and 3) covariates selected using recursive feature elimination. Overall, complex models were consistently more accurate than simple or moderately complex models.Random forests (RF) using covariates selected via recursive feature elimination was consistently most accurate, or was among the most accurate, classifiers sets within each study area. We recommend that for soil taxonomic class prediction, complex models and covariates selected by recursive feature elimination be used. Overall classification accuracy in each study area was largely dependent upon the number of soil taxonomic classes and the frequency distribution of pedon observations between taxonomic classes. 43 Individual subgroup class accuracy was generally dependent upon the number of soil pedon 44 observations in each taxonomic class. The number of soil classes is related to the inherent variability of a given area. The imbalance of soil pedon observations between classes is likely related to cLHS. Imbalanced frequency distributions of soil pedon observations between classes must be addressed to improve model accuracy. Solutions include increasing the number of soil pedon observations in classes with few observations or decreasing the number of classes. Spatial predictions using the most accurate models generally agree with expected soil-landscape relationships. Spatial prediction uncertainty was lowest in areas of relatively low relief for each study area
    • …
    corecore