3,498 research outputs found

    Interpretable Predictions of Tree-based Ensembles via Actionable Feature Tweaking

    Full text link
    Machine-learned models are often described as "black boxes". In many real-world applications however, models may have to sacrifice predictive power in favour of human-interpretability. When this is the case, feature engineering becomes a crucial task, which requires significant and time-consuming human effort. Whilst some features are inherently static, representing properties that cannot be influenced (e.g., the age of an individual), others capture characteristics that could be adjusted (e.g., the daily amount of carbohydrates taken). Nonetheless, once a model is learned from the data, each prediction it makes on new instances is irreversible - assuming every instance to be a static point located in the chosen feature space. There are many circumstances however where it is important to understand (i) why a model outputs a certain prediction on a given instance, (ii) which adjustable features of that instance should be modified, and finally (iii) how to alter such a prediction when the mutated instance is input back to the model. In this paper, we present a technique that exploits the internals of a tree-based ensemble classifier to offer recommendations for transforming true negative instances into positively predicted ones. We demonstrate the validity of our approach using an online advertising application. First, we design a Random Forest classifier that effectively separates between two types of ads: low (negative) and high (positive) quality ads (instances). Then, we introduce an algorithm that provides recommendations that aim to transform a low quality ad (negative instance) into a high quality one (positive instance). Finally, we evaluate our approach on a subset of the active inventory of a large ad network, Yahoo Gemini.Comment: 10 pages, KDD 201

    Tree Ring Disturbance Clustering for the Collapse of Long Tree-ring Chronologies

    Get PDF
    The Disturbance-Clustering hypothesis, first introduced here, posits that geographically-demarcated subtly-perturbed tree rings had induced the affected trees to crossmatch not in accordance with climatic signals, as is assumed in conventional dendrochronology. They instead crossmatch only within a geographic cluster of like-perturbed trees, and not with those of other clusters or with any of the remaining unaffected climatically-governed trees. During chronology-building, these clusters became connected with each other, into an artificially-long chronology, by means of rarely-occurring fortuitously-crossmatching “bridge” series. An experiment involving fifteen ostensibly heterochronous ancient trees graphically supports this hypothesis. Merely one-per-decade individual-ring perturbations induce all fifteen series to form a self-clustering, robust false master chronology (common variance), moreover to which each series crossmatches to an almost-entirely-convincing degree (nearly all featuring all the important statistics, and including segment-by-segment correspondence of the curves). Significantly, and as experimentally demonstrated in this paper, at least 3 of every 10 disturbances can be omitted in some series, and a robust master chronology still develops. What’s more, the construction of the master chronology is not dependent upon the presence of any series that has the full complement of disturbances. Clearly, modestly-disturbed series could adequately have served as the “core” of a cluster of disturbed trees, just as required by the Disturbance-Clustering hypothesis. Furthermore, the previously-introduced now-called Migrating-Disturbance Hypothesis does not require a literal repetition of events in time. A lateral movement of disturbances over centuries is sufficient, as is illustrated. The Swedish and Finnish (Lapland) long Scots pine chronologies have a number of internal discontinuities. While not invalidating the chronologies, these discontinuities provide possible clues to their deconstruction

    Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI

    Get PDF
    In the last few years, Artificial Intelligence (AI) has achieved a notable momentum that, if harnessed appropriately, may deliver the best of expectations over many application sectors across the field. For this to occur shortly in Machine Learning, the entire community stands in front of the barrier of explainability, an inherent problem of the latest techniques brought by sub-symbolism (e.g. ensembles or Deep Neural Networks) that were not present in the last hype of AI (namely, expert systems and rule based models). Paradigms underlying this problem fall within the so-called eXplainable AI (XAI) field, which is widely acknowledged as a crucial feature for the practical deployment of AI models. The overview presented in this article examines the existing literature and contributions already done in the field of XAI, including a prospect toward what is yet to be reached. For this purpose we summarize previous efforts made to define explainability in Machine Learning, establishing a novel definition of explainable Machine Learning that covers such prior conceptual propositions with a major focus on the audience for which the explainability is sought. Departing from this definition, we propose and discuss about a taxonomy of recent contributions related to the explainability of different Machine Learning models, including those aimed at explaining Deep Learning methods for which a second dedicated taxonomy is built and examined in detail. This critical literature analysis serves as the motivating background for a series of challenges faced by XAI, such as the interesting crossroads of data fusion and explainability. Our prospects lead toward the concept of Responsible Artificial Intelligence, namely, a methodology for the large-scale implementation of AI methods in real organizations with fairness, model explainability and accountability at its core. Our ultimate goal is to provide newcomers to the field of XAI with a thorough taxonomy that can serve as reference material in order to stimulate future research advances, but also to encourage experts and professionals from other disciplines to embrace the benefits of AI in their activity sectors, without any prior bias for its lack of interpretability.Basque GovernmentConsolidated Research Group MATHMODE - Department of Education of the Basque Government IT1294-19Spanish GovernmentEuropean Commission TIN2017-89517-PBBVA Foundation through its Ayudas Fundacion BBVA a Equipos de Investigacion Cientifica 2018 call (DeepSCOP project)European Commission 82561

    Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI

    Get PDF
    In the last few years, Artificial Intelligence (AI) has achieved a notable momentum that, if harnessed appropriately, may deliver the best of expectations over many application sectors across the field. For this to occur shortly in Machine Learning, the entire community stands in front of the barrier of explainability, an inherent problem of the latest techniques brought by sub-symbolism (e.g. ensembles or Deep Neural Networks) that were not present in the last hype of AI (namely, expert systems and rule based models). Paradigms underlying this problem fall within the so-called eXplainable AI (XAI) field, which is widely acknowledged as a crucial feature for the practical deployment of AI models. The overview presented in this article examines the existing literature and contributions already done in the field of XAI, including a prospect toward what is yet to be reached. For this purpose we summarize previous efforts made to define explainability in Machine Learning, establishing a novel definition of explainable Machine Learning that covers such prior conceptual propositions with a major focus on the audience for which the explainability is sought. Departing from this definition, we propose and discuss about a taxonomy of recent contributions related to the explainability of different Machine Learning models, including those aimed at explaining Deep Learning methods for which a second dedicated taxonomy is built and examined in detail. This critical literature analysis serves as the motivating background for a series of challenges faced by XAI, such as the interesting crossroads of data fusion and explainability. Our prospects lead toward the concept of Responsible Artificial Intelligence, namely, a methodology for the large-scale implementation of AI methods in real organizations with fairness, model explainability and accountability at its core. Our ultimate goal is to provide newcomers to the field of XAI with a thorough taxonomy that can serve as reference material in order to stimulate future research advances, but also to encourage experts and professionals from other disciplines to embrace the benefits of AI in their activity sectors, without any prior bias for its lack of interpretability

    Tree-based ensembles unveil the microhabitat suitability for the invasive bleak (Alburnus alburnus L.) and pumpkinseed (Lepomis gibbosus L.): Introducing XGBoost to eco-informatics

    Full text link
    [EN] Random Forests (RFs) and Gradient Boosting Machines (GBMs) are popular approaches for habitat suitability modelling in environmental flow assessment. However, both present some limitations theoretically solved by alternative tree-based ensemble techniques (e.g. conditional RFs or oblique RFs). Among them, eXtreme Gradient Boosting machines (XGBoost) has proven to be another promising technique that mixes subroutines developed for RFs and GBMs. To inspect the capabilities of these alternative techniques, RFs and GBMs were compared with: conditional RFs, oblique RFs and XGBoost by modelling, at the micro-scale, the habitat suitability for the invasive bleak (Alburnus alburnus L.) and pumpkinseed (Lepomis gibbosus L). XGBoost outperformed the other approaches, particularly conditional and oblique RFs, although there were no statistical differences with standard RFs and GBMs. The partial dependence plots highlighted the lacustrine origins of pumpkinseed and the preference for lentic habitats of bleak. However, the latter depicted a larger tolerance for rapid microhabitats found in run-type river segments, which is likely to hinder the management of flow regimes to control its invasion. The difference in the computational burden and, especially, the characteristics of datasets on microhabitat use (low data prevalence and high overlapping between categories) led us to conclude that, in the short term, XGBoost is not destined to replace properly optimised RFs and GBMs in the process of habitat suitability modelling at the micro-scale.This project had the support of Fundacion Biodiversidad, of Spanish Ministry for Ecological Transition. We want to thank the volunteering students of the Universitat Politecnica de Valencia, Marina de Miguel, Carlos A. Puig-Mengual, Cristina Barea, Rares Hugianu, and Pau Rodriguez. R. Munoz-Mas benefitted from a postdoctoral Juan de la Cierva fellowship from the Spanish Ministry of Science, Innovation and Universities (ref. FJCI-2016-30829). This research was supported by the Government of Catalonia (ref. 2017 SGR 548).Muñoz-Mas, R.; Gil-Martínez, E.; Oliva-Paterna, FJ.; Belda, E.; Martinez-Capel, F. (2019). Tree-based ensembles unveil the microhabitat suitability for the invasive bleak (Alburnus alburnus L.) and pumpkinseed (Lepomis gibbosus L.): Introducing XGBoost to eco-informatics. Ecological Informatics. 53:1-12. https://doi.org/10.1016/j.ecoinf.2019.100974S1125

    An assessment of the effectiveness of using data analytics to predict death claim seasonality and protection policy review lapses in a life insurance company

    Get PDF
    Data analytics tools are becoming increasingly common in the life insurance industry. This research considers two use cases for predictive analytics in a life insurance company based in Ireland. The first case study relates to the use of time series models to forecast the seasonality of death claim notifications. The baseline model predicted no seasonal variation in death claim notifications over a calendar year. This reflects the life insurance company’s current approach, whereby it is assumed that claims are notified linearly over a calendar year. More accurate forecasting of death claims seasonality would enhance the life insurance company’s cashflow planning and analysis of financial results. The performance of five time series models was compared against the baseline model. The time series models included a simple historical average model, a classical SARIMA model, the Random Forest Regressor and Prophet machine learning models and the LSTM deep learning model. The models were trained on both the life insurance company’s historical death claims data and on Irish population deaths data for the 25-74 age cohort over the same observation periods. The results demonstrated that machine learning time series models were generally more effective than the baseline model in forecasting death claim seasonality. It was also demonstrated that models trained on both Irish population deaths and the life insurance company’s historical death claims could outperform the baseline model. The best forecaster was Facebook’s Prophet model, trained on the life insurance company’s claims data. Each of the models trained on Irish population deaths data outperformed the baseline model. The SARIMA and LSTM consistently underperformed the baseline model when both were trained on death claims data. All models performed better when claims directly related to Covid-19 were removed from the testing data. The second case study relates to the use of classification models to predict protection policy lapse behaviour following a policy review. The life insurance company currently has no method of predicting individual policy lapses, hence the baseline model assumed that all policies had an equal probability of lapsing. More accurate prediction of policy review lapse outcomes would enhance the life insurance company’s profit forecasting ability. It would also provide the company with the opportunity to potentially reduce lapse rates at policy review by tailoring alternative options for certain groups of policyholders. The performance of 12 classification models was assessed against the baseline model - KNN, Naïve Bayes, Support Vector Machine, Decision Tree, Random Forest, Extra Trees, XGBoost, LightGBM, AdaBoost and Multi-Layer Perceptron (MLP). To address class imbalance in the data, 11 rebalancing techniques were assessed. These included cost-sensitive algorithms (Class Weight Balancing), oversampling (Random Oversampling, ADASYN, SMOTE, Borderline SMOTE), undersampling (Random Undersampling, and Near Miss versions 1 to 3) as well as a combination of oversampling and undersampling (SMOTETomek and SMOTEENN). When combined with rebalancing methods, the predictive capacity of the classification models outperformed the baseline model in almost every case. However, results varied by train/test split and by evaluation metric. Oversampling models performed best on F1 Score and ROC-AUC while SMOTEENN and the undersampling models generated the highest levels of Recall. The top F1 Score was generated by the Naïve Bayes model when combined with SMOTE. The MLP model generated the highest ROC-AUC when combined with BorderlineSMOTE. The results of both case studies demonstrate that data analytics techniques can enhance a life insurance company’s predictive toolkit. It is recommended that further opportunities to enhance the predictive ability of the time series and classification models be explored

    Applying Machine Learning to Biological Status (QValues) from Physio-chemical Conditions of Irish Rivers

    Get PDF
    This thesis evaluates and optimises a variety of predictive models for assessing biological classification status, with an emphasis on water quality monitoring. Grounded in previous pertinent studies, it builds on the findings of (Arrighi and Castelli, 2023) concerning Tuscany’s river catchments, highlighting a solid correlation between river ecological status and parameters like summer climate and land use. They achieved an 80% prediction precision using the Random Forest algorithm, particularly adept at identifying good ecological conditions, leveraging a dataset devoid of chemical data

    Definitions, methods, and applications in interpretable machine learning.

    Get PDF
    Machine-learning models have demonstrated great success in learning complex patterns that enable them to make predictions about unobserved data. In addition to using models for prediction, the ability to interpret what a model has learned is receiving an increasing amount of attention. However, this increased focus has led to considerable confusion about the notion of interpretability. In particular, it is unclear how the wide array of proposed interpretation methods are related and what common concepts can be used to evaluate them. We aim to address these concerns by defining interpretability in the context of machine learning and introducing the predictive, descriptive, relevant (PDR) framework for discussing interpretations. The PDR framework provides 3 overarching desiderata for evaluation: predictive accuracy, descriptive accuracy, and relevancy, with relevancy judged relative to a human audience. Moreover, to help manage the deluge of interpretation methods, we introduce a categorization of existing techniques into model-based and post hoc categories, with subgroups including sparsity, modularity, and simulatability. To demonstrate how practitioners can use the PDR framework to evaluate and understand interpretations, we provide numerous real-world examples. These examples highlight the often underappreciated role played by human audiences in discussions of interpretability. Finally, based on our framework, we discuss limitations of existing methods and directions for future work. We hope that this work will provide a common vocabulary that will make it easier for both practitioners and researchers to discuss and choose from the full range of interpretation methods
    corecore