92 research outputs found

    Interpretable Predictions of Tree-based Ensembles via Actionable Feature Tweaking

    Full text link
    Machine-learned models are often described as "black boxes". In many real-world applications however, models may have to sacrifice predictive power in favour of human-interpretability. When this is the case, feature engineering becomes a crucial task, which requires significant and time-consuming human effort. Whilst some features are inherently static, representing properties that cannot be influenced (e.g., the age of an individual), others capture characteristics that could be adjusted (e.g., the daily amount of carbohydrates taken). Nonetheless, once a model is learned from the data, each prediction it makes on new instances is irreversible - assuming every instance to be a static point located in the chosen feature space. There are many circumstances however where it is important to understand (i) why a model outputs a certain prediction on a given instance, (ii) which adjustable features of that instance should be modified, and finally (iii) how to alter such a prediction when the mutated instance is input back to the model. In this paper, we present a technique that exploits the internals of a tree-based ensemble classifier to offer recommendations for transforming true negative instances into positively predicted ones. We demonstrate the validity of our approach using an online advertising application. First, we design a Random Forest classifier that effectively separates between two types of ads: low (negative) and high (positive) quality ads (instances). Then, we introduce an algorithm that provides recommendations that aim to transform a low quality ad (negative instance) into a high quality one (positive instance). Finally, we evaluate our approach on a subset of the active inventory of a large ad network, Yahoo Gemini.Comment: 10 pages, KDD 201

    Understanding Random Forests: From Theory to Practice

    Get PDF
    Data analysis and machine learning have become an integrative part of the modern scientific methodology, offering automated procedures for the prediction of a phenomenon based on past observations, unraveling underlying patterns in data and providing insights about the problem. Yet, caution should avoid using machine learning as a black-box tool, but rather consider it as a methodology, with a rational thought process that is entirely dependent on the problem under study. In particular, the use of algorithms should ideally require a reasonable understanding of their mechanisms, properties and limitations, in order to better apprehend and interpret their results. Accordingly, the goal of this thesis is to provide an in-depth analysis of random forests, consistently calling into question each and every part of the algorithm, in order to shed new light on its learning capabilities, inner workings and interpretability. The first part of this work studies the induction of decision trees and the construction of ensembles of randomized trees, motivating their design and purpose whenever possible. Our contributions follow with an original complexity analysis of random forests, showing their good computational performance and scalability, along with an in-depth discussion of their implementation details, as contributed within Scikit-Learn. In the second part of this work, we analyse and discuss the interpretability of random forests in the eyes of variable importance measures. The core of our contributions rests in the theoretical characterization of the Mean Decrease of Impurity variable importance measure, from which we prove and derive some of its properties in the case of multiway totally randomized trees and in asymptotic conditions. In consequence of this work, our analysis demonstrates that variable importances [...].Comment: PhD thesis. Source code available at https://github.com/glouppe/phd-thesi

    Data Structures & Algorithm Analysis in C++

    Get PDF
    This is the textbook for CSIS 215 at Liberty University.https://digitalcommons.liberty.edu/textbooks/1005/thumbnail.jp

    An introduction to explainable artificial intelligence with LIME and SHAP

    Full text link
    Treballs Finals de Grau de Matemàtiques, Facultat de Matemàtiques, Universitat de Barcelona, Any: 2022, Director: Albert Clapés i Sintes i Sergio Escalera Guerrero[en] Artificial intelligence (AI) and more specifically machine learning (ML) have shown their potential by approaching or even exceeding human levels of accuracy for a variety of real-world problems. However, the highest accuracy for large modern datasets is often achieved by complex models that even experts struggle to interpret, creating a tradeoff between accuracy and interpretability. These models are known for being "black box" and opaque, which is especially problematic in industries like healthcare. Therefore, understanding the reasons behind predictions is crucial in establishing trust, which is fundamental if one plans to take action based on a prediction, or when deciding whether or not to implement a new model. Here is where explainable artificial intelligence (XAI) comes in by helping humans to comprehend and trust the results and output created by a machine learning model. This project is organised in 3 chapters with the aim of introducing the reader to the field of explainable artificial intelligence. Machine learning and some related concepts are introduced in the first chapter. The second chapter focuses on the theory of the random forest model in detail. Finally, in the third chapter, the theory behind two contemporary and influential XAI methods, LIME and SHAP, is formalised. Additionally, a public diabetes tabular dataset is used to illustrate an application of these two methods in the medical sector. The project concludes with a discussion of its possible future works

    Analysis of technical & tactical performances for success in the National Rugby League

    Get PDF
    Corey Wedding investigated the ability of novel analytical techniques to identify success in the Australian National Rugby League. These techniques were effective in identifying unique playing styles and positions important for team success within the competition. These results will help elite teams to better prepare for successful matches and seasons

    Techniques for Constructing Efficient Lock-free Data Structures

    Full text link
    Building a library of concurrent data structures is an essential way to simplify the difficult task of developing concurrent software. Lock-free data structures, in which processes can help one another to complete operations, offer the following progress guarantee: If processes take infinitely many steps, then infinitely many operations are performed. Handcrafted lock-free data structures can be very efficient, but are notoriously difficult to implement. We introduce numerous tools that support the development of efficient lock-free data structures, and especially trees.Comment: PhD thesis, Univ Toronto (2017

    Application of Pattern Recognition Methods to Identify Dietary Patterns in Longitudinal Studies: A Novel approach in Nutritional Epidemiology

    Get PDF
    With the increasing prevalence of longitudinal nutritional data applications in medical science, there is a need for complex statistical models for the identification of dietary patterns in the longitudinal set. Advances are constantly being made in our understanding of the interpretability and application of statistical methodologies for longitudinal data. However, little guidance on these matters is available in most nutritional contexts. One of the most important features of longitudinal data is that the observations repeatedly collected over time are correlated to each other. This time-varying association among observations, which cannot be obtained solely by focusing on cross-sectional data analysis methods, is the main analytic challenge in longitudinal studies. This challenge is particularly relevant in the nutritional field, where researchers strive to identify useful and understandable dietary patterns from large-scale nutritional data. In nutritional epidemiology, dietary patterns are derived using pattern recognition (PR) methods. Generally, there are two types of PR methods; supervised and unsupervised. Although many nutritional studies applied cross-sectional PR methods to the identification of dietary patterns, however, the assumption of these methods might not be suitable for the identification of patterns in longitudinal data. Currently, extensions to both supervised and unsupervised cross-sectional PR methods for revealing patterns in longitudinal data exist in the literature. However, none of these methods have been applied to the identification of dietary patterns where nutritional data collected repeatedly over time. Recently, longitudinal principal component analysis (LPCA) and unbiased random effects expectation maximization algorithm (RE-EM) tree methods, as a substitute to principal component analysis (PCA) and regression tree analysis (RT), for revealing pattens in longitudinal studies are developed. This thesis introduces the first application of LPCA and unbiased RE-EM tree, as unsupervised and supervised PR methods, respectively, for the analysis of longitudinal nutritional data. To illustrate these methods, an analysis of dietary patterns in a representative sub-sample of the Saskatchewan Bone Mineral Study (BMAS) is presented. Results showed that the models presented in this thesis seem feasible and useful for the identification of dietary patterns and their trajectories where nutritional data is collected longitudinally. In this sense, this thesis assists the nutritional epidemiologists and researchers in understanding the importance, role, and meaning of the consideration of time-varying associations in diet. It also introduces new dietary pattern analysis methods in longitudinal nutritional studies using LPCA and unbiased RE-EM tree

    LIPIcs, Volume 274, ESA 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 274, ESA 2023, Complete Volum

    High-Quality Hypergraph Partitioning

    Get PDF
    This dissertation focuses on computing high-quality solutions for the NP-hard balanced hypergraph partitioning problem: Given a hypergraph and an integer kk, partition its vertex set into kk disjoint blocks of bounded size, while minimizing an objective function over the hyperedges. Here, we consider the two most commonly used objectives: the cut-net metric and the connectivity metric. Since the problem is computationally intractable, heuristics are used in practice - the most prominent being the three-phase multi-level paradigm: During coarsening, the hypergraph is successively contracted to obtain a hierarchy of smaller instances. After applying an initial partitioning algorithm to the smallest hypergraph, contraction is undone and, at each level, refinement algorithms try to improve the current solution. With this work, we give a brief overview of the field and present several algorithmic improvements to the multi-level paradigm. Instead of using a logarithmic number of levels like traditional algorithms, we present two coarsening algorithms that create a hierarchy of (nearly) nn levels, where nn is the number of vertices. This makes consecutive levels as similar as possible and provides many opportunities for refinement algorithms to improve the partition. This approach is made feasible in practice by tailoring all algorithms and data structures to the nn-level paradigm, and developing lazy-evaluation techniques, caching mechanisms and early stopping criteria to speed up the partitioning process. Furthermore, we propose a sparsification algorithm based on locality-sensitive hashing that improves the running time for hypergraphs with large hyperedges, and show that incorporating global information about the community structure into the coarsening process improves quality. Moreover, we present a portfolio-based initial partitioning approach, and propose three refinement algorithms. Two are based on the Fiduccia-Mattheyses (FM) heuristic, but perform a highly localized search at each level. While one is designed for two-way partitioning, the other is the first FM-style algorithm that can be efficiently employed in the multi-level setting to directly improve kk-way partitions. The third algorithm uses max-flow computations on pairs of blocks to refine kk-way partitions. Finally, we present the first memetic multi-level hypergraph partitioning algorithm for an extensive exploration of the global solution space. All contributions are made available through our open-source framework KaHyPar. In a comprehensive experimental study, we compare KaHyPar with hMETIS, PaToH, Mondriaan, Zoltan-AlgD, and HYPE on a wide range of hypergraphs from several application areas. Our results indicate that KaHyPar, already without the memetic component, computes better solutions than all competing algorithms for both the cut-net and the connectivity metric, while being faster than Zoltan-AlgD and equally fast as hMETIS. Moreover, KaHyPar compares favorably with the current best graph partitioning system KaFFPa - both in terms of solution quality and running time
    • …
    corecore