167 research outputs found

    Understanding Random Forests: From Theory to Practice

    Get PDF
    Data analysis and machine learning have become an integrative part of the modern scientific methodology, offering automated procedures for the prediction of a phenomenon based on past observations, unraveling underlying patterns in data and providing insights about the problem. Yet, caution should avoid using machine learning as a black-box tool, but rather consider it as a methodology, with a rational thought process that is entirely dependent on the problem under study. In particular, the use of algorithms should ideally require a reasonable understanding of their mechanisms, properties and limitations, in order to better apprehend and interpret their results. Accordingly, the goal of this thesis is to provide an in-depth analysis of random forests, consistently calling into question each and every part of the algorithm, in order to shed new light on its learning capabilities, inner workings and interpretability. The first part of this work studies the induction of decision trees and the construction of ensembles of randomized trees, motivating their design and purpose whenever possible. Our contributions follow with an original complexity analysis of random forests, showing their good computational performance and scalability, along with an in-depth discussion of their implementation details, as contributed within Scikit-Learn. In the second part of this work, we analyse and discuss the interpretability of random forests in the eyes of variable importance measures. The core of our contributions rests in the theoretical characterization of the Mean Decrease of Impurity variable importance measure, from which we prove and derive some of its properties in the case of multiway totally randomized trees and in asymptotic conditions. In consequence of this work, our analysis demonstrates that variable importances [...].Comment: PhD thesis. Source code available at https://github.com/glouppe/phd-thesi

    Computational Geometric and Algebraic Topology

    Get PDF
    Computational topology is a young, emerging field of mathematics that seeks out practical algorithmic methods for solving complex and fundamental problems in geometry and topology. It draws on a wide variety of techniques from across pure mathematics (including topology, differential geometry, combinatorics, algebra, and discrete geometry), as well as applied mathematics and theoretical computer science. In turn, solutions to these problems have a wide-ranging impact: already they have enabled significant progress in the core area of geometric topology, introduced new methods in applied mathematics, and yielded new insights into the role that topology has to play in fundamental problems surrounding computational complexity. At least three significant branches have emerged in computational topology: algorithmic 3-manifold and knot theory, persistent homology and surfaces and graph embeddings. These branches have emerged largely independently. However, it is clear that they have much to offer each other. The goal of this workshop was to be the first significant step to bring these three areas together, to share ideas in depth, and to pool our expertise in approaching some of the major open problems in the field

    On the Huffman and Alphabetic Tree Problem with General Cost Functions

    Get PDF
    We address generalized versions of the Huffman and Alphabetic Tree Problem where the cost caused by each individual leaf i, instead of being linear, depends on its depth in the tree by an arbitrary function. The objective is to minimize either the total cost or the maximum cost among all leaves. We review and extend the known results in this direction and devise a number of new algorithms and hardness proofs. It turns out that the Dynamic Programming approach for the Alphabetic Tree Problem can be extended to arbitrary cost functions, resulting in a time O(n (4)) optimal algorithm using space O(n (3)). We identify classes of cost functions where the well-known trick to reduce the runtime by a factor of n via a "monotonicity" property can be applied. For the generalized Huffman Tree Problem we show that even the k-ary version can be solved by a generalized version of the Coin Collector Algorithm of Larmore and Hirschberg (in Proc. SODA'90, pp. 310-318, 1990) when the cost functions are nondecreasing and convex. Furthermore, we give an O(n (2)logn) algorithm for the worst case minimization variants of both the Huffman and Alphabetic Tree Problem with nondecreasing cost functions. Investigating the limits of computational tractability, we show that the Huffman Tree Problem in its full generality is inapproximable unless P = NP, no matter if the objective function is the sum of leaf costs or their maximum. The alphabetic version becomes NP-hard when the leaf costs are interdependent.ArticleALGORITHMICA. 69(3): 582-604 (2014)journal articl

    Approximating Pandora's Box with Correlations

    Full text link
    We revisit the classic Pandora's Box (PB) problem under correlated distributions on the box values. Recent work of arXiv:1911.01632 obtained constant approximate algorithms for a restricted class of policies for the problem that visit boxes in a fixed order. In this work, we study the complexity of approximating the optimal policy which may adaptively choose which box to visit next based on the values seen so far. Our main result establishes an approximation-preserving equivalence of PB to the well studied Uniform Decision Tree (UDT) problem from stochastic optimization and a variant of the Min-Sum Set Cover (MSSCf\mathcal{MSSC}_f) problem. For distributions of support mm, UDT admits a logm\log m approximation, and while a constant factor approximation in polynomial time is a long-standing open problem, constant factor approximations are achievable in subexponential time (arXiv:1906.11385). Our main result implies that the same properties hold for PB and MSSCf\mathcal{MSSC}_f. We also study the case where the distribution over values is given more succinctly as a mixture of mm product distributions. This problem is again related to a noisy variant of the Optimal Decision Tree which is significantly more challenging. We give a constant-factor approximation that runs in time nO~(m2/ε2)n^{ \tilde O( m^2/\varepsilon^2 ) } when the mixture components on every box are either identical or separated in TV distance by ε\varepsilon

    An Introduction to Recursive Partitioning: Rationale, Application and Characteristics of Classification and Regression Trees, Bagging and Random Forests

    Get PDF
    Recursive partitioning methods have become popular and widely used tools for nonparametric regression and classification in many scientific fields. Especially random forests, that can deal with large numbers of predictor variables even in the presence of complex interactions, have been applied successfully in genetics, clinical medicine and bioinformatics within the past few years. High dimensional problems are common not only in genetics, but also in some areas of psychological research, where only few subjects can be measured due to time or cost constraints, yet a large amount of data is generated for each subject. Random forests have been shown to achieve a high prediction accuracy in such applications, and provide descriptive variable importance measures reflecting the impact of each variable in both main effects and interactions. The aim of this work is to introduce the principles of the standard recursive partitioning methods as well as recent methodological improvements, to illustrate their usage for low and high dimensional data exploration, but also to point out limitations of the methods and potential pitfalls in their practical application. Application of the methods is illustrated using freely available implementations in the R system for statistical computing
    corecore