Search CORE

58,057 research outputs found

Learning Optimal Decision Trees from Large Datasets

Author: Avellaneda Florent
Publication venue: HAL CCSD
Publication date: 12/04/2019
Field of study

Inferring a decision tree from a given dataset is one of the classic problems in machine learning. This problem consists of buildings , from a labelled dataset, a tree such that each node corresponds to a class and a path between the tree root and a leaf corresponds to a conjunction of features to be satisfied in this class. Following the principle of parsimony, we want to infer a minimal tree consistent with the dataset. Unfortunately, inferring an optimal decision tree is known to be NP-complete for several definitions of optimality. Hence, the majority of existing approaches relies on heuristics, and as for the few exact inference approaches, they do not work on large data sets. In this paper, we propose a novel approach for inferring a decision tree of a minimum depth based on the incremental generation of Boolean formula. The experimental results indicate that it scales sufficiently well and the time it takes to run grows slowly with the size of dataset

An Online Tree-Based Approach for Mining Non-Stationary High-Speed Data Streams

Author: Baldo Fabiano
Frías Blanco Isvani Inocencio
Ortiz Díaz Agustín Alejandro
Palomino Mariño Laura María
Publication venue: Instituto de Informática - Universidade Federal do Rio Grande do Sul
Publication date: 15/01/2020
Field of study

This paper presents a new learning algorithm for inducing decision trees from data streams. In these domains, large amounts of data are constantly arriving over time, possibly at high speed. The proposed algorithm uses a top-down induction method for building trees, splitting leaf nodes recursively, until none of them can be expanded. The new algorithm combines two split methods in the tree induction. The first method is able to guarantee, with statistical significance, that each split chosen would be the same as that chosen using infinite examples. By doing so, it aims at ensuring that the tree induced online is close to the optimal model. However, this split method often needs too many examples to make a decision about the best split, which delays the accuracy improvement of the online predictive learning model. Therefore, the second method is used to split nodes more quickly, speeding up the tree growth. The second split method is based on the observation that larger trees are able to store more information about the training examples and to represent more complex concepts. The first split method is also used to correct splits previously suggested by the second one, when it has sufficient evidence. Finally, an additional procedure rebuilds the tree model according to the suggestions made with an adequate level of statistical significance. The proposed algorithm is empirically compared with several well-known induction algorithms for learning decision trees from data streams. In the tests it is possible to observe that the proposed algorithm is more competitive in terms of accuracy and model size using various synthetic and real world datasets.

Em Questao

Archives of the Faculty of Veterinary Medicine UFRGS

ANALYZING BIG DATA WITH DECISION TREES

Author: Leong Lok Kei
Publication venue: SJSU ScholarWorks
Publication date: 01/04/2014
Field of study

ANALYZING BIG DATA WITH DECISION TREE

SJSU ScholarWorks

Robust Decision Trees Against Adversarial Examples

Author: Boning Duane
Chen Hongge
Hsieh Cho-Jui
Zhang Huan
Publication venue
Publication date: 11/06/2019
Field of study

Although adversarial examples and model robustness have been extensively studied in the context of linear models and neural networks, research on this issue in tree-based models and how to make tree-based models robust against adversarial examples is still limited. In this paper, we show that tree based models are also vulnerable to adversarial examples and develop a novel algorithm to learn robust trees. At its core, our method aims to optimize the performance under the worst-case perturbation of input features, which leads to a max-min saddle point problem. Incorporating this saddle point objective into the decision tree building procedure is non-trivial due to the discrete nature of trees --- a naive approach to finding the best split according to this saddle point objective will take exponential time. To make our approach practical and scalable, we propose efficient tree building algorithms by approximating the inner minimizer in this saddle point problem, and present efficient implementations for classical information gain based trees as well as state-of-the-art tree boosting models such as XGBoost. Experimental results on real world datasets demonstrate that the proposed algorithms can substantially improve the robustness of tree-based models against adversarial examples

arXiv.org e-Print Archive

DSpace@MIT