31,976 research outputs found
Column generation based math-heuristic for classification trees
This paper explores the use of Column Generation (CG) techniques in
constructing univariate binary decision trees for classification tasks. We
propose a novel Integer Linear Programming (ILP) formulation, based on
root-to-leaf paths in decision trees. The model is solved via a Column
Generation based heuristic. To speed up the heuristic, we use a restricted
instance data by considering a subset of decision splits, sampled from the
solutions of the well-known CART algorithm. Extensive numerical experiments
show that our approach is competitive with the state-of-the-art ILP-based
algorithms. In particular, the proposed approach is capable of handling big
data sets with tens of thousands of data rows. Moreover, for large data sets,
it finds solutions competitive to CART
An Automatic Interaction Detection Hybrid Model for Bankcard Response Classification
In this paper, we propose a hybrid bankcard response model, which integrates
decision tree based chi-square automatic interaction detection (CHAID) into
logistic regression. In the first stage of the hybrid model, CHAID analysis is
used to detect the possibly potential variable interactions. Then in the second
stage, these potential interactions are served as the additional input
variables in logistic regression. The motivation of the proposed hybrid model
is that adding variable interactions may improve the performance of logistic
regression. To demonstrate the effectiveness of the proposed hybrid model, it
is evaluated on a real credit customer response data set. As the results
reveal, by identifying potential interactions among independent variables, the
proposed hybrid approach outperforms the logistic regression without searching
for interactions in terms of classification accuracy, the area under the
receiver operating characteristic curve (ROC), and Kolmogorov-Smirnov (KS)
statistics. Furthermore, CHAID analysis for interaction detection is much more
computationally efficient than the stepwise search mentioned above and some
identified interactions are shown to have statistically significant predictive
power on the target variable. Last but not least, the customer profile created
based on the CHAID tree provides a reasonable interpretation of the
interactions, which is the required by regulations of the credit industry.
Hence, this study provides an alternative for handling bankcard classification
tasks
Stratification Trees for Adaptive Randomization in Randomized Controlled Trials
This paper proposes an adaptive randomization procedure for two-stage
randomized controlled trials. The method uses data from a first-wave experiment
in order to determine how to stratify in a second wave of the experiment, where
the objective is to minimize the variance of an estimator for the average
treatment effect (ATE). We consider selection from a class of stratified
randomization procedures which we call stratification trees: these are
procedures whose strata can be represented as decision trees, with differing
treatment assignment probabilities across strata. By using the first wave to
estimate a stratification tree, we simultaneously select which covariates to
use for stratification, how to stratify over these covariates, as well as the
assignment probabilities within these strata. Our main result shows that using
this randomization procedure with an appropriate estimator results in an
asymptotic variance which is minimal in the class of stratification trees.
Moreover, the results we present are able to accommodate a large class of
assignment mechanisms within strata, including stratified block randomization.
In a simulation study, we find that our method, paired with an appropriate
cross-validation procedure ,can improve on ad-hoc choices of stratification. We
conclude by applying our method to the study in Karlan and Wood (2017), where
we estimate stratification trees using the first wave of their experiment
An overview of decision table literature 1982-1995.
This report gives an overview of the literature on decision tables over the past 15 years. As much as possible, for each reference, an author supplied abstract, a number of keywords and a classification are provided. In some cases own comments are added. The purpose of these comments is to show where, how and why decision tables are used. The literature is classified according to application area, theoretical versus practical character, year of publication, country or origin (not necessarily country of publication) and the language of the document. After a description of the scope of the interview, classification results and the classification by topic are presented. The main body of the paper is the ordered list of publications with abstract, classification and comments.
Efficient, reliable and fast high-level triggering using a bonsai boosted decision tree
High-level triggering is a vital component in many modern particle physics
experiments. This paper describes a modification to the standard boosted
decision tree (BDT) classifier, the so-called "bonsai" BDT, that has the
following important properties: it is more efficient than traditional cut-based
approaches; it is robust against detector instabilities, and it is very fast.
Thus, it is fit-for-purpose for the online running conditions faced by any
large-scale data acquisition system.Comment: 10 pages, 2 figure
Decision diagrams in machine learning: an empirical study on real-life credit-risk data.
Decision trees are a widely used knowledge representation in machine learning. However, one of their main drawbacks is the inherent replication of isomorphic subtrees, as a result of which the produced classifiers might become too large to be comprehensible by the human experts that have to validate them. Alternatively, decision diagrams, a generalization of decision trees taking on the form of a rooted, acyclic digraph instead of a tree, have occasionally been suggested as a potentially more compact representation. Their application in machine learning has nonetheless been criticized, because the theoretical size advantages of subgraph sharing did not always directly materialize in the relatively scarce reported experiments on real-world data. Therefore, in this paper, starting from a series of rule sets extracted from three real-life credit-scoring data sets, we will empirically assess to what extent decision diagrams are able to provide a compact visual description. Furthermore, we will investigate the practical impact of finding a good attribute ordering on the achieved size savings.Advantages; Classifiers; Credit scoring; Data; Decision; Decision diagrams; Decision trees; Empirical study; Knowledge; Learning; Real life; Representation; Size; Studies;
Bayesian Additive Regression Trees using Bayesian Model Averaging
Bayesian Additive Regression Trees (BART) is a statistical sum of trees
model. It can be considered a Bayesian version of machine learning tree
ensemble methods where the individual trees are the base learners. However for
data sets where the number of variables is large (e.g. ) the
algorithm can become prohibitively expensive, computationally.
Another method which is popular for high dimensional data is random forests,
a machine learning algorithm which grows trees using a greedy search for the
best split points. However, as it is not a statistical model, it cannot produce
probabilistic estimates or predictions.
We propose an alternative algorithm for BART called BART-BMA, which uses
Bayesian Model Averaging and a greedy search algorithm to produce a model which
is much more efficient than BART for datasets with large . BART-BMA
incorporates elements of both BART and random forests to offer a model-based
algorithm which can deal with high-dimensional data.
We have found that BART-BMA can be run in a reasonable time on a standard
laptop for the "small large " scenario which is common in many areas of
bioinformatics. We showcase this method using simulated data and data from two
real proteomic experiments; one to distinguish between patients with
cardiovascular disease and controls and another to classify agressive from
non-agressive prostate cancer. We compare our results to their main
competitors.
Open source code written in R and Rcpp to run BART-BMA can be found at:
https://github.com/BelindaHernandez/BART-BMA.gi
ADAPTS: An Intelligent Sustainable Conceptual Framework for Engineering Projects
This paper presents a conceptual framework for the optimization of environmental sustainability in engineering projects, both for products and industrial facilities or processes. The main objective of this work is to propose a conceptual framework to help researchers to approach optimization under the criteria of sustainability of engineering projects, making use of current Machine Learning techniques. For the development of this conceptual framework, a bibliographic search has been carried out on the Web of Science. From the selected documents and through a hermeneutic procedure the texts have been analyzed and the conceptual framework has been carried out. A graphic representation pyramid shape is shown to clearly define the variables of the proposed conceptual framework and their relationships. The conceptual framework consists of 5 dimensions; its acronym is ADAPTS. In the base are: (1) the Application to which it is intended, (2) the available DAta, (3) the APproach under which it is operated, and (4) the machine learning Tool used. At the top of the pyramid, (5) the necessary Sensing. A study case is proposed to show its applicability. This work is part of a broader line of research, in terms of optimization under sustainability criteria.Telefónica Chair “Intelligence in Networks” of the University of Seville (Spain
A framework for redescription set construction
Redescription mining is a field of knowledge discovery that aims at finding
different descriptions of similar subsets of instances in the data. These
descriptions are represented as rules inferred from one or more disjoint sets
of attributes, called views. As such, they support knowledge discovery process
and help domain experts in formulating new hypotheses or constructing new
knowledge bases and decision support systems. In contrast to previous
approaches that typically create one smaller set of redescriptions satisfying a
pre-defined set of constraints, we introduce a framework that creates large and
heterogeneous redescription set from which user/expert can extract compact sets
of differing properties, according to its own preferences. Construction of
large and heterogeneous redescription set relies on CLUS-RM algorithm and a
novel, conjunctive refinement procedure that facilitates generation of larger
and more accurate redescription sets. The work also introduces the variability
of redescription accuracy when missing values are present in the data, which
significantly extends applicability of the method. Crucial part of the
framework is the redescription set extraction based on heuristic
multi-objective optimization procedure that allows user to define importance
levels towards one or more redescription quality criteria. We provide both
theoretical and empirical comparison of the novel framework against current
state of the art redescription mining algorithms and show that it represents
more efficient and versatile approach for mining redescriptions from data
Bridge type classification: supervised learning on a modified NBI dataset
A key phase in the bridge design process is the selection of the structural
system. Due to budget and time constraints, engineers typically rely on
engineering judgment and prior experience when selecting a structural system,
often considering a limited range of design alternatives. The objective of this
study was to explore the suitability of supervised machine learning as a
preliminary design aid that provides guidance to engineers with regards to the
statistically optimal bridge type to choose, ultimately improving the
likelihood of optimized design, design standardization, and reduced maintenance
costs. In order to devise this supervised learning system, data for over
600,000 bridges from the National Bridge Inventory database were analyzed. Key
attributes for determining the bridge structure type were identified through
three feature selection techniques. Potentially useful attributes like seismic
intensity and historic data on the cost of materials (steel and concrete) were
then added from the US Geological Survey (USGS) database and Engineering News
Record. Decision tree, Bayes network and Support Vector Machines were used for
predicting the bridge design type. Due to state-to-state variations in material
availability, material costs, and design codes, supervised learning models
based on the complete data set did not yield favorable results. Supervised
learning models were then trained and tested using 10-fold cross validation on
data for each state. Inclusion of seismic data improved the model performance
noticeably. The data was then resampled to reduce the bias of the models
towards more common design types, and the supervised learning models thus
constructed showed further improvements in performance. The average recall and
precision for the state models was 88.6% and 88.0% using Decision Trees, 84.0%
and 83.7% using Bayesian Networks, and 80.8% and 75.6% using SVM.Comment: Preprint of paper published in ASCE Journal of Computing
(https://ascelibrary.org/doi/full/10.1061/(ASCE)CP.1943-5487.0000712). 6
figures and 8 tables, provided at end of documen
- …