10,631 research outputs found
An Introduction to Recursive Partitioning: Rationale, Application and Characteristics of Classification and Regression Trees, Bagging and Random Forests
Recursive partitioning methods have become popular and widely used tools for nonparametric regression and classification in many scientific fields. Especially random forests, that can deal with large numbers of predictor variables even in the presence of complex interactions, have been applied successfully in genetics, clinical medicine and bioinformatics within the past few years.
High dimensional problems are common not only in genetics, but also in some areas of psychological research, where only few subjects can be measured due to time or cost constraints, yet a large amount of data is generated for each subject. Random forests have been shown to achieve a high prediction accuracy in such applications, and provide descriptive variable importance measures reflecting the impact of each variable in both main effects and interactions.
The aim of this work is to introduce the principles of the standard recursive partitioning methods as well as recent methodological improvements, to illustrate their usage for low and high dimensional data exploration, but also to point out limitations of the methods and potential pitfalls in their practical application.
Application of the methods is illustrated using freely available implementations in the R system for statistical computing
Decision Stream: Cultivating Deep Decision Trees
Various modifications of decision trees have been extensively used during the
past years due to their high efficiency and interpretability. Tree node
splitting based on relevant feature selection is a key step of decision tree
learning, at the same time being their major shortcoming: the recursive nodes
partitioning leads to geometric reduction of data quantity in the leaf nodes,
which causes an excessive model complexity and data overfitting. In this paper,
we present a novel architecture - a Decision Stream, - aimed to overcome this
problem. Instead of building a tree structure during the learning process, we
propose merging nodes from different branches based on their similarity that is
estimated with two-sample test statistics, which leads to generation of a deep
directed acyclic graph of decision rules that can consist of hundreds of
levels. To evaluate the proposed solution, we test it on several common machine
learning problems - credit scoring, twitter sentiment analysis, aircraft flight
control, MNIST and CIFAR image classification, synthetic data classification
and regression. Our experimental results reveal that the proposed approach
significantly outperforms the standard decision tree learning methods on both
regression and classification tasks, yielding a prediction error decrease up to
35%
Recommended from our members
An independently validated nomogram for isocitrate dehydrogenase-wild-type glioblastoma patient survival.
BackgroundIn 2016, the World Health Organization reclassified the definition of glioblastoma (GBM), dividing these tumors into isocitrate dehydrogenase (IDH)-wild-type and IDH-mutant GBM, where the vast majority of GBMs are IDH-wild-type. Nomograms are useful tools for individualized estimation of survival. This study aimed to develop and independently validate a nomogram for IDH-wild-type patients with newly diagnosed GBM.MethodsData were obtained from newly diagnosed GBM patients from the Ohio Brain Tumor Study (OBTS) and the University of California San Francisco (UCSF) for diagnosis years 2007-2017 with the following variables: age at diagnosis, sex, extent of resection, concurrent radiation/temozolomide (TMZ) status, Karnofsky Performance Status (KPS), O6-methylguanine-DNA methyltransferase (MGMT) methylation status, and IDH mutation status. Survival was assessed using Cox proportional hazards regression, random survival forests, and recursive partitioning analysis, with adjustment for known prognostic factors. The models were developed using the OBTS data and independently validated using the UCSF data. Models were internally validated using 10-fold cross-validation and externally validated by plotting calibration curves.ResultsA final nomogram was validated for IDH-wild-type newly diagnosed GBM. Factors that increased the probability of survival included younger age at diagnosis, female sex, having gross total resection, having concurrent radiation/TMZ, having a high KPS, and having MGMT methylation.ConclusionsA nomogram that calculates individualized survival probabilities for IDH-wild-type patients with newly diagnosed GBM could be useful to physicians for counseling patients regarding treatment decisions and optimizing therapeutic approaches. Free software for implementing this nomogram is provided: https://gcioffi.shinyapps.io/Nomogram_For_IDH_Wildtype_GBM_H_Gittleman/
Novel proposal for prediction of CO2 course and occupancy recognition in Intelligent Buildings within IoT
Many direct and indirect methods, processes, and sensors available on the market today are used to monitor the occupancy of selected Intelligent Building (IB) premises and the living activities of IB residents. By recognizing the occupancy of individual spaces in IB, IB can be optimally automated in conjunction with energy savings. This article proposes a novel method of indirect occupancy monitoring using CO2, temperature, and relative humidity measured by means of standard operating measurements using the KNX (Konnex (standard EN 50090, ISO/IEC 14543)) technology to monitor laboratory room occupancy in an intelligent building within the Internet of Things (IoT). The article further describes the design and creation of a Software (SW) tool for ensuring connectivity of the KNX technology and the IoT IBM Watson platform in real-time for storing and visualization of the values measured using a Message Queuing Telemetry Transport (MQTT) protocol and data storage into a CouchDB type database. As part of the proposed occupancy determination method, the prediction of the course of CO2 concentration from the measured temperature and relative humidity values were performed using mathematical methods of Linear Regression, Neural Networks, and Random Tree (using IBM SPSS Modeler) with an accuracy higher than 90%. To increase the accuracy of the prediction, the application of suppression of additive noise from the CO2 signal predicted by CO2 using the Least mean squares (LMS) algorithm in adaptive filtering (AF) method was used within the newly designed method. In selected experiments, the prediction accuracy with LMS adaptive filtration was better than 95%.Web of Science1223art. no. 454
Optimization in a Simulation Setting: Use of Function Approximation in Debt Strategy Analysis
The stochastic simulation model suggested by Bolder (2003) for the analysis of the federal government's debt-management strategy provides a wide variety of useful information. It does not, however, assist in determining an optimal debt-management strategy for the government in its current form. Including optimization in the debt-strategy model would be useful, since it could substantially broaden the range of policy questions that can be addressed. Finding such an optimal strategy is nonetheless complicated by two challenges. First, performing optimization with traditional techniques in a simulation setting is computationally intractable. Second, it is necessary to define precisely what one means by an "optimal" debt strategy. The authors detail a possible approach for addressing these two challenges. They address the first challenge by approximating the numerically computed objective function using a function-approximation technique. They consider the use of ordinary least squares, kernel regression, multivariate adaptive regression splines, and projection-pursuit regressions as approximation algorithms. The second challenge is addressed by proposing a wide range of possible government objective functions and examining them in the context of an illustrative example. The authors' view is that the approach permits debt and fiscal managers to address a number of policy questions that could not be fully addressed with the current stochastic simulation engine.Debt management; Econometric and statistical methods; Fiscal policy; Financial markets
- ā¦