Search CORE

10,631 research outputs found

An Introduction to Recursive Partitioning: Rationale, Application and Characteristics of Classification and Regression Trees, Bagging and Random Forests

Author: Malley James
Strobl Carolin
Tutz Gerhard
Publication venue
Publication date: 01/04/2009
Field of study

Recursive partitioning methods have become popular and widely used tools for nonparametric regression and classification in many scientific fields. Especially random forests, that can deal with large numbers of predictor variables even in the presence of complex interactions, have been applied successfully in genetics, clinical medicine and bioinformatics within the past few years. High dimensional problems are common not only in genetics, but also in some areas of psychological research, where only few subjects can be measured due to time or cost constraints, yet a large amount of data is generated for each subject. Random forests have been shown to achieve a high prediction accuracy in such applications, and provide descriptive variable importance measures reflecting the impact of each variable in both main effects and interactions. The aim of this work is to introduce the principles of the standard recursive partitioning methods as well as recent methodological improvements, to illustrate their usage for low and high dimensional data exploration, but also to point out limitations of the methods and potential pitfalls in their practical application. Application of the methods is illustrated using freely available implementations in the R system for statistical computing

Crossref

Open Access LMU

PubMed Central

Decision Stream: Cultivating Deep Decision Trees

Author: Ignatov Andrey
Ignatov Dmitry
Publication venue
Publication date: 03/09/2017
Field of study

Various modifications of decision trees have been extensively used during the past years due to their high efficiency and interpretability. Tree node splitting based on relevant feature selection is a key step of decision tree learning, at the same time being their major shortcoming: the recursive nodes partitioning leads to geometric reduction of data quantity in the leaf nodes, which causes an excessive model complexity and data overfitting. In this paper, we present a novel architecture - a Decision Stream, - aimed to overcome this problem. Instead of building a tree structure during the learning process, we propose merging nodes from different branches based on their similarity that is estimated with two-sample test statistics, which leads to generation of a deep directed acyclic graph of decision rules that can consist of hundreds of levels. To evaluate the proposed solution, we test it on several common machine learning problems - credit scoring, twitter sentiment analysis, aircraft flight control, MNIST and CIFAR image classification, synthetic data classification and regression. Our experimental results reveal that the proposed approach significantly outperforms the standard decision tree learning methods on both regression and classification tasks, yielding a prediction error decrease up to 35%

arXiv.org e-Print Archive

Crossref

Recommended from our members

An independently validated nomogram for isocitrate dehydrogenase-wild-type glioblastoma patient survival.

Author: Barnholtz-Sloan Jill S
Berger Mitchel S
Chunduru Pranathi
Cioffi Gino
Gittleman Haley
Molinaro Annette M
Sloan Andrew E
Publication venue: eScholarship, University of California
Publication date: 01/05/2019
Field of study

BackgroundIn 2016, the World Health Organization reclassified the definition of glioblastoma (GBM), dividing these tumors into isocitrate dehydrogenase (IDH)-wild-type and IDH-mutant GBM, where the vast majority of GBMs are IDH-wild-type. Nomograms are useful tools for individualized estimation of survival. This study aimed to develop and independently validate a nomogram for IDH-wild-type patients with newly diagnosed GBM.MethodsData were obtained from newly diagnosed GBM patients from the Ohio Brain Tumor Study (OBTS) and the University of California San Francisco (UCSF) for diagnosis years 2007-2017 with the following variables: age at diagnosis, sex, extent of resection, concurrent radiation/temozolomide (TMZ) status, Karnofsky Performance Status (KPS), O6-methylguanine-DNA methyltransferase (MGMT) methylation status, and IDH mutation status. Survival was assessed using Cox proportional hazards regression, random survival forests, and recursive partitioning analysis, with adjustment for known prognostic factors. The models were developed using the OBTS data and independently validated using the UCSF data. Models were internally validated using 10-fold cross-validation and externally validated by plotting calibration curves.ResultsA final nomogram was validated for IDH-wild-type newly diagnosed GBM. Factors that increased the probability of survival included younger age at diagnosis, female sex, having gross total resection, having concurrent radiation/TMZ, having a high KPS, and having MGMT methylation.ConclusionsA nomogram that calculates individualized survival probabilities for IDH-wild-type patients with newly diagnosed GBM could be useful to physicians for counseling patients regarding treatment decisions and optimizing therapeutic approaches. Free software for implementing this nomogram is provided: https://gcioffi.shinyapps.io/Nomogram_For_IDH_Wildtype_GBM_H_Gittleman/

eScholarship - University of California

Novel proposal for prediction of CO2 course and occupancy recognition in Intelligent Buildings within IoT

Author: Bilík Petr
Gorjani Ojan Majidzadeh
Vaňuš Jan
Publication venue: 'MDPI AG'
Publication date: 01/01/2019
Field of study

Many direct and indirect methods, processes, and sensors available on the market today are used to monitor the occupancy of selected Intelligent Building (IB) premises and the living activities of IB residents. By recognizing the occupancy of individual spaces in IB, IB can be optimally automated in conjunction with energy savings. This article proposes a novel method of indirect occupancy monitoring using CO2, temperature, and relative humidity measured by means of standard operating measurements using the KNX (Konnex (standard EN 50090, ISO/IEC 14543)) technology to monitor laboratory room occupancy in an intelligent building within the Internet of Things (IoT). The article further describes the design and creation of a Software (SW) tool for ensuring connectivity of the KNX technology and the IoT IBM Watson platform in real-time for storing and visualization of the values measured using a Message Queuing Telemetry Transport (MQTT) protocol and data storage into a CouchDB type database. As part of the proposed occupancy determination method, the prediction of the course of CO2 concentration from the measured temperature and relative humidity values were performed using mathematical methods of Linear Regression, Neural Networks, and Random Tree (using IBM SPSS Modeler) with an accuracy higher than 90%. To increase the accuracy of the prediction, the application of suppression of additive noise from the CO2 signal predicted by CO2 using the Least mean squares (LMS) algorithm in adaptive filtering (AF) method was used within the newly designed method. In selected experiments, the prediction accuracy with LMS adaptive filtration was better than 95%.Web of Science1223art. no. 454

DSpace at VSB Technical University of Ostrava

Optimization in a Simulation Setting: Use of Function Approximation in Debt Strategy Analysis

Author: David Jamieson Bolder
Tiago Rubin
Publication venue
Publication date
Field of study

The stochastic simulation model suggested by Bolder (2003) for the analysis of the federal government's debt-management strategy provides a wide variety of useful information. It does not, however, assist in determining an optimal debt-management strategy for the government in its current form. Including optimization in the debt-strategy model would be useful, since it could substantially broaden the range of policy questions that can be addressed. Finding such an optimal strategy is nonetheless complicated by two challenges. First, performing optimization with traditional techniques in a simulation setting is computationally intractable. Second, it is necessary to define precisely what one means by an "optimal" debt strategy. The authors detail a possible approach for addressing these two challenges. They address the first challenge by approximating the numerically computed objective function using a function-approximation technique. They consider the use of ordinary least squares, kernel regression, multivariate adaptive regression splines, and projection-pursuit regressions as approximation algorithms. The second challenge is addressed by proposing a wide range of possible government objective functions and examining them in the context of an illustrative example. The authors' view is that the approach permits debt and fiscal managers to address a number of policy questions that could not be fully addressed with the current stochastic simulation engine.Debt management; Econometric and statistical methods; Fiscal policy; Financial markets

Research Papers in Economics