10,197 research outputs found

    Decision trees in epidemiological research

    Get PDF
    Background: In many studies, it is of interest to identify population subgroups that are relatively homogeneous with respect to an outcome. The nature of these subgroups can provide insight into effect mechanisms and suggest targets for tailored interventions. However, identifying relevant subgroups can be challenging with standard statistical methods. Main text: We review the literature on decision trees, a family of techniques for partitioning the population, on the basis of covariates, into distinct subgroups who share similar values of an outcome variable. We compare two decision tree methods, the popular Classification and Regression tree (CART) technique and the newer Conditional Inference tree (CTree) technique, assessing their performance in a simulation study and using data from the Box Lunch Study, a randomized controlled trial of a portion size intervention. Both CART and CTree identify homogeneous population subgroups and offer improved prediction accuracy relative to regression-based approaches when subgroups are truly present in the data. An important distinction between CART and CTree is that the latter uses a formal statistical hypothesis testing framework in building decision trees, which simplifies the process of identifying and interpreting the final tree model. We also introduce a novel way to visualize the subgroups defined by decision trees. Our novel graphical visualization provides a more scientifically meaningful characterization of the subgroups identified by decision trees. Conclusions: Decision trees are a useful tool for identifying homogeneous subgroups defined by combinations of individual characteristics. While all decision tree techniques generate subgroups, we advocate the use of the newer CTree technique due to its simplicity and ease of interpretation

    Developing a discrimination rule between breast cancer patients and controls using proteomics mass spectrometric data: A three-step approach

    Get PDF
    To discriminate between breast cancer patients and controls, we used a three-step approach to obtain our decision rule. First, we ranked the mass/charge values using random forests, because it generates importance indices that take possible interactions into account. We observed that the top ranked variables consisted of highly correlated contiguous mass/charge values, which were grouped in the second step into new variables. Finally, these newly created variables were used as predictors to find a suitable discrimination rule. In this last step, we compared three different methods, namely Classification and Regression Tree ( CART), logistic regression and penalized logistic regression. Logistic regression and penalized logistic regression performed equally well and both had a higher classification accuracy than CART. The model obtained with penalized logistic regression was chosen as we hypothesized that this model would provide a better classification accuracy in the validation set. The solution had a good performance on the training set with a classification accuracy of 86.3%, and a sensitivity and specificity of 86.8% and 85.7%, respectively

    Understanding Preferences For Income Redestribution

    Get PDF
    Recent research suggests that income redistribution preferences vary across identity groups. We employ a new pattern recognition technology, tree regression analysis, to uncover what these groups are. Using data from the General Social Survey, we present a new stylized fact that preferences for governmental provision of income redistribution vary systematically with race, gender, and class background. We explore the extent to which existing theories of income redistribution can explain our results, but conclude that current approaches do not fully explain the findings.

    Hybrid model using logit and nonparametric methods for predicting micro-entity failure

    Get PDF
    Following the calls from literature on bankruptcy, a parsimonious hybrid bankruptcy model is developed in this paper by combining parametric and non-parametric approaches.To this end, the variables with the highest predictive power to detect bankruptcy are selected using logistic regression (LR). Subsequently, alternative non-parametric methods (Multilayer Perceptron, Rough Set, and Classification-Regression Trees) are applied, in turn, to firms classified as either “bankrupt” or “not bankrupt”. Our findings show that hybrid models, particularly those combining LR and Multilayer Perceptron, offer better accuracy performance and interpretability and converge faster than each method implemented in isolation. Moreover, the authors demonstrate that the introduction of non-financial and macroeconomic variables complement financial ratios for bankruptcy prediction

    Learning Multi-Tree Classification Models with Ant Colony Optimization

    Get PDF
    Ant Colony Optimization (ACO) is a meta-heuristic for solving combinatorial optimization problems, inspired by the behaviour of biological ant colonies. One of the successful applications of ACO is learning classification models (classifiers). A classifier encodes the relationships between the input attribute values and the values of a class attribute in a given set of labelled cases and it can be used to predict the class value of new unlabelled cases. Decision trees have been widely used as a type of classification model that represent comprehensible knowledge to the user. In this paper, we propose the use of ACO-based algorithms for learning an extended multi-tree classification model, which consists of multiple decision trees, one for each class value. Each class-based decision trees is responsible for discriminating between its class value and all other values available in the class domain. Our proposed algorithms are empirically evaluated against well-known decision trees induction algorithms, as well as the ACO-based Ant-Tree-Miner algorithm. The results show an overall improvement in predictive accuracy over 32 benchmark datasets. We also discuss how the new multi-tree models can provide the user with more understanding and knowledge-interpretability in a given domain

    Fitting Prediction Rule Ensembles with R Package pre

    Get PDF
    Prediction rule ensembles (PREs) are sparse collections of rules, offering highly interpretable regression and classification models. This paper presents the R package pre, which derives PREs through the methodology of Friedman and Popescu (2008). The implementation and functionality of package pre is described and illustrated through application on a dataset on the prediction of depression. Furthermore, accuracy and sparsity of PREs is compared with that of single trees, random forest and lasso regression in four benchmark datasets. Results indicate that pre derives ensembles with predictive accuracy comparable to that of random forests, while using a smaller number of variables for prediction
    • …
    corecore