49,273 research outputs found

    Variable selection with Random Forests for missing data

    Get PDF
    Variable selection has been suggested for Random Forests to improve their efficiency of data prediction and interpretation. However, its basic element, i.e. variable importance measures, can not be computed straightforward when there is missing data. Therefore an extensive simulation study has been conducted to explore possible solutions, i.e. multiple imputation, complete case analysis and a newly suggested importance measure for several missing data generating processes. The ability to distinguish relevant from non-relevant variables has been investigated for these procedures in combination with two popular variable selection methods. Findings and recommendations: Complete case analysis should not be applied as it lead to inaccurate variable selection and models with the worst prediction accuracy. Multiple imputation is a good means to select variables that would be of relevance in fully observed data. It produced the best prediction accuracy. By contrast, the application of the new importance measure causes a selection of variables that reflects the actual data situation, i.e. that takes the occurrence of missing values into account. It's error was only negligible worse compared to imputation

    Random Forest variable importance with missing data

    Get PDF
    Random Forests are commonly applied for data prediction and interpretation. The latter purpose is supported by variable importance measures that rate the relevance of predictors. Yet existing measures can not be computed when data contains missing values. Possible solutions are given by imputation methods, complete case analysis and a newly suggested importance measure. However, it is unknown to what extend these approaches are able to provide a reliable estimate of a variables relevance. An extensive simulation study was performed to investigate this property for a variety of missing data generating processes. Findings and recommendations: Complete case analysis should not be applied as it inappropriately penalized variables that were completely observed. The new importance measure is much more capable to reflect decreased information exclusively for variables with missing values and should therefore be used to evaluate actual data situations. By contrast, multiple imputation allows for an estimation of importances one would potentially observe in complete data situations

    Analysis of missing data with random forests

    Get PDF
    Random Forests are widely used for data prediction and interpretation purposes. They show many appealing characteristics, such as the ability to deal with high dimensional data, complex interactions and correlations. Furthermore, missing values can easily be processed by the built-in procedure of surrogate splits. However, there is only little knowledge about the properties of recursive partitioning in missing data situations. Therefore, extensive simulation studies and empirical evaluations have been conducted to gain deeper insight. In addition, new methods have been developed to enhance methodology and solve current issues of data interpretation, prediction and variable selection. A variableā€™s relevance in a Random Forest can be assessed by means of importance measures. Unfortunately, existing methods cannot be applied when the data contain miss- ing values. Thus, one of the most appreciated properties of Random Forests ā€“ its ability to handle missing values ā€“ gets lost for the computation of such measures. This work presents a new approach that is designed to deal with missing values in an intuitive and straightforward way, yet retains widely appreciated qualities of existing methods. Results indicate that it meets sensible requirements and shows good variable ranking properties. Random Forests provide variable selection that is usually based on importance mea- sures. An extensive review of corresponding literature led to the development of a new approach that is based on a profound theoretical framework and meets important statis- tical properties. A comparison to another eight popular methods showed that it controls the test-wise and family-wise error rate, provides a higher power to distinguish relevant from non-relevant variables and leads to models located among the best performing ones. Alternative ways to handle missing values are the application of imputation methods and complete case analysis. Yet it is unknown to what extent these approaches are able to provide sensible variable rankings and meaningful variable selections. Investigations showed that complete case analysis leads to inaccurate variable selection as it may in- appropriately penalize the importance of fully observed variables. By contrast, the new importance measure decreases for variables with missing values and therefore causes se- lections that accurately reļ¬‚ect the information given in actual data situations. Multiple imputation leads to an assessment of a variableā€™s importance and to selection frequencies that would be expected for data that was completely observed. In several performance evaluations the best prediction accuracy emerged from multiple imputation, closely fol- lowed by the application of surrogate splits. Complete case analysis clearly performed worst

    Party on!

    Get PDF

    An Introduction to Recursive Partitioning: Rationale, Application and Characteristics of Classification and Regression Trees, Bagging and Random Forests

    Get PDF
    Recursive partitioning methods have become popular and widely used tools for nonparametric regression and classification in many scientific fields. Especially random forests, that can deal with large numbers of predictor variables even in the presence of complex interactions, have been applied successfully in genetics, clinical medicine and bioinformatics within the past few years. High dimensional problems are common not only in genetics, but also in some areas of psychological research, where only few subjects can be measured due to time or cost constraints, yet a large amount of data is generated for each subject. Random forests have been shown to achieve a high prediction accuracy in such applications, and provide descriptive variable importance measures reflecting the impact of each variable in both main effects and interactions. The aim of this work is to introduce the principles of the standard recursive partitioning methods as well as recent methodological improvements, to illustrate their usage for low and high dimensional data exploration, but also to point out limitations of the methods and potential pitfalls in their practical application. Application of the methods is illustrated using freely available implementations in the R system for statistical computing

    Using Random Forests to Describe Equity in Higher Education: A Critical Quantitative Analysis of Utahā€™s Postsecondary Pipelines

    Get PDF
    The following work examines the Random Forest (RF) algorithm as a tool for predicting student outcomes and interrogating the equity of postsecondary education pipelines. The RF model, created using longitudinal data of 41,303 students from Utah\u27s 2008 high school graduation cohort, is compared to logistic and linear models, which are commonly used to predict college access and success. Substantially, this work finds High School GPA to be the best predictor of postsecondary GPA, whereas commonly used ACT and AP test scores are not nearly as important. Each model identified several demographic disparities in higher education access, most significantly the effects of individual-level economic disadvantage. District- and school-level factors such as the proportion of Low Income students and the proportion of Underrepresented Racial Minority (URM) students were important and negatively associated with postsecondary success. Methodologically, the RF model was able to capture non-linearity in the predictive power of school- and district-level variables, a key finding which was undetectable using linear models. The RF algorithm outperforms logistic models in prediction of student enrollment, performs similarly to linear models in prediction of postsecondary GPA, and excels both models in its descriptions of non-linear variable relationships. RF provides novel interpretations of data, challenges conclusions from linear models, and has enormous potential to further the literature around equity in postsecondary pipelines

    Identifying features predictive of faculty integrating computation into physics courses

    Full text link
    Computation is a central aspect of 21st century physics practice; it is used to model complicated systems, to simulate impossible experiments, and to analyze mountains of data. Physics departments and their faculty are increasingly recognizing the importance of teaching computation to their students. We recently completed a national survey of faculty in physics departments to understand the state of computational instruction and the factors that underlie that instruction. The data collected from the faculty responding to the survey included a variety of scales, binary questions, and numerical responses. We then used Random Forest, a supervised learning technique, to explore the factors that are most predictive of whether a faculty member decides to include computation in their physics courses. We find that experience using computation with students in their research, or lack thereof and various personal beliefs to be most predictive of a faculty member having experience teaching computation. Interestingly, we find demographic and departmental factors to be less useful factors in our model. The results of this study inform future efforts to promote greater integration of computation into the physics curriculum as well as comment on the current state of computational instruction across the United States
    • ā€¦
    corecore