218 research outputs found

    On the overestimation of random forest’s out-of-bag error

    Get PDF

    Resampling approaches in biometrical applications

    Get PDF

    Characteristics of whatsapp interactions of undergraduate students

    Get PDF
    The aim of this study is to analyze the characteristics of interactions of students when using WhatsApp, their perception towards this tool, the kinds of academic activities they undertake when using it and the interaction patterns in these chat conversations. This study was carried out through a qualitative case study research design which involved nineteen BA in Foreign Language students in a public university in the northern coast of Colombia with ethnographic data collection methods which included an interview, a questionnaire and document analysis based on their WhatsApp chats. The analysis revealed that the implementation of mobile devices in the classroom seems to be an interesting feature to respond the way the educational system is evolving, besides, considering this application helps not only teachers but also students to facilitate the teaching and learning process even outside the classroom and it turns out to be an innovative trend that derives in better outcomes for both. Also the analysis revealed that the use of a WhatsApp chat group represented a step given to combine technology with an innovative methodology addressed to enrich the learning process of students. Contextual factors were also analyzed in this study since these become essential part of the analysis done.MaestríaMagister en la Enseñanza del Ingle

    Overview of Random Forest Methodology and Practical Guidance with Emphasis on Computational Biology and Bioinformatics

    Get PDF
    The Random Forest (RF) algorithm by Leo Breiman has become a standard data analysis tool in bioinformatics. It has shown excellent performance in settings where the number of variables is much larger than the number of observations, can cope with complex interaction structures as well as highly correlated variables and returns measures of variable importance. This paper synthesizes ten years of RF development with emphasis on applications to bioinformatics and computational biology. Special attention is given to practical aspects such as the selection of parameters, available RF implementations, and important pitfalls and biases of RF and its variable importance measures (VIMs). The paper surveys recent developments of the methodology relevant to bioinformatics as well as some representative examples of RF applications in this context and possible directions for future research

    Prediction Models for Time Discrete Competing Risks

    Get PDF
    The classical approach to the modeling of discrete time competing risks consists of fitting multinomial logit models where parameters are estimated using maximum likelihood theory. Since the effects of covariates are specific to the target events, the resulting models contain a large number of parameters, even if there are only few predictor variables. Due to the large number of parameters classical maximum likelihood estimates tend to deteriorate or do even not exist. Regularization techniques might be used to overcome these problems. This article explores the use of two different regularization techniques, namely penalized likelihood estimation methods and random forests, for modeling time discrete competing risks using both, extensive simulation studies and studies on real data. The simulation results as well as the application on three real world data sets show that the novel approaches perform very well and distinctly outperform the classical (unpenalized) maximum likelihood approach

    Resampling approaches in biometrical applications

    Get PDF

    Random Forests for Ordinal Response Data: Prediction and Variable Selection

    Get PDF
    The random forest method is a commonly used tool for classification with high-dimensional data that is able to rank candidate predictors through its inbuilt variable importance measures (VIMs). It can be applied to various kinds of regression problems including nominal, metric and survival response variables. While classification and regression problems using random forest methodology have been extensively investigated in the past, there seems to be a lack of literature on handling ordinal regression problems, that is if response categories have an inherent ordering. The classical random forest version of Breiman ignores the ordering in the levels and implements standard classification trees. Or if the variable is treated like a metric variable, regression trees are used which, however, are not appropriate for ordinal response data. Further compounding the difficulties the currently existing VIMs for nominal or metric responses have not proven to be appropriate for ordinal response. The random forest version of Hothorn et al. utilizes a permutation test framework that is applicable to problems where both predictors and response are measured on arbitrary scales. It is therefore a promising tool for handling ordinal regression problems. However, for this random forest version there is also no specific VIM for ordinal response variables and the appropriateness of the error rate based VIM computed by default in the case of ordinal responses has to date not been investigated in the literature. We performed simulation studies using random forest based on conditional inference trees to explore whether incorporating the ordering information yields any improvement in prediction performance or variable selection. We present two novel permutation VIMs that are reasonable alternatives to the currently implemented VIM which was developed for nominal response and makes no use of the ordering in the levels of an ordinal response variable. Results based on simulated and real data suggest that predictor rankings can be improved by using our new permutation VIMs that explicitly use the ordering in the response levels in combination with the ordinal regression trees suggested by Hothorn et al. With respect to prediction accuracy in our studies, the performance of ordinal regression trees was similar to and in most settings even slightly better than that of classification trees. An explanation for the greater performance is that in ordinal regression trees there is a higher probability of selecting relevant variables for a split. The codes implementing our studies and our novel permutation VIMs for the statistical software R are available at http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/070_drittmittel/janitza/index.html

    Categorical variables with many categories are preferentially selected in model selection procedures for multivariable regression models on bootstrap samples

    Get PDF
    To perform model selection in the context of multivariable regression, automated variable selection procedures such as backward elimination are commonly employed. However, these procedures are known to be highly unstable. Their stability can be investigated using bootstrap-based procedures: the idea is to perform model selection on a high number of bootstrap samples successively and to examine the obtained models, for instance in terms of the inclusion of specific predictor variables. However, from the literature such bootstrap-based procedures are known to yield misleading results in some cases. In this paper we aim to thoroughly investigate a particular important facet of these problems. More precisely, we assess the behaviour of regression models--with automated variable selection procedure based on the likelihood ratio test--fitted on bootstrap samples drawn with replacement and on subsamples drawn without replacement with respect to the number and type of included predictor variables. Our study includes both extensive simulations and a real data example from the NHANES study. The results indicate that models derived from bootstrap samples include more predictor variables than models fitted on original samples and that categorical predictor variables with many categories are preferentially selected over categorical predictor variables with fewer categories and over metric predictor variables. We conclude that using bootstrap samples to select variables for multivariable regression models may lead to overly complex models with a preferential selection of categorical predictor variables with many categories. We suggest the use of subsamples instead of bootstrap samples to bypass these drawbacks
    corecore