17,431 research outputs found

    Overview of Random Forest Methodology and Practical Guidance with Emphasis on Computational Biology and Bioinformatics

    Get PDF
    The Random Forest (RF) algorithm by Leo Breiman has become a standard data analysis tool in bioinformatics. It has shown excellent performance in settings where the number of variables is much larger than the number of observations, can cope with complex interaction structures as well as highly correlated variables and returns measures of variable importance. This paper synthesizes ten years of RF development with emphasis on applications to bioinformatics and computational biology. Special attention is given to practical aspects such as the selection of parameters, available RF implementations, and important pitfalls and biases of RF and its variable importance measures (VIMs). The paper surveys recent developments of the methodology relevant to bioinformatics as well as some representative examples of RF applications in this context and possible directions for future research

    Proportional Odds Models with High-dimensional Data Structure

    Get PDF
    The proportional odds model (POM) is the most widely used model when the response has ordered categories. In the case of high-dimensional predictor structure the common maximum likelihood approach typically fails when all predictors are included. A boosting technique pomBoost is proposed that fits the model by implicitly selecting the influential predictors. The approach distinguishes between metric and categorical predictors. In the case of categorical predictors, where each predictor relates to a set of parameters, the objective is to select simultaneously all the associated parameters. In addition the approach distinguishes between nominal and ordinal predictors. In the case of ordinal predictors, the proposed technique uses the ordering of the ordinal predictors by penalizing the difference between the parameters of adjacent categories. The technique has also a provision to consider some mandatory predictors (if any) which must be part of the final sparse model. The performance of the proposed boosting algorithm is evaluated in a simulation study and applications with respect to mean squared error and prediction error. Hit rates and false alarm rates are used to judge the performance of pomBoost for selection of the relevant predictors

    Створення та тестування спеціалізованих словників для аналізу тексту

    Get PDF
    Practitioners in many domains–e.g., clinical psychologists, college instructors, researchers–collect written responses from clients. A well-developed method that has been applied to texts from sources like these is the computer application Linguistic Inquiry and Word Count (LIWC). LIWC uses the words in texts as cues to a person’s thought processes, emotional states, intentions, and motivations. In the present study, we adopt analytic principles from LIWC and develop and test an alternative method of text analysis using naïve Bayes methods. We further show how output from the naïve Bayes analysis can be used for mark up of student work in order to provide immediate, constructive feedback to students and instructors.Робота фахівців-практиків у багатьох галузях, наприклад, клінічних психологів, викладачів кол д ів, дослідників п р дбача збір пись ових відповід хніх клі нтів чи студ нтів. обр розробл ни тод, яки застосову ться сьогодні до т кстів такого типу, ц ко п’ют рни додаток Linguistic Inquiry and Word Count (LIWC). Програма LIWC тракту слова в т кстах як індикатори нтальних проц сів людини, оці них станів, на ірів і отивів. У статті використано аналітичні принципи LIWC, розробл но та прот стовано альт рнативни тод аналізу т ксту з використання тодів на вного ба сового класифікатора. Автори д онструють, як р зультати аналізу за на вни ба сови класифікаторо о уть бути використані для аналізу студ нтсько роботи з тою надання н га ного, конструктивного зворотного зв’язку і студ нта і викладача

    Variable Selection for Nonparametric Gaussian Process Priors: Models and Computational Strategies

    Full text link
    This paper presents a unified treatment of Gaussian process models that extends to data from the exponential dispersion family and to survival data. Our specific interest is in the analysis of data sets with predictors that have an a priori unknown form of possibly nonlinear associations to the response. The modeling approach we describe incorporates Gaussian processes in a generalized linear model framework to obtain a class of nonparametric regression models where the covariance matrix depends on the predictors. We consider, in particular, continuous, categorical and count responses. We also look into models that account for survival outcomes. We explore alternative covariance formulations for the Gaussian process prior and demonstrate the flexibility of the construction. Next, we focus on the important problem of selecting variables from the set of possible predictors and describe a general framework that employs mixture priors. We compare alternative MCMC strategies for posterior inference and achieve a computationally efficient and practical approach. We demonstrate performances on simulated and benchmark data sets.Comment: Published in at http://dx.doi.org/10.1214/11-STS354 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org
    corecore