17,431 research outputs found
Overview of Random Forest Methodology and Practical Guidance with Emphasis on Computational Biology and Bioinformatics
The Random Forest (RF) algorithm by Leo Breiman has become a
standard data analysis tool in bioinformatics. It has shown excellent performance in settings where the number of variables is much larger than the number of observations, can cope with complex interaction structures as well as highly correlated variables and returns measures of variable importance. This paper synthesizes ten years of RF development with emphasis on applications to bioinformatics and computational biology. Special attention is given to practical aspects such as the selection of parameters, available RF implementations, and important pitfalls and biases of RF and its variable importance measures (VIMs). The paper surveys recent developments of the methodology relevant to bioinformatics as well as some representative examples of RF applications in this context and possible directions for future research
Proportional Odds Models with High-dimensional Data Structure
The proportional odds model (POM) is the most widely used model when the response has ordered categories. In the case of high-dimensional predictor structure the common maximum likelihood approach typically fails when all predictors are included. A boosting technique pomBoost is proposed that fits the model by implicitly selecting the influential predictors. The approach distinguishes between metric and categorical predictors. In the case of categorical predictors, where each predictor relates to a set of parameters, the objective is to select simultaneously all the associated parameters. In addition the approach distinguishes between nominal and ordinal predictors. In the case of ordinal predictors, the proposed technique uses the ordering of the ordinal predictors by penalizing the difference between the parameters of adjacent categories. The technique has also a provision to consider some mandatory predictors (if any) which must be part of the final sparse model. The performance of the proposed boosting algorithm is evaluated in a simulation study and applications with respect to mean squared error and prediction error. Hit rates and false alarm rates are used to judge the performance of pomBoost for selection of the relevant predictors
Створення та тестування спеціалізованих словників для аналізу тексту
Practitioners in many domains–e.g., clinical psychologists, college instructors,
researchers–collect written responses from clients. A well-developed method that has been applied
to texts from sources like these is the computer application Linguistic Inquiry and Word Count
(LIWC). LIWC uses the words in texts as cues to a person’s thought processes, emotional states,
intentions, and motivations. In the present study, we adopt analytic principles from LIWC and
develop and test an alternative method of text analysis using naïve Bayes methods. We further
show how output from the naïve Bayes analysis can be used for mark up of student work in order
to provide immediate, constructive feedback to students and instructors.Робота фахівців-практиків у багатьох галузях, наприклад, клінічних
психологів, викладачів кол д ів, дослідників п р дбача збір пись ових відповід хніх
клі нтів чи студ нтів. обр розробл ни тод, яки застосову ться сьогодні до т кстів
такого типу, ц ко п’ют рни додаток Linguistic Inquiry and Word Count (LIWC).
Програма LIWC тракту слова в т кстах як індикатори нтальних проц сів людини,
оці них станів, на ірів і отивів. У статті використано аналітичні принципи LIWC,
розробл но та прот стовано альт рнативни тод аналізу т ксту з використання тодів
на вного ба сового класифікатора. Автори д онструють, як р зультати аналізу за на вни
ба сови класифікаторо о уть бути використані для аналізу студ нтсько роботи з
тою надання н га ного, конструктивного зворотного зв’язку і студ нта і викладача
Variable Selection for Nonparametric Gaussian Process Priors: Models and Computational Strategies
This paper presents a unified treatment of Gaussian process models that
extends to data from the exponential dispersion family and to survival data.
Our specific interest is in the analysis of data sets with predictors that have
an a priori unknown form of possibly nonlinear associations to the response.
The modeling approach we describe incorporates Gaussian processes in a
generalized linear model framework to obtain a class of nonparametric
regression models where the covariance matrix depends on the predictors. We
consider, in particular, continuous, categorical and count responses. We also
look into models that account for survival outcomes. We explore alternative
covariance formulations for the Gaussian process prior and demonstrate the
flexibility of the construction. Next, we focus on the important problem of
selecting variables from the set of possible predictors and describe a general
framework that employs mixture priors. We compare alternative MCMC strategies
for posterior inference and achieve a computationally efficient and practical
approach. We demonstrate performances on simulated and benchmark data sets.Comment: Published in at http://dx.doi.org/10.1214/11-STS354 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
- …