17,032 research outputs found
MissForest - nonparametric missing value imputation for mixed-type data
Modern data acquisition based on high-throughput technology is often facing
the problem of missing data. Algorithms commonly used in the analysis of such
large-scale data often depend on a complete set. Missing value imputation
offers a solution to this problem. However, the majority of available
imputation methods are restricted to one type of variable only: continuous or
categorical. For mixed-type data the different types are usually handled
separately. Therefore, these methods ignore possible relations between variable
types. We propose a nonparametric method which can cope with different types of
variables simultaneously. We compare several state of the art methods for the
imputation of missing values. We propose and evaluate an iterative imputation
method (missForest) based on a random forest. By averaging over many unpruned
classification or regression trees random forest intrinsically constitutes a
multiple imputation scheme. Using the built-in out-of-bag error estimates of
random forest we are able to estimate the imputation error without the need of
a test set. Evaluation is performed on multiple data sets coming from a diverse
selection of biological fields with artificially introduced missing values
ranging from 10% to 30%. We show that missForest can successfully handle
missing values, particularly in data sets including different types of
variables. In our comparative study missForest outperforms other methods of
imputation especially in data settings where complex interactions and nonlinear
relations are suspected. The out-of-bag imputation error estimates of
missForest prove to be adequate in all settings. Additionally, missForest
exhibits attractive computational efficiency and can cope with high-dimensional
data.Comment: Submitted to Oxford Journal's Bioinformatics on 3rd of May 201
On Weight Matrix and Free Energy Models for Sequence Motif Detection
The problem of motif detection can be formulated as the construction of a
discriminant function to separate sequences of a specific pattern from
background. In computational biology, motif detection is used to predict DNA
binding sites of a transcription factor (TF), mostly based on the weight matrix
(WM) model or the Gibbs free energy (FE) model. However, despite the wide
applications, theoretical analysis of these two models and their predictions is
still lacking. We derive asymptotic error rates of prediction procedures based
on these models under different data generation assumptions. This allows a
theoretical comparison between the WM-based and the FE-based predictions in
terms of asymptotic efficiency. Applications of the theoretical results are
demonstrated with empirical studies on ChIP-seq data and protein binding
microarray data. We find that, irrespective of underlying data generation
mechanisms, the FE approach shows higher or comparable predictive power
relative to the WM approach when the number of observed binding sites used for
constructing a discriminant decision is not too small.Comment: 23 pages, 1 figure and 4 table
An Introduction to Recursive Partitioning: Rationale, Application and Characteristics of Classification and Regression Trees, Bagging and Random Forests
Recursive partitioning methods have become popular and widely used tools for nonparametric regression and classification in many scientific fields. Especially random forests, that can deal with large numbers of predictor variables even in the presence of complex interactions, have been applied successfully in genetics, clinical medicine and bioinformatics within the past few years.
High dimensional problems are common not only in genetics, but also in some areas of psychological research, where only few subjects can be measured due to time or cost constraints, yet a large amount of data is generated for each subject. Random forests have been shown to achieve a high prediction accuracy in such applications, and provide descriptive variable importance measures reflecting the impact of each variable in both main effects and interactions.
The aim of this work is to introduce the principles of the standard recursive partitioning methods as well as recent methodological improvements, to illustrate their usage for low and high dimensional data exploration, but also to point out limitations of the methods and potential pitfalls in their practical application.
Application of the methods is illustrated using freely available implementations in the R system for statistical computing
Gaussian process models for periodicity detection
We consider the problem of detecting and quantifying the periodic component
of a function given noise-corrupted observations of a limited number of
input/output tuples. Our approach is based on Gaussian process regression which
provides a flexible non-parametric framework for modelling periodic data. We
introduce a novel decomposition of the covariance function as the sum of
periodic and aperiodic kernels. This decomposition allows for the creation of
sub-models which capture the periodic nature of the signal and its complement.
To quantify the periodicity of the signal, we derive a periodicity ratio which
reflects the uncertainty in the fitted sub-models. Although the method can be
applied to many kernels, we give a special emphasis to the Mat\'ern family,
from the expression of the reproducing kernel Hilbert space inner product to
the implementation of the associated periodic kernels in a Gaussian process
toolkit. The proposed method is illustrated by considering the detection of
periodically expressed genes in the arabidopsis genome.Comment: in PeerJ Computer Science, 201
- …