1,784 research outputs found
Massively-Parallel Feature Selection for Big Data
We present the Parallel, Forward-Backward with Pruning (PFBP) algorithm for
feature selection (FS) in Big Data settings (high dimensionality and/or sample
size). To tackle the challenges of Big Data FS PFBP partitions the data matrix
both in terms of rows (samples, training examples) as well as columns
(features). By employing the concepts of -values of conditional independence
tests and meta-analysis techniques PFBP manages to rely only on computations
local to a partition while minimizing communication costs. Then, it employs
powerful and safe (asymptotically sound) heuristics to make early, approximate
decisions, such as Early Dropping of features from consideration in subsequent
iterations, Early Stopping of consideration of features within the same
iteration, or Early Return of the winner in each iteration. PFBP provides
asymptotic guarantees of optimality for data distributions faithfully
representable by a causal network (Bayesian network or maximal ancestral
graph). Our empirical analysis confirms a super-linear speedup of the algorithm
with increasing sample size, linear scalability with respect to the number of
features and processing cores, while dominating other competitive algorithms in
its class
Improving the Efficiency of Genomic Selection
We investigate two approaches to increase the efficiency of phenotypic
prediction from genome-wide markers, which is a key step for genomic selection
(GS) in plant and animal breeding. The first approach is feature selection
based on Markov blankets, which provide a theoretically-sound framework for
identifying non-informative markers. Fitting GS models using only the
informative markers results in simpler models, which may allow cost savings
from reduced genotyping. We show that this is accompanied by no loss, and
possibly a small gain, in predictive power for four GS models: partial least
squares (PLS), ridge regression, LASSO and elastic net. The second approach is
the choice of kinship coefficients for genomic best linear unbiased prediction
(GBLUP). We compare kinships based on different combinations of centring and
scaling of marker genotypes, and a newly proposed kinship measure that adjusts
for linkage disequilibrium (LD).
We illustrate the use of both approaches and examine their performances using
three real-world data sets from plant and animal genetics. We find that elastic
net with feature selection and GBLUP using LD-adjusted kinships performed
similarly well, and were the best-performing methods in our study.Comment: 17 pages, 5 figure
Learning Bayesian Networks with the bnlearn R Package
bnlearn is an R package which includes several algorithms for learning the
structure of Bayesian networks with either discrete or continuous variables.
Both constraint-based and score-based algorithms are implemented, and can use
the functionality provided by the snow package to improve their performance via
parallel computing. Several network scores and conditional independence
algorithms are available for both the learning algorithms and independent use.
Advanced plotting options are provided by the Rgraphviz package.Comment: 22 pages, 4 picture
Digging into acceptor splice site prediction : an iterative feature selection approach
Feature selection techniques are often used to reduce data dimensionality, increase classification performance, and gain insight into the processes that generated the data. In this paper, we describe an iterative procedure of feature selection and feature construction steps, improving the classification of acceptor splice sites, an important subtask of gene prediction.
We show that acceptor prediction can benefit from feature selection, and describe how feature selection techniques can be used to gain new insights in the classification of acceptor sites. This is illustrated by the identification of a new, biologically motivated feature: the AG-scanning feature.
The results described in this paper contribute both to the domain of gene prediction, and to research in feature selection techniques, describing a new wrapper based feature weighting method that aids in knowledge discovery when dealing with complex datasets
Learning Bayesian Networks with the bnlearn R Package
bnlearn is an R package (R Development Core Team 2010) which includes several algorithms for learning the structure of Bayesian networks with either discrete or continuous variables. Both constraint-based and score-based algorithms are implemented, and can use the functionality provided by the snow package (Tierney et al. 2008) to improve their performance via parallel computing. Several network scores and conditional independence algorithms are available for both the learning algorithms and independent use. Advanced plotting options are provided by the Rgraphviz package (Gentry et al. 2010).
- …