Search CORE

1,784 research outputs found

Massively-Parallel Feature Selection for Big Data

Author: Borboudakis Giorgos
Christophides Vassilis
Katsogridakis Pavlos
Pratikakis Polyvios
Tsamardinos Ioannis
Publication venue
Publication date: 23/08/2017
Field of study

We present the Parallel, Forward-Backward with Pruning (PFBP) algorithm for feature selection (FS) in Big Data settings (high dimensionality and/or sample size). To tackle the challenges of Big Data FS PFBP partitions the data matrix both in terms of rows (samples, training examples) as well as columns (features). By employing the concepts of

p

-values of conditional independence tests and meta-analysis techniques PFBP manages to rely only on computations local to a partition while minimizing communication costs. Then, it employs powerful and safe (asymptotically sound) heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Our empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores, while dominating other competitive algorithms in its class

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Hal-Diderot

Improving the Efficiency of Genomic Selection

Author: Balding David J.
Mackay Ian
Scutari Marco
Publication venue
Publication date: 01/01/2013
Field of study

We investigate two approaches to increase the efficiency of phenotypic prediction from genome-wide markers, which is a key step for genomic selection (GS) in plant and animal breeding. The first approach is feature selection based on Markov blankets, which provide a theoretically-sound framework for identifying non-informative markers. Fitting GS models using only the informative markers results in simpler models, which may allow cost savings from reduced genotyping. We show that this is accompanied by no loss, and possibly a small gain, in predictive power for four GS models: partial least squares (PLS), ridge regression, LASSO and elastic net. The second approach is the choice of kinship coefficients for genomic best linear unbiased prediction (GBLUP). We compare kinships based on different combinations of centring and scaling of marker genotypes, and a newly proposed kinship measure that adjusts for linkage disequilibrium (LD). We illustrate the use of both approaches and examine their performances using three real-world data sets from plant and animal genetics. We find that elastic net with feature selection and GBLUP using LD-adjusted kinships performed similarly well, and were the best-performing methods in our study.Comment: 17 pages, 5 figure

arXiv.org e-Print Archive

Crossref

Oxford University Research Archive

University of Melbourne Institutional Repository

Learning Bayesian Networks with the bnlearn R Package

Author: Scutari Marco
Publication venue
Publication date: 01/01/2010
Field of study

bnlearn is an R package which includes several algorithms for learning the structure of Bayesian networks with either discrete or continuous variables. Both constraint-based and score-based algorithms are implemented, and can use the functionality provided by the snow package to improve their performance via parallel computing. Several network scores and conditional independence algorithms are available for both the learning algorithms and independent use. Advanced plotting options are provided by the Rgraphviz package.Comment: 22 pages, 4 picture

arXiv.org e-Print Archive

Crossref

Directory of Open Access Journals

UCL Discovery

Oxford University Research Archive

Journal of Statistical Software

Digging into acceptor splice site prediction : an iterative feature selection approach

Author: A.I. Blum
A.K. Jain
C. Mathé
D. Mladenić
E. Alpaydin
G.R. Harik
H. Mühlenbein
I. Guyon
I. Guyon
J. Weston
M. Kudo
M. Pertea
P. Larrañaga
R. Kohavi
R.O. Duda
S. Degroeve
T. Joachims
X. Zhang
Y. Saeys
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2004
Field of study

Feature selection techniques are often used to reduce data dimensionality, increase classification performance, and gain insight into the processes that generated the data. In this paper, we describe an iterative procedure of feature selection and feature construction steps, improving the classification of acceptor splice sites, an important subtask of gene prediction. We show that acceptor prediction can benefit from feature selection, and describe how feature selection techniques can be used to gain new insights in the classification of acceptor sites. This is illustrated by the identification of a new, biologically motivated feature: the AG-scanning feature. The results described in this paper contribute both to the domain of gene prediction, and to research in feature selection techniques, describing a new wrapper based feature weighting method that aids in knowledge discovery when dealing with complex datasets

Crossref

Ghent University Academic Bibliography

Learning Bayesian Networks with the bnlearn R Package

Author: Marco Scutari
Publication venue
Publication date
Field of study

bnlearn is an R package (R Development Core Team 2010) which includes several algorithms for learning the structure of Bayesian networks with either discrete or continuous variables. Both constraint-based and score-based algorithms are implemented, and can use the functionality provided by the snow package (Tierney et al. 2008) to improve their performance via parallel computing. Several network scores and conditional independence algorithms are available for both the learning algorithms and independent use. Advanced plotting options are provided by the Rgraphviz package (Gentry et al. 2010).

Research Papers in Economics