3,668 research outputs found
icet - A Python library for constructing and sampling alloy cluster expansions
Alloy cluster expansions (CEs) provide an accurate and computationally
efficient mapping of the potential energy surface of multi-component systems
that enables comprehensive sampling of the many-dimensional configuration
space. Here, we introduce \textsc{icet}, a flexible, extensible, and
computationally efficient software package for the construction and sampling of
CEs. \textsc{icet} is largely written in Python for easy integration in
comprehensive workflows, including first-principles calculations for the
generation of reference data and machine learning libraries for training and
validation. The package enables training using a variety of linear regression
algorithms with and without regularization, Bayesian regression, feature
selection, and cross-validation. It also provides complementary functionality
for structure enumeration and mapping as well as data management and analysis.
Potential applications are illustrated by two examples, including the
computation of the phase diagram of a prototypical metallic alloy and the
analysis of chemical ordering in an inorganic semiconductor.Comment: 10 page
Stable Feature Selection for Biomarker Discovery
Feature selection techniques have been used as the workhorse in biomarker
discovery applications for a long time. Surprisingly, the stability of feature
selection with respect to sampling variations has long been under-considered.
It is only until recently that this issue has received more and more attention.
In this article, we review existing stable feature selection methods for
biomarker discovery using a generic hierarchal framework. We have two
objectives: (1) providing an overview on this new yet fast growing topic for a
convenient reference; (2) categorizing existing methods under an expandable
framework for future research and development
Analyzing genome-wide association studies with an FDR controlling modification of the Bayesian information criterion
The prevailing method of analyzing GWAS data is still to test each marker
individually, although from a statistical point of view it is quite obvious
that in case of complex traits such single marker tests are not ideal. Recently
several model selection approaches for GWAS have been suggested, most of them
based on LASSO-type procedures. Here we will discuss an alternative model
selection approach which is based on a modification of the Bayesian Information
Criterion (mBIC2) which was previously shown to have certain asymptotic
optimality properties in terms of minimizing the misclassification error.
Heuristic search strategies are introduced which attempt to find the model
which minimizes mBIC2, and which are efficient enough to allow the analysis of
GWAS data.
Our approach is implemented in a software package called MOSGWA. Its
performance in case control GWAS is compared with the two algorithms HLASSO and
GWASelect, as well as with single marker tests, where we performed a simulation
study based on real SNP data from the POPRES sample. Our results show that
MOSGWA performs slightly better than HLASSO, whereas according to our
simulations GWASelect does not control the type I error when used to
automatically determine the number of important SNPs. We also reanalyze the
GWAS data from the Wellcome Trust Case-Control Consortium (WTCCC) and compare
the findings of the different procedures
- …