55,409 research outputs found
Markov blanket: efficient strategy for feature subset selection method for high dimensionality microarray cancer datasets
Currently, feature subset selection methods are very important, especially in areas of application for which
datasets with tens or hundreds of thousands of variables (genes) are available. Feature subset selection
methods help us select a small number of variables out of thousands of genes in microarray datasets for a
more accurate and balanced classification. Efficient gene selection can be considered as an easy computational hold of the subsequent classification
task, and can give subset of gene set without the loss of classification performance. In classifying
microarray data, the main objective of gene selection is to search for the genes while keeping the maximum
amount of relevant information about the class and minimize classification errors. In this paper, explain the
importance of feature subset selection methods in machine learning and data mining fields. Consequently,
the analysis of microarray expression was used to check whether global biological differences underlie
common pathological features in different types of cancer datasets and identify genes that might anticipate
the clinical behavior of this disease. Using the feature subset selection model for gene expression contains
large amounts of raw data that needs analyzing to obtain useful information for specific biological and
medical applications. One way of finding relevant (and removing redundant ) genes is by using the
Bayesian network based on the Markov blanket [1]. We present and compare the performance of the
different approaches to feature (genes) subset selection methods based on Wrapper and Markov Blanket
models for the five-microarray cancer datasets. The first way depends on the Memetic algorithms (MAs)
used for the feature selection method. The second way uses MRMR (Minimum Redundant Maximum
Relevant) for feature subset selection hybridized by genetic search optimization techniques and afterwards
compares the Markov blanket modelās performance with the most common classical classification
algorithms for the selected set of features. For the memetic algorithm, we present a comparison between two embedded approaches for feature subset
selection which are the wrapper filter for feature selection algorithm (WFFSA) and Markov Blanket
Embedded Genetic Algorithm (MBEGA). The memetic algorithm depends on genetic operators (crossover,
mutation) and the dedicated local search procedure. For comparisons, we depend on two evaluations
techniques for learning and testing data which are 10-Kfold cross validation and 30-Bootstraping. The
results of the memetic algorithm clearly show MBEGA often outperforms WFFSA methods by yielding
more significant differentiation among different microarray cancer datasets. In the second part of this paper, we focus mainly on MRMR for feature subset selection methods and the
Bayesian network based on Markov blanket (MB) model that are useful for building a good predictor and
defying the curse of dimensionality to improve prediction performance. These methods cover a wide range
of concerns: providing a better definition of the objective function, feature construction, feature ranking,
efficient search methods, and feature validity assessment methods as well as defining the relationships
among attributes to make predictions. We present performance measures for some common (or classical) learning classification algorithms (Naive
Bayes, Support vector machine [LiBSVM], K-nearest neighbor, and AdBoostM Ensampling) before and
after using the MRMR method. We compare the Bayesian network classification algorithm based on the
Markov Blanket modelās performance measure with the performance of these common classification
algorithms. The result of performance measures for classification algorithm based on the Bayesian network
of the Markov blanket model get higher accuracy rates than other types of classical classification algorithms
for the cancer Microarray datasets.
Bayesian networks clearly depend on relationships among attributes to make predictions. The Bayesian
network based on the Markov blanket (MB) classification method of classifying variables provides all
necessary information for predicting its value. In this paper, we recommend the Bayesian network based on the Markov blanket for learning and classification processing, which is highly effective and efficient on
feature subset selection measures.Master of Science (MSc) in Computational Science
PLS dimension reduction for classification of microarray data
PLS dimension reduction is known to give good prediction accuracy in the context of classification with high-dimensional microarray data. In this paper, PLS is compared with some of the best state-of-the-art classification methods. In addition, a simple procedure to choose the number of components is suggested. The connection between PLS dimension reduction and gene selection is examined and a property of the first PLS component for binary classification is proven. PLS can also be used as a visualization tool for high-dimensional data in the classification framework. The whole study is based on 9 real microarray cancer data sets
A graph-based representation of Gene Expression profiles in DNA microarrays
This paper proposes a new and very flexible data model, called gene expression graph (GEG), for genes expression analysis and classification. Three features differentiate GEGs from other available microarray data representation structures: (i) the memory occupation of a GEG is independent of the number of samples used to built it; (ii) a GEG more clearly expresses relationships among expressed and non expressed genes in both healthy and diseased tissues experiments; (iii) GEGs allow to easily implement very efficient classifiers. The paper also presents a simple classifier for sample-based classification to show the flexibility and user-friendliness of the proposed data structur
Fully Bayesian Logistic Regression with Hyper-Lasso Priors for High-dimensional Feature Selection
High-dimensional feature selection arises in many areas of modern science.
For example, in genomic research we want to find the genes that can be used to
separate tissues of different classes (e.g. cancer and normal) from tens of
thousands of genes that are active (expressed) in certain tissue cells. To this
end, we wish to fit regression and classification models with a large number of
features (also called variables, predictors). In the past decade, penalized
likelihood methods for fitting regression models based on hyper-LASSO
penalization have received increasing attention in the literature. However,
fully Bayesian methods that use Markov chain Monte Carlo (MCMC) are still in
lack of development in the literature. In this paper we introduce an MCMC
(fully Bayesian) method for learning severely multi-modal posteriors of
logistic regression models based on hyper-LASSO priors (non-convex penalties).
Our MCMC algorithm uses Hamiltonian Monte Carlo in a restricted Gibbs sampling
framework; we call our method Bayesian logistic regression with hyper-LASSO
(BLRHL) priors. We have used simulation studies and real data analysis to
demonstrate the superior performance of hyper-LASSO priors, and to investigate
the issues of choosing heaviness and scale of hyper-LASSO priors.Comment: 33 pages. arXiv admin note: substantial text overlap with
arXiv:1308.469
Correcting for selection bias via cross-validation in the classification of microarray data
There is increasing interest in the use of diagnostic rules based on
microarray data. These rules are formed by considering the expression levels of
thousands of genes in tissue samples taken on patients of known classification
with respect to a number of classes, representing, say, disease status or
treatment strategy. As the final versions of these rules are usually based on a
small subset of the available genes, there is a selection bias that has to be
corrected for in the estimation of the associated error rates. We consider the
problem using cross-validation. In particular, we present explicit formulae
that are useful in explaining the layers of validation that have to be
performed in order to avoid improperly cross-validated estimates.Comment: Published in at http://dx.doi.org/10.1214/193940307000000284 the IMS
Collections (http://www.imstat.org/publications/imscollections.htm) by the
Institute of Mathematical Statistics (http://www.imstat.org
Variable selection for BART: An application to gene regulation
We consider the task of discovering gene regulatory networks, which are
defined as sets of genes and the corresponding transcription factors which
regulate their expression levels. This can be viewed as a variable selection
problem, potentially with high dimensionality. Variable selection is especially
challenging in high-dimensional settings, where it is difficult to detect
subtle individual effects and interactions between predictors. Bayesian
Additive Regression Trees [BART, Ann. Appl. Stat. 4 (2010) 266-298] provides a
novel nonparametric alternative to parametric regression approaches, such as
the lasso or stepwise regression, especially when the number of relevant
predictors is sparse relative to the total number of available predictors and
the fundamental relationships are nonlinear. We develop a principled
permutation-based inferential approach for determining when the effect of a
selected predictor is likely to be real. Going further, we adapt the BART
procedure to incorporate informed prior information about variable importance.
We present simulations demonstrating that our method compares favorably to
existing parametric and nonparametric procedures in a variety of data settings.
To demonstrate the potential of our approach in a biological context, we apply
it to the task of inferring the gene regulatory network in yeast (Saccharomyces
cerevisiae). We find that our BART-based procedure is best able to recover the
subset of covariates with the largest signal compared to other variable
selection methods. The methods developed in this work are readily available in
the R package bartMachine.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS755 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- ā¦