58,343 research outputs found
Markov blanket: efficient strategy for feature subset selection method for high dimensionality microarray cancer datasets
Currently, feature subset selection methods are very important, especially in areas of application for which
datasets with tens or hundreds of thousands of variables (genes) are available. Feature subset selection
methods help us select a small number of variables out of thousands of genes in microarray datasets for a
more accurate and balanced classification. Efficient gene selection can be considered as an easy computational hold of the subsequent classification
task, and can give subset of gene set without the loss of classification performance. In classifying
microarray data, the main objective of gene selection is to search for the genes while keeping the maximum
amount of relevant information about the class and minimize classification errors. In this paper, explain the
importance of feature subset selection methods in machine learning and data mining fields. Consequently,
the analysis of microarray expression was used to check whether global biological differences underlie
common pathological features in different types of cancer datasets and identify genes that might anticipate
the clinical behavior of this disease. Using the feature subset selection model for gene expression contains
large amounts of raw data that needs analyzing to obtain useful information for specific biological and
medical applications. One way of finding relevant (and removing redundant ) genes is by using the
Bayesian network based on the Markov blanket [1]. We present and compare the performance of the
different approaches to feature (genes) subset selection methods based on Wrapper and Markov Blanket
models for the five-microarray cancer datasets. The first way depends on the Memetic algorithms (MAs)
used for the feature selection method. The second way uses MRMR (Minimum Redundant Maximum
Relevant) for feature subset selection hybridized by genetic search optimization techniques and afterwards
compares the Markov blanket model’s performance with the most common classical classification
algorithms for the selected set of features. For the memetic algorithm, we present a comparison between two embedded approaches for feature subset
selection which are the wrapper filter for feature selection algorithm (WFFSA) and Markov Blanket
Embedded Genetic Algorithm (MBEGA). The memetic algorithm depends on genetic operators (crossover,
mutation) and the dedicated local search procedure. For comparisons, we depend on two evaluations
techniques for learning and testing data which are 10-Kfold cross validation and 30-Bootstraping. The
results of the memetic algorithm clearly show MBEGA often outperforms WFFSA methods by yielding
more significant differentiation among different microarray cancer datasets. In the second part of this paper, we focus mainly on MRMR for feature subset selection methods and the
Bayesian network based on Markov blanket (MB) model that are useful for building a good predictor and
defying the curse of dimensionality to improve prediction performance. These methods cover a wide range
of concerns: providing a better definition of the objective function, feature construction, feature ranking,
efficient search methods, and feature validity assessment methods as well as defining the relationships
among attributes to make predictions. We present performance measures for some common (or classical) learning classification algorithms (Naive
Bayes, Support vector machine [LiBSVM], K-nearest neighbor, and AdBoostM Ensampling) before and
after using the MRMR method. We compare the Bayesian network classification algorithm based on the
Markov Blanket model’s performance measure with the performance of these common classification
algorithms. The result of performance measures for classification algorithm based on the Bayesian network
of the Markov blanket model get higher accuracy rates than other types of classical classification algorithms
for the cancer Microarray datasets.
Bayesian networks clearly depend on relationships among attributes to make predictions. The Bayesian
network based on the Markov blanket (MB) classification method of classifying variables provides all
necessary information for predicting its value. In this paper, we recommend the Bayesian network based on the Markov blanket for learning and classification processing, which is highly effective and efficient on
feature subset selection measures.Master of Science (MSc) in Computational Science
Massively-Parallel Feature Selection for Big Data
We present the Parallel, Forward-Backward with Pruning (PFBP) algorithm for
feature selection (FS) in Big Data settings (high dimensionality and/or sample
size). To tackle the challenges of Big Data FS PFBP partitions the data matrix
both in terms of rows (samples, training examples) as well as columns
(features). By employing the concepts of -values of conditional independence
tests and meta-analysis techniques PFBP manages to rely only on computations
local to a partition while minimizing communication costs. Then, it employs
powerful and safe (asymptotically sound) heuristics to make early, approximate
decisions, such as Early Dropping of features from consideration in subsequent
iterations, Early Stopping of consideration of features within the same
iteration, or Early Return of the winner in each iteration. PFBP provides
asymptotic guarantees of optimality for data distributions faithfully
representable by a causal network (Bayesian network or maximal ancestral
graph). Our empirical analysis confirms a super-linear speedup of the algorithm
with increasing sample size, linear scalability with respect to the number of
features and processing cores, while dominating other competitive algorithms in
its class
A hybrid algorithm for Bayesian network structure learning with application to multi-label learning
We present a novel hybrid algorithm for Bayesian network structure learning,
called H2PC. It first reconstructs the skeleton of a Bayesian network and then
performs a Bayesian-scoring greedy hill-climbing search to orient the edges.
The algorithm is based on divide-and-conquer constraint-based subroutines to
learn the local structure around a target variable. We conduct two series of
experimental comparisons of H2PC against Max-Min Hill-Climbing (MMHC), which is
currently the most powerful state-of-the-art algorithm for Bayesian network
structure learning. First, we use eight well-known Bayesian network benchmarks
with various data sizes to assess the quality of the learned structure returned
by the algorithms. Our extensive experiments show that H2PC outperforms MMHC in
terms of goodness of fit to new data and quality of the network structure with
respect to the true dependence structure of the data. Second, we investigate
H2PC's ability to solve the multi-label learning problem. We provide
theoretical results to characterize and identify graphically the so-called
minimal label powersets that appear as irreducible factors in the joint
distribution under the faithfulness condition. The multi-label learning problem
is then decomposed into a series of multi-class classification problems, where
each multi-class variable encodes a label powerset. H2PC is shown to compare
favorably to MMHC in terms of global classification accuracy over ten
multi-label data sets covering different application domains. Overall, our
experiments support the conclusions that local structural learning with H2PC in
the form of local neighborhood induction is a theoretically well-motivated and
empirically effective learning framework that is well suited to multi-label
learning. The source code (in R) of H2PC as well as all data sets used for the
empirical tests are publicly available.Comment: arXiv admin note: text overlap with arXiv:1101.5184 by other author
Recommended from our members
The robust selection of predictive genes via a simple classifier
Identifying genes that direct the mechanism of a disease from expression data is extremely useful in understanding how that mechanism works.
This in turn may lead to better diagnoses and potentially can lead to a cure for that disease. This task becomes extremely challenging when the
data are characterised by only a small number of samples and a high number of dimensions, as it is often the case with gene expression data.
Motivated by this challenge, we present a general framework that focuses on simplicity and data perturbation. These are the keys for the robust
identification of the most predictive features in such data. Within this framework, we propose a simple selective na¨ıve Bayes classifier discovered using a global search technique, and combine it with data perturbation to
increase its robustness to small sample sizes.
An extensive validation of the method was carried out using two applied datasets from the field of microarrays and a simulated dataset, all
confounded by small sample sizes and high dimensionality. The method has been shown capable of identifying genes previously confirmed or associated with prostate cancer and viral infections
Predicting Pancreatic Cancer Using Support Vector Machine
This report presents an approach to predict pancreatic cancer using Support Vector Machine Classification algorithm. The research objective of this project it to predict pancreatic cancer on just genomic, just clinical and combination of genomic and clinical data. We have used real genomic data having 22,763 samples and 154 features per sample. We have also created Synthetic Clinical data having 400 samples and 7 features per sample in order to predict accuracy of just clinical data. To validate the hypothesis, we have combined synthetic clinical data with subset of features from real genomic data. In our results, we observed that prediction accuracy, precision, recall with just genomic data is 80.77%, 20%, 4%. Prediction accuracy, precision, recall with just synthetic clinical data is 93.33%, 95%, 30%. While prediction accuracy, precision, recall for combination of real genomic and synthetic clinical data is 90.83%, 10%, 5%. The combination of real genomic and synthetic clinical data decreased the accuracy since the genomic data is weakly correlated. Thus we conclude that the combination of genomic and clinical data does not improve pancreatic cancer prediction accuracy. A dataset with more significant genomic features might help to predict pancreatic cancer more accurately
- …