121 research outputs found
A hybrid algorithm for Bayesian network structure learning with application to multi-label learning
We present a novel hybrid algorithm for Bayesian network structure learning,
called H2PC. It first reconstructs the skeleton of a Bayesian network and then
performs a Bayesian-scoring greedy hill-climbing search to orient the edges.
The algorithm is based on divide-and-conquer constraint-based subroutines to
learn the local structure around a target variable. We conduct two series of
experimental comparisons of H2PC against Max-Min Hill-Climbing (MMHC), which is
currently the most powerful state-of-the-art algorithm for Bayesian network
structure learning. First, we use eight well-known Bayesian network benchmarks
with various data sizes to assess the quality of the learned structure returned
by the algorithms. Our extensive experiments show that H2PC outperforms MMHC in
terms of goodness of fit to new data and quality of the network structure with
respect to the true dependence structure of the data. Second, we investigate
H2PC's ability to solve the multi-label learning problem. We provide
theoretical results to characterize and identify graphically the so-called
minimal label powersets that appear as irreducible factors in the joint
distribution under the faithfulness condition. The multi-label learning problem
is then decomposed into a series of multi-class classification problems, where
each multi-class variable encodes a label powerset. H2PC is shown to compare
favorably to MMHC in terms of global classification accuracy over ten
multi-label data sets covering different application domains. Overall, our
experiments support the conclusions that local structural learning with H2PC in
the form of local neighborhood induction is a theoretically well-motivated and
empirically effective learning framework that is well suited to multi-label
learning. The source code (in R) of H2PC as well as all data sets used for the
empirical tests are publicly available.Comment: arXiv admin note: text overlap with arXiv:1101.5184 by other author
Fair Causal Feature Selection
Causal feature selection has recently received increasing attention in
machine learning. Existing causal feature selection algorithms select unique
causal features of a class variable as the optimal feature subset. However, a
class variable usually has multiple states, and it is unfair to select the same
causal features for different states of a class variable. To address this
problem, we employ the class-specific mutual information to evaluate the causal
information carried by each state of the class attribute, and theoretically
analyze the unique relationship between each state and the causal features.
Based on this, a Fair Causal Feature Selection algorithm (FairCFS) is proposed
to fairly identifies the causal features for each state of the class variable.
Specifically, FairCFS uses the pairwise comparisons of class-specific mutual
information and the size of class-specific mutual information values from the
perspective of each state, and follows a divide-and-conquer framework to find
causal features. The correctness and application condition of FairCFS are
theoretically proved, and extensive experiments are conducted to demonstrate
the efficiency and superiority of FairCFS compared to the state-of-the-art
approaches
Markov blanket: efficient strategy for feature subset selection method for high dimensionality microarray cancer datasets
Currently, feature subset selection methods are very important, especially in areas of application for which
datasets with tens or hundreds of thousands of variables (genes) are available. Feature subset selection
methods help us select a small number of variables out of thousands of genes in microarray datasets for a
more accurate and balanced classification. Efficient gene selection can be considered as an easy computational hold of the subsequent classification
task, and can give subset of gene set without the loss of classification performance. In classifying
microarray data, the main objective of gene selection is to search for the genes while keeping the maximum
amount of relevant information about the class and minimize classification errors. In this paper, explain the
importance of feature subset selection methods in machine learning and data mining fields. Consequently,
the analysis of microarray expression was used to check whether global biological differences underlie
common pathological features in different types of cancer datasets and identify genes that might anticipate
the clinical behavior of this disease. Using the feature subset selection model for gene expression contains
large amounts of raw data that needs analyzing to obtain useful information for specific biological and
medical applications. One way of finding relevant (and removing redundant ) genes is by using the
Bayesian network based on the Markov blanket [1]. We present and compare the performance of the
different approaches to feature (genes) subset selection methods based on Wrapper and Markov Blanket
models for the five-microarray cancer datasets. The first way depends on the Memetic algorithms (MAs)
used for the feature selection method. The second way uses MRMR (Minimum Redundant Maximum
Relevant) for feature subset selection hybridized by genetic search optimization techniques and afterwards
compares the Markov blanket model’s performance with the most common classical classification
algorithms for the selected set of features. For the memetic algorithm, we present a comparison between two embedded approaches for feature subset
selection which are the wrapper filter for feature selection algorithm (WFFSA) and Markov Blanket
Embedded Genetic Algorithm (MBEGA). The memetic algorithm depends on genetic operators (crossover,
mutation) and the dedicated local search procedure. For comparisons, we depend on two evaluations
techniques for learning and testing data which are 10-Kfold cross validation and 30-Bootstraping. The
results of the memetic algorithm clearly show MBEGA often outperforms WFFSA methods by yielding
more significant differentiation among different microarray cancer datasets. In the second part of this paper, we focus mainly on MRMR for feature subset selection methods and the
Bayesian network based on Markov blanket (MB) model that are useful for building a good predictor and
defying the curse of dimensionality to improve prediction performance. These methods cover a wide range
of concerns: providing a better definition of the objective function, feature construction, feature ranking,
efficient search methods, and feature validity assessment methods as well as defining the relationships
among attributes to make predictions. We present performance measures for some common (or classical) learning classification algorithms (Naive
Bayes, Support vector machine [LiBSVM], K-nearest neighbor, and AdBoostM Ensampling) before and
after using the MRMR method. We compare the Bayesian network classification algorithm based on the
Markov Blanket model’s performance measure with the performance of these common classification
algorithms. The result of performance measures for classification algorithm based on the Bayesian network
of the Markov blanket model get higher accuracy rates than other types of classical classification algorithms
for the cancer Microarray datasets.
Bayesian networks clearly depend on relationships among attributes to make predictions. The Bayesian
network based on the Markov blanket (MB) classification method of classifying variables provides all
necessary information for predicting its value. In this paper, we recommend the Bayesian network based on the Markov blanket for learning and classification processing, which is highly effective and efficient on
feature subset selection measures.Master of Science (MSc) in Computational Science
Learning Patient-Specific Models From Clinical Data
A key purpose of building a model from clinical data is to predict the outcomes of future individual patients. This work introduces a Bayesian patient-specific predictive framework for constructing predictive models from data that are optimized to predict well for a particular patient case. The construction of such patient-specific models is influenced by the particular history, symptoms, laboratory results, and other features of the patient case at hand. This approach is in contrast to the commonly used population-wide models that are constructed to perform well on average on all future cases.The new patient-specific method described in this research uses Bayesian network models, carries out Bayesian model averaging over a set of models to predict the outcome of interest for the patient case at hand, and employs a patient-specific heuristic to locate a set of suitable models to average over. Two versions of the method are developed that differ in the representation used for the conditional probability distributions in the Bayesian networks. One version uses a representation that captures only the so called global structure among the variables of a Bayesian network and the second representation captures additional local structure among the variables. The patient-specific methods were experimentally evaluated on one synthetic dataset, 21 UCI datasets and three medical datasets. Their performance was measured using five different performance measures and compared to that of several commonly used methods for constructing predictive models including naïve Bayes, C4.5 decision tree, logistic regression, neural networks, k-Nearest Neighbor and Lazy Bayesian Rules. Over all the datasets, both patient-specific methods performed better on average on all performance measures and against all the comparison algorithms. The global structure method that performs Bayesian model averaging in conjunction with the patient-specific search heuristic had better performance than either model selection with the patient-specific heuristic or non-patient-specific Bayesian model averaging. However, the additional learning of local structure by the local structure method did not lead to significant improvements over the use of global structure alone. The specific implementation limitations of the local structure method may have limited its performance
Medical data mining using Bayesian network and DNA sequence analysis.
Lee Kit Ying.Thesis (M.Phil.)--Chinese University of Hong Kong, 2004.Includes bibliographical references (leaves 115-117).Abstracts in English and Chinese.Abstract --- p.iAcknowledgement --- p.ivChapter 1 --- Introduction --- p.1Chapter 1.1 --- Project Background --- p.1Chapter 1.2 --- Problem Specifications --- p.3Chapter 1.3 --- Contributions --- p.5Chapter 1.4 --- Thesis Organization --- p.6Chapter 2 --- Background --- p.8Chapter 2.1 --- Medical Data Mining --- p.8Chapter 2.1.1 --- General Information --- p.9Chapter 2.1.2 --- Related Research --- p.10Chapter 2.1.3 --- Characteristics and Difficulties Encountered --- p.11Chapter 2.2 --- DNA Sequence Analysis --- p.13Chapter 2.3 --- Hepatitis B Virus --- p.14Chapter 2.3.1 --- Virus Characteristics --- p.15Chapter 2.3.2 --- Important Findings on the Virus --- p.17Chapter 2.4 --- Bayesian Network and its Classifiers --- p.17Chapter 2.4.1 --- Formal Definition --- p.18Chapter 2.4.2 --- Existing Learning Algorithms --- p.19Chapter 2.4.3 --- Evolutionary Algorithms and Hybrid EP (HEP) --- p.22Chapter 2.4.4 --- Bayesian Network Classifiers --- p.25Chapter 2.4.5 --- Learning Algorithms for BN Classifiers --- p.32Chapter 3 --- Bayesian Network Classifier for Clinical Data --- p.35Chapter 3.1 --- Related Work --- p.36Chapter 3.2 --- Proposed BN-augmented Naive Bayes Classifier (BAN) --- p.38Chapter 3.2.1 --- Definition --- p.38Chapter 3.2.2 --- Learning Algorithm with HEP --- p.39Chapter 3.2.3 --- Modifications on HEP --- p.39Chapter 3.3 --- Proposed General Bayesian Network with Markov Blan- ket (GBN) --- p.40Chapter 3.3.1 --- Definition --- p.41Chapter 3.3.2 --- Learning Algorithm with HEP --- p.41Chapter 3.4 --- Findings on Bayesian Network Parameters Calculation --- p.43Chapter 3.4.1 --- Situation and Errors --- p.43Chapter 3.4.2 --- Proposed Solution --- p.46Chapter 3.5 --- Performance Analysis on Proposed BN Classifier Learn- ing Algorithms --- p.47Chapter 3.5.1 --- Experimental Methodology --- p.47Chapter 3.5.2 --- Benchmark Data --- p.48Chapter 3.5.3 --- Clinical Data --- p.50Chapter 3.5.4 --- Discussion --- p.55Chapter 3.6 --- Summary --- p.56Chapter 4 --- Classification in DNA Analysis --- p.57Chapter 4.1 --- Related Work --- p.58Chapter 4.2 --- Problem Definition --- p.59Chapter 4.3 --- Proposed Methodology Architecture --- p.60Chapter 4.3.1 --- Overall Design --- p.60Chapter 4.3.2 --- Important Components --- p.62Chapter 4.4 --- Clustering --- p.63Chapter 4.5 --- Feature Selection Algorithms --- p.65Chapter 4.5.1 --- Information Gain --- p.66Chapter 4.5.2 --- Other Approaches --- p.67Chapter 4.6 --- Classification Algorithms --- p.67Chapter 4.6.1 --- Naive Bayes Classifier --- p.68Chapter 4.6.2 --- Decision Tree --- p.68Chapter 4.6.3 --- Neural Networks --- p.68Chapter 4.6.4 --- Other Approaches --- p.69Chapter 4.7 --- Important Points on Evaluation --- p.69Chapter 4.7.1 --- Errors --- p.70Chapter 4.7.2 --- Independent Test --- p.70Chapter 4.8 --- Performance Analysis on Classification of DNA Data --- p.71Chapter 4.8.1 --- Experimental Methodology --- p.71Chapter 4.8.2 --- Using Naive-Bayes Classifier --- p.73Chapter 4.8.3 --- Using Decision Tree --- p.73Chapter 4.8.4 --- Using Neural Network --- p.74Chapter 4.8.5 --- Discussion --- p.76Chapter 4.9 --- Summary --- p.77Chapter 5 --- Adaptive HEP for Learning Bayesian Network Struc- ture --- p.78Chapter 5.1 --- Background --- p.79Chapter 5.1.1 --- Objective --- p.79Chapter 5.1.2 --- Related Work - AEGA --- p.79Chapter 5.2 --- Feasibility Study --- p.80Chapter 5.3 --- Proposed A-HEP Algorithm --- p.82Chapter 5.3.1 --- Structural Dissimilarity Comparison --- p.82Chapter 5.3.2 --- Dynamic Population Size --- p.83Chapter 5.4 --- Evaluation on Proposed Algorithm --- p.88Chapter 5.4.1 --- Experimental Methodology --- p.89Chapter 5.4.2 --- Comparison on Running Time --- p.93Chapter 5.4.3 --- Comparison on Fitness of Final Network --- p.94Chapter 5.4.4 --- Comparison on Similarity to the Original Network --- p.95Chapter 5.4.5 --- Parameter Study --- p.96Chapter 5.5 --- Applications on Medical Domain --- p.100Chapter 5.5.1 --- Discussion --- p.100Chapter 5.5.2 --- An Example --- p.101Chapter 5.6 --- Summary --- p.105Chapter 6 --- Conclusion --- p.107Chapter 6.1 --- Summary --- p.107Chapter 6.2 --- Future Work --- p.109Bibliography --- p.11
ALGORITHMS FOR CONSTRAINT-BASED LEARNING OF BAYESIAN NETWORK STRUCTURES WITH LARGE NUMBERS OF VARIABLES
Bayesian networks (BNs) are highly practical and successful tools for modeling probabilistic knowledge. They can be constructed by an expert, learned from data, or by a combination of the two. A popular approach to learning the structure of a BN is the constraint-based search (CBS) approach, with the PC algorithm being a prominent example. In recent years, we have been experiencing a data deluge. We have access to more data, big and small, than ever before. The exponential nature of BN algorithms, however, hinders large-scale analysis. Developments in parallel and distributed computing have made the computational power required for large-scale data processing widely available, yielding opportunities for developing parallel and distributed algorithms for BN learning and inference. In this dissertation, (1) I propose two MapReduce versions of the PC algorithm, aimed at solving an increasingly common case: data is not necessarily massive in the number of records, but more and more so in the number of variables. (2) When the number of data records is small, the PC algorithm experiences problems in independence testing. Empirically, I explore a contradiction in the literature on how to resolve the case of having insufficient data when testing the independence of two variables: declare independence or dependence. (3) When BNs learned from data become complex in terms of graph density, they may require more parameters than we can feasibly store. I propose and evaluate five approaches to pruning a BN structure to guarantee that it will be tractable for storage and inference. I follow this up by proposing three approaches to improving the classification accuracy of a BN by modifying its structure
- …