735 research outputs found

    From 'tree' based Bayesian networks to mutual information classifiers : deriving a singly connected network classifier using an information theory based technique

    Get PDF
    For reasoning under uncertainty the Bayesian network has become the representation of choice. However, except where models are considered 'simple' the task of construction and inference are provably NP-hard. For modelling larger 'real' world problems this computational complexity has been addressed by methods that approximate the model. The Naive Bayes classifier, which has strong assumptions of independence among features, is a common approach, whilst the class of trees is another less extreme example. In this thesis we propose the use of an information theory based technique as a mechanism for inference in Singly Connected Networks. We call this a Mutual Information Measure classifier, as it corresponds to the restricted class of trees built from mutual information. We show that the new approach provides for both an efficient and localised method of classification, with performance accuracies comparable with the less restricted general Bayesian networks. To improve the performance of the classifier, we additionally investigate the possibility of expanding the class Markov blanket by use of a Wrapper approach and further show that the performance can be improved by focusing on the class Markov blanket and that the improvement is not at the expense of increased complexity. Finally, the two methods are applied to the task of diagnosing the 'real' world medical domain, Acute Abdominal Pain. Known to be both a different and challenging domain to classify, the objective was to investigate the optiniality claims, in respect of the Naive Bayes classifier, that some researchers have argued, for classifying in this domain. Despite some loss of representation capabilities we show that the Mutual Information Measure classifier can be effectively applied to the domain and also provides a recognisable qualitative structure without violating 'real' world assertions. In respect of its 'selective' variant we further show that the improvement achieves a comparable predictive accuracy to the Naive Bayes classifier and that the Naive Bayes classifier's 'overall' performance is largely due the contribution of the majority group Non-Specific Abdominal Pain, a group of exclusion

    Bayesian network learning and applications in Bioinformatics

    Get PDF
    Abstract A Bayesian network (BN) is a compact graphic representation of the probabilistic re- lationships among a set of random variables. The advantages of the BN formalism include its rigorous mathematical basis, the characteristics of locality both in knowl- edge representation and during inference, and the innate way to deal with uncertainty. Over the past decades, BNs have gained increasing interests in many areas, including bioinformatics which studies the mathematical and computing approaches to under- stand biological processes. In this thesis, I develop new methods for BN structure learning with applications to bi- ological network reconstruction and assessment. The first application is to reconstruct the genetic regulatory network (GRN), where each gene is modeled as a node and an edge indicates a regulatory relationship between two genes. In this task, we are given time-series microarray gene expression measurements for tens of thousands of genes, which can be modeled as true gene expressions mixed with noise in data generation, variability of the underlying biological systems etc. We develop a novel BN structure learning algorithm for reconstructing GRNs. The second application is to develop a BN method for protein-protein interaction (PPI) assessment. PPIs are the foundation of most biological mechanisms, and the knowl- edge on PPI provides one of the most valuable resources from which annotations of genes and proteins can be discovered. Experimentally, recently-developed high- throughput technologies have been carried out to reveal protein interactions in many organisms. However, high-throughput interaction data often contain a large number of iv spurious interactions. In this thesis, I develop a novel in silico model for PPI assess- ment. Our model is based on a BN that integrates heterogeneous data sources from different organisms. The main contributions are: 1. A new concept to depict the dynamic dependence relationships among random variables, which widely exist in biological processes, such as the relationships among genes and genes' products in regulatory networks and signaling pathways. This con- cept leads to a novel algorithm for dynamic Bayesian network learning. We apply it to time-series microarray gene expression data, and discover some missing links in a well-known regulatory pathway. Those new causal relationships between genes have been found supportive evidences in literature. 2. Discovery and theoretical proof of an asymptotic property of K2 algorithm ( a well-known efficient BN structure learning approach). This property has been used to identify Markov blankets (MB) in a Bayesian network, and further recover the BN structure. This hybrid algorithm is evaluated on a benchmark regulatory pathway, and obtains better results than some state-of-art Bayesian learning approaches. 3. A Bayesian network based integrative method which incorporates heterogeneous data sources from different organisms to predict protein-protein interactions (PPI) in a target organism. The framework is employed in human PPI prediction and in as- sessment of high-throughput PPI data. Furthermore, our experiments reveal some interesting biological results. 4. We introduce the learning of a TAN (Tree Augmented Naïve Bayes) based net- work, which has the computational simplicity and robustness to high-throughput PPI assessment. The empirical results show that our method outperforms naïve Bayes and a manual constructed Bayesian Network, additionally demonstrate sufficient informa- tion from model organisms can achieve high accuracy in PPI prediction

    Hybrid feature selection technique for intrusion detection system

    Get PDF
    High dimensionality’s problems have make feature selection as one of the most important criteria in determining the efficiency of intrusion detection systems. In this study we have selected a hybrid feature selection model that potentially combines the strengths of both the filter and the wrapper selection procedure. The potential hybrid solution is expected to effectively select the optimal set of features in detecting intrusion. The proposed hybrid model was carried out using correlation feature selection (CFS) together with three different search techniques known as best-first, greedy stepwise and genetic algorithm. The wrapper-based subset evaluation uses a random forest (RF) classifier to evaluate each of the features that were first selected by the filter method. The reduced feature selection on both KDD99 and DARPA 1999 dataset was tested using RF algorithm with ten-fold cross-validation in a supervised environment. The experimental result shows that the hybrid feature selections had produced satisfactory outcome

    Tpda2 Algorithm for Learning Bn Structure From Missing Value and Outliers in Data Mining

    Full text link
    Three-Phase Dependency Analysis (TPDA) algorithm was proved as most efficient algorithm (which requires at most O(N4) Conditional Independence (CI) tests). By integrating TPDA with "node topological sort algorithm", it can be used to learn Bayesian Network (BN) structure from missing value (named as TPDA1 algorithm). And then, outlier can be reduced by applying an "outlier detection & removal algorithm" as pre-processing for TPDA1. TPDA2 algorithm proposed consists of those ideas, outlier detection & removal, TPDA, and node topological sort node

    A Robust and Fast System for CTC Computer-Aided Detection of Colorectal Lesions

    Get PDF
    We present a complete, end-to-end computer-aided detection (CAD) system for identifying lesions in the colon, imaged with computed tomography (CT). This system includes facilities for colon segmentation, candidate generation, feature analysis, and classification. The algorithms have been designed to offer robust performance to variation in image data and patient preparation. By utilizing efficient 2D and 3D processing, software optimizations, multi-threading, feature selection, and an optimized cascade classifier, the CAD system quickly determines a set of detection marks. The colon CAD system has been validated on the largest set of data to date, and demonstrates excellent performance, in terms of its high sensitivity, low false positive rate, and computational efficiency