265 research outputs found

    Digging into acceptor splice site prediction : an iterative feature selection approach

    Get PDF
    Feature selection techniques are often used to reduce data dimensionality, increase classification performance, and gain insight into the processes that generated the data. In this paper, we describe an iterative procedure of feature selection and feature construction steps, improving the classification of acceptor splice sites, an important subtask of gene prediction. We show that acceptor prediction can benefit from feature selection, and describe how feature selection techniques can be used to gain new insights in the classification of acceptor sites. This is illustrated by the identification of a new, biologically motivated feature: the AG-scanning feature. The results described in this paper contribute both to the domain of gene prediction, and to research in feature selection techniques, describing a new wrapper based feature weighting method that aids in knowledge discovery when dealing with complex datasets

    Markov blanket: efficient strategy for feature subset selection method for high dimensionality microarray cancer datasets

    Get PDF
    Currently, feature subset selection methods are very important, especially in areas of application for which datasets with tens or hundreds of thousands of variables (genes) are available. Feature subset selection methods help us select a small number of variables out of thousands of genes in microarray datasets for a more accurate and balanced classification. Efficient gene selection can be considered as an easy computational hold of the subsequent classification task, and can give subset of gene set without the loss of classification performance. In classifying microarray data, the main objective of gene selection is to search for the genes while keeping the maximum amount of relevant information about the class and minimize classification errors. In this paper, explain the importance of feature subset selection methods in machine learning and data mining fields. Consequently, the analysis of microarray expression was used to check whether global biological differences underlie common pathological features in different types of cancer datasets and identify genes that might anticipate the clinical behavior of this disease. Using the feature subset selection model for gene expression contains large amounts of raw data that needs analyzing to obtain useful information for specific biological and medical applications. One way of finding relevant (and removing redundant ) genes is by using the Bayesian network based on the Markov blanket [1]. We present and compare the performance of the different approaches to feature (genes) subset selection methods based on Wrapper and Markov Blanket models for the five-microarray cancer datasets. The first way depends on the Memetic algorithms (MAs) used for the feature selection method. The second way uses MRMR (Minimum Redundant Maximum Relevant) for feature subset selection hybridized by genetic search optimization techniques and afterwards compares the Markov blanket model’s performance with the most common classical classification algorithms for the selected set of features. For the memetic algorithm, we present a comparison between two embedded approaches for feature subset selection which are the wrapper filter for feature selection algorithm (WFFSA) and Markov Blanket Embedded Genetic Algorithm (MBEGA). The memetic algorithm depends on genetic operators (crossover, mutation) and the dedicated local search procedure. For comparisons, we depend on two evaluations techniques for learning and testing data which are 10-Kfold cross validation and 30-Bootstraping. The results of the memetic algorithm clearly show MBEGA often outperforms WFFSA methods by yielding more significant differentiation among different microarray cancer datasets. In the second part of this paper, we focus mainly on MRMR for feature subset selection methods and the Bayesian network based on Markov blanket (MB) model that are useful for building a good predictor and defying the curse of dimensionality to improve prediction performance. These methods cover a wide range of concerns: providing a better definition of the objective function, feature construction, feature ranking, efficient search methods, and feature validity assessment methods as well as defining the relationships among attributes to make predictions. We present performance measures for some common (or classical) learning classification algorithms (Naive Bayes, Support vector machine [LiBSVM], K-nearest neighbor, and AdBoostM Ensampling) before and after using the MRMR method. We compare the Bayesian network classification algorithm based on the Markov Blanket model’s performance measure with the performance of these common classification algorithms. The result of performance measures for classification algorithm based on the Bayesian network of the Markov blanket model get higher accuracy rates than other types of classical classification algorithms for the cancer Microarray datasets. Bayesian networks clearly depend on relationships among attributes to make predictions. The Bayesian network based on the Markov blanket (MB) classification method of classifying variables provides all necessary information for predicting its value. In this paper, we recommend the Bayesian network based on the Markov blanket for learning and classification processing, which is highly effective and efficient on feature subset selection measures.Master of Science (MSc) in Computational Science

    Evaluation of the Performance of the Markov Blanket Bayesian Classifier Algorithm

    Full text link
    The Markov Blanket Bayesian Classifier is a recently-proposed algorithm for construction of probabilistic classifiers. This paper presents an empirical comparison of the MBBC algorithm with three other Bayesian classifiers: Naive Bayes, Tree-Augmented Naive Bayes and a general Bayesian network. All of these are implemented using the K2 framework of Cooper and Herskovits. The classifiers are compared in terms of their performance (using simple accuracy measures and ROC curves) and speed, on a range of standard benchmark data sets. It is concluded that MBBC is competitive in terms of speed and accuracy with the other algorithms considered.Comment: 9 pages: Technical Report No. NUIG-IT-011002, Department of Information Technology, National University of Ireland, Galway (2002

    Refining gene signatures: a Bayesian approach

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In high density arrays, the identification of relevant genes for disease classification is complicated by not only the curse of dimensionality but also the highly correlated nature of the array data. In this paper, we are interested in the question of how many and which genes should be selected for a disease class prediction. Our work consists of a Bayesian supervised statistical learning approach to refine gene signatures with a regularization which penalizes for the correlation between the variables selected.</p> <p>Results</p> <p>Our simulation results show that we can most often recover the correct subset of genes that predict the class as compared to other methods, even when accuracy and subset size remain the same. On real microarray datasets, we show that our approach can refine gene signatures to obtain either the same or better predictive performance than other existing methods with a smaller number of genes.</p> <p>Conclusions</p> <p>Our novel Bayesian approach includes a prior which penalizes highly correlated features in model selection and is able to extract key genes in the highly correlated context of microarray data. The methodology in the paper is described in the context of microarray data, but can be applied to any array data (such as micro RNA, for example) as a first step towards predictive modeling of cancer pathways. A user-friendly software implementation of the method is available.</p
    • 

    corecore