5 research outputs found

    Knowledge-based variable selection for learning rules from proteomic data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The incorporation of biological knowledge can enhance the analysis of biomedical data. We present a novel method that uses a proteomic knowledge base to enhance the performance of a rule-learning algorithm in identifying putative biomarkers of disease from high-dimensional proteomic mass spectral data. In particular, we use the Empirical Proteomics Ontology Knowledge Base (EPO-KB) that contains previously identified and validated proteomic biomarkers to select <it>m/z</it>s in a proteomic dataset prior to analysis to increase performance.</p> <p>Results</p> <p>We show that using EPO-KB as a pre-processing method, specifically selecting all biomarkers found only in the biofluid of the proteomic dataset, reduces the dimensionality by 95% and provides a statistically significantly greater increase in performance over no variable selection and random variable selection.</p> <p>Conclusion</p> <p>Knowledge-based variable selection even with a sparsely-populated resource such as the EPO-KB increases overall performance of rule-learning for disease classification from high-dimensional proteomic mass spectra.</p

    Application of an efficient Bayesian discretization method to biomedical data

    Get PDF
    Background\ud Several data mining methods require data that are discrete, and other methods often perform better with discrete data. We introduce an efficient Bayesian discretization (EBD) method for optimal discretization of variables that runs efficiently on high-dimensional biomedical datasets. The EBD method consists of two components, namely, a Bayesian score to evaluate discretizations and a dynamic programming search procedure to efficiently search the space of possible discretizations. We compared the performance of EBD to Fayyad and Irani's (FI) discretization method, which is commonly used for discretization.\ud \ud Results\ud On 24 biomedical datasets obtained from high-throughput transcriptomic and proteomic studies, the classification performances of the C4.5 classifier and the naïve Bayes classifier were statistically significantly better when the predictor variables were discretized using EBD over FI. EBD was statistically significantly more stable to the variability of the datasets than FI. However, EBD was less robust, though not statistically significantly so, than FI and produced slightly more complex discretizations than FI.\ud \ud Conclusions\ud On a range of biomedical datasets, a Bayesian discretization method (EBD) yielded better classification performance and stability but was less robust than the widely used FI discretization method. The EBD discretization method is easy to implement, permits the incorporation of prior knowledge and belief, and is sufficiently fast for application to high-dimensional data

    Rule learning for disease-specific biomarker discovery from clinical proteomic mass spectra

    No full text
    A major goal of clinical proteomics is the identification of protein biomarkers from mass spectral analyses of fairly easily obtainable samples such as blood serum, urine or cerebrospinal fluid from patient populations. It is hoped that such protein biomarkers can be utilized for early detection of disease and examined further for potential therapeutic use. In this paper, we present the process for successful discovery of biomarkers that are indicators of a chronic neurodegenerative disease of motor neurons, called Amyotrophic Lateral Sclerosis; from application of rule learning to the analysis of proteomic mass spectra from cerebrospinal fluid samples. We have implemented a wrapper-based rule learning framework within which the massive number of features that accumulate from mass spectral analyses of clinical samples can be evaluated by repeated invocation of a rule learner. Our framework facilitates evidence gathering as indicated in this case study, and can speed up disease-specific biomarker discovery from clinical proteomic mass spectra. © Springer-Verlag Berlin Heidelberg 2006

    Transfer rule learning for biomarker discovery and verification from related data sets

    Get PDF
    Biomarkers are a critical tool for the detection, diagnosis,monitoring and prognosis of diseases, and for understandingdisease mechanisms in order to create treatments. Unfortunately,finding reliable biomarkers is often hampered by a number of practicalproblems, including scarcity of samples, the high dimensionality of the data, and measurement error. An important opportunity to make the most ofthese scarce data is to combine information from multiple relateddata sets for more effective biomarker discovery. Because the costsof creating large data sets for every disease of interest are likelyto remain prohibitive, methods for more effectively making use ofrelated biomarker data sets continues to be important.This thesis develops TRL, a novel framework for integrative biomarkerdiscovery from related but separate data sets, such as those generatedfor similar biomarker profiling studies. TRL alleviates the problemof data scarcity by providing a way to validateknowledge learned from one data set and simultaneously learn newknowledge on a related data set. Unlike other transfer learningapproaches, TRL takes prior knowledge in the form of interpretable,modular classification rules, and uses them to seed learning on a newdata set.We evaluated TRL on 13 pairs of real-world biomarker discovery datasets, and found TRL improves accuracy twice as often asdegrading it. TRL consists of four alternative methods for transferand three measures of the amount of information transferred. Byexperimenting with these methods, we investigate the kinds ofinformation necessary to preserve for transfer learning from relateddata sets. We found it is important to keep track of therelationships between biomarker values and disease state, and toconsider during learning how rules will interact in the final model.If the source and target data are drawn from the same distribution, wefound the performance improvement and amount of transfer increase withincreasing size of the source compared to the target data

    A Bayesian Rule Generation Framework for 'Omic' Biomedical Data Analysis

    Get PDF
    High-dimensional biomedical 'omic' datasets are accumulating rapidly from studies aimed at early detection and better management of human disease. These datasets pose tremendous challenges for analysis due to their large number of variables that represent measurements of biochemical molecules, such as proteins and mRNA, from bodily fluids or tissues extracted from a rather small cohort of samples. Machine learning methods have been applied to modeling these datasets including rule learning methods, which have been successful in generating models that are easily interpretable by the scientists. Rule learning methods have typically relied on a frequentist measure of certainty within IF-THEN (propositional) rules. In this dissertation, a Bayesian Rule Generation Framework (BRGF) is developed and tested that can produce rules with probabilities, thereby enabling a mathematically rigorous representation of uncertainty in rule models. The BRGF includes a novel Bayesian Discretization method combined with one or more search strategies for building constrained Bayesian Networks from data and converting them into probabilistic rules. Both global and local structures are built using different Bayesian Network generation algorithms and the rule models generated from the network are tested on public and private 'omic' datasets. We show that using a specific type of structure (Bayesian decision graphs) in tandem with a specific type of search method (parallel greedy) allows us to achieve statistically significant higher overall performance over current state of the art rule learning methods. Not only does using the BRGF boost performance on average on 'omic' biomedical data to a statistically significant point, but also provides the ability to incorporate prior information in a mathematically rigorous fashion for modeling purposes
    corecore