3,401 research outputs found

    Bayesian networks for omics data analysis

    Get PDF
    This thesis focuses on two aspects of high throughput technologies, i.e. data storage and data analysis, in particular in transcriptomics and metabolomics. Both technologies are part of a research field that is generally called ‘omics’ (or ‘-omics’, with a leading hyphen), which refers to genomics, transcriptomics, proteomics, or metabolomics. Although these techniques study different entities (genes, gene expression, proteins, or metabolites), they all have in common that they use high-throughput technologies such as microarrays and mass spectrometry, and thus generate huge amounts of data. Experiments conducted using these technologies allow one to compare different states of a living cell, for example a healthy cell versus a cancer cell or the effect of food on cell condition, and at different levels. The tools needed to apply omics technologies, in particular microarrays, are often manufactured by different vendors and require separate storage and analysis software for the data generated by them. Moreover experiments conducted using different technologies cannot be analyzed simultaneously to answer a biological question. Chapter 3 presents MADMAX, our software system which supports storage and analysis of data from multiple microarray platforms. It consists of a vendor-independent database which is tightly coupled with vendor-specific analysis tools. Upcoming technologies like metabolomics, proteomics and high-throughput sequencing can easily be incorporated in this system. Once the data are stored in this system, one obviously wants to deduce a biological relevant meaning from these data and here statistical and machine learning techniques play a key role. The aim of such analysis is to search for relationships between entities of interest, such as genes, metabolites or proteins. One of the major goals of these techniques is to search for causal relationships rather than mere correlations. It is often emphasized in the literature that "correlation is not causation" because people tend to jump to conclusions by making inferences about causal relationships when they actually only see correlations. Statistics are often good in finding these correlations; techniques called linear regression and analysis of variance form the core of applied multivariate statistics. However, these techniques cannot find causal relationships, neither are they able to incorporate prior knowledge of the biological domain. Graphical models, a machine learning technique, on the other hand do not suffer from these limitations. Graphical models, a combination of graph theory, statistics and information science, are one of the most exciting things happening today in the field of machine learning applied to biological problems (see chapter 2 for a general introduction). This thesis deals with a special type of graphical models known as probabilistic graphical models, belief networks or Bayesian networks. The advantage of Bayesian networks over classical statistical techniques is that they allow the incorporation of background knowledge from a biological domain, and that analysis of data is intuitive as it is represented in the form of graphs (nodes and edges). Standard statistical techniques are good in describing the data but are not able to find non-linear relations whereas Bayesian networks allow future prediction and discovering nonlinear relations. Moreover, Bayesian networks allow hierarchical representation of data, which makes them particularly useful for representing biological data, since most biological processes are hierarchical by nature. Once we have such a causal graph made either by a computer program or constructed manually we can predict the effects of a certain entity by manipulating the state of other entities, or make backward inferences from effects to causes. Of course, if the graph is big, doing the necessary calculations can be very difficult and CPU-expensive, and in such cases approximate methods are used. Chapter 4 demonstrates the use of Bayesian networks to determine the metabolic state of feeding and fasting mice to determine the effect of a high fat diet on gene expression. This chapter also shows how selection of genes based on key biological processes generates more informative results than standard statistical tests. In chapter 5 the use of Bayesian networks is shown on the combination of gene expression data and clinical parameters, to determine the effect of smoking on gene expression and which genes are responsible for the DNA damage and the raise in plasma cotinine levels of blood of a smoking population. This study was conducted at Maastricht University where 22 twin smokers were profiled. Chapter 6 presents the reconstruction of a key metabolic pathway which plays an important role in ripening of tomatoes, thus showing the versatility of the use of Bayesian networks in metabolomics data analysis. The general trend in research shows a flood of data emerging from sequencing and metabolomics experiments. This means that to perform data mining on these data one requires intelligent techniques that are computationally feasible and able to take the knowledge of experts into account to generate relevant results. Graphical models fit this paradigm well and we expect them to play a key role in mining the data generated from omics experiments. <br/

    A hybrid algorithm for Bayesian network structure learning with application to multi-label learning

    Get PDF
    We present a novel hybrid algorithm for Bayesian network structure learning, called H2PC. It first reconstructs the skeleton of a Bayesian network and then performs a Bayesian-scoring greedy hill-climbing search to orient the edges. The algorithm is based on divide-and-conquer constraint-based subroutines to learn the local structure around a target variable. We conduct two series of experimental comparisons of H2PC against Max-Min Hill-Climbing (MMHC), which is currently the most powerful state-of-the-art algorithm for Bayesian network structure learning. First, we use eight well-known Bayesian network benchmarks with various data sizes to assess the quality of the learned structure returned by the algorithms. Our extensive experiments show that H2PC outperforms MMHC in terms of goodness of fit to new data and quality of the network structure with respect to the true dependence structure of the data. Second, we investigate H2PC's ability to solve the multi-label learning problem. We provide theoretical results to characterize and identify graphically the so-called minimal label powersets that appear as irreducible factors in the joint distribution under the faithfulness condition. The multi-label learning problem is then decomposed into a series of multi-class classification problems, where each multi-class variable encodes a label powerset. H2PC is shown to compare favorably to MMHC in terms of global classification accuracy over ten multi-label data sets covering different application domains. Overall, our experiments support the conclusions that local structural learning with H2PC in the form of local neighborhood induction is a theoretically well-motivated and empirically effective learning framework that is well suited to multi-label learning. The source code (in R) of H2PC as well as all data sets used for the empirical tests are publicly available.Comment: arXiv admin note: text overlap with arXiv:1101.5184 by other author

    Boosting Local Causal Discovery in High-Dimensional Expression Data

    Full text link
    We study the performance of Local Causal Discovery (LCD), a simple and efficient constraint-based method for causal discovery, in predicting causal effects in large-scale gene expression data. We construct practical estimators specific to the high-dimensional regime. Inspired by the ICP algorithm, we use an optional preselection method and two different statistical tests. Empirically, the resulting LCD estimator is seen to closely approach the accuracy of ICP, the state-of-the-art method, while it is algorithmically simpler and computationally more efficient.Comment: Accepted at BIBM / CABB 201

    PPARα siRNA–Treated Expression Profiles Uncover the Causal Sufficiency Network for Compound-Induced Liver Hypertrophy

    Get PDF
    Uncovering pathways underlying drug-induced toxicity is a fundamental objective in the field of toxicogenomics. Developing mechanism-based toxicity biomarkers requires the identification of such novel pathways and the order of their sufficiency in causing a phenotypic response. Genome-wide RNA interference (RNAi) phenotypic screening has emerged as an effective tool in unveiling the genes essential for specific cellular functions and biological activities. However, eliciting the relative contribution of and sufficiency relationships among the genes identified remains challenging. In the rodent, the most widely used animal model in preclinical studies, it is unrealistic to exhaustively examine all potential interactions by RNAi screening. Application of existing computational approaches to infer regulatory networks with biological outcomes in the rodent is limited by the requirements for a large number of targeted permutations. Therefore, we developed a two-step relay method that requires only one targeted perturbation for genome-wide de novo pathway discovery. Using expression profiles in response to small interfering RNAs (siRNAs) against the gene for peroxisome proliferator-activated receptor α (Ppara), our method unveiled the potential causal sufficiency order network for liver hypertrophy in the rodent. The validity of the inferred 16 causal transcripts or 15 known genes for PPARα-induced liver hypertrophy is supported by their ability to predict non-PPARα–induced liver hypertrophy with 84% sensitivity and 76% specificity. Simulation shows that the probability of achieving such predictive accuracy without the inferred causal relationship is exceedingly small (p < 0.005). Five of the most sufficient causal genes have been previously disrupted in mouse models; the resulting phenotypic changes in the liver support the inferred causal roles in liver hypertrophy. Our results demonstrate the feasibility of defining pathways mediating drug-induced toxicity from siRNA-treated expression profiles. When combined with phenotypic evaluation, our approach should help to unleash the full potential of siRNAs in systematically unveiling the molecular mechanism of biological events

    Learning Large-Scale Bayesian Networks with the sparsebn Package

    Get PDF
    Learning graphical models from data is an important problem with wide applications, ranging from genomics to the social sciences. Nowadays datasets often have upwards of thousands---sometimes tens or hundreds of thousands---of variables and far fewer samples. To meet this challenge, we have developed a new R package called sparsebn for learning the structure of large, sparse graphical models with a focus on Bayesian networks. While there are many existing software packages for this task, this package focuses on the unique setting of learning large networks from high-dimensional data, possibly with interventions. As such, the methods provided place a premium on scalability and consistency in a high-dimensional setting. Furthermore, in the presence of interventions, the methods implemented here achieve the goal of learning a causal network from data. Additionally, the sparsebn package is fully compatible with existing software packages for network analysis.Comment: To appear in the Journal of Statistical Software, 39 pages, 7 figure
    corecore