1,766 research outputs found
Bayesian network learning and applications in Bioinformatics
Abstract A Bayesian network (BN) is a compact graphic representation of the probabilistic re- lationships among a set of random variables. The advantages of the BN formalism include its rigorous mathematical basis, the characteristics of locality both in knowl- edge representation and during inference, and the innate way to deal with uncertainty. Over the past decades, BNs have gained increasing interests in many areas, including bioinformatics which studies the mathematical and computing approaches to under- stand biological processes. In this thesis, I develop new methods for BN structure learning with applications to bi- ological network reconstruction and assessment. The first application is to reconstruct the genetic regulatory network (GRN), where each gene is modeled as a node and an edge indicates a regulatory relationship between two genes. In this task, we are given time-series microarray gene expression measurements for tens of thousands of genes, which can be modeled as true gene expressions mixed with noise in data generation, variability of the underlying biological systems etc. We develop a novel BN structure learning algorithm for reconstructing GRNs. The second application is to develop a BN method for protein-protein interaction (PPI) assessment. PPIs are the foundation of most biological mechanisms, and the knowl- edge on PPI provides one of the most valuable resources from which annotations of genes and proteins can be discovered. Experimentally, recently-developed high- throughput technologies have been carried out to reveal protein interactions in many organisms. However, high-throughput interaction data often contain a large number of iv spurious interactions. In this thesis, I develop a novel in silico model for PPI assess- ment. Our model is based on a BN that integrates heterogeneous data sources from different organisms. The main contributions are: 1. A new concept to depict the dynamic dependence relationships among random variables, which widely exist in biological processes, such as the relationships among genes and genes' products in regulatory networks and signaling pathways. This con- cept leads to a novel algorithm for dynamic Bayesian network learning. We apply it to time-series microarray gene expression data, and discover some missing links in a well-known regulatory pathway. Those new causal relationships between genes have been found supportive evidences in literature. 2. Discovery and theoretical proof of an asymptotic property of K2 algorithm ( a well-known efficient BN structure learning approach). This property has been used to identify Markov blankets (MB) in a Bayesian network, and further recover the BN structure. This hybrid algorithm is evaluated on a benchmark regulatory pathway, and obtains better results than some state-of-art Bayesian learning approaches. 3. A Bayesian network based integrative method which incorporates heterogeneous data sources from different organisms to predict protein-protein interactions (PPI) in a target organism. The framework is employed in human PPI prediction and in as- sessment of high-throughput PPI data. Furthermore, our experiments reveal some interesting biological results. 4. We introduce the learning of a TAN (Tree Augmented NaĆÆve Bayes) based net- work, which has the computational simplicity and robustness to high-throughput PPI assessment. The empirical results show that our method outperforms naĆÆve Bayes and a manual constructed Bayesian Network, additionally demonstrate sufficient informa- tion from model organisms can achieve high accuracy in PPI prediction
Recommended from our members
Building trajectories through clinical data to model disease progression
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Clinical trials are typically conducted over a population within a defined time period
in order to illuminate certain characteristics of a health issue or disease process. These cross-sectional studies provide a snapshot of these disease processes over a large number of people but do not allow us to model the temporal nature of disease, which is essential for modeling detailed prognostic predictions. Longitudinal studies, on the other hand, are used to explore how these processes develop over time in a number of people but can be expensive and time-consuming, and many studies only cover a relatively small window within the disease process. This thesis describes the application of intelligent data analysis techniques for extracting information from time series generated by different diseases. The aim of this thesis is to identify intermediate stages
in a disease process and sub-categories of the disease exhibiting subtly different symptoms. It explores the use of a bootstrap technique that fits trajectories through the data generating āpseudo time-seriesā. It addresses issues including: how clinical variables interact as a disease progresses along the trajectories in the data; and how to automatically identify different disease states along these trajectories, as well as the transitions between them. The thesis documents how reliable time-series models can be created from large amounts of historical cross-sectional data and a novel relabling/latent variable approach has enabled the exploration of the temporal nature of disease progression. The proposed algorithms are tested extensively on simulated data and on three real clinical datasets. Finally, a study is carried out to explore whether we can ācalibrateā pseudo time-series models with real longitudinal data in order to improve them. Plausible directions for future research are discussed at the end of the thesis
Microarray Data Mining and Gene Regulatory Network Analysis
The novel molecular biological technology, microarray, makes it feasible to obtain quantitative measurements of expression of thousands of genes present in a biological sample simultaneously. Genome-wide expression data generated from this technology are promising to uncover the implicit, previously unknown biological knowledge. In this study, several problems about microarray data mining techniques were investigated, including feature(gene) selection, classifier genes identification, generation of reference genetic interaction network for non-model organisms and gene regulatory network reconstruction using time-series gene expression data. The limitations of most of the existing computational models employed to infer gene regulatory network lie in that they either suffer from low accuracy or computational complexity. To overcome such limitations, the following strategies were proposed to integrate bioinformatics data mining techniques with existing GRN inference algorithms, which enables the discovery of novel biological knowledge. An integrated statistical and machine learning (ISML) pipeline was developed for feature selection and classifier genes identification to solve the challenges of the curse of dimensionality problem as well as the huge search space. Using the selected classifier genes as seeds, a scale-up technique is applied to search through major databases of genetic interaction networks, metabolic pathways, etc.
By curating relevant genes and blasting genomic sequences of non-model organisms against well-studied genetic model organisms, a reference gene regulatory network for less-studied organisms was built and used both as prior knowledge and model validation for GRN reconstructions. Networks of gene interactions were inferred using a Dynamic Bayesian Network (DBN) approach and were analyzed for elucidating the dynamics caused by perturbations. Our proposed pipelines were applied to investigate molecular mechanisms for chemical-induced reversible neurotoxicity
Biological Applications of Knowledge Graph Embedding Models
Complex biological systems are traditionally modelled as graphs of interconnected biological entities. These graphs, i.e. biological knowledge graphs, are then processed using graph exploratory approaches to perform different types of analytical and predictive tasks. Despite the high predictive accuracy of these approaches, they have limited scalability due to their dependency on time-consuming path exploratory procedures. In recent years, owing to the rapid advances of computational technologies, new approaches for modelling graphs and mining them with high accuracy and scalability have emerged. These approaches, i.e. knowledge graph embedding (KGE) models, operate by learning low-rank vector representations of graph nodes and edges that preserve the graph s inherent structure. These approaches were used to analyse knowledge graphs from different domains where they showed superior performance and accuracy compared to previous graph exploratory approaches. In this work, we study this class of models in the context of biological knowledge graphs and their different applications. We then show how KGE models can be a natural fit for representing complex biological knowledge modelled as graphs. We also discuss their predictive and analytical capabilities in different biology applications. In this regard, we present two example case studies that demonstrate the capabilities of KGE models: prediction of drug target interactions and polypharmacy side effects. Finally, we analyse different practical considerations for KGEs, and we discuss possible opportunities and challenges related to adopting them for modelling biological systems.The work presented in this paper was supported by the CLARIFY project that has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 875160, and by Insight research centre supported by the Science Foundation Ireland (SFI) grant (12/RC/2289_2)peer-reviewed2021-02-1
- ā¦