211 research outputs found

    Two Novel Methods for Clustering Short Time-Course Gene Expression Profiles

    Get PDF
    As genes with similar expression pattern are very likely having the same biological function, cluster analysis becomes an important tool to understand and predict gene functions from gene expression profi les. In many situations, each gene expression profi le only contains a few data points. Directly applying traditional clustering algorithms to such short gene expression profi les does not yield satisfactory results. Developing clustering algorithms for short gene expression profi les is necessary. In this thesis, two novel methods are developed for clustering short gene expression pro files. The fi rst method, called the network-based clustering method, deals with the defect of short gene expression profi les by generating a gene co-expression network using conditional mutual information (CMI), which measures the non-linear relationship between two genes, as well as considering indirect gene relationships in the presence of other genes. The network-based clustering method consists of two steps. A gene co-expression network is firstly constructed from short gene expression profi les using a path consistency algorithm (PCA) based on the CMI between genes. Then, a gene functional module is identi ed in terms of cluster cohesiveness. The network-based clustering method is evaluated on 10 large scale Arabidopsis thaliana short time-course gene expression profi le datasets in terms of gene ontology (GO) enrichment analysis, and compared with an existing method called Clustering with Over-lapping Neighbourhood Expansion (ClusterONE). Gene functional modules identi ed by the network-based clustering method for 10 datasets returns target GO p-values as low as 10-24, whereas the original ClusterONE yields insigni cant results. In order to more speci cally cluster gene expression profi les, a second clustering method, namely the protein-protein interaction (PPI) integrated clustering method, is developed. It is designed for clustering short gene expression profi les by integrating gene expression profi le patterns and curated PPI data. The method consists of the three following steps: (1) generate a number of prede ned profi le patterns according to the number of data points in the profi les and assign each gene to the prede fined profi le to which its expression profi le is the most similar; (2) integrate curated PPI data to refi ne the initial clustering result from (1); (3) combine the similar clusters from (2) to gradually reduce cluster numbers by a hierarchical clustering method. The PPI-integrated clustering method is evaluated on 10 large scale A. thaliana datasets using GO enrichment analysis, and by comparison with an existing method called Short Time-series Expression Miner (STEM). Target gene functional clusters identi ed by the PPI-integrated clustering method for 10 datasets returns GO p-values as low as 10-62, whereas STEM returns GO p-values as low as 10-38. In addition to the method development, obtained clusters by two proposed methods are further analyzed to identify cross-talk genes under fi ve stress conditions in root and shoot tissues. A list of potential abiotic stress tolerant genes are found

    Mixed membership stochastic blockmodels

    Full text link
    Observations consisting of measurements on relationships for pairs of objects arise in many settings, such as protein interaction and gene regulatory networks, collections of author-recipient email, and social networks. Analyzing such data with probabilisic models can be delicate because the simple exchangeability assumptions underlying many boilerplate models no longer hold. In this paper, we describe a latent variable model of such data called the mixed membership stochastic blockmodel. This model extends blockmodels for relational data to ones which capture mixed membership latent relational structure, thus providing an object-specific low-dimensional representation. We develop a general variational inference algorithm for fast approximate posterior inference. We explore applications to social and protein interaction networks.Comment: 46 pages, 14 figures, 3 table

    Development and application of software and algorithms for network approaches to proteomics data analysis

    Get PDF
    The cells making up all living organisms integrate external and internal signals to carry out the functions of life. Dysregulation of signaling can lead to a variety of grave diseases, including cancer [Slamon et al., 1987]. In order to understand signal transduction, one has to identify and characterize the main constituents of cellular signaling cascades. Proteins are involved in most cellular processes and form the major class of biomolecules responsible for signal transduction. Post-translational modifications (PTMs) of proteins can modulate their enzymatic activity and their protein-protein interactions (PPIs) which in turn can ultimately lead to changes in protein expression. Classical biochemistry has approached the study of proteins, PTMs and interaction from a reductionist view. The abundance, stability and localization of proteins was studied one protein at a time, following the one gene-one protein-one function paradigm [Beadle and Tatum, 1941]. Pathways were considered to be linear, where signals would be transmitted from a gene to proteins, eventually resulting in a specific phenotype. Establishing the crucial link between genotype and phenotype remains challenging despite great advances in omics technologies, such as liquid chromatography (LC)-mass spectrometry (MS) that allow for the system-wide interrogation of proteins. Systems and network biology [Barabási and Oltvai, 2004, Bensimon et al., 2012, Jørgensen and Locard-Paulet, 2012, Choudhary and Mann, 2010] aims to transform modern biology by utilizing omics technologies to understand and uncover the various complex networks that govern the cell. The first detected large-scale biological networks have been found to be highly structured and non-random [Albert and Barabási, 2002]. Furthermore, these are assembled from functional and topological modules. The smallest topological modules are formed by the direct physical interactions within protein-protein and protein-RNA complexes. These molecular machines are able to perform a diverse array of cellular functions, such as transcription and degradation [Alberts, 1998]. Members of functional modules are not required to have a direct physical interaction. Instead, such modules also include proteins with temporal co-regulation throughout the cell cycle [Olsen et al., 2010], or following the circadian day-night rhythm [Robles et al., 2014]. The signaling pathways that make up the cellular network [Jordan et al., 2000] are assembled from a hierarchy of these smaller modules [Barabási and Oltvai, 2004]. The regulation of these modules through dynamic rewiring enables the cell to respond to internal an external stimuli. The main challenge in network biology is to develop techniques to probe the topology of various biological networks, to identify topological and functional modules, and to understand their assembly and dynamic rewiring. LC-MS has become a powerful experimental platform that addresses all these challenges directly [Bensimon et al., 2012], and has long been used to study a wide range of biomolecules that participate in the cellular network. The field of proteomics in particular, which is concerned with the identification and characterization of the proteins in the cell, has been revolutionized by recent technological advances in MS. Proteomics experiments are used not only to quantify peptides and proteins, but also to uncover the edges of the cellular network, by screening for physical PPIs in a global [Hein et al., 2015] or condition specific manner [Kloet et al., 2016]. Crucial for the interpretation of the large-scale data generated by MS experiments is the development of software tools that aid researchers in translating raw measurements into biological insights. The MaxQuant and Perseus platforms were designed for this exact purpose. The aim of this thesis was to develop software tools for the analysis of MS-based proteomics data with a focus on network biology and apply the developed tools to study cellular signaling. The first step was the extension of the Perseus software with network data structures and activities. The new network module allows for the sideby-side analysis of matrices and networks inside an interactive workflow and is described in article 1. We subsequently apply the newly developed software to study the circadian phosphoproteome of cortical synapses (see article 2). In parallel we aimed to improve the analysis of large datasets by adapting the previously Windows-only MaxQuant software to the Linux operating system, which is more prevalent in high performance computing environments (see article 3)

    System biology modeling : the insights for computational drug discovery

    Get PDF
    Indiana University-Purdue University Indianapolis (IUPUI)Traditional treatment strategy development for diseases involves the identification of target proteins related to disease states, and the interference of these proteins with drug molecules. Computational drug discovery and virtual screening from thousands of chemical compounds have accelerated this process. The thesis presents a comprehensive framework of computational drug discovery using system biology approaches. The thesis mainly consists of two parts: disease biomarker identification and disease treatment discoveries. The first part of the thesis focuses on the research in biomarker identification for human diseases in the post-genomic era with an emphasis in system biology approaches such as using the protein interaction networks. There are two major types of biomarkers: Diagnostic Biomarker is expected to detect a given type of disease in an individual with both high sensitivity and specificity; Predictive Biomarker serves to predict drug response before treatment is started. Both are essential before we even start seeking any treatment for the patients. In this part, we first studied how the coverage of the disease genes, the protein interaction quality, and gene ranking strategies can affect the identification of disease genes. Second, we addressed the challenge of constructing a central database to collect the system level data such as protein interaction, pathway, etc. Finally, we built case studies for biomarker identification for using dabetes as a case study. The second part of the thesis mainly addresses how to find treatments after disease identification. It specifically focuses on computational drug repositioning due to its low lost, few translational issues and other benefits. First, we described how to implement literature mining approaches to build the disease-protein-drug connectivity map and demonstrated its superior performances compared to other existing applications. Second, we presented a valuable drug-protein directionality database which filled the research gap of lacking alternatives for the experimental CMAP in computational drug discovery field. We also extended the correlation based ranking algorithms by including the underlying topology among proteins. Finally, we demonstrated how to study drug repositioning beyond genomic level and from one dimension to two dimensions with clinical side effect as prediction features

    Structure-oriented prediction in complex networks

    Get PDF
    Complex systems are extremely hard to predict due to its highly nonlinear interactions and rich emergent properties. Thanks to the rapid development of network science, our understanding of the structure of real complex systems and the dynamics on them has been remarkably deepened, which meanwhile largely stimulates the growth of effective prediction approaches on these systems. In this article, we aim to review different network-related prediction problems, summarize and classify relevant prediction methods, analyze their advantages and disadvantages, and point out the forefront as well as critical challenges of the field

    Genetic mapping of host loci determining gut microbiota in hybrid mice

    Get PDF
    All animals and plants are colonized by microorganisms, whereby different host species contain different microbial populations. These microbial communities form long-term relationships with their hosts. Understanding the genomic basis underlying these relationships provides insight into the possible coevolution between hosts and their microbiota. This thesis aims to contribute to a deeper understanding of the forces shaping the microbiome

    Large-Scale and Pan-Cancer Multi-omic Analyses with Machine Learning

    Get PDF
    Multi-omic data analysis has been foundational in many fields of molecular biology, including cancer research. Investigation of the relationship between different omic data types reveals patterns that cannot otherwise be found in a single data type alone. With recent technological advancements in mass spectrometry (MS), MS-based proteomics has enabled the quantification of thousands of proteins in hundreds of cell lines and human tissue samples. This thesis presents several machine learning-based methods that facilitate the integrative analysis of multi-omic data. First, we reviewed five existing multi-omic data integration methods and performed a benchmarking analysis, using a large-scale multi-omic cancer cell line dataset. We evaluated the performance of these machine learning methods for drug response prediction and cancer type classification. Our result provides recommendations to researchers regarding optimal machine learning method selection for their applications. Second, we generated a pan-cancer proteomic map of 949 cancer cell lines across 40 cancer types and developed a machine learning method DeeProM to analyse the multi-omic information of these lines. This pan-cancer proteomic map (ProCan-DepMapSanger) is now publicly available and represents a major resource for the scientific community, for biomarker discovery and for the study of fundamental aspects of protein regulation. Third, we focused on publicly available multi-omic datasets of both cancer cell lines and human tissue samples and developed a Transformer-based deep learning method, DeePathNet, which integrates human knowledge with machine intelligence. We applied DeePathNet on three evaluation tasks, namely drug response prediction, cancer type classification and breast cancer subtype classification. Taken together, our analyses and methods allowed more accurate cancer diagnosis and prognosis
    • …
    corecore