124 research outputs found

    Probabilistic analysis of the human transcriptome with side information

    Get PDF
    Understanding functional organization of genetic information is a major challenge in modern biology. Following the initial publication of the human genome sequence in 2001, advances in high-throughput measurement technologies and efficient sharing of research material through community databases have opened up new views to the study of living organisms and the structure of life. In this thesis, novel computational strategies have been developed to investigate a key functional layer of genetic information, the human transcriptome, which regulates the function of living cells through protein synthesis. The key contributions of the thesis are general exploratory tools for high-throughput data analysis that have provided new insights to cell-biological networks, cancer mechanisms and other aspects of genome function. A central challenge in functional genomics is that high-dimensional genomic observations are associated with high levels of complex and largely unknown sources of variation. By combining statistical evidence across multiple measurement sources and the wealth of background information in genomic data repositories it has been possible to solve some the uncertainties associated with individual observations and to identify functional mechanisms that could not be detected based on individual measurement sources. Statistical learning and probabilistic models provide a natural framework for such modeling tasks. Open source implementations of the key methodological contributions have been released to facilitate further adoption of the developed methods by the research community.Comment: Doctoral thesis. 103 pages, 11 figure

    Information Theory in Molecular Evolution: From Models to Structures and Dynamics

    Get PDF
    This Special Issue collects novel contributions from scientists in the interdisciplinary field of biomolecular evolution. Works listed here use information theoretical concepts as a core but are tightly integrated with the study of molecular processes. Applications include the analysis of phylogenetic signals to elucidate biomolecular structure and function, the study and quantification of structural dynamics and allostery, as well as models of molecular interaction specificity inspired by evolutionary cues

    Network-based analysis of gene expression data

    Get PDF
    The methods of molecular biology for the quantitative measurement of gene expression have undergone a rapid development in the past two decades. High-throughput assays with the microarray and RNA-seq technology now enable whole-genome studies in which several thousands of genes can be measured at a time. However, this has also imposed serious challenges on data storage and analysis, which are subject of the young, but rapidly developing field of computational biology. To explain observations made on such a large scale requires suitable and accordingly scaled models of gene regulation. Detailed models, as available for single genes, need to be extended and assembled in larger networks of regulatory interactions between genes and gene products. Incorporation of such networks into methods for data analysis is crucial to identify molecular mechanisms that are drivers of the observed expression. As methods for this purpose emerge in parallel to each other and without knowing the standard of truth, results need to be critically checked in a competitive setup and in the context of the available rich literature corpus. This work is centered on and contributes to the following subjects, each of which represents important and distinct research topics in the field of computational biology: (i) construction of realistic gene regulatory network models; (ii) detection of subnetworks that are significantly altered in the data under investigation; and (iii) systematic biological interpretation of detected subnetworks. For the construction of regulatory networks, I review existing methods with a focus on curation and inference approaches. I first describe how literature curation can be used to construct a regulatory network for a specific process, using the well-studied diauxic shift in yeast as an example. In particular, I address the question how a detailed understanding, as available for the regulation of single genes, can be scaled-up to the level of larger systems. I subsequently inspect methods for large-scale network inference showing that they are significantly skewed towards master regulators. A recalibration strategy is introduced and applied, yielding an improved genome-wide regulatory network for yeast. To detect significantly altered subnetworks, I introduce GGEA as a method for network-based enrichment analysis. The key idea is to score regulatory interactions within functional gene sets for consistency with the observed expression. Compared to other recently published methods, GGEA yields results that consistently and coherently align expression changes with known regulation types and that are thus easier to explain. I also suggest and discuss several significant enhancements to the original method that are improving its applicability, outcome and runtime. For the systematic detection and interpretation of subnetworks, I have developed the EnrichmentBrowser software package. It implements several state-of-the-art methods besides GGEA, and allows to combine and explore results across methods. As part of the Bioconductor repository, the package provides a unified access to the different methods and, thus, greatly simplifies the usage for biologists. Extensions to this framework, that support automating of biological interpretation routines, are also presented. In conclusion, this work contributes substantially to the research field of network-based analysis of gene expression data with respect to regulatory network construction, subnetwork detection, and their biological interpretation. This also includes recent developments as well as areas of ongoing research, which are discussed in the context of current and future questions arising from the new generation of genomic data

    Distance-based analysis of dynamical systems and time series by optimal transport

    Get PDF
    The concept of distance is a fundamental notion that forms a basis for the orientation in space. It is related to the scientific measurement process: quantitative measurements result in numerical values, and these can be immediately translated into distances. Vice versa, a set of mutual distances defines an abstract Euclidean space. Each system is thereby represented as a point, whose Euclidean distances approximate the original distances as close as possible. If the original distance measures interesting properties, these can be found back as interesting patterns in this space. This idea is applied to complex systems: The act of breathing, the structure and activity of the brain, and dynamical systems and time series in general. In all these situations, optimal transportation distances are used; these measure how much work is needed to transform one probability distribution into another. The reconstructed Euclidean space then permits to apply multivariate statistical methods. In particular, canonical discriminant analysis makes it possible to distinguish between distinct classes of systems, e.g., between healthy and diseased lungs. This offers new diagnostic perspectives in the assessment of lung and brain diseases, and also offers a new approach to numerical bifurcation analysis and to quantify synchronization in dynamical systems.LEI Universiteit LeidenNWO Computational Life Sciences, grant no. 635.100.006Analyse en stochastie

    Fantastic Sources Of Tumor Heterogeneity And How To Characterize Them

    Get PDF
    Cancer constantly evolves to evade the host immune system and resist different treatments. As a consequence, we see a wide range of inter and intra-tumor heterogeneity. In this PhD thesis, we present a collection of computational methods that characterize this heterogeneity from diverse perspectives. First, we developed computational frameworks for predicting functional re-wiring events in cancer and imputing the functional effects of protein-protein interactions given genome-wide transcriptomics and genetic perturbation data. Second, we developed a computational framework to characterize intra-tumor genetic heterogeneity in melanoma from bulk sequencing data and study its effects on the host immune response and patient survival independently of the overall mutation burden. Third, we analyzed publicly available genome-wide copy number, expression and methylation data of distinct cancer types and their normal tissues of origin to systematically uncover factors driving the acquisition of cancer type-specific chromosomal aneuploidies. Lastly, we developed a new computational tool: CODEFACS (COnfident Deconvolution For All Cell Subsets) to dissect the cellular heterogeneity of each patient’s tumor microenvironment (TME) from bulk RNA sequencing data, and LIRICS (LIgand Receptor Interactions between Cell Subsets): a supporting statistical framework to discover clinically relevant cellular immune crosstalk. Taken together, the methods presented in this thesis offer a way to study tumor heterogeneity in large patient cohorts using widely available bulk sequencing data and obtain new insights on tumor progression

    Development of Integrated Machine Learning and Data Science Approaches for the Prediction of Cancer Mutation and Autonomous Drug Discovery of Anti-Cancer Therapeutic Agents

    Get PDF
    Few technological ideas have captivated the minds of biochemical researchers to the degree that machine learning (ML) and artificial intelligence (AI) have. Over the last few years, advances in the ML field have driven the design of new computational systems that improve with experience and are able to model increasingly complex chemical and biological phenomena. In this dissertation, we capitalize on these achievements and use machine learning to study drug receptor sites and design drugs to target these sites. First, we analyze the significance of various single nucleotide variations and assess their rate of contribution to cancer. Following that, we used a portfolio of machine learning and data science approaches to design new drugs to target protein kinase inhibitors. We show that these techniques exhibit strong promise in aiding cancer research and drug discovery

    Two-component regulation: modelling, predicting & identifying protein-protein interactions & assessing signalling networks of bacteria

    Get PDF
    Two-component signalling systems (TCSs) are found in most prokaryotic genomes. They typically comprise of two proteins, a histidine (or sensor) kinase (HK) and an associated response regulator (RR), containing transmitter and receiver domains respectively, which interact to achieve transfer of a phosphoryl group from a histidine residue (of the transmitter domain in the HK) to an aspartate residue (of the partner RR’s receiver domain). An automated analysis pipeline using the NCBI’s RPS-BLAST tool was developed to identify and classify all TCS genes from completed prokaryotic genomes using the PFAM and CDD protein domain databases. A large proportion of TCS genes were found to be simple hybrid kinases (HYs) containing both a transmitter domain and a receiver domain within a single protein, presumably the result of the fusion or combination of separate HK and RR genes. This propensity to consolidate functionality into a single protein was found to be limited in the presence of either a transmembrane sensory/input domain or a DNA binding domain – two spatially separated functions. While HK and RR genes are usually found together in the genome, in some species a large proportion of TCS domains are found as part of complex hybrid kinases (genes containing multiple TCS domains), in isolated or orphaned genes, or in complex gene clusters. In such organisms the lack of paired HK and RR genes makes it difficult to define genome-encoded signalling networks. Identifying paired transmitter and receiver domains from a pan-genomic survey of prokaryotes gives a database of amino acid sequences for thousands of interacting protein-protein complexes. Covariation between columns of multiple sequence alignments (MSAs) identifies particular pairs of residues representing interactions within the docked complex. Using numerical scores, these amino acids pairs were successfully used as explanatory variables in a generalised linear model (GLM) to predict the probabilities of interaction between transmitter and receiver domains
    corecore