1,182 research outputs found

    Understanding cellular differentiation by modelling of single-cell gene expression data

    Get PDF
    Over the course of the last decade single-cell RNA sequencing (scRNA-seq) has revolutionized the study of cellular heterogeneity, as one experiment routinely covers the expression of thousands of genes in tens or hundreds of thousands of cells. By quantifying differences between the single cell transcriptomes it is possible to reconstruct the process that gives rise to different cell fates from a progenitor population and gain access to trajectories of gene expression over developmental time. Tree reconstruction algorithms must deal with the high levels of noise, the high dimensionality of gene expression space, and strong non-linear dependencies between genes. In this thesis we address three aspects of working with scRNA-seq data: (1) lineage tree reconstruction, where we propose MERLoT, a novel trajectory inference method, (2) method comparison, where we propose PROSSTT, a novel algorithm that simulates scRNA-seq count data of complex differentiation trajectories, and (3) noise modelling, where we propose a novel probabilistic description of count data, a statistically motivated local averaging strategy, and an adaptation of the cross validation approach for the evaluation of gene expression imputation strategies. While statistical modelling of the data was our primary motivation, due to time constraints we did not manage to fully realize our plans for it. Increasingly complex processes like whole-organism development are being studied by single-cell transcriptomics, producing large amounts of data. Methods for trajectory inference must therefore efficiently reconstruct \textit{a priori} unknown lineage trees with many cell fates. We propose MERLoT, a method that can reconstruct trees in sub-quadratic time by utilizing a local averaging strategy, scaling very well on large datasets. MERLoT compares favorably to the state of the art, both on real data and a large synthetic benchmark. The absence of data with known complex underlying topologies makes it challenging to quantitatively compare tree reconstruction methods to each other. PROSSTT is a novel algorithm that simulates count data from complex differentiation processes, facilitating comparisons between algorithms. We created the largest synthetic dataset to-date, and the first to contain simulations with up to 12 cell fates. Additionally, PROSSTT can learn simulation parameters from reconstructed lineage trees and produce cells with expression profiles similar to the real data. Quantifying similarity between single-cell transcriptomes is crucial for clustering scRNA-seq profiles to cell types or inferring developmental trajectories, and appropriate statistical modelling of the data should improve such similarity calculations. We propose a Gaussian mixture of negative binomial distributions where gene expression variance depends on the square of the average expression. The model hyperparameters can be learned via the hybrid Monte Carlo algorithm, and a good initialization of average expression and variance parameters can be obtained by trajectory inference. A way to limit noise in the data is to apply local averaging, using the nearest neighbours of each cell to recover expression of non-captured mRNA. Our proposal, nearest neighbour smoothing with optimal bias-variance trade-off, optimizes the k-nearest neighbours approach by reducing the contribution of inappropriate neighbours. We also propose a way to assess the quality of gene expression imputation. After reconstructing a trajectory with imputed data, each cell can be projected to the trajectory using non-overlapping subsets of genes. The robustness of these assignments over multiple partitions of the genes is a novel estimator of imputation performance. Finally, I was involved in the planning and initial stages of a mouse ovary cell atlas as a collaboration

    Reconstructing evolving signalling networks by hidden Markov nested effects models

    Get PDF
    Inferring time-varying networks is important to understand the development and evolution of interactions over time. However, the vast majority of currently used models assume direct measurements of node states, which are often difficult to obtain, especially in fields like cell biology, where perturbation experiments often only provide indirect information of network structure. Here we propose hidden Markov nested effects models (HM-NEMs) to model the evolving network by a Markov chain on a state space of signalling networks, which are derived from nested effects models (NEMs) of indirect perturbation data. To infer the hidden network evolution and unknown parameter, a Gibbs sampler is developed, in which sampling network structure is facilitated by a novel structural Metropolis–Hastings algorithm. We demonstrate the potential of HM-NEMs by simulations on synthetic time-series perturbation data. We also show the applicability of HM-NEMs in two real biological case studies, in one capturing dynamic crosstalk during the progression of neutrophil polarisation, and in the other inferring an evolving network underlying early differentiation of mouse embryonic stem cells.This is the final published manuscript, originally published by The Annals of Applied Statistics here: http://projecteuclid.org/euclid.aoas/1396966294

    Bayesian nonparametric clusterings in relational and high-dimensional settings with applications in bioinformatics.

    Get PDF
    Recent advances in high throughput methodologies offer researchers the ability to understand complex systems via high dimensional and multi-relational data. One example is the realm of molecular biology where disparate data (such as gene sequence, gene expression, and interaction information) are available for various snapshots of biological systems. This type of high dimensional and multirelational data allows for unprecedented detailed analysis, but also presents challenges in accounting for all the variability. High dimensional data often has a multitude of underlying relationships, each represented by a separate clustering structure, where the number of structures is typically unknown a priori. To address the challenges faced by traditional clustering methods on high dimensional and multirelational data, we developed three feature selection and cross-clustering methods: 1) infinite relational model with feature selection (FIRM) which incorporates the rich information of multirelational data; 2) Bayesian Hierarchical Cross-Clustering (BHCC), a deterministic approximation to Cross Dirichlet Process mixture (CDPM) and to cross-clustering; and 3) randomized approximation (RBHCC), based on a truncated hierarchy. An extension of BHCC, Bayesian Congruence Measuring (BCM), is proposed to measure incongruence between genes and to identify sets of congruent loci with identical evolutionary histories. We adapt our BHCC algorithm to the inference of BCM, where the intended structure of each view (congruent loci) represents consistent evolutionary processes. We consider an application of FIRM on categorizing mRNA and microRNA. The model uses latent structures to encode the expression pattern and the gene ontology annotations. We also apply FIRM to recover the categories of ligands and proteins, and to predict unknown drug-target interactions, where latent categorization structure encodes drug-target interaction, chemical compound similarity, and amino acid sequence similarity. BHCC and RBHCC are shown to have improved predictive performance (both in terms of cluster membership and missing value prediction) compared to traditional clustering methods. Our results suggest that these novel approaches to integrating multi-relational information have a promising future in the biological sciences where incorporating data related to varying features is often regarded as a daunting task

    MicroRNA-mediated regulatory circuits: outlook and perspectives

    Get PDF
    MicroRNAs have been found to be necessary for regulating genes implicated in almost all signaling pathways, and consequently their dysfunction influences many diseases, including cancer. Understanding of the complexity of the microRNA-mediated regulatory network has grown in terms of size, connectivity and dynamics with the development of computational and, more recently, experimental high-throughput approaches for microRNA target identification. Newly developed studies on recurrent microRNA-mediated circuits in regulatory networks, also known as network motifs, have substantially contributed to addressing this complexity, and therefore to helping understand the ways by which microRNAs achieve their regulatory role. This review provides a summarizing view of the state-of-the-art, and perspectives of research efforts on microRNA-mediated regulatory motifs. In this review, we discuss the topological properties characterizing different types of circuits, and the regulatory features theoretically enabled by such properties, with a special emphasis on examples of circuits typifying their biological significance in experimentally validated contexts. Finally, we will consider possible future developments, in particular regarding microRNA-mediated circuits involving long non-coding RNAs and epigenetic regulators

    Robust Algorithms for Detecting Hidden Structure in Biological Data

    Get PDF
    Biological data, such as molecular abundance measurements and protein sequences, harbor complex hidden structure that reflects its underlying biological mechanisms. For example, high-throughput abundance measurements provide a snapshot the global state of a living cell, while homologous protein sequences encode the residue-level logic of the proteins\u27 function and provide a snapshot of the evolutionary trajectory of the protein family. In this work I describe algorithmic approaches and analysis software I developed for uncovering hidden structure in both kinds of data. Clustering is an unsurpervised machine learning technique commonly used to map the structure of data collected in high-throughput experiments, such as quantification of gene expression by DNA microarrays or short-read sequencing. Clustering algorithms always yield a partitioning of the data, but relying on a single partitioning solution can lead to spurious conclusions. In particular, noise in the data can cause objects to fall into the same cluster by chance rather than due to meaningful association. In the first part of this thesis I demonstrate approaches to clustering data robustly in the presence of noise and apply robust clustering to analyze the transcriptional response to injury in a neuron cell. In the second part of this thesis I describe identifying hidden specificity determining residues (SDPs) from alignments of protein sequences descended through gene duplication from a common ancestor (paralogs) and apply the approach to identify numerous putative SDPs in bacterial transcription factors in the LacI family. Finally, I describe and demonstrate a new algorithm for reconstructing the history of duplications by which paralogs descended from their common ancestor. This algorithm addresses the complexity of such reconstruction due to indeterminate or erroneous homology assignments made by sequence alignment algorithms and to the vast prevalence of divergence through speciation over divergence through gene duplication in protein evolution

    SeqNet: An R Package for Generating Gene-Gene Networks and Simulating RNA-Seq Data

    Get PDF
    Gene expression data provide an abundant resource for inferring connections in gene regulatory networks. While methodologies developed for this task have shown success, a challenge remains in comparing the performance among methods. Gold-standard datasets are scarce and limited in use. And while tools for simulating expression data are available, they are not designed to resemble the data obtained from RNA-seq experiments. SeqNet is an R package that provides tools for generating a rich variety of gene network structures and simulating RNA-seq data from them. This produces in silico RNA-seq data for benchmarking and assessing gene network inference methods. The package is available from the Comprehensive R Archive Network at https://CRAN.R-project.org/package= SeqNet and on GitHub at https://github.com/tgrimes/SeqNet

    Exploiting natural and induced genetic variation to study hematopoiesis

    Get PDF
    PUZZLING WITH DNA Blood cell formation can be studied by making use of natural genetic variation across mouse strains. There are, for example, two mouse strains that do not only differ in fur color, but also in average life span and more specifically in the number of blood-forming stem cells in their bone marrow. The cause of these differences can be found in the DNA of these mice. This DNA differs slightly between the two mouse strains, making some genes in one strain just a bit more or less active compared to those same genes in the other strain. The aim of part I of this thesis was to study the influence of genetic variation on gene expression and how this might explain the specific characteristics of the mouse strains. One of the findings in this study was that the influence of genetic variation on gene expression is strongly cell-type-dependent. Additionally, blood cell formation can be studied by introducing genetic variation into the system. In part II of this thesis genetic variation was introduced into mouse blood-forming stem cells by letting random DNA sequences or “barcodes” integrate into the DNA of these cells. Thereby, these cells were provided with a unique and identifiable label that was heritable from mother- to daughter cell. In this manner the fate of blood-forming stem cells and their progeny could be tracked following transplantation in mice. This technique is very promising for monitoring blood cell formation in future clinical gene therapy studies in humans. PUZZELEN MET DNA Bloedvorming kan bestudeerd worden door gebruik te maken van natuurlijke genetische variatie tussen muizenstammen. Zo bestaan er bijvoorbeeld twee muizenstammen die niet alleen verschillen in vachtkleur, maar ook in gemiddelde levensduur en meer specifiek in het aantal bloedvormende stamcellen dat zich in hun beenmerg bevindt. De oorzaak van deze verschillen kan gevonden worden in het DNA van deze muizen. Dat DNA verschilt net iets tussen de twee muizenstammen, waardoor sommige genen in de ene stam actiever of juist minder actief zijn dan diezelfde genen in de andere stam. In deel I van dit proefschrift is onderzocht hoe genetische variatie de expressie van genen beïnvloedt en hoe dit de specifieke eigenschappen van de muizenstammen zou kunnen verklaren. Er is onder andere gevonden dat de invloed van genetische variatie op de expressie van genen sterk celtype-afhankelijk is. Daarnaast kan bloedvorming bestudeerd worden door genetische variatie te introduceren in het systeem. In deel II van dit proefschrift is genetische variatie in bloedvormende stamcellen van muizen geïntroduceerd door random DNA volgordes of “barcodes” te laten integreren in het DNA van deze cellen. Dit resulteert erin dat elke cel voorzien wordt van een uniek label dat overgegeven wordt van moeder- op dochtercel. De DNA volgorde van het label kan gelezen worden met behulp van een zogenaamde sequencing techniek. Op deze manier kan het lot van bloedvormende stamcellen en hun nakomelingen gevolgd worden na transplantatie in muizen. Deze techniek is zeer veelbelovend voor het monitoren van bloedvorming in toekomstige klinische gentherapie studies in de mens.
    • …
    corecore