79 research outputs found

    Computational methods for large-scale single-cell RNA-seq and multimodal data

    Get PDF
    Emerging single cell genomics technologies such as single cell RNA-seq (scRNA-seq) and single cell ATAC-seq provide new opportunities for discovery of previously unknown cell types, facilitating the study of biological processes such as tumor progression, and delineating molecular mechanism differences between species. Due to the high dimensionality of the data produced by the technologies, computation and mathematics have been the cornerstone in decoding meaningful information from the data. Computational models have been challenged by the exponential growth of the data thanks to the continuing decrease in sequencing costs and growth of large-scale genomic projects such as the Human Cell Atlas. In addition, recent single-cell technologies have enabled us to measure multiple modalities such as transcriptome, protome, and epigenome in the same cell. This requires us to establish new computational methods which can cope with multiple layers of the data. To address these challenges, the main goal of this thesis was to develop computational methods and mathematical models for analyzing large-scale scRNA-seq and multimodal omics data. In particular, I have focused on fundamental single-cell analysis such as clustering and visualization. The most common task in scRNA-seq data analysis is the identification of cell types. Numerous methods have been proposed for this problem with a current focus on methods for the analysis of large scale scRNA-seq data. I developed Specter, a computational method that utilizes recent algorithmic advances in fast spectral clustering and ensemble learning. Specter achieves a substantial improvement in accuracy over existing methods and identifies rare cell types with high sensitivity. Specter allows us to process a dataset comprising 2 million cells in just 26 minutes. Moreover, the analysis of CITE-seq data, that simultaneously provides gene expression and protein levels, showed that Specter is able to incorporate multimodal omics measurements to resolve subtle transcriptomic differences between subpopulations of cells. We have effectively handled big data for clustering analysis using Specter. The question is how to cope with the big data for other downstream analyses such as trajectory inference and data integration. The most simple scheme is to shrink the data by selecting a subset of cells (the sketch) that best represents the full data set. Therefore I developed an algorithm called Sphetcher that makes use of the thresholding technique to efficiently pick representative cells that evenly cover the transcriptomic space occupied by the original data set. I showed that the sketch computed by Sphetcher constitutes a more accurate presentation of the original transcriptomic landscape than existing methods, which leads to a more balanced composition of cell types and a large fraction of rare cell types in the sketch. Sphetcher bridges the gap between the scalability of computational methods and the volume of the data. Moreover, I demonstrated that Sphetcher can incorporate prior information (e.g. cell labels) to inform the inference of the trajectory of human skeletal muscle myoblast differentiation. The biological processes such as development, differentiation, and cell cycle can be monitored by performing single cell sequencing at different time points, each corresponding to a snapshot of the process. A class of computational methods called trajectory inference aims to reconstruct the developmental trajectories from these snapshots. Trajectory inference (TI) methods such as Monocle, can computationally infer a pseudotime variable which serves as a proxy for developmental time. In order to compare two trajectories inferred by TI methods, we need to align the pseudotime between two trajectories. Current methods for aligning trajectories are based on the concept of dynamic time warping, which is limited to simple linear trajectories. Since complex trajectories are common in developmental processes, I adopted arboreal matchings to compare and align complex trajectories with multiple branch points diverting cells into alternative fates. Arboreal matchings were originally proposed in the context of phylogenetic trees and I theoretically linked them to dynamic time warping. A suite of exact and heuristic algorithms for aligning complex trajectories was implemented in a software Trajan. When aligning single-cell trajectories describing human muscle differentiation and myogenic reprogramming, Trajan automatically identifies the core paths from which we are able to reproduce recently reported barriers to reprogramming. In a perturbation experiment, I showed that Trajan correctly maps identical cells in a global view of trajectories, as opposed to a pairwise application of dynamic time warping. Visualization using dimensionality reduction techniques such as t-SNE and UMAP is a fundamental step in the analysis of high-dimensional data. Visualization has played a pivotal role in discovering the dynamic trends in single cell genomics data. I developed j-SNE and j-UMAP as their generalizations to the joint visualization of multimodal omics data, e.g., CITE-seq data. The approach automatically learns the relative importance of each modality in order to obtain a concise representation of the data. When comparing with the conventional approaches, I demonstrated that j-SNE and j-UMAP produce unified embeddings that better agree with known cell types and that harmonize RNA and protein velocity landscapes

    MultiMAP: dimensionality reduction and integration of multimodal data.

    Get PDF
    Multimodal data is rapidly growing in many fields of science and engineering, including single-cell biology. We introduce MultiMAP, a novel algorithm for dimensionality reduction and integration. MultiMAP can integrate any number of datasets, leverages features not present in all datasets, is not restricted to a linear mapping, allows the user to specify the influence of each dataset, and is extremely scalable to large datasets. We apply MultiMAP to single-cell transcriptomics, chromatin accessibility, methylation, and spatial data and show that it outperforms current approaches. On a new thymus dataset, we use MultiMAP to integrate cells along a temporal trajectory. This enables quantitative comparison of transcription factor expression and binding site accessibility over the course of T cell differentiation, revealing patterns of expression versus binding site opening kinetics

    Resolving Biological Trajectories in Single-cell Data using Feature Selection and Multi-modal Integration

    Get PDF
    Single-cell technologies can readily measure the expression of thousands of molecular features from individual cells undergoing dynamic biological processes, such as cellular differentiation, immune response, and disease progression. While computational trajectory inference methods and RNA velocity approaches have been developed to study how subtle changes in gene or protein expression impact cell fate decision-making, identifying characteristic features that drive continuous biological processes remains difficult to detect due to the inherent biological or technical challenges associated with single-cell data. Here, we developed two data representation-based approaches for improving inference of cellular dynamics. First, we present DELVE, an unsupervised feature selection method for identifying a representative subset of dynamically-expressed molecular features that resolve cellular trajectories in noisy data. In contrast to previous work, DELVE uses a bottom-up approach to mitigate the effect of unwanted sources of variation confounding inference and models cell states from dynamic feature modules that constitute core regulatory complexes. Using simulations, single-cell RNA sequencing data, and iterative immunofluorescence imaging data in the context of cell cycle and cellular differentiation, we demonstrate that DELVE selects genes or proteins that more accurately characterize cell populations and improve the recovery of cell type transitions. Next, we present the first task-oriented benchmarking study that investigates integration of temporal gene expression modalities for dynamic cell state prediction. We benchmark ten multi-modal integration approaches on ten datasets spanning different biological contexts, sequencing technologies, and species. This study illustrates how temporal gene expression modalities can be optimally combined to improve inference of cellular trajectories and more accurately predict sample-associated perturbation and disease phenotypes. Lastly, we illustrate an application of these approaches and perform an integrative analysis of gene expression and RNA velocity data to study the crosstalk between signaling pathways that govern the mesendoderm fate decision during directed definitive endoderm differentiation. Results of this study suggest that lineage-specific, temporally expressed genes within the primitive streak may serve as a potential target for increasing definitive endoderm efficiency. Collectively, this work uses scalable data-driven approaches to effectively manage the inherent biological or technical challenges associated with single-cell data in order to improve inference of cellular dynamics.Doctor of Philosoph

    Applied Randomized Algorithms for Efficient Genomic Analysis

    Get PDF
    The scope and scale of biological data continues to grow at an exponential clip, driven by advances in genetic sequencing, annotation and widespread adoption of surveillance efforts. For instance, the Sequence Read Archive (SRA) now contains more than 25 petabases of public data, while RefSeq, a collection of reference genomes, recently surpassed 100,000 complete genomes. In the process, it has outgrown the practical reach of many traditional algorithmic approaches in both time and space. Motivated by this extreme scale, this thesis details efficient methods for clustering and summarizing large collections of sequence data. While our primary area of interest is biological sequences, these approaches largely apply to sequence collections of any type, including natural language, software source code, and graph structured data. We applied recent advances in randomized algorithms to practical problems. We used MinHash and HyperLogLog, both examples of Locality- Sensitive Hashing, as well as coresets, which are approximate representations for finite sum problems, to build methods capable of scaling to billions of items. Ultimately, these are all derived from variations on sampling. We combined these advances with hardware-based optimizations and incorporated into free and open-source software libraries (sketch, frp, lib- simdsampling) and practical software tools built on these libraries (Dashing, Minicore, Dashing 2), empowering users to interact practically with colossal datasets on commodity hardware

    Mapping the proteome with data-driven methods: A cycle of measurement, modeling, hypothesis generation, and engineering

    Get PDF
    The living cell exhibits emergence of complex behavior and its modeling requires a systemic, integrative approach if we are to thoroughly understand and harness it. The work in this thesis has had the more narrow aim of quantitatively characterizing and mapping the proteome using data-driven methods, as proteins perform most functional and structural roles within the cell. Covered are the different parts of the cycle from improving quantification methods, to deriving protein features relying on their primary structure, predicting the protein content solely from sequence data, and, finally, to developing theoretical protein engineering tools, leading back to experiment.\ua0\ua0\ua0\ua0 High-throughput mass spectrometry platforms provide detailed snapshots of a cell\u27s protein content, which can be mined towards understanding how the phenotype arises from genotype and the interplay between the various properties of the constituent proteins. However, these large and dense data present an increased analysis challenge and current methods capture only a small fraction of signal. The first part of my work has involved tackling these issues with the implementation of a GPU-accelerated and distributed signal decomposition pipeline, making factorization of large proteomics scans feasible and efficient. The pipeline yields individual analyte signals spanning the majority of acquired signal, enabling high precision quantification and further analytical tasks.\ua0\ua0\ua0 Having such detailed snapshots of the proteome enables a multitude of undertakings. One application has been to use a deep neural network model to learn the amino acid sequence determinants of temperature adaptation, in the form of reusable deep model features. More generally, systemic quantities may be predicted from the information encoded in sequence by evolutionary pressure. Two studies taking inspiration from natural language processing have sought to learn the grammars behind the languages of expression, in one case predicting mRNA levels from DNA sequence, and in the other protein abundance from amino acid sequence. These two models helped build a quantitative understanding of the central dogma and, furthermore, in combination yielded an improved predictor of protein amount. Finally, a mathematical framework relying on the embedded space of a deep model has been constructed to assist guided mutation of proteins towards optimizing their abundance

    Regionalized choroid plexus-cerebrospinal fluid factors and effect of DNA Ligase IV deficiency in the developing mammalian brain

    Full text link
    Fundamental to mammalian brain development is the integration of cell intrinsic and extrinsic signals that direct the proliferation and differentiation of neural stem cells. Precise expression of transcription factors together with other intracellular components instruct progenitor cell fate, whereas interaction with extracellular signaling factors refines this process. We have elucidated the composition of the cerebrospinal fluid that is the source of multiple extrinsic cues during brain development. The choroid plexus, a highly vascularized tissue located in each ventricle of the brain, actively secretes cerebrospinal fluid. By RNA sequencing, we obtained transcriptome data on the choroid plexi from lateral and fourth ventricles of the mouse brain and discovered that they include transcripts unique to each tissue. Transcription factor expression in the macaque and human choroid plexi suggests that positional identities of these tissues are conserved in the primate brain. Based on transcriptional results, we defined the choroid plexus secretome, a prediction of secreted factors from the choroid plexus. By quantitative mass spectrometry, we detected proteins secreted by each choroid plexus, and comparison of these proteomic results with transcriptional profiling suggests that choroid plexus transcriptomes contribute to availability of regionalized cerebrospinal fluid factors during development. In the second part of my dissertation research, I studied the role of DNA repair mechanisms in regulating neural stem cells. These studies focused on DNA LigaseIV, an essential component of DNA double-stranded break repair, during cerebral cortical development. Deficiency of LigaseIV activity caused by a missense mutation leads to LigaseIV syndrome, in which a key clinical feature is microcephaly. Using the Lig4 R278H mouse mutant, we found increased cell death in the developing cortex, contributing to reduced cortical thickness and cellularity in the anterior cerebral cortex. These results indicate that DNA LigaseIV is essential for proper cortical development. Together, these findings illustrate the complexity of regulatory mechanisms that guide brain development, requiring the integration of mechanisms from within and outside the cell. We have investigated two such mechanisms, extrinsic cues from regionalized cerebrospinal fluid and DNA LigaseIV. These results should provide greater insight into mechanisms of normal brain development and neuropathological states.2017-11-02T00:00:00

    Doctor of Philosophy

    Get PDF
    dissertationThis dissertation establishes a new visualization design process model devised to guide visualization designers in building more effective and useful visualization systems and tools. The novelty of this framework includes its flexibility for iteration, actionability for guiding visualization designers with concrete steps, concise yet methodical definitions, and connections to other visualization design models commonly used in the field of data visualization. In summary, the design activity framework breaks down the visualization design process into a series of four design activities: understand, ideate, make, and deploy. For each activity, the framework prescribes a descriptive motivation, list of design methods, and expected visualization artifacts. To elucidate the framework, two case studies for visualization design illustrate these concepts, methods, and artifacts in real-world projects in the field of cybersecurity. For example, these projects employ user-centered design methods, such as personas and data sketches, which emphasize our teams' motivations and visualization artifacts with respect to the design activity framework. These case studies also serve as examples for novice visualization designers, and we hypothesized that the framework could serve as a pedagogical tool for teaching and guiding novices through their own design process to create a visualization tool. To externally evaluate the efficacy of this framework, we created worksheets for each design activity, outlining a series of concrete, tangible steps for novices. In order to validate the design worksheets, we conducted 13 student observations over the course of two months, received 32 online survey responses, and performed a qualitative analysis of 11 in-depth interviews. Students found the worksheets both useful and effective for framing the visualization design process. Next, by applying the design activity framework to technique-driven and evaluation-based research projects, we brainstormed possible extensions to the design model. Lastly, we examined implications of the design activity framework and present future work in this space. The visualization community is challenged to consider how to more effectively describe, capture, and communicate the complex, iterative nature of data visualization design throughout research, design, development, and deployment of visualization systems and tools
    corecore