1,703 research outputs found

    Computational methods for large-scale single-cell RNA-seq and multimodal data

    Get PDF
    Emerging single cell genomics technologies such as single cell RNA-seq (scRNA-seq) and single cell ATAC-seq provide new opportunities for discovery of previously unknown cell types, facilitating the study of biological processes such as tumor progression, and delineating molecular mechanism differences between species. Due to the high dimensionality of the data produced by the technologies, computation and mathematics have been the cornerstone in decoding meaningful information from the data. Computational models have been challenged by the exponential growth of the data thanks to the continuing decrease in sequencing costs and growth of large-scale genomic projects such as the Human Cell Atlas. In addition, recent single-cell technologies have enabled us to measure multiple modalities such as transcriptome, protome, and epigenome in the same cell. This requires us to establish new computational methods which can cope with multiple layers of the data. To address these challenges, the main goal of this thesis was to develop computational methods and mathematical models for analyzing large-scale scRNA-seq and multimodal omics data. In particular, I have focused on fundamental single-cell analysis such as clustering and visualization. The most common task in scRNA-seq data analysis is the identification of cell types. Numerous methods have been proposed for this problem with a current focus on methods for the analysis of large scale scRNA-seq data. I developed Specter, a computational method that utilizes recent algorithmic advances in fast spectral clustering and ensemble learning. Specter achieves a substantial improvement in accuracy over existing methods and identifies rare cell types with high sensitivity. Specter allows us to process a dataset comprising 2 million cells in just 26 minutes. Moreover, the analysis of CITE-seq data, that simultaneously provides gene expression and protein levels, showed that Specter is able to incorporate multimodal omics measurements to resolve subtle transcriptomic differences between subpopulations of cells. We have effectively handled big data for clustering analysis using Specter. The question is how to cope with the big data for other downstream analyses such as trajectory inference and data integration. The most simple scheme is to shrink the data by selecting a subset of cells (the sketch) that best represents the full data set. Therefore I developed an algorithm called Sphetcher that makes use of the thresholding technique to efficiently pick representative cells that evenly cover the transcriptomic space occupied by the original data set. I showed that the sketch computed by Sphetcher constitutes a more accurate presentation of the original transcriptomic landscape than existing methods, which leads to a more balanced composition of cell types and a large fraction of rare cell types in the sketch. Sphetcher bridges the gap between the scalability of computational methods and the volume of the data. Moreover, I demonstrated that Sphetcher can incorporate prior information (e.g. cell labels) to inform the inference of the trajectory of human skeletal muscle myoblast differentiation. The biological processes such as development, differentiation, and cell cycle can be monitored by performing single cell sequencing at different time points, each corresponding to a snapshot of the process. A class of computational methods called trajectory inference aims to reconstruct the developmental trajectories from these snapshots. Trajectory inference (TI) methods such as Monocle, can computationally infer a pseudotime variable which serves as a proxy for developmental time. In order to compare two trajectories inferred by TI methods, we need to align the pseudotime between two trajectories. Current methods for aligning trajectories are based on the concept of dynamic time warping, which is limited to simple linear trajectories. Since complex trajectories are common in developmental processes, I adopted arboreal matchings to compare and align complex trajectories with multiple branch points diverting cells into alternative fates. Arboreal matchings were originally proposed in the context of phylogenetic trees and I theoretically linked them to dynamic time warping. A suite of exact and heuristic algorithms for aligning complex trajectories was implemented in a software Trajan. When aligning single-cell trajectories describing human muscle differentiation and myogenic reprogramming, Trajan automatically identifies the core paths from which we are able to reproduce recently reported barriers to reprogramming. In a perturbation experiment, I showed that Trajan correctly maps identical cells in a global view of trajectories, as opposed to a pairwise application of dynamic time warping. Visualization using dimensionality reduction techniques such as t-SNE and UMAP is a fundamental step in the analysis of high-dimensional data. Visualization has played a pivotal role in discovering the dynamic trends in single cell genomics data. I developed j-SNE and j-UMAP as their generalizations to the joint visualization of multimodal omics data, e.g., CITE-seq data. The approach automatically learns the relative importance of each modality in order to obtain a concise representation of the data. When comparing with the conventional approaches, I demonstrated that j-SNE and j-UMAP produce unified embeddings that better agree with known cell types and that harmonize RNA and protein velocity landscapes

    Mining topological structure in graphs through forest representations

    Get PDF
    We consider the problem of inferring simplified topological substructures—which we term backbones—in metric and non-metric graphs. Intuitively, these are subgraphs with ‘few’ nodes, multifurcations, and cycles, that model the topology of the original graph well. We present a multistep procedure for inferring these backbones. First, we encode local (geometric) information of each vertex in the original graph by means of the boundary coefficient (BC) to identify ‘core’ nodes in the graph. Next, we construct a forest representation of the graph, termed an f-pine, that connects every node of the graph to a local ‘core’ node. The final backbone is then inferred from the f-pine through CLOF (Constrained Leaves Optimal subForest), a novel graph optimization problem we introduce in this paper. On a theoretical level, we show that CLOF is NP-hard for general graphs. However, we prove that CLOF can be efficiently solved for forest graphs, a surprising fact given that CLOF induces a nontrivial monotone submodular set function maximization problem on tree graphs. This result is the basis of our method for mining backbones in graphs through forest representation. We qualitatively and quantitatively confirm the applicability, effectiveness, and scalability of our method for discovering backbones in a variety of graph-structured data, such as social networks, earthquake locations scattered across the Earth, and high-dimensional cell trajectory dat

    From Caenorhabditis elegans to the Human Connectome: A Specific Modular Organisation Increases Metabolic, Functional, and Developmental Efficiency

    Full text link
    The connectome, or the entire connectivity of a neural system represented by network, ranges various scales from synaptic connections between individual neurons to fibre tract connections between brain regions. Although the modularity they commonly show has been extensively studied, it is unclear whether connection specificity of such networks can already be fully explained by the modularity alone. To answer this question, we study two networks, the neuronal network of C. elegans and the fibre tract network of human brains yielded through diffusion spectrum imaging (DSI). We compare them to their respective benchmark networks with varying modularities, which are generated by link swapping to have desired modularity values but otherwise maximally random. We find several network properties that are specific to the neural networks and cannot be fully explained by the modularity alone. First, the clustering coefficient and the characteristic path length of C. elegans and human connectomes are both higher than those of the benchmark networks with similar modularity. High clustering coefficient indicates efficient local information distribution and high characteristic path length suggests reduced global integration. Second, the total wiring length is smaller than for the alternative configurations with similar modularity. This is due to lower dispersion of connections, which means each neuron in C. elegans connectome or each region of interest (ROI) in human connectome reaches fewer ganglia or cortical areas, respectively. Third, both neural networks show lower algorithmic entropy compared to the alternative arrangements. This implies that fewer rules are needed to encode for the organisation of neural systems

    Petuum: A New Platform for Distributed Machine Learning on Big Data

    Full text link
    What is a systematic way to efficiently apply a wide spectrum of advanced ML programs to industrial scale problems, using Big Models (up to 100s of billions of parameters) on Big Data (up to terabytes or petabytes)? Modern parallelization strategies employ fine-grained operations and scheduling beyond the classic bulk-synchronous processing paradigm popularized by MapReduce, or even specialized graph-based execution that relies on graph representations of ML programs. The variety of approaches tends to pull systems and algorithms design in different directions, and it remains difficult to find a universal platform applicable to a wide range of ML programs at scale. We propose a general-purpose framework that systematically addresses data- and model-parallel challenges in large-scale ML, by observing that many ML programs are fundamentally optimization-centric and admit error-tolerant, iterative-convergent algorithmic solutions. This presents unique opportunities for an integrative system design, such as bounded-error network synchronization and dynamic scheduling based on ML program structure. We demonstrate the efficacy of these system designs versus well-known implementations of modern ML algorithms, allowing ML programs to run in much less time and at considerably larger model sizes, even on modestly-sized compute clusters.Comment: 15 pages, 10 figures, final version in KDD 2015 under the same titl

    Biclustering on expression data: A review

    Get PDF
    Biclustering has become a popular technique for the study of gene expression data, especially for discovering functionally related gene sets under different subsets of experimental conditions. Most of biclustering approaches use a measure or cost function that determines the quality of biclusters. In such cases, the development of both a suitable heuristics and a good measure for guiding the search are essential for discovering interesting biclusters in an expression matrix. Nevertheless, not all existing biclustering approaches base their search on evaluation measures for biclusters. There exists a diverse set of biclustering tools that follow different strategies and algorithmic concepts which guide the search towards meaningful results. In this paper we present a extensive survey of biclustering approaches, classifying them into two categories according to whether or not use evaluation metrics within the search method: biclustering algorithms based on evaluation measures and non metric-based biclustering algorithms. In both cases, they have been classified according to the type of meta-heuristics which they are based on.Ministerio de EconomĂ­a y Competitividad TIN2011-2895

    Preparation and characterization of magnetite (Fe3O4) nanoparticles By Sol-Gel method

    Get PDF
    The magnetite (Fe3O4) nanoparticles were successfully synthesized and annealed under vacuum at different temperature. The Fe3O4 nanoparticles prepared via sol-gel assisted method and annealed at 200-400ÂșC were characterized by Fourier Transformation Infrared Spectroscopy (FTIR), X-ray Diffraction spectra (XRD), Field Emission Scanning Electron Microscope (FESEM) and Atomic Force Microscopy (AFM). The XRD result indicate the presence of Fe3O4 nanoparticles, and the Scherer`s Formula calculated the mean particles size in range of 2-25 nm. The FESEM result shows that the morphologies of the particles annealed at 400ÂșC are more spherical and partially agglomerated, while the EDS result indicates the presence of Fe3O4 by showing Fe-O group of elements. AFM analyzed the 3D and roughness of the sample; the Fe3O4 nanoparticles have a minimum diameter of 79.04 nm, which is in agreement with FESEM result. In many cases, the synthesis of Fe3O4 nanoparticles using FeCl3 and FeCl2 has not been achieved, according to some literatures, but this research was able to obtained Fe3O4 nanoparticles base on the characterization results

    Machine Learning Based Applications for Data Visualization, Modeling, Control, and Optimization for Chemical and Biological Systems

    Get PDF
    This dissertation report covers Yan Ma’s Ph.D. research with applicational studies of machine learning in manufacturing and biological systems. The research work mainly focuses on reaction modeling, optimization, and control using a deep learning-based approaches, and the work mainly concentrates on deep reinforcement learning (DRL). Yan Ma’s research also involves with data mining with bioinformatics. Large-scale data obtained in RNA-seq is analyzed using non-linear dimensionality reduction with Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and Uniform Manifold Approximation and Projection (UMAP), followed by clustering analysis using k-Means and Hierarchical Density-Based Spatial Clustering with Noise (HDBSCAN). This report focuses on 3 case studies with DRL optimization control including a polymerization reaction control with deep reinforcement learning, a bioreactor optimization, and a fed-batch reaction optimization from a reactor at Dow Inc.. In the first study, a data-driven controller based on DRL is developed for a fed-batch polymerization reaction with multiple continuous manipulative variables with continuous control. The second case study is the modeling and optimization of a bioreactor. In this study, a data-driven reaction model is developed using Artificial Neural Network (ANN) to simulate the growth curve and bio-product accumulation of cyanobacteria Plectonema. Then a DRL control agent that optimizes the daily nutrient input is applied to maximize the yield of valuable bio-product C-phycocyanin. C-phycocyanin yield is increased by 52.1% compared to a control group with the same total nutrient content in experimental validation. The third case study is employing the data-driven control scheme for optimization of a reactor from Dow Inc, where a DRL-based optimization framework is established for the optimization of the Multi-Input, Multi-Output (MIMO) reaction system with reaction surrogate modeling. Yan Ma’s research overall shows promising directions for employing the emerging technologies of data-driven methods and deep learning in the field of manufacturing and biological systems. It is demonstrated that DRL is an efficient algorithm in the study of three different reaction systems with both stochastic and deterministic policies. Also, the use of data-driven models in reaction simulation also shows promising results with the non-linear nature and fast computational speed of the neural network models
    • 

    corecore