818 research outputs found

    Clustering Single-cell RNA-sequencing Data based on Matching Clusters Structures

    Get PDF
    Single-cell sequencing technology can generate RNA-sequencing data at the single cell level, and one important single-cell RNA-sequencing data analysis method is to identify their cell types without supervised information. Clustering is an unsupervised approach that can help find new insights into biology especially for exploring the biological functions of specific cell type. However, it is challenging for traditional clustering methods to obtain high-quality cell type recognition results. In this research, we propose a novel Clustering method based on Matching Clusters Structures (MCSC) for identifying cell types among single-cell RNA-sequencing data. Firstly, MCSC obtains two different groups of clustering results from the same K-means algorithm because its initial centroids are randomly selected. Then, for one group, MCSC uses shared nearest neighbour information to calculate a label transition matrix, which denotes label transition probability between any two initial clusters. Each initial cluster may be reassigned if merging results after label transition satisfy a consensus function that maximizes structural matching degree of two different groups of clustering results. In essence, the MCSC may be interpreted as a label training process. We evaluate the proposed MCSC with five commonly used datasets and compare MCSC with several classical and state-of-the-art algorithms. The experimental results show that MCSC outperform other algorithms

    Statistical metamodeling of dynamic network loading

    Get PDF
    Dynamic traffic assignment models rely on a network performance module known as dynamic network loading (DNL), which expresses flow propagation, flow conservation, and travel delay at a network level. The DNL defines the so-called network delay operator, which maps a set of path departure rates to a set of path travel times (or costs). It is widely known that the delay operator is not available in closed form, and has undesirable properties that severely complicate DTA analysis and computation, such as discontinuity, non-differentiability, non-monotonicity, and computational inefficiency. This paper proposes a fresh take on this important and difficult issue, by providing a class of surrogate DNL models based on a statistical learning method known as Kriging. We present a metamodeling framework that systematically approximates DNL models and is flexible in the sense of allowing the modeler to make trade-offs among model granularity, complexity, and accuracy. It is shown that such surrogate DNL models yield highly accurate approximations (with errors below 8%) and superior computational efficiency (9 to 455 times faster than conventional DNL procedures such as those based on the link transmission model). Moreover, these approximate DNL models admit closed-form and analytical delay operators, which are Lipschitz continuous and infinitely differentiable, with closed-form Jacobians. We provide in-depth discussions on the implications of these properties to DTA research and model applications

    A Theoretical Analysis of Why Hybrid Ensembles Work

    Get PDF
    Inspired by the group decision making process, ensembles or combinations of classifiers have been found favorable in a wide variety of application domains. Some researchers propose to use the mixture of two different types of classification algorithms to create a hybrid ensemble. Why does such an ensemble work? The question remains. Following the concept of diversity, which is one of the fundamental elements of the success of ensembles, we conduct a theoretical analysis of why hybrid ensembles work, connecting using different algorithms to accuracy gain. We also conduct experiments on classification performance of hybrid ensembles of classifiers created by decision tree and naïve Bayes classification algorithms, each of which is a top data mining algorithm and often used to create non-hybrid ensembles. Therefore, through this paper, we provide a complement to the theoretical foundation of creating and using hybrid ensembles

    Diffeomorphic Transformations for Time Series Analysis: An Efficient Approach to Nonlinear Warping

    Full text link
    The proliferation and ubiquity of temporal data across many disciplines has sparked interest for similarity, classification and clustering methods specifically designed to handle time series data. A core issue when dealing with time series is determining their pairwise similarity, i.e., the degree to which a given time series resembles another. Traditional distance measures such as the Euclidean are not well-suited due to the time-dependent nature of the data. Elastic metrics such as dynamic time warping (DTW) offer a promising approach, but are limited by their computational complexity, non-differentiability and sensitivity to noise and outliers. This thesis proposes novel elastic alignment methods that use parametric \& diffeomorphic warping transformations as a means of overcoming the shortcomings of DTW-based metrics. The proposed method is differentiable \& invertible, well-suited for deep learning architectures, robust to noise and outliers, computationally efficient, and is expressive and flexible enough to capture complex patterns. Furthermore, a closed-form solution was developed for the gradient of these diffeomorphic transformations, which allows an efficient search in the parameter space, leading to better solutions at convergence. Leveraging the benefits of these closed-form diffeomorphic transformations, this thesis proposes a suite of advancements that include: (a) an enhanced temporal transformer network for time series alignment and averaging, (b) a deep-learning based time series classification model to simultaneously align and classify signals with high accuracy, (c) an incremental time series clustering algorithm that is warping-invariant, scalable and can operate under limited computational and time resources, and finally, (d) a normalizing flow model that enhances the flexibility of affine transformations in coupling and autoregressive layers.Comment: PhD Thesis, defended at the University of Navarra on July 17, 2023. 277 pages, 8 chapters, 1 appendi

    Identifying cell types with single cell sequencing data

    Get PDF
    Single-cell RNA sequencing (scRNA-seq) techniques, which examine the genetic information of individual cells, provide an unparalleled resolution to discern deeply into cellular heterogeneity. On the contrary, traditional RNA sequencing technologies (bulk RNA sequencing technologies), measure the average RNA expression level of a large number of input cells, which are insufficient for studying heterogeneous systems. Hence, scRNA-seq technologies make it possible to tackle many inaccessible problems, such as rare cell types identification, cancer evolution and cell lineage relationship inference. Cell population identification is the fundamental of the analysis of scRNA-seq data. Generally, the workflow of scRNA-seq analysis includes data processing, dropout imputation, feature selection, dimensionality reduction, similarity matrix construction and unsupervised clustering. Many single-cell clustering algorithms rely on similarity matrices of cells, but many existing studies have not received the expectant results. There are some unique challenges in analyzing scRNA-seq data sets, including a significant level of biological and technical noise, so similarity matrix construction still deserves further study. In my study, I present a new method, named Learning Sparse Similarity Matrices (LSSM), to construct cell-cell similarity matrices, and then several clustering methods are used to identify cell populations respectively with scRNA-seq data. Firstly, based on sparse subspace theory, the relationship between a cell and the other cells in the same cell type is expressed by a linear combination. Secondly, I construct a convex optimization objective function to find the similarity matrix, which is consist of the corresponding coefficients of the linear combinations mentioned above. Thirdly, I design an algorithm with column-wise learning and greedy algorithm to solve the objective function. As a result, the large optimization problem on the similarity matrix can be decomposed into a series of smaller optimization problems on the single column of the similarity matrix respectively, and the sparsity of the whole matrix can be ensured by the sparsity of each column. Fourthly, in order to pick an optimal clustering method for identifying cell populations based on the similarity matrix developed by LSSM, I use several clustering methods separately based on the similarity matrix calculated by LSSM from eight scRNA-seq data sets. The clustering results show that my method performs the best when combined with spectral clustering (Laplacian eigenmaps + k-means clustering). In addition, compared with five state-of-the-art methods, my method outperforms most competing methods on eight data sets. Finally, I combine LSSM with t-Distributed Stochastic Neighbor Embedding (t-SNE) to visualize the data points of scRNA-seq data in the two-dimensional space. The results show that for most data points, in the same cell types they are close, while from different cell clusters, they are separated

    Deep Learning in Single-Cell Analysis

    Full text link
    Single-cell technologies are revolutionizing the entire field of biology. The large volumes of data generated by single-cell technologies are high-dimensional, sparse, heterogeneous, and have complicated dependency structures, making analyses using conventional machine learning approaches challenging and impractical. In tackling these challenges, deep learning often demonstrates superior performance compared to traditional machine learning methods. In this work, we give a comprehensive survey on deep learning in single-cell analysis. We first introduce background on single-cell technologies and their development, as well as fundamental concepts of deep learning including the most popular deep architectures. We present an overview of the single-cell analytic pipeline pursued in research applications while noting divergences due to data sources or specific applications. We then review seven popular tasks spanning through different stages of the single-cell analysis pipeline, including multimodal integration, imputation, clustering, spatial domain identification, cell-type deconvolution, cell segmentation, and cell-type annotation. Under each task, we describe the most recent developments in classical and deep learning methods and discuss their advantages and disadvantages. Deep learning tools and benchmark datasets are also summarized for each task. Finally, we discuss the future directions and the most recent challenges. This survey will serve as a reference for biologists and computer scientists, encouraging collaborations.Comment: 77 pages, 11 figures, 15 tables, deep learning, single-cell analysi
    corecore