47,207 research outputs found

    Graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells

    Get PDF
    Single-cell RNA-seq allows quantification of biological heterogeneity across both discrete cell types and continuous cell differentiation transitions. We present approximate graph abstraction (AGA), an algorithm that reconciles the computational analysis strategies of clustering and trajectory inference by explaining cell-to-cell variation both in terms of discrete and continuous latent variables (https://github.com/theislab/graph_abstraction). This enables to generate cellular maps of differentiation manifolds with complex topologies - efficiently and robustly across different datasets. Approximate graph abstraction quantifies the connectivity of partitions of a neighborhood graph of single cells, thereby generating a much simpler abstracted graph whose nodes label the partitions. Together with a random walk-based distance measure, this generates a topology preserving map of single cells - a partial coordinatization of data useful for exploring and explaining its variation. We use the abstracted graph to assess which subsets of data are better explained by discrete clusters than by a continuous variable, to trace gene expression changes along aggregated single-cell paths through data and to infer abstracted trees that best explain the global topology of data. We demonstrate the power of the method by reconstructing differentiation processes with high numbers of branchings from single-cell gene expression datasets and by identifying biological trajectories from single-cell imaging data using a deep-learning based distance metric. Along with the method, we introduce measures for the connectivity of graph partitions, generalize random-walk based distance measures to disconnected graphs and introduce a path-based measure for topological similarity between graphs. Graph abstraction is computationally efficient and provides speedups of at least 30 times when compared to algorithms for the inference of lineage trees

    Developments in the theory of randomized shortest paths with a comparison of graph node distances

    Get PDF
    There have lately been several suggestions for parametrized distances on a graph that generalize the shortest path distance and the commute time or resistance distance. The need for developing such distances has risen from the observation that the above-mentioned common distances in many situations fail to take into account the global structure of the graph. In this article, we develop the theory of one family of graph node distances, known as the randomized shortest path dissimilarity, which has its foundation in statistical physics. We show that the randomized shortest path dissimilarity can be easily computed in closed form for all pairs of nodes of a graph. Moreover, we come up with a new definition of a distance measure that we call the free energy distance. The free energy distance can be seen as an upgrade of the randomized shortest path dissimilarity as it defines a metric, in addition to which it satisfies the graph-geodetic property. The derivation and computation of the free energy distance are also straightforward. We then make a comparison between a set of generalized distances that interpolate between the shortest path distance and the commute time, or resistance distance. This comparison focuses on the applicability of the distances in graph node clustering and classification. The comparison, in general, shows that the parametrized distances perform well in the tasks. In particular, we see that the results obtained with the free energy distance are among the best in all the experiments.Comment: 30 pages, 4 figures, 3 table

    Node-weighted measures for complex networks with spatially embedded, sampled, or differently sized nodes

    Full text link
    When network and graph theory are used in the study of complex systems, a typically finite set of nodes of the network under consideration is frequently either explicitly or implicitly considered representative of a much larger finite or infinite region or set of objects of interest. The selection procedure, e.g., formation of a subset or some kind of discretization or aggregation, typically results in individual nodes of the studied network representing quite differently sized parts of the domain of interest. This heterogeneity may induce substantial bias and artifacts in derived network statistics. To avoid this bias, we propose an axiomatic scheme based on the idea of node splitting invariance to derive consistently weighted variants of various commonly used statistical network measures. The practical relevance and applicability of our approach is demonstrated for a number of example networks from different fields of research, and is shown to be of fundamental importance in particular in the study of spatially embedded functional networks derived from time series as studied in, e.g., neuroscience and climatology.Comment: 21 pages, 13 figure

    Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck operators

    Full text link
    This paper presents a diffusion based probabilistic interpretation of spectral clustering and dimensionality reduction algorithms that use the eigenvectors of the normalized graph Laplacian. Given the pairwise adjacency matrix of all points, we define a diffusion distance between any two data points and show that the low dimensional representation of the data by the first few eigenvectors of the corresponding Markov matrix is optimal under a certain mean squared error criterion. Furthermore, assuming that data points are random samples from a density p(\x) = e^{-U(\x)} we identify these eigenvectors as discrete approximations of eigenfunctions of a Fokker-Planck operator in a potential 2U(\x) with reflecting boundary conditions. Finally, applying known results regarding the eigenvalues and eigenfunctions of the continuous Fokker-Planck operator, we provide a mathematical justification for the success of spectral clustering and dimensional reduction algorithms based on these first few eigenvectors. This analysis elucidates, in terms of the characteristics of diffusion processes, many empirical findings regarding spectral clustering algorithms.Comment: submitted to NIPS 200

    Growing Attributed Networks through Local Processes

    Full text link
    This paper proposes an attributed network growth model. Despite the knowledge that individuals use limited resources to form connections to similar others, we lack an understanding of how local and resource-constrained mechanisms explain the emergence of rich structural properties found in real-world networks. We make three contributions. First, we propose a parsimonious and accurate model of attributed network growth that jointly explains the emergence of in-degree distributions, local clustering, clustering-degree relationship and attribute mixing patterns. Second, our model is based on biased random walks and uses local processes to form edges without recourse to global network information. Third, we account for multiple sociological phenomena: bounded rationality, structural constraints, triadic closure, attribute homophily, and preferential attachment. Our experiments indicate that the proposed Attributed Random Walk (ARW) model accurately preserves network structure and attribute mixing patterns of six real-world networks; it improves upon the performance of eight state-of-the-art models by a statistically significant margin of 2.5-10x.Comment: 11 pages, 13 figure

    Search Result Clustering via Randomized Partitioning of Query-Induced Subgraphs

    Full text link
    In this paper, we present an approach to search result clustering, using partitioning of underlying link graph. We define the notion of "query-induced subgraph" and formulate the problem of search result clustering as a problem of efficient partitioning of given subgraph into topic-related clusters. Also, we propose a novel algorithm for approximative partitioning of such graph, which results in cluster quality comparable to the one obtained by deterministic algorithms, while operating in more efficient computation time, suitable for practical implementations. Finally, we present a practical clustering search engine developed as a part of this research and use it to get results about real-world performance of proposed concepts.Comment: 16th Telecommunications Forum TELFOR 200

    An Agent-Based Algorithm exploiting Multiple Local Dissimilarities for Clusters Mining and Knowledge Discovery

    Full text link
    We propose a multi-agent algorithm able to automatically discover relevant regularities in a given dataset, determining at the same time the set of configurations of the adopted parametric dissimilarity measure yielding compact and separated clusters. Each agent operates independently by performing a Markovian random walk on a suitable weighted graph representation of the input dataset. Such a weighted graph representation is induced by the specific parameter configuration of the dissimilarity measure adopted by the agent, which searches and takes decisions autonomously for one cluster at a time. Results show that the algorithm is able to discover parameter configurations that yield a consistent and interpretable collection of clusters. Moreover, we demonstrate that our algorithm shows comparable performances with other similar state-of-the-art algorithms when facing specific clustering problems

    Do logarithmic proximity measures outperform plain ones in graph clustering?

    Full text link
    We consider a number of graph kernels and proximity measures including commute time kernel, regularized Laplacian kernel, heat kernel, exponential diffusion kernel (also called "communicability"), etc., and the corresponding distances as applied to clustering nodes in random graphs and several well-known datasets. The model of generating random graphs involves edge probabilities for the pairs of nodes that belong to the same class or different predefined classes of nodes. It turns out that in most cases, logarithmic measures (i.e., measures resulting after taking logarithm of the proximities) perform better while distinguishing underlying classes than the "plain" measures. A comparison in terms of reject curves of inter-class and intra-class distances confirms this conclusion. A similar conclusion can be made for several well-known datasets. A possible origin of this effect is that most kernels have a multiplicative nature, while the nature of distances used in cluster algorithms is an additive one (cf. the triangle inequality). The logarithmic transformation is a tool to transform the first nature to the second one. Moreover, some distances corresponding to the logarithmic measures possess a meaningful cutpoint additivity property. In our experiments, the leader is usually the logarithmic Communicability measure. However, we indicate some more complicated cases in which other measures, typically, Communicability and plain Walk, can be the winners.Comment: 11 pages, 5 tables, 9 figures. Accepted for publication in the Proceedings of 6th International Conference on Network Analysis, May 26-28, 2016, Nizhny Novgorod, Russi

    Absorbing random-walk centrality: Theory and algorithms

    Full text link
    We study a new notion of graph centrality based on absorbing random walks. Given a graph G=(V,E)G=(V,E) and a set of query nodes QVQ\subseteq V, we aim to identify the kk most central nodes in GG with respect to QQ. Specifically, we consider central nodes to be absorbing for random walks that start at the query nodes QQ. The goal is to find the set of kk central nodes that minimizes the expected length of a random walk until absorption. The proposed measure, which we call kk absorbing random-walk centrality, favors diverse sets, as it is beneficial to place the kk absorbing nodes in different parts of the graph so as to "intercept" random walks that start from different query nodes. Although similar problem definitions have been considered in the literature, e.g., in information-retrieval settings where the goal is to diversify web-search results, in this paper we study the problem formally and prove some of its properties. We show that the problem is NP-hard, while the objective function is monotone and supermodular, implying that a greedy algorithm provides solutions with an approximation guarantee. On the other hand, the greedy algorithm involves expensive matrix operations that make it prohibitive to employ on large datasets. To confront this challenge, we develop more efficient algorithms based on spectral clustering and on personalized PageRank.Comment: 11 pages, 11 figures, short paper to appear at ICDM 201
    corecore