16 research outputs found

    Graph Summarization

    Full text link
    The continuous and rapid growth of highly interconnected datasets, which are both voluminous and complex, calls for the development of adequate processing and analytical techniques. One method for condensing and simplifying such datasets is graph summarization. It denotes a series of application-specific algorithms designed to transform graphs into more compact representations while preserving structural patterns, query answers, or specific property distributions. As this problem is common to several areas studying graph topologies, different approaches, such as clustering, compression, sampling, or influence detection, have been proposed, primarily based on statistical and optimization methods. The focus of our chapter is to pinpoint the main graph summarization methods, but especially to focus on the most recent approaches and novel research trends on this topic, not yet covered by previous surveys.Comment: To appear in the Encyclopedia of Big Data Technologie

    Clustering-based Algorithms for Big Data Computations

    Get PDF
    In the age of big data, the amount of information that applications need to process often exceeds the computational capabilities of single machines. To cope with this deluge of data, new computational models have been defined. The MapReduce model allows the development of distributed algorithms targeted at large clusters, where each machine can only store a small fraction of the data. In the streaming model a single processor processes on-the-fly an incoming stream of data, using only limited memory. The specific characteristics of these models combined with the necessity of processing very large datasets rule out, in many cases, the adoption of known algorithmic strategies, prompting the development of new ones. In this context, clustering, the process of grouping together elements according to some proximity measure, is a valuable tool, which allows to build succinct summaries of the input data. In this thesis we develop novel algorithms for some fundamental problems, where clustering is a key ingredient to cope with very large instances or is itself the ultimate target. First, we consider the problem of approximating the diameter of an undirected graph, a fundamental metric in graph analytics, for which the known exact algorithms are too costly to use for very large inputs. We develop a MapReduce algorithm for this problem which, for the important class of graphs of bounded doubling dimension, features a polylogarithmic approximation guarantee, uses linear memory and executes in a number of parallel rounds that can be made sublinear in the input graph's diameter. To the best of our knowledge, ours is the first parallel algorithm with these guarantees. Our algorithm leverages a novel clustering primitive to extract a concise summary of the input graph on which to compute the diameter approximation. We complement our theoretical analysis with an extensive experimental evaluation, finding that our algorithm features an approximation quality significantly better than the theoretical upper bound and high scalability. Next, we consider the problem of clustering uncertain graphs, that is, graphs where each edge has a probability of existence, specified as part of the input. These graphs, whose applications range from biology to privacy in social networks, have an exponential number of possible deterministic realizations, which impose a big-data perspective. We develop the first algorithms for clustering uncertain graphs with provable approximation guarantees which aim at maximizing the probability that nodes be connected to the centers of their assigned clusters. A preliminary suite of experiments, provides evidence that the quality of the clusterings returned by our algorithms compare very favorably with respect to previous approaches with no theoretical guarantees. Finally, we deal with the problem of diversity maximization, which is a fundamental primitive in big data analytics: given a set of points in a metric space we are asked to provide a small subset maximizing some notion of diversity. We provide efficient streaming and MapReduce algorithms with approximation guarantees that can be made arbitrarily close to the ones of the best sequential algorithms available. The algorithms crucially rely on the use of a k-center clustering primitive to extract a succinct summary of the data and their analysis is expressed in terms of the doubling dimension of the input point set. Moreover, unlike previously known algorithms, ours feature an interesting tradeoff between approximation quality and memory requirements. Our theoretical findings are supported by the first experimental analysis of diversity maximization algorithms in streaming and MapReduce, which highlights the tradeoffs of our algorithms on both real-world and synthetic datasets. Moreover, our algorithms exhibit good scalability, and a significantly better performance than the approaches proposed in previous works

    Scalable Distributed Approximation of Internal Measures for Clustering Evaluation

    Full text link
    The most widely used internal measure for clustering evaluation is the silhouette coefficient, whose naive computation requires a quadratic number of distance calculations, which is clearly unfeasible for massive datasets. Surprisingly, there are no known general methods to efficiently approximate the silhouette coefficient of a clustering with rigorously provable high accuracy. In this paper, we present the first scalable algorithm to compute such a rigorous approximation for the evaluation of clusterings based on any metric distances. Our algorithm hinges on a Probability Proportional to Size (PPS) sampling scheme, and, for any fixed ε,δ(0,1)\varepsilon, \delta \in (0,1), it approximates the silhouette coefficient within a mere additive error O(ε)O(\varepsilon) with probability 1δ1-\delta, using a very small number of distance calculations. We also prove that the algorithm can be adapted to obtain rigorous approximations of other internal measures of clustering quality, such as cohesion and separation. Importantly, we provide a distributed implementation of the algorithm using the MapReduce model, which runs in constant rounds and requires only sublinear local space at each worker, which makes our estimation approach applicable to big data scenarios. We perform an extensive experimental evaluation of our silhouette approximation algorithm, comparing its performance to a number of baseline heuristics on real and synthetic datasets. The experiments provide evidence that, unlike other heuristics, our estimation strategy not only provides tight theoretical guarantees but is also able to return highly accurate estimations while running in a fraction of the time required by the exact computation, and that its distributed implementation is highly scalable, thus enabling the computation of internal measures for very large datasets for which the exact computation is prohibitive.Comment: 16 pages, 4 tables, 1 figur

    PRESTO: Simple and Scalable Sampling Techniques for the Rigorous Approximation of Temporal Motif Counts

    Full text link
    The identification and counting of small graph patterns, called network motifs, is a fundamental primitive in the analysis of networks, with application in various domains, from social networks to neuroscience. Several techniques have been designed to count the occurrences of motifs in static networks, with recent work focusing on the computational challenges provided by large networks. Modern networked datasets contain rich information, such as the time at which the events modeled by the networks edges happened, which can provide useful insights into the process modeled by the network. The analysis of motifs in temporal networks, called temporal motifs, is becoming an important component in the analysis of modern networked datasets. Several methods have been recently designed to count the number of instances of temporal motifs in temporal networks, which is even more challenging than its counterpart for static networks. Such methods are either exact, and not applicable to large networks, or approximate, but provide only weak guarantees on the estimates they produce and do not scale to very large networks. In this work we present an efficient and scalable algorithm to obtain rigorous approximations of the count of temporal motifs. Our algorithm is based on a simple but effective sampling approach, which renders our algorithm practical for very large datasets. Our extensive experimental evaluation shows that our algorithm provides estimates of temporal motif counts which are more accurate than the state-of-the-art sampling algorithms, with significantly lower running time than exact approaches, enabling the study of temporal motifs, of size larger than the ones considered in previous works, on billion edges networks.Comment: 19 pages, 5 figures, to appear in SDM 202

    1 対全ノードに対するs-t 信頼性の高速推定

    Get PDF
    s-t 信頼性は不確実グラフにおいて2ノード間の接続確率を評価する重要な指標のひとつである.s-t 信頼性の計算は♯P 完全であるため,近似解を計算する手法が提案されている.しかしながら,不確実グラフにおいてtop-k検索やクラスタリングなどの分析処理を実行するためには,1 対全ノードに対するs-t 信頼性計算を繰り返し実行する必要がある.この処理は,これまで提案されてきた近似的な計算手法を用いた場合でも,計算コストが膨大となり,大規模な不確実グラフを効率的に分析することが難しい.そこで本研究では,1 対全ノードに対する高速なs-t 信頼性推定手法を提案する.提案手法では幅優先探索とグラフサンプリングに基づくs-t 信頼性推定手法を統合する.これにより,少ない計算回数で高精度に1 対全ノードに対するs-t 信頼性を推定する.第12回データ工学と情報マネジメントに関するフォーラム (DEIM2020)日時:2020年3月2日~4日 オンライン開

    Shortest paths and centrality in uncertain networks

    Get PDF
    Computing the shortest path between a pair of nodes is a fundamental graph primitive, which has critical applications in vehicle routing, finding functional pathways in biological networks, survivable network design, among many others. In this work, we study shortest-path queries over uncertain networks, i.e., graphs where every edge is associated with a probability of existence. We show that, for a given path, it is #P-hard to compute the probability of it being the shortest path, and we also derive other interesting properties highlighting the complexity of computing the Most Probable Shortest Paths (MPSPs). We thus devise sampling-based efficient algorithms, with end-to-end accuracy guarantees, to compute the MPSP. As a concrete application, we show how to compute a novel concept of betweenness centrality in an uncertain graph using MPSPs. Our thorough experimental results and rich real-world case studies on sensor networks and brain networks validate the effectiveness, efficiency, scalability, and usefulness of our solution
    corecore