223 research outputs found

    Efficient Triangle Counting in Large Graphs via Degree-based Vertex Partitioning

    Full text link
    The number of triangles is a computationally expensive graph statistic which is frequently used in complex network analysis (e.g., transitivity ratio), in various random graph models (e.g., exponential random graph model) and in important real world applications such as spam detection, uncovering of the hidden thematic structure of the Web and link recommendation. Counting triangles in graphs with millions and billions of edges requires algorithms which run fast, use small amount of space, provide accurate estimates of the number of triangles and preferably are parallelizable. In this paper we present an efficient triangle counting algorithm which can be adapted to the semistreaming model. The key idea of our algorithm is to combine the sampling algorithm of Tsourakakis et al. and the partitioning of the set of vertices into a high degree and a low degree subset respectively as in the Alon, Yuster and Zwick work treating each set appropriately. We obtain a running time O(m+m3/2Δlog⁥ntÏ”2)O \left(m + \frac{m^{3/2} \Delta \log{n}}{t \epsilon^2} \right) and an Ï”\epsilon approximation (multiplicative error), where nn is the number of vertices, mm the number of edges and Δ\Delta the maximum number of triangles an edge is contained. Furthermore, we show how this algorithm can be adapted to the semistreaming model with space usage O(m1/2log⁥n+m3/2Δlog⁥ntÏ”2)O\left(m^{1/2}\log{n} + \frac{m^{3/2} \Delta \log{n}}{t \epsilon^2} \right) and a constant number of passes (three) over the graph stream. We apply our methods in various networks with several millions of edges and we obtain excellent results. Finally, we propose a random projection based method for triangle counting and provide a sufficient condition to obtain an estimate with low variance.Comment: 1) 12 pages 2) To appear in the 7th Workshop on Algorithms and Models for the Web Graph (WAW 2010

    Massively Parallel Approximate Distance Sketches

    Get PDF
    Data structures that allow efficient distance estimation (distance oracles, distance sketches, etc.) have been extensively studied, and are particularly well studied in centralized models and classical distributed models such as CONGEST. We initiate their study in newer (and arguably more realistic) models of distributed computation: the Congested Clique model and the Massively Parallel Computation (MPC) model. We provide efficient constructions in both of these models, but our core results are for MPC. In MPC we give two main results: an algorithm that constructs stretch/space optimal distance sketches but takes a (small) polynomial number of rounds, and an algorithm that constructs distance sketches with worse stretch but that only takes polylogarithmic rounds. Along the way, we show that other useful combinatorial structures can also be computed in MPC. In particular, one key component we use to construct distance sketches are an MPC construction of the hopsets of [Elkin and Neiman, 2016]. This result has additional applications such as the first polylogarithmic time algorithm for constant approximate single-source shortest paths for weighted graphs in the low memory MPC setting

    On Conceptually Simple Algorithms for Variants of Online Bipartite Matching

    Full text link
    We present a series of results regarding conceptually simple algorithms for bipartite matching in various online and related models. We first consider a deterministic adversarial model. The best approximation ratio possible for a one-pass deterministic online algorithm is 1/21/2, which is achieved by any greedy algorithm. D\"urr et al. recently presented a 22-pass algorithm called Category-Advice that achieves approximation ratio 3/53/5. We extend their algorithm to multiple passes. We prove the exact approximation ratio for the kk-pass Category-Advice algorithm for all k≄1k \ge 1, and show that the approximation ratio converges to the inverse of the golden ratio 2/(1+5)≈0.6182/(1+\sqrt{5}) \approx 0.618 as kk goes to infinity. The convergence is extremely fast --- the 55-pass Category-Advice algorithm is already within 0.01%0.01\% of the inverse of the golden ratio. We then consider a natural greedy algorithm in the online stochastic IID model---MinDegree. This algorithm is an online version of a well-known and extensively studied offline algorithm MinGreedy. We show that MinDegree cannot achieve an approximation ratio better than 1−1/e1-1/e, which is guaranteed by any consistent greedy algorithm in the known IID model. Finally, following the work in Besser and Poloczek, we depart from an adversarial or stochastic ordering and investigate a natural randomized algorithm (MinRanking) in the priority model. Although the priority model allows the algorithm to choose the input ordering in a general but well defined way, this natural algorithm cannot obtain the approximation of the Ranking algorithm in the ROM model

    Tight Bounds on the Round Complexity of the Distributed Maximum Coverage Problem

    Full text link
    We study the maximum kk-set coverage problem in the following distributed setting. A collection of sets S1,
,SmS_1,\ldots,S_m over a universe [n][n] is partitioned across pp machines and the goal is to find kk sets whose union covers the most number of elements. The computation proceeds in synchronous rounds. In each round, all machines simultaneously send a message to a central coordinator who then communicates back to all machines a summary to guide the computation for the next round. At the end, the coordinator outputs the answer. The main measures of efficiency in this setting are the approximation ratio of the returned solution, the communication cost of each machine, and the number of rounds of computation. Our main result is an asymptotically tight bound on the tradeoff between these measures for the distributed maximum coverage problem. We first show that any rr-round protocol for this problem either incurs a communication cost of k⋅mΩ(1/r) k \cdot m^{\Omega(1/r)} or only achieves an approximation factor of kΩ(1/r)k^{\Omega(1/r)}. This implies that any protocol that simultaneously achieves good approximation ratio (O(1)O(1) approximation) and good communication cost (O~(n)\widetilde{O}(n) communication per machine), essentially requires logarithmic (in kk) number of rounds. We complement our lower bound result by showing that there exist an rr-round protocol that achieves an ee−1\frac{e}{e-1}-approximation (essentially best possible) with a communication cost of k⋅mO(1/r)k \cdot m^{O(1/r)} as well as an rr-round protocol that achieves a kO(1/r)k^{O(1/r)}-approximation with only O~(n)\widetilde{O}(n) communication per each machine (essentially best possible). We further use our results in this distributed setting to obtain new bounds for the maximum coverage problem in two other main models of computation for massive datasets, namely, the dynamic streaming model and the MapReduce model

    Massively Parallel Algorithms for Distance Approximation and Spanners

    Full text link
    Over the past decade, there has been increasing interest in distributed/parallel algorithms for processing large-scale graphs. By now, we have quite fast algorithms -- usually sublogarithmic-time and often poly(log⁥log⁥n)poly(\log\log n)-time, or even faster -- for a number of fundamental graph problems in the massively parallel computation (MPC) model. This model is a widely-adopted theoretical abstraction of MapReduce style settings, where a number of machines communicate in an all-to-all manner to process large-scale data. Contributing to this line of work on MPC graph algorithms, we present poly(log⁥k)∈poly(log⁥log⁥n)poly(\log k) \in poly(\log\log n) round MPC algorithms for computing O(k1+o(1))O(k^{1+{o(1)}})-spanners in the strongly sublinear regime of local memory. To the best of our knowledge, these are the first sublogarithmic-time MPC algorithms for spanner construction. As primary applications of our spanners, we get two important implications, as follows: -For the MPC setting, we get an O(log⁥2log⁥n)O(\log^2\log n)-round algorithm for O(log⁥1+o(1)n)O(\log^{1+o(1)} n) approximation of all pairs shortest paths (APSP) in the near-linear regime of local memory. To the best of our knowledge, this is the first sublogarithmic-time MPC algorithm for distance approximations. -Our result above also extends to the Congested Clique model of distributed computing, with the same round complexity and approximation guarantee. This gives the first sub-logarithmic algorithm for approximating APSP in weighted graphs in the Congested Clique model

    Engineering Aggregation Operators for Relational In-Memory Database Systems

    Get PDF
    In this thesis we study the design and implementation of Aggregation operators in the context of relational in-memory database systems. In particular, we identify and address the following challenges: cache-efficiency, CPU-friendliness, parallelism within and across processors, robust handling of skewed data, adaptive processing, processing with constrained memory, and integration with modern database architectures. Our resulting algorithm outperforms the state-of-the-art by up to 3.7x

    Preconditioned Data Sparsification for Big Data with Applications to PCA and K-means

    Get PDF
    We analyze a compression scheme for large data sets that randomly keeps a small percentage of the components of each data sample. The benefit is that the output is a sparse matrix and therefore subsequent processing, such as PCA or K-means, is significantly faster, especially in a distributed-data setting. Furthermore, the sampling is single-pass and applicable to streaming data. The sampling mechanism is a variant of previous methods proposed in the literature combined with a randomized preconditioning to smooth the data. We provide guarantees for PCA in terms of the covariance matrix, and guarantees for K-means in terms of the error in the center estimators at a given step. We present numerical evidence to show both that our bounds are nearly tight and that our algorithms provide a real benefit when applied to standard test data sets, as well as providing certain benefits over related sampling approaches.Comment: 28 pages, 10 figure

    Correlation clustering in data streams

    Get PDF
    In this paper, we address the problem of correlation clustering in the dynamic data stream model. The stream consists of updates to the edge weights of a graph on n nodes and the goal is to find a node-partition such that the end-points of negative-weight edges are typically in different clusters whereas the end-points of positive-weight edges are typically in the same cluster. We present polynomial-time, O(n·polylog n)-space approximation algorithms for natural problems that arise. We first develop data structures based on linear sketches that allow the “quality” of a given node-partition to be measured. We then combine these data structures with convex programming and sampling techniques to solve the relevant approximation problem. However the standard LP and SDP formulations are not obviously solvable in O(n·polylog n)-space. Our work presents space-efficient algorithms for the convex programming required, as well as approaches to reduce the adaptivity of the sampling. Note that the improved space and running-time bounds achieved from streaming algorithms are also useful for offline settings such as MapReduce models
    • 

    corecore