19,139 research outputs found

    Streaming, Local, and Multi­Level (Hyper)Graph Decomposition

    Get PDF
    (Hyper)Graph decomposition is a family of problems that aim to break down large (hyper)graphs into smaller sub(hyper)graphs for easier analysis. The importance of this lies in its ability to enable efficient computation on large and complex (hyper)graphs, such as social networks, chemical compounds, and computer networks. This dissertation explores several types of (hyper)graph decomposition problems, including graph partitioning, hypergraph partitioning, local graph clustering, process mapping, and signed graph clustering. Our main focus is on streaming algorithms, local algorithms and multilevel algorithms. In terms of streaming algorithms, we make contributions with highly efficient and effective algorithms for (hyper)graph partitioning and process mapping. In terms of local algorithms, we propose sub-linear algorithms which are effective in detecting high-quality local communities around a given seed node in a graph based on the distribution of a given motif. In terms of multilevel algorithms, we engineer high-quality multilevel algorithms for process mapping and signed graph clustering. We provide a thorough discussion of each algorithm along with experimental results demonstrating their superiority over existing state-of-the-art techniques. The results show that the proposed algorithms achieve improved performance and better solutions in various metrics, making them highly promising for practical applications. Overall, this dissertation showcases the effectiveness of advanced combinatorial algorithmic techniques in solving challenging (hyper)graph decomposition problems

    Analyzing Massive Graphs in the Semi-streaming Model

    Get PDF
    Massive graphs arise in a many scenarios, for example, traffic data analysis in large networks, large scale scientific experiments, and clustering of large data sets. The semi-streaming model was proposed for processing massive graphs. In the semi-streaming model, we have a random accessible memory which is near-linear in the number of vertices. The input graph (or equivalently, edges in the graph) is presented as a sequential list of edges (insertion-only model) or edge insertions and deletions (dynamic model). The list is read-only but we may make multiple passes over the list. There has been a few results in the insertion-only model such as computing distance spanners and approximating the maximum matching. In this thesis, we present some algorithms and techniques for (i) solving more complex problems in the semi-streaming model, (for example, problems in the dynamic model) and (ii) having better solutions for the problems which have been studied (for example, the maximum matching problem). In course of both of these, we develop new techniques with broad applications and explore the rich trade-offs between the complexity of models (insertion-only streams vs. dynamic streams), the number of passes, space, accuracy, and running time. 1. We initiate the study of dynamic graph streams. We start with basic problems such as the connectivity problem and computing the minimum spanning tree. These problems are trivial in the insertion-only model. However, they require non-trivial (and multiple passes for computing the exact minimum spanning tree) algorithms in the dynamic model. 2. Second, we present a graph sparsification algorithm in the semi-streaming model. A graph sparsification is a sparse graph that approximately preserves all the cut values of a graph. Such a graph acts as an oracle for solving cut-related problems, for example, the minimum cut problem and the multicut problem. Our algorithm produce a graph sparsification with high probability in one pass. 3. Third, we use the primal-dual algorithms to develop the semi-streaming algorithms. The primal-dual algorithms have been widely accepted as a framework for solving linear programs and semidefinite programs faster. In contrast, we apply the method for reducing space and number of passes in addition to reducing the running time. We also present some examples that arise in applications and show how to apply the techniques: the multicut problem, the correlation clustering problem, and the maximum matching problem. As a consequence, we also develop near-linear time algorithms for the bb-matching problems which were not known before

    Improved Algorithms for Time Decay Streams

    Get PDF
    In the time-decay model for data streams, elements of an underlying data set arrive sequentially with the recently arrived elements being more important. A common approach for handling large data sets is to maintain a coreset, a succinct summary of the processed data that allows approximate recovery of a predetermined query. We provide a general framework that takes any offline-coreset and gives a time-decay coreset for polynomial time decay functions. We also consider the exponential time decay model for k-median clustering, where we provide a constant factor approximation algorithm that utilizes the online facility location algorithm. Our algorithm stores O(k log(h Delta)+h) points where h is the half-life of the decay function and Delta is the aspect ratio of the dataset. Our techniques extend to k-means clustering and M-estimators as well

    Network Sampling: From Static to Streaming Graphs

    Full text link
    Network sampling is integral to the analysis of social, information, and biological networks. Since many real-world networks are massive in size, continuously evolving, and/or distributed in nature, the network structure is often sampled in order to facilitate study. For these reasons, a more thorough and complete understanding of network sampling is critical to support the field of network science. In this paper, we outline a framework for the general problem of network sampling, by highlighting the different objectives, population and units of interest, and classes of network sampling methods. In addition, we propose a spectrum of computational models for network sampling methods, ranging from the traditionally studied model based on the assumption of a static domain to a more challenging model that is appropriate for streaming domains. We design a family of sampling methods based on the concept of graph induction that generalize across the full spectrum of computational models (from static to streaming) while efficiently preserving many of the topological properties of the input graphs. Furthermore, we demonstrate how traditional static sampling algorithms can be modified for graph streams for each of the three main classes of sampling methods: node, edge, and topology-based sampling. Our experimental results indicate that our proposed family of sampling methods more accurately preserves the underlying properties of the graph for both static and streaming graphs. Finally, we study the impact of network sampling algorithms on the parameter estimation and performance evaluation of relational classification algorithms

    Approximation and Streaming Algorithms for Projective Clustering via Random Projections

    Full text link
    Let PP be a set of nn points in Rd\mathbb{R}^d. In the projective clustering problem, given k,qk, q and norm ρ[1,]\rho \in [1,\infty], we have to compute a set F\mathcal{F} of kk qq-dimensional flats such that (pPd(p,F)ρ)1/ρ(\sum_{p\in P}d(p, \mathcal{F})^\rho)^{1/\rho} is minimized; here d(p,F)d(p, \mathcal{F}) represents the (Euclidean) distance of pp to the closest flat in F\mathcal{F}. We let fkq(P,ρ)f_k^q(P,\rho) denote the minimal value and interpret fkq(P,)f_k^q(P,\infty) to be maxrPd(r,F)\max_{r\in P}d(r, \mathcal{F}). When ρ=1,2\rho=1,2 and \infty and q=0q=0, the problem corresponds to the kk-median, kk-mean and the kk-center clustering problems respectively. For every 0<ϵ<10 < \epsilon < 1, SPS\subset P and ρ1\rho \ge 1, we show that the orthogonal projection of PP onto a randomly chosen flat of dimension O(((q+1)2log(1/ϵ)/ϵ3)logn)O(((q+1)^2\log(1/\epsilon)/\epsilon^3) \log n) will ϵ\epsilon-approximate f1q(S,ρ)f_1^q(S,\rho). This result combines the concepts of geometric coresets and subspace embeddings based on the Johnson-Lindenstrauss Lemma. As a consequence, an orthogonal projection of PP to an O(((q+1)2log((q+1)/ϵ)/ϵ3)logn)O(((q+1)^2 \log ((q+1)/\epsilon)/\epsilon^3) \log n) dimensional randomly chosen subspace ϵ\epsilon-approximates projective clusterings for every kk and ρ\rho simultaneously. Note that the dimension of this subspace is independent of the number of clusters~kk. Using this dimension reduction result, we obtain new approximation and streaming algorithms for projective clustering problems. For example, given a stream of nn points, we show how to compute an ϵ\epsilon-approximate projective clustering for every kk and ρ\rho simultaneously using only O((n+d)((q+1)2log((q+1)/ϵ))/ϵ3logn)O((n+d)((q+1)^2\log ((q+1)/\epsilon))/\epsilon^3 \log n) space. Compared to standard streaming algorithms with Ω(kd)\Omega(kd) space requirement, our approach is a significant improvement when the number of input points and their dimensions are of the same order of magnitude.Comment: Canadian Conference on Computational Geometry (CCCG 2015

    Coresets-Methods and History: A Theoreticians Design Pattern for Approximation and Streaming Algorithms

    Get PDF
    We present a technical survey on the state of the art approaches in data reduction and the coreset framework. These include geometric decompositions, gradient methods, random sampling, sketching and random projections. We further outline their importance for the design of streaming algorithms and give a brief overview on lower bounding techniques
    corecore