13,785 research outputs found

    On Graph Stream Clustering with Side Information

    Full text link
    Graph clustering becomes an important problem due to emerging applications involving the web, social networks and bio-informatics. Recently, many such applications generate data in the form of streams. Clustering massive, dynamic graph streams is significantly challenging because of the complex structures of graphs and computational difficulties of continuous data. Meanwhile, a large volume of side information is associated with graphs, which can be of various types. The examples include the properties of users in social network activities, the meta attributes associated with web click graph streams and the location information in mobile communication networks. Such attributes contain extremely useful information and has the potential to improve the clustering process, but are neglected by most recent graph stream mining techniques. In this paper, we define a unified distance measure on both link structures and side attributes for clustering. In addition, we propose a novel optimization framework DMO, which can dynamically optimize the distance metric and make it adapt to the newly received stream data. We further introduce a carefully designed statistics SGS(C) which consume constant storage spaces with the progression of streams. We demonstrate that the statistics maintained are sufficient for the clustering process as well as the distance optimization and can be scalable to massive graphs with side attributes. We will present experiment results to show the advantages of the approach in graph stream clustering with both links and side information over the baselines.Comment: Full version of SIAM SDM 2013 pape

    Graph Sample and Hold: A Framework for Big-Graph Analytics

    Full text link
    Sampling is a standard approach in big-graph analytics; the goal is to efficiently estimate the graph properties by consulting a sample of the whole population. A perfect sample is assumed to mirror every property of the whole population. Unfortunately, such a perfect sample is hard to collect in complex populations such as graphs (e.g. web graphs, social networks etc), where an underlying network connects the units of the population. Therefore, a good sample will be representative in the sense that graph properties of interest can be estimated with a known degree of accuracy. While previous work focused particularly on sampling schemes used to estimate certain graph properties (e.g. triangle count), much less is known for the case when we need to estimate various graph properties with the same sampling scheme. In this paper, we propose a generic stream sampling framework for big-graph analytics, called Graph Sample and Hold (gSH). To begin, the proposed framework samples from massive graphs sequentially in a single pass, one edge at a time, while maintaining a small state. We then show how to produce unbiased estimators for various graph properties from the sample. Given that the graph analysis algorithms will run on a sample instead of the whole population, the runtime complexity of these algorithm is kept under control. Moreover, given that the estimators of graph properties are unbiased, the approximation error is kept under control. Finally, we show the performance of the proposed framework (gSH) on various types of graphs, such as social graphs, among others

    Analyzing Massive Graphs in the Semi-streaming Model

    Get PDF
    Massive graphs arise in a many scenarios, for example, traffic data analysis in large networks, large scale scientific experiments, and clustering of large data sets. The semi-streaming model was proposed for processing massive graphs. In the semi-streaming model, we have a random accessible memory which is near-linear in the number of vertices. The input graph (or equivalently, edges in the graph) is presented as a sequential list of edges (insertion-only model) or edge insertions and deletions (dynamic model). The list is read-only but we may make multiple passes over the list. There has been a few results in the insertion-only model such as computing distance spanners and approximating the maximum matching. In this thesis, we present some algorithms and techniques for (i) solving more complex problems in the semi-streaming model, (for example, problems in the dynamic model) and (ii) having better solutions for the problems which have been studied (for example, the maximum matching problem). In course of both of these, we develop new techniques with broad applications and explore the rich trade-offs between the complexity of models (insertion-only streams vs. dynamic streams), the number of passes, space, accuracy, and running time. 1. We initiate the study of dynamic graph streams. We start with basic problems such as the connectivity problem and computing the minimum spanning tree. These problems are trivial in the insertion-only model. However, they require non-trivial (and multiple passes for computing the exact minimum spanning tree) algorithms in the dynamic model. 2. Second, we present a graph sparsification algorithm in the semi-streaming model. A graph sparsification is a sparse graph that approximately preserves all the cut values of a graph. Such a graph acts as an oracle for solving cut-related problems, for example, the minimum cut problem and the multicut problem. Our algorithm produce a graph sparsification with high probability in one pass. 3. Third, we use the primal-dual algorithms to develop the semi-streaming algorithms. The primal-dual algorithms have been widely accepted as a framework for solving linear programs and semidefinite programs faster. In contrast, we apply the method for reducing space and number of passes in addition to reducing the running time. We also present some examples that arise in applications and show how to apply the techniques: the multicut problem, the correlation clustering problem, and the maximum matching problem. As a consequence, we also develop near-linear time algorithms for the bb-matching problems which were not known before

    Network Sampling: From Static to Streaming Graphs

    Full text link
    Network sampling is integral to the analysis of social, information, and biological networks. Since many real-world networks are massive in size, continuously evolving, and/or distributed in nature, the network structure is often sampled in order to facilitate study. For these reasons, a more thorough and complete understanding of network sampling is critical to support the field of network science. In this paper, we outline a framework for the general problem of network sampling, by highlighting the different objectives, population and units of interest, and classes of network sampling methods. In addition, we propose a spectrum of computational models for network sampling methods, ranging from the traditionally studied model based on the assumption of a static domain to a more challenging model that is appropriate for streaming domains. We design a family of sampling methods based on the concept of graph induction that generalize across the full spectrum of computational models (from static to streaming) while efficiently preserving many of the topological properties of the input graphs. Furthermore, we demonstrate how traditional static sampling algorithms can be modified for graph streams for each of the three main classes of sampling methods: node, edge, and topology-based sampling. Our experimental results indicate that our proposed family of sampling methods more accurately preserves the underlying properties of the graph for both static and streaming graphs. Finally, we study the impact of network sampling algorithms on the parameter estimation and performance evaluation of relational classification algorithms

    Growing Story Forest Online from Massive Breaking News

    Full text link
    We describe our experience of implementing a news content organization system at Tencent that discovers events from vast streams of breaking news and evolves news story structures in an online fashion. Our real-world system has distinct requirements in contrast to previous studies on topic detection and tracking (TDT) and event timeline or graph generation, in that we 1) need to accurately and quickly extract distinguishable events from massive streams of long text documents that cover diverse topics and contain highly redundant information, and 2) must develop the structures of event stories in an online manner, without repeatedly restructuring previously formed stories, in order to guarantee a consistent user viewing experience. In solving these challenges, we propose Story Forest, a set of online schemes that automatically clusters streaming documents into events, while connecting related events in growing trees to tell evolving stories. We conducted extensive evaluation based on 60 GB of real-world Chinese news data, although our ideas are not language-dependent and can easily be extended to other languages, through detailed pilot user experience studies. The results demonstrate the superior capability of Story Forest to accurately identify events and organize news text into a logical structure that is appealing to human readers, compared to multiple existing algorithm frameworks.Comment: Accepted by CIKM 2017, 9 page
    • …