Search CORE

986 research outputs found

Graph Sample and Hold: A Framework for Big-Graph Analytics

Author: Ahmed Nesreen K.
Duffield Nick
Kompella Ramana
Neville Jennifer
Publication venue
Publication date: 16/03/2014
Field of study

Sampling is a standard approach in big-graph analytics; the goal is to efficiently estimate the graph properties by consulting a sample of the whole population. A perfect sample is assumed to mirror every property of the whole population. Unfortunately, such a perfect sample is hard to collect in complex populations such as graphs (e.g. web graphs, social networks etc), where an underlying network connects the units of the population. Therefore, a good sample will be representative in the sense that graph properties of interest can be estimated with a known degree of accuracy. While previous work focused particularly on sampling schemes used to estimate certain graph properties (e.g. triangle count), much less is known for the case when we need to estimate various graph properties with the same sampling scheme. In this paper, we propose a generic stream sampling framework for big-graph analytics, called Graph Sample and Hold (gSH). To begin, the proposed framework samples from massive graphs sequentially in a single pass, one edge at a time, while maintaining a small state. We then show how to produce unbiased estimators for various graph properties from the sample. Given that the graph analysis algorithms will run on a sample instead of the whole population, the runtime complexity of these algorithm is kept under control. Moreover, given that the estimators of graph properties are unbiased, the approximation error is kept under control. Finally, we show the performance of the proposed framework (gSH) on various types of graphs, such as social graphs, among others

arXiv.org e-Print Archive

CiteSeerX

Scaling Up Network Analysis and Mining: Statistical Sampling, Estimation, and Pattern Discovery

Author: Ahmed Nesreen Kamel
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2015
Field of study

Network analysis and graph mining play a prominent role in providing insights and studying phenomena across various domains, including social, behavioral, biological, transportation, communication, and financial domains. Across all these domains, networks arise as a natural and rich representation for data. Studying these real-world networks is crucial for solving numerous problems that lead to high-impact applications. For example, identifying the behavior and interests of users in online social networks (e.g., viral marketing), monitoring and detecting virus outbreaks in human contact networks, predicting protein functions in biological networks, and detecting anomalous behavior in computer networks. A key characteristic of these networks is that their complex structure is massive and continuously evolving over time, which makes it challenging and computationally intensive to analyze, query, and model these networks in their entirety. In this dissertation, we propose sampling as well as fast, efficient, and scalable methods for network analysis and mining in both static and streaming graphs

Purdue E-Pubs

Sample-Based Estimation of Node Similarity in Streaming Bipartite Graphs

Author: Xia Liangzhen
Publication venue
Publication date: 21/09/2018
Field of study

My thesis would focus on analyzing the estimation of node similarity in streaming bipartite graph. As an important model in many applications of data mining, the bipartite graph represents the relationships between two sets of non-interconnected nodes, e.g. customers and the products/services they buy, users and the events/groups they get involved in, individuals and the diseases that they are subject to, etc. In most of these cases, data is naturally streaming over time. The node similarity in my thesis is mainly referred to neighborhood-based similarity, i.e., Common Neighbors (CN) measure. We analyze the distributional properties of CN in terms of the CN score, its dense ranks, in which equal weight objects receive the same rank and ranks are consecutive, and its fraction in full projection graph, which is also called similarity graph. We find that, in real-world dataset, the pairs of nodes with large value of CN only constitute a relatively quite small fraction. With this property, real-world streaming bipartite graph provide an opportunity for space saving by weighted sampling, which can preferentially select high weighted edges. Therefore, in this thesis, we propose a new one pass scheme for sampling the projection graphs of streaming bipartite graph in fixed storage and providing unbiased estimates of the CN similarity weights

Texas A&M Repository

Computing Graph Descriptors on Edge Streams

Author: Abbas Waseem
Ali Sarwan
Hassan Zohair Raza
Khan Imdadullah
Shabbir Mudassir
Publication venue
Publication date: 07/06/2022
Field of study

Feature extraction is an essential task in graph analytics. These feature vectors, called graph descriptors, are used in downstream vector-space-based graph analysis models. This idea has proved fruitful in the past, with spectral-based graph descriptors providing state-of-the-art classification accuracy. However, known algorithms to compute meaningful descriptors do not scale to large graphs since: (1) they require storing the entire graph in memory, and (2) the end-user has no control over the algorithm's runtime. In this paper, we present streaming algorithms to approximately compute three different graph descriptors capturing the essential structure of graphs. Operating on edge streams allows us to avoid storing the entire graph in memory, and controlling the sample size enables us to keep the runtime of our algorithms within desired bounds. We demonstrate the efficacy of the proposed descriptors by analyzing the approximation error and classification accuracy. Our scalable algorithms compute descriptors of graphs with millions of edges within minutes. Moreover, these descriptors yield predictive accuracy comparable to the state-of-the-art methods but can be computed using only 25% as much memory.Comment: Extension of work accepted to PAKDD 202

arXiv.org e-Print Archive

Scalable Methods and Algorithms for Very Large Graphs Based on Sampling

Author: Etemadi Roohollah
Publication venue: 'University of Windsor Leddy Library'
Publication date: 01/01/2019
Field of study

Analyzing real-life networks is a computationally intensive task due to the sheer size of networks. Direct analysis is even impossible when the network data is not entirely accessible. For instance, user networks in Twitter and Facebook are not available for third parties to explore their properties directly. Thus, sampling-based algorithms are indispensable. This dissertation addresses the conﬁdence interval (CI) and bias problems in real-world network analysis. It uses estimations of the number of triangles (hereafter ∆) and clustering coefficient (hereafter C) as a case study. Metric ∆ in a graph is an important measurement for understanding the graph. It is also directly related to C in a graph, which is one of the most important indicators for social networks. The methods proposed in this dissertation can be utilized in other sampling problems. First, we proposed two new methods to estimate ∆ based on random edge sampling in both streaming and non-streaming models. These methods outperformed the state-of-the-art methods consistently and could be better by orders of magnitude when the graph is very large. More importantly, we proved the improvement ratio analytically and veriﬁed our result extensively in real-world networks. The analytical results were achieved by simplifying the variances of the estimators based on the assumption that the graph is very large. We believe that such big data assumption can lead to interesting results not only in triangle estimation but also in other sampling problems. Next, we studied the estimation of C in both streaming and non-streaming sampling models. Despite numerous algorithms proposed in this area, the bias and variance of the estimators remain an open problem. We quantiﬁed the bias using Taylor expansion and found that the bias can be determined by the structure of the sampled data. Based on the understanding of the bias, we gave new estimators that correct the bias. The results were derived analytically and veriﬁed in 56 real networks ranging in diﬀerent sizes and structures. The experiments reveal that the bias ranges widely from data to data. The relative bias can be as high as 4% in non-streaming model and 2% in streaming model, or it can be negative. We also derived the variances of the estimators, and the estimators for the variances. Our simpliﬁed estimators can be used in practice to control the accuracy level of estimations

Scholarship at UWindsor

Sampling methods and estimation of triangle count distributions in large networks

Author: Antunes Nelson
Guo Tianjian
Pipiras Vladas
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 01/10/2021
Field of study

This paper investigates the distributions of triangle counts per vertex and edge, as a means for network description, analysis, model building, and other tasks. The main interest is in estimating these distributions through sampling, especially for large networks. A novel sampling method tailored for the estimation analysis is proposed, with three sampling designs motivated by several network access scenarios. An estimation method based on inversion and an asymptotic method are developed to recover the entire distribution. A single method to estimate the distribution using multiple samples is also considered. Algorithms are presented to sample the network under the various access scenarios. Finally, the estimation methods on synthetic and real-world networks are evaluated in a data study.info:eu-repo/semantics/publishedVersio

Sapientia

Recommended from our members

Data Stream Algorithms for Large Graphs and High Dimensional Data

Author: Vu Hoa
Publication venue: ScholarWorks@UMass Amherst
Publication date: 25/10/2018
Field of study

In contrast to the traditional random access memory computational model where the entire input is available in the working memory, the data stream model only provides sequential access to the input. The data stream model is a natural framework to handle large and dynamic data. In this model, we focus on designing algorithms that use sublinear memory and a small number of passes over the stream. Other desirable properties include fast update time, query time, and post processing time. In this dissertation, we consider different problems in graph theory, combinatorial optimization, and high dimensional data processing. The first part of this dissertation focuses on algorithms for graph theory and combinatorial optimization. We present new results for the problems of finding the densest subgraph, counting the number of triangles, finding max cut with bounded components, and finding the maximum

k

set coverage. The second part of this dissertation considers problems in high dimensional data streams. In this setting, each stream item consists of multiple coordinates corresponding to different attributes. We consider the problem of testing or learning about the relationships among the attributes, and the problem of finding heavy hitters in subsets of attributes

ScholarWorks@UMass Amherst