986 research outputs found
Graph Sample and Hold: A Framework for Big-Graph Analytics
Sampling is a standard approach in big-graph analytics; the goal is to
efficiently estimate the graph properties by consulting a sample of the whole
population. A perfect sample is assumed to mirror every property of the whole
population. Unfortunately, such a perfect sample is hard to collect in complex
populations such as graphs (e.g. web graphs, social networks etc), where an
underlying network connects the units of the population. Therefore, a good
sample will be representative in the sense that graph properties of interest
can be estimated with a known degree of accuracy. While previous work focused
particularly on sampling schemes used to estimate certain graph properties
(e.g. triangle count), much less is known for the case when we need to estimate
various graph properties with the same sampling scheme. In this paper, we
propose a generic stream sampling framework for big-graph analytics, called
Graph Sample and Hold (gSH). To begin, the proposed framework samples from
massive graphs sequentially in a single pass, one edge at a time, while
maintaining a small state. We then show how to produce unbiased estimators for
various graph properties from the sample. Given that the graph analysis
algorithms will run on a sample instead of the whole population, the runtime
complexity of these algorithm is kept under control. Moreover, given that the
estimators of graph properties are unbiased, the approximation error is kept
under control. Finally, we show the performance of the proposed framework (gSH)
on various types of graphs, such as social graphs, among others
Scaling Up Network Analysis and Mining: Statistical Sampling, Estimation, and Pattern Discovery
Network analysis and graph mining play a prominent role in providing insights and studying phenomena across various domains, including social, behavioral, biological, transportation, communication, and financial domains. Across all these domains, networks arise as a natural and rich representation for data. Studying these real-world networks is crucial for solving numerous problems that lead to high-impact applications. For example, identifying the behavior and interests of users in online social networks (e.g., viral marketing), monitoring and detecting virus outbreaks in human contact networks, predicting protein functions in biological networks, and detecting anomalous behavior in computer networks. A key characteristic of these networks is that their complex structure is massive and continuously evolving over time, which makes it challenging and computationally intensive to analyze, query, and model these networks in their entirety. In this dissertation, we propose sampling as well as fast, efficient, and scalable methods for network analysis and mining in both static and streaming graphs
Sample-Based Estimation of Node Similarity in Streaming Bipartite Graphs
My thesis would focus on analyzing the estimation of node similarity in streaming bipartite
graph. As an important model in many applications of data mining, the bipartite
graph represents the relationships between two sets of non-interconnected nodes, e.g. customers
and the products/services they buy, users and the events/groups they get involved
in, individuals and the diseases that they are subject to, etc. In most of these cases, data is
naturally streaming over time.
The node similarity in my thesis is mainly referred to neighborhood-based similarity,
i.e., Common Neighbors (CN) measure. We analyze the distributional properties of CN
in terms of the CN score, its dense ranks, in which equal weight objects receive the same
rank and ranks are consecutive, and its fraction in full projection graph, which is also
called similarity graph. We find that, in real-world dataset, the pairs of nodes with large
value of CN only constitute a relatively quite small fraction. With this property, real-world
streaming bipartite graph provide an opportunity for space saving by weighted sampling,
which can preferentially select high weighted edges.
Therefore, in this thesis, we propose a new one pass scheme for sampling the projection
graphs of streaming bipartite graph in fixed storage and providing unbiased estimates of
the CN similarity weights
Computing Graph Descriptors on Edge Streams
Feature extraction is an essential task in graph analytics. These feature
vectors, called graph descriptors, are used in downstream vector-space-based
graph analysis models. This idea has proved fruitful in the past, with
spectral-based graph descriptors providing state-of-the-art classification
accuracy. However, known algorithms to compute meaningful descriptors do not
scale to large graphs since: (1) they require storing the entire graph in
memory, and (2) the end-user has no control over the algorithm's runtime. In
this paper, we present streaming algorithms to approximately compute three
different graph descriptors capturing the essential structure of graphs.
Operating on edge streams allows us to avoid storing the entire graph in
memory, and controlling the sample size enables us to keep the runtime of our
algorithms within desired bounds. We demonstrate the efficacy of the proposed
descriptors by analyzing the approximation error and classification accuracy.
Our scalable algorithms compute descriptors of graphs with millions of edges
within minutes. Moreover, these descriptors yield predictive accuracy
comparable to the state-of-the-art methods but can be computed using only 25%
as much memory.Comment: Extension of work accepted to PAKDD 202
Scalable Methods and Algorithms for Very Large Graphs Based on Sampling
Analyzing real-life networks is a computationally intensive task due to the sheer size of networks. Direct analysis is even impossible when the network data is not entirely accessible. For instance, user networks in Twitter and Facebook are not available for third parties to explore their properties directly. Thus, sampling-based algorithms are indispensable. This dissertation addresses the conļ¬dence interval (CI) and bias problems in real-world network analysis. It uses estimations of the number of triangles (hereafter ā) and clustering coefficient (hereafter C) as a case study. Metric ā in a graph is an important measurement for understanding the graph. It is also directly related to C in a graph, which is one of the most important indicators for social networks. The methods proposed in this dissertation can be utilized in other sampling problems. First, we proposed two new methods to estimate ā based on random edge sampling in both streaming and non-streaming models. These methods outperformed the state-of-the-art methods consistently and could be better by orders of magnitude when the graph is very large. More importantly, we proved the improvement ratio analytically and veriļ¬ed our result extensively in real-world networks. The analytical results were achieved by simplifying the variances of the estimators based on the assumption that the graph is very large. We believe that such big data assumption can lead to interesting results not only in triangle estimation but also in other sampling problems. Next, we studied the estimation of C in both streaming and non-streaming sampling models. Despite numerous algorithms proposed in this area, the bias and variance of the estimators remain an open problem. We quantiļ¬ed the bias using Taylor expansion and found that the bias can be determined by the structure of the sampled data. Based on the understanding of the bias, we gave new estimators that correct the bias. The results were derived analytically and veriļ¬ed in 56 real networks ranging in diļ¬erent sizes and structures. The experiments reveal that the bias ranges widely from data to data. The relative bias can be as high as 4% in non-streaming model and 2% in streaming model, or it can be negative. We also derived the variances of the estimators, and the estimators for the variances. Our simpliļ¬ed estimators can be used in practice to control the accuracy level of estimations
Sampling methods and estimation of triangle count distributions in large networks
This paper investigates the distributions of triangle counts per vertex and edge, as a means for network
description, analysis, model building, and other tasks. The main interest is in estimating these distributions
through sampling, especially for large networks. A novel sampling method tailored for the estimation analysis is proposed, with three sampling designs motivated by several network access scenarios. An estimation
method based on inversion and an asymptotic method are developed to recover the entire distribution.
A single method to estimate the distribution using multiple samples is also considered. Algorithms are
presented to sample the network under the various access scenarios. Finally, the estimation methods on
synthetic and real-world networks are evaluated in a data study.info:eu-repo/semantics/publishedVersio
Recommended from our members
Data Stream Algorithms for Large Graphs and High Dimensional Data
In contrast to the traditional random access memory computational model where the entire input is available in the working memory, the data stream model only provides sequential access to the input. The data stream model is a natural framework to handle large and dynamic data. In this model, we focus on designing algorithms that use sublinear memory and a small number of passes over the stream. Other desirable properties include fast update time, query time, and post processing time.
In this dissertation, we consider different problems in graph theory, combinatorial optimization, and high dimensional data processing.
The first part of this dissertation focuses on algorithms for graph theory and combinatorial optimization. We present new results for the problems of finding the densest subgraph, counting the number of triangles, finding max cut with bounded components, and finding the maximum set coverage.
The second part of this dissertation considers problems in high dimensional data streams. In this setting, each stream item consists of multiple coordinates corresponding to different attributes. We consider the problem of testing or learning about the relationships among the attributes, and the problem of finding heavy hitters in subsets of attributes
- ā¦