28,682 research outputs found
Graph Sample and Hold: A Framework for Big-Graph Analytics
Sampling is a standard approach in big-graph analytics; the goal is to
efficiently estimate the graph properties by consulting a sample of the whole
population. A perfect sample is assumed to mirror every property of the whole
population. Unfortunately, such a perfect sample is hard to collect in complex
populations such as graphs (e.g. web graphs, social networks etc), where an
underlying network connects the units of the population. Therefore, a good
sample will be representative in the sense that graph properties of interest
can be estimated with a known degree of accuracy. While previous work focused
particularly on sampling schemes used to estimate certain graph properties
(e.g. triangle count), much less is known for the case when we need to estimate
various graph properties with the same sampling scheme. In this paper, we
propose a generic stream sampling framework for big-graph analytics, called
Graph Sample and Hold (gSH). To begin, the proposed framework samples from
massive graphs sequentially in a single pass, one edge at a time, while
maintaining a small state. We then show how to produce unbiased estimators for
various graph properties from the sample. Given that the graph analysis
algorithms will run on a sample instead of the whole population, the runtime
complexity of these algorithm is kept under control. Moreover, given that the
estimators of graph properties are unbiased, the approximation error is kept
under control. Finally, we show the performance of the proposed framework (gSH)
on various types of graphs, such as social graphs, among others
Network Sampling: From Static to Streaming Graphs
Network sampling is integral to the analysis of social, information, and
biological networks. Since many real-world networks are massive in size,
continuously evolving, and/or distributed in nature, the network structure is
often sampled in order to facilitate study. For these reasons, a more thorough
and complete understanding of network sampling is critical to support the field
of network science. In this paper, we outline a framework for the general
problem of network sampling, by highlighting the different objectives,
population and units of interest, and classes of network sampling methods. In
addition, we propose a spectrum of computational models for network sampling
methods, ranging from the traditionally studied model based on the assumption
of a static domain to a more challenging model that is appropriate for
streaming domains. We design a family of sampling methods based on the concept
of graph induction that generalize across the full spectrum of computational
models (from static to streaming) while efficiently preserving many of the
topological properties of the input graphs. Furthermore, we demonstrate how
traditional static sampling algorithms can be modified for graph streams for
each of the three main classes of sampling methods: node, edge, and
topology-based sampling. Our experimental results indicate that our proposed
family of sampling methods more accurately preserves the underlying properties
of the graph for both static and streaming graphs. Finally, we study the impact
of network sampling algorithms on the parameter estimation and performance
evaluation of relational classification algorithms
Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation
We introduce and study a new data sketch for processing massive datasets. It
addresses two common problems: 1) computing a sum given arbitrary filter
conditions and 2) identifying the frequent items or heavy hitters in a data
set. For the former, the sketch provides unbiased estimates with state of the
art accuracy. It handles the challenging scenario when the data is
disaggregated so that computing the per unit metric of interest requires an
expensive aggregation. For example, the metric of interest may be total clicks
per user while the raw data is a click stream with multiple rows per user. Thus
the sketch is suitable for use in a wide range of applications including
computing historical click through rates for ad prediction, reporting user
metrics from event streams, and measuring network traffic for IP flows.
We prove and empirically show the sketch has good properties for both the
disaggregated subset sum estimation and frequent item problems. On i.i.d. data,
it not only picks out the frequent items but gives strongly consistent
estimates for the proportion of each frequent item. The resulting sketch
asymptotically draws a probability proportional to size sample that is optimal
for estimating sums over the data. For non i.i.d. data, we show that it
typically does much better than random sampling for the frequent item problem
and never does worse. For subset sum estimation, we show that even for
pathological sequences, the variance is close to that of an optimal sampling
design. Empirically, despite the disadvantage of operating on disaggregated
data, our method matches or bests priority sampling, a state of the art method
for pre-aggregated data and performs orders of magnitude better on skewed data
compared to uniform sampling. We propose extensions to the sketch that allow it
to be used in combining multiple data sets, in distributed systems, and for
time decayed aggregation
A Selectivity based approach to Continuous Pattern Detection in Streaming Graphs
Cyber security is one of the most significant technical challenges in current
times. Detecting adversarial activities, prevention of theft of intellectual
properties and customer data is a high priority for corporations and government
agencies around the world. Cyber defenders need to analyze massive-scale,
high-resolution network flows to identify, categorize, and mitigate attacks
involving networks spanning institutional and national boundaries. Many of the
cyber attacks can be described as subgraph patterns, with prominent examples
being insider infiltrations (path queries), denial of service (parallel paths)
and malicious spreads (tree queries). This motivates us to explore subgraph
matching on streaming graphs in a continuous setting. The novelty of our work
lies in using the subgraph distributional statistics collected from the
streaming graph to determine the query processing strategy. We introduce a
"Lazy Search" algorithm where the search strategy is decided on a
vertex-to-vertex basis depending on the likelihood of a match in the vertex
neighborhood. We also propose a metric named "Relative Selectivity" that is
used to select between different query processing strategies. Our experiments
performed on real online news, network traffic stream and a synthetic social
network benchmark demonstrate 10-100x speedups over selectivity agnostic
approaches.Comment: in 18th International Conference on Extending Database Technology
(EDBT) (2015
- …