15 research outputs found
On Counting Triangles through Edge Sampling in Large Dynamic Graphs
Traditional frameworks for dynamic graphs have relied on processing only the
stream of edges added into or deleted from an evolving graph, but not any
additional related information such as the degrees or neighbor lists of nodes
incident to the edges. In this paper, we propose a new edge sampling framework
for big-graph analytics in dynamic graphs which enhances the traditional model
by enabling the use of additional related information. To demonstrate the
advantages of this framework, we present a new sampling algorithm, called Edge
Sample and Discard (ESD). It generates an unbiased estimate of the total number
of triangles, which can be continuously updated in response to both edge
additions and deletions. We provide a comparative analysis of the performance
of ESD against two current state-of-the-art algorithms in terms of accuracy and
complexity. The results of the experiments performed on real graphs show that,
with the help of the neighborhood information of the sampled edges, the
accuracy achieved by our algorithm is substantially better. We also
characterize the impact of properties of the graph on the performance of our
algorithm by testing on several Barabasi-Albert graphs.Comment: A short version of this article appeared in Proceedings of the 2017
IEEE/ACM International Conference on Advances in Social Networks Analysis and
Mining (ASONAM 2017
Variance-Optimal Offline and Streaming Stratified Random Sampling
Stratified random sampling (SRS) is a fundamental sampling technique that
provides accurate estimates for aggregate queries using a small size sample,
and has been used widely for approximate query processing. A key question in
SRS is how to partition a target sample size among different strata. While
Neyman allocation provides a solution that minimizes the variance of an
estimate using this sample, it works under the assumption that each stratum is
abundant, i.e., has a large number of data points to choose from. This
assumption may not hold in general: one or more strata may be bounded, and may
not contain a large number of data points, even though the total data size may
be large.
We first present VOILA, an offline method for allocating sample sizes to
strata in a variance-optimal manner, even for the case when one or more strata
may be bounded. We next consider SRS on streaming data that are continuously
arriving. We show a lower bound, that any streaming algorithm for SRS must have
(in the worst case) a variance that is {\Omega}(r) factor away from the
optimal, where r is the number of strata. We present S-VOILA, a practical
streaming algorithm for SRS that is locally variance-optimal in its allocation
of sample sizes to different strata. Our result from experiments on real and
synthetic data show that VOILA can have significantly (1.4 to 50.0 times)
smaller variance than Neyman allocation. The streaming algorithm S-VOILA
results in a variance that is typically close to VOILA, which was given the
entire input beforehand
Stratified Random Sampling from Streaming and Stored Data
Stratified random sampling (SRS) is a widely used sampling technique for approximate query processing. We consider SRS on continuously arriving data streams, and make the following contributions. We present a lower bound that shows that any streaming algorithm for SRS must have (in the worst case) a variance that is Ω(r ) factor away from the optimal, where r is the number of strata. We present S-VOILA, a streaming algorithm for SRS that is locally variance-optimal. Results from experiments on real and synthetic data show that S-VOILA results in a variance that is typically close to an optimal offline algorithm, which was given the entire input beforehand. We also present a variance-optimal offline algorithm VOILA for stratified random sampling. VOILA is a strict generalization of the well-known Neyman allocation, which is optimal only under the assumption that each stratum is abundant, i.e. has a large number of data points to choose from. Experiments show that VOILA can have significantly smaller variance (1.4x to 50x) than Neyman allocation on real-world data
Maintaining bounded-size sample synopses of evolving datasets
Perhaps the most flexible synopsis of a database is a uniform random sample of the data; such samples are widely used to speed up processing of analytic queries and data-mining tasks, enhance query optimization, and facilitate information integration. The ability to bound the maximum size of a sample can be very convenient from a system-design point of view, because the task of memory management is simplified, especially when many samples are maintained simultaneously. In this paper, we study methods for incrementally maintaining a bounded-size uniform random sample of the items in a dataset in the presence of an arbitrary sequence of insertions and deletions. For “stable” datasets whose size remains roughly constant over time, we provide a novel sampling scheme, called “random pairing” (RP), that maintains a bounded-size uniform sample by using newly inserted data items to compensate for previous deletions. The RP algorithm is the first extension of the 45-year-old reservoir sampling algorithm to handle deletions; RP reduces to the “passive” algorithm of Babcock et al. when the insertions and deletions correspond to a moving window over a data stream. Experiments show that, when dataset-size fluctuations over time are not too extreme, RP is the algorithm of choice with respect to speed and sample-size stability. For “growing” datasets, we consider algorithms for periodically resizing a bounded-size random sample upwards. We prove that any such algorithm cannot avoid accessing the base data, and provide a novel resizing algorithm that minimizes the time needed to increase the sample size. We also show how to merge uniform samples from disjoint datasets to obtain a uniform sample of the union of the datasets; the merged sample can be incrementally maintained. Our new RPMerge algorithm extends the HRMerge algorithm of Brown and Haas to effectively deal with deletions, thereby facilitating efficient parallel sampling
Distributed Data Streaming Algorithms for Network Anomaly Detection
Network attacks and anomalies such as DDoS attacks, service outages, email spamming are happening everyday, causing various problems for users such as financial loss, inconvenience due to service unavailability, personal information leakage and so on. Different methods have been studied and developed to tackle these network attacks, and among them data streaming algorithms are quite powerful, useful and flexible schemes that have many applications in network attack detection and identification. Data streaming algorithms usually use limited space to store aggregated information and report certain properties of the traffic in short and constant time.
There are several challenges for designing data streaming algorithms. Firstly, network traffic is usually distributed and monitored at different locations, and it is often desirable to aggregate the distributed monitoring information together to detect attacks which might be low-profile at a single location; thus data streaming algorithms have to support data merging without loss of information. Secondly, network traffic is usually in high-speed and large-volume; data streaming algorithms have to process data fast and smart to save space and time. Thirdly, sometimes only detection is not useful enough and identification of targets make more sense, in which case data streaming algorithms have to be concise and reversible.
In this dissertation, we study three different types of data streaming algorithms: hot item identification, distinct element counting and superspreader identification. We propose new algorithms to solve these problems and evaluate them with both theoretical analysis and experiments to show their effectiveness and improvements upon previous methods