307 research outputs found
Efficient Algorithms to Compute Hierarchical Summaries from Big Data Streams
Many data stream applications have hierarchical data; containing time, geographic locations, product information, clickstreams, server logs, IP addresses. A hierarchical summary of such volumous data offers multiple advantages including compactness, quick understanding, and abstraction. The goal of this thesis is to design algorithmic approaches for summarizing hierarchical data streams.
First, this thesis provides a theoretical analysis of the benchmark hierarchical heavy hitters' algorithms and uncovers their shortcomings such as requiring high theoretical memory, updates and coverage problem. To address these shortcomings, this thesis proposes efficient algorithms which offer deterministic estimation accuracy using O(η/ε) worst-case memory and O(η) worst-case time complexity per item, where ε ∈ [0,1] is a user defined parameter and η is a small constant derived from the data. The proposed hierarchical heavy hitters' algorithms are shown to have improved significantly over existing algorithms both theoretically as well as empirically.
Next, this thesis introduces a new concept called hierarchically correlated heavy hitters, which is different from existing hierarchical summarization techniques. The thesis provides a formal definition of the proposed concept and compares it with existing hierarchical summarization approaches both at definition level and empirically. It also proposes an efficient hierarchy-aware algorithm for computing hierarchically correlated heavy hitters. The proposed algorithm offers deterministic estimation accuracy using O(η / (ε_p * ε_s )) worst-case memory and O(η) worst-case time complexity per item, where η is as defined previously, and ε_p ∈ [0,1], ε_s ∈ [0,1] are other user defined parameters.
Finally, the thesis proposes a special hierarchical data structure and algorithm to summarize spatiotemporal data. It can be used to extract interesting and useful patterns from high-speed spatiotemporal data streams at multiple spatial and temporal granularities. Theoretical and empirical analysis are provided, which show that the proposed data structure is very efficient concerning data storage and response to queries. It updates a single item in O(1) time and responds to a point query in O(1) time. Importantly, the memory requirement of the proposed data structure is independent of the size of the data and only depends on user-supplied parameters ψ ⃗ and φ ⃗.
In summary, this thesis provides a general framework consisting of a set of algorithms and data structures to compute hierarchical summaries of the big data streams. All of the proposed algorithms exploit a lattice structure built from the hierarchical attributes of the data to compute different hierarchical summaries, which can be used to address various data analytic issues in many emerging applications
Optimal Elephant Flow Detection
Monitoring the traffic volumes of elephant flows, including the total byte
count per flow, is a fundamental capability for online network measurements. We
present an asymptotically optimal algorithm for solving this problem in terms
of both space and time complexity. This improves on previous approaches, which
can only count the number of packets in constant time. We evaluate our work on
real packet traces, demonstrating an up to X2.5 speedup compared to the best
alternative.Comment: Accepted to IEEE INFOCOM 201
Tiresias: Online Anomaly Detection for Hierarchical Operational Network Data
Operational network data, management data such as customer care call logs and
equipment system logs, is a very important source of information for network
operators to detect problems in their networks. Unfortunately, there is lack of
efficient tools to automatically track and detect anomalous events on
operational data, causing ISP operators to rely on manual inspection of this
data. While anomaly detection has been widely studied in the context of network
data, operational data presents several new challenges, including the
volatility and sparseness of data, and the need to perform fast detection
(complicating application of schemes that require offline processing or
large/stable data sets to converge).
To address these challenges, we propose Tiresias, an automated approach to
locating anomalous events on hierarchical operational data. Tiresias leverages
the hierarchical structure of operational data to identify high-impact
aggregates (e.g., locations in the network, failure modes) likely to be
associated with anomalous events. To accommodate different kinds of operational
network data, Tiresias consists of an online detection algorithm with low time
and space complexity, while preserving high detection accuracy. We present
results from two case studies using operational data collected at a large
commercial IP network operated by a Tier-1 ISP: customer care call logs and
set-top box crash logs. By comparing with a reference set verified by the ISP's
operational group, we validate that Tiresias can achieve >94% accuracy in
locating anomalies. Tiresias also discovered several previously unknown
anomalies in the ISP's customer care cases, demonstrating its effectiveness
Fast and Accurate Mining of Correlated Heavy Hitters
The problem of mining Correlated Heavy Hitters (CHH) from a two-dimensional
data stream has been introduced recently, and a deterministic algorithm based
on the use of the Misra--Gries algorithm has been proposed by Lahiri et al. to
solve it. In this paper we present a new counter-based algorithm for tracking
CHHs, formally prove its error bounds and correctness and show, through
extensive experimental results, that our algorithm outperforms the Misra--Gries
based algorithm with regard to accuracy and speed whilst requiring
asymptotically much less space
- …