27,988 research outputs found
A Selectivity based approach to Continuous Pattern Detection in Streaming Graphs
Cyber security is one of the most significant technical challenges in current
times. Detecting adversarial activities, prevention of theft of intellectual
properties and customer data is a high priority for corporations and government
agencies around the world. Cyber defenders need to analyze massive-scale,
high-resolution network flows to identify, categorize, and mitigate attacks
involving networks spanning institutional and national boundaries. Many of the
cyber attacks can be described as subgraph patterns, with prominent examples
being insider infiltrations (path queries), denial of service (parallel paths)
and malicious spreads (tree queries). This motivates us to explore subgraph
matching on streaming graphs in a continuous setting. The novelty of our work
lies in using the subgraph distributional statistics collected from the
streaming graph to determine the query processing strategy. We introduce a
"Lazy Search" algorithm where the search strategy is decided on a
vertex-to-vertex basis depending on the likelihood of a match in the vertex
neighborhood. We also propose a metric named "Relative Selectivity" that is
used to select between different query processing strategies. Our experiments
performed on real online news, network traffic stream and a synthetic social
network benchmark demonstrate 10-100x speedups over selectivity agnostic
approaches.Comment: in 18th International Conference on Extending Database Technology
(EDBT) (2015
On the estimation of a fixed effects model with selective non-response
Economics;Statistical Methods;econometrics
Statistical structures for internet-scale data management
Efficient query processing in traditional database management systems relies on statistics on base data. For centralized systems, there is a rich body of research results on such statistics, from simple aggregates to more elaborate synopses such as sketches and histograms. For Internet-scale distributed systems, on the other hand, statistics management still poses major challenges. With the work in this paper we aim to endow peer-to-peer data management over structured overlays with the power associated with such statistical information, with emphasis on meeting the scalability challenge. To this end, we first contribute efficient, accurate, and decentralized algorithms that can compute key aggregates such as Count, CountDistinct, Sum, and Average. We show how to construct several types of histograms, such as simple Equi-Width, Average-Shifted Equi-Width, and Equi-Depth histograms. We present a full-fledged open-source implementation of these tools for distributed statistical synopses, and report on a comprehensive experimental performance evaluation, evaluating our contributions in terms of efficiency, accuracy, and scalability
Scalable Audience Reach Estimation in Real-time Online Advertising
Online advertising has been introduced as one of the most efficient methods
of advertising throughout the recent years. Yet, advertisers are concerned
about the efficiency of their online advertising campaigns and consequently,
would like to restrict their ad impressions to certain websites and/or certain
groups of audience. These restrictions, known as targeting criteria, limit the
reachability for better performance. This trade-off between reachability and
performance illustrates a need for a forecasting system that can quickly
predict/estimate (with good accuracy) this trade-off. Designing such a system
is challenging due to (a) the huge amount of data to process, and, (b) the need
for fast and accurate estimates. In this paper, we propose a distributed fault
tolerant system that can generate such estimates fast with good accuracy. The
main idea is to keep a small representative sample in memory across multiple
machines and formulate the forecasting problem as queries against the sample.
The key challenge is to find the best strata across the past data, perform
multivariate stratified sampling while ensuring fuzzy fall-back to cover the
small minorities. Our results show a significant improvement over the uniform
and simple stratified sampling strategies which are currently widely used in
the industry
- …