7,570 research outputs found
Sequential anomaly detection in the presence of noise and limited feedback
This paper describes a methodology for detecting anomalies from sequentially
observed and potentially noisy data. The proposed approach consists of two main
elements: (1) {\em filtering}, or assigning a belief or likelihood to each
successive measurement based upon our ability to predict it from previous noisy
observations, and (2) {\em hedging}, or flagging potential anomalies by
comparing the current belief against a time-varying and data-adaptive
threshold. The threshold is adjusted based on the available feedback from an
end user. Our algorithms, which combine universal prediction with recent work
on online convex programming, do not require computing posterior distributions
given all current observations and involve simple primal-dual parameter
updates. At the heart of the proposed approach lie exponential-family models
which can be used in a wide variety of contexts and applications, and which
yield methods that achieve sublinear per-round regret against both static and
slowly varying product distributions with marginals drawn from the same
exponential family. Moreover, the regret against static distributions coincides
with the minimax value of the corresponding online strongly convex game. We
also prove bounds on the number of mistakes made during the hedging step
relative to the best offline choice of the threshold with access to all
estimated beliefs and feedback signals. We validate the theory on synthetic
data drawn from a time-varying distribution over binary vectors of high
dimensionality, as well as on the Enron email dataset.Comment: 19 pages, 12 pdf figures; final version to be published in IEEE
Transactions on Information Theor
Explain3D: Explaining Disagreements in Disjoint Datasets
Data plays an important role in applications, analytic processes, and many
aspects of human activity. As data grows in size and complexity, we are met
with an imperative need for tools that promote understanding and explanations
over data-related operations. Data management research on explanations has
focused on the assumption that data resides in a single dataset, under one
common schema. But the reality of today's data is that it is frequently
un-integrated, coming from different sources with different schemas. When
different datasets provide different answers to semantically similar questions,
understanding the reasons for the discrepancies is challenging and cannot be
handled by the existing single-dataset solutions.
In this paper, we propose Explain3D, a framework for explaining the
disagreements across disjoint datasets (3D). Explain3D focuses on identifying
the reasons for the differences in the results of two semantically similar
queries operating on two datasets with potentially different schemas. Our
framework leverages the queries to perform a semantic mapping across the
relevant parts of their provenance; discrepancies in this mapping point to
causes of the queries' differences. Exploiting the queries gives Explain3D an
edge over traditional schema matching and record linkage techniques, which are
query-agnostic. Our work makes the following contributions: (1) We formalize
the problem of deriving optimal explanations for the differences of the results
of semantically similar queries over disjoint datasets. (2) We design a 3-stage
framework for solving the optimal explanation problem. (3) We develop a
smart-partitioning optimizer that improves the efficiency of the framework by
orders of magnitude. (4)~We experiment with real-world and synthetic data to
demonstrate that Explain3D can derive precise explanations efficiently
- …