1,924 research outputs found
Online Data Cleaning
Data-centric applications have never been more ubiquitous in our lives, e.g., search engines, route navigation and social media. This has brought along a new age where digital data is at the core of many decisions we make as individuals, e.g., looking for the most scenic route to plan a road trip, or as professionals, e.g., analysing customers’ transactions to predict the best time to restock different products. However, the surge in data generation has also led to creating massive amounts of dirty data, i.e., inaccurate or redundant data. Using dirty data to inform business decisions comes with dire consequences, for instance, an IBM report estimates that dirty data costs the U.S. $3.1 trillion a year.
Dirty data is the product of many factors which include data entry errors and integration of several data sources. Data integration of multiple sources is especially prone to producing dirty data. For instance, while individual sources may not have redundant data, they often carry redundant data across each other. Furthermore, different data sources may obey different business rules (sometimes not even known) which makes it challenging to reconcile the integrated data. Even if the data is clean at the time of the integration, data updates would compromise its quality over time.
There is a wide spectrum of errors that can be found in the data, e,g, duplicate records, missing values, obsolete data, etc. To address these problems, several data cleaning efforts have been proposed, e.g., record linkage to identify duplicate records, data fusion to fuse duplicate data items into a single representation and enforcing integrity constraints on the data. However, most existing efforts make two key assumptions: (1) Data cleaning is done in one shot; and (2) The data is available in its entirety. Those two assumptions do not hold in our age where data is highly volatile and integrated from several sources. This calls for a paradigm shift in approaching data cleaning: it has to be made iterative where data comes in chunks and not all at once. Consequently, cleaning the data should not be repeated from scratch whenever the data changes, but instead, should be done only for data items affected by the updates. Moreover, the repair should be computed effciently to support applications where cleaning is performed online (e.g. query time data cleaning). In this dissertation, we present several proposals to realize this paradigm for two major types of data errors: duplicates and integrity constraint violations.
We frst present a framework that supports online record linkage and fusion over Web databases. Our system processes queries posted to Web databases. Query results are deduplicated, fused and then stored in a cache for future reference. The cache is updated iteratively with new query results. This effort makes it possible to perform record linkage and fusion effciently, but also effectively, i.e., the cache contains data items seen in previous queries which are jointly cleaned with incoming query results.
To address integrity constraints violations, we propose a novel way to approach Functional Dependency repairs, develop a new class of repairs and then demonstrate it is superior to existing efforts, in runtime and accuracy. We then show how our framework can be easily tuned to work iteratively to support online applications. We implement a proof-ofconcept query answering system to demonstrate the iterative capability of our system
Cleaning Denial Constraint Violations through Relaxation
Data cleaning is a time-consuming process that depends on the data analysis
that users perform. Existing solutions treat data cleaning as a separate
offline process that takes place before analysis begins. Applying data cleaning
before analysis assumes a priori knowledge of the inconsistencies and the query
workload, thereby requiring effort on understanding and cleaning the data that
is unnecessary for the analysis. We propose an approach that performs
probabilistic repair of denial constraint violations on-demand, driven by the
exploratory analysis that users perform. We introduce Daisy, a system that
seamlessly integrates data cleaning into the analysis by relaxing query
results. Daisy executes analytical query-workloads over dirty data by weaving
cleaning operators into the query plan. Our evaluation shows that Daisy adapts
to the workload and outperforms traditional offline cleaning on both synthetic
and real-world workloads.Comment: To appear in SIGMOD 2020 proceeding
Secure Diagnostics And Forensics With Network Provenance
In large-scale networks, many things can go wrong: routers can be misconfigured, programs can be buggy, and computers can be compromised by an attacker. As a result, there is a constant need to perform network diagnostics and forensics. In this dissertation, we leverage the concept of provenance to build better support for diagnostic and forensic tasks. At a high level, provenance tracks causality between network states and events, and produces a detailed explanation of any event of interest, which makes it a good starting point for investigating network problems.
However, in order to use provenance for network diagnostics and forensics, several challenges need to be addressed. First, existing provenance systems cannot provide security properties on high-speed network traffic, because the cryptographic operations would cause enormous overhead when the data rates are high. To address this challenge, we design secure packet provenance, a system that comes with a novel lightweight security protocol, to maintain secure provenance with low overhead. Second, in large-scale distributed systems, the provenance of a network event can be quite complex, so it is still challenging to identify the problem root cause from the complex provenance. To address this challenge, we design differential provenance, which can identify a symptom event’s root cause by reasoning about the differences between its provenance and the provenance of a similar “reference” event. Third, provenance can only explain why a current network state came into existence, but by itself, it does not reason about changes to the network state to fix a problem. To provide operators with more diagnostic support, we design causal networks – a generalization of network provenance – to reason about network repairs that can avoid undesirable side effects in the network. Causal networks can encode multiple diagnostic goals in the same data structure, and, therefore, generate repairs that satisfy multiple constraints simultaneously. We have applied these techniques to Software-Defined Networks, Hadoop MapReduce, as well as the Internet’s data plane. Our evaluation with real-world traffic traces and network topologies shows that our systems can run with reasonable overhead, and that they can accurately identify root causes of practical problems and generate repairs without causing collateral damage
Semiring Provenance for B\"uchi Games: Strategy Analysis with Absorptive Polynomials
This paper presents a case study for the application of semiring semantics
for fixed-point formulae to the analysis of strategies in B\"uchi games.
Semiring semantics generalizes the classical Boolean semantics by permitting
multiple truth values from certain semirings. Evaluating the fixed-point
formula that defines the winning region in a given game in an appropriate
semiring of polynomials provides not only the Boolean information on who wins,
but also tells us how they win and which strategies they might use. This is
well-understood for reachability games, where the winning region is definable
as a least fixed point. The case of B\"uchi games is of special interest, not
only due to their practical importance, but also because it is the simplest
case where the fixed-point definition involves a genuine alternation of a
greatest and a least fixed point.
We show that, in a precise sense, semiring semantics provide information
about all absorption-dominant strategies -- strategies that win with minimal
effort, and we discuss how these relate to positional and the more general
persistent strategies. This information enables further applications such as
game synthesis or determining minimal modifications to the game needed to
change its outcome. Lastly, we discuss limitations of our approach and present
questions that cannot be immediately answered by semiring semantics.Comment: Full version of a paper submitted to GandALF 202
- …