1,924 research outputs found

    Online Data Cleaning

    Get PDF
    Data-centric applications have never been more ubiquitous in our lives, e.g., search engines, route navigation and social media. This has brought along a new age where digital data is at the core of many decisions we make as individuals, e.g., looking for the most scenic route to plan a road trip, or as professionals, e.g., analysing customers’ transactions to predict the best time to restock different products. However, the surge in data generation has also led to creating massive amounts of dirty data, i.e., inaccurate or redundant data. Using dirty data to inform business decisions comes with dire consequences, for instance, an IBM report estimates that dirty data costs the U.S. $3.1 trillion a year. Dirty data is the product of many factors which include data entry errors and integration of several data sources. Data integration of multiple sources is especially prone to producing dirty data. For instance, while individual sources may not have redundant data, they often carry redundant data across each other. Furthermore, different data sources may obey different business rules (sometimes not even known) which makes it challenging to reconcile the integrated data. Even if the data is clean at the time of the integration, data updates would compromise its quality over time. There is a wide spectrum of errors that can be found in the data, e,g, duplicate records, missing values, obsolete data, etc. To address these problems, several data cleaning efforts have been proposed, e.g., record linkage to identify duplicate records, data fusion to fuse duplicate data items into a single representation and enforcing integrity constraints on the data. However, most existing efforts make two key assumptions: (1) Data cleaning is done in one shot; and (2) The data is available in its entirety. Those two assumptions do not hold in our age where data is highly volatile and integrated from several sources. This calls for a paradigm shift in approaching data cleaning: it has to be made iterative where data comes in chunks and not all at once. Consequently, cleaning the data should not be repeated from scratch whenever the data changes, but instead, should be done only for data items affected by the updates. Moreover, the repair should be computed effciently to support applications where cleaning is performed online (e.g. query time data cleaning). In this dissertation, we present several proposals to realize this paradigm for two major types of data errors: duplicates and integrity constraint violations. We frst present a framework that supports online record linkage and fusion over Web databases. Our system processes queries posted to Web databases. Query results are deduplicated, fused and then stored in a cache for future reference. The cache is updated iteratively with new query results. This effort makes it possible to perform record linkage and fusion effciently, but also effectively, i.e., the cache contains data items seen in previous queries which are jointly cleaned with incoming query results. To address integrity constraints violations, we propose a novel way to approach Functional Dependency repairs, develop a new class of repairs and then demonstrate it is superior to existing efforts, in runtime and accuracy. We then show how our framework can be easily tuned to work iteratively to support online applications. We implement a proof-ofconcept query answering system to demonstrate the iterative capability of our system

    Cleaning Denial Constraint Violations through Relaxation

    Full text link
    Data cleaning is a time-consuming process that depends on the data analysis that users perform. Existing solutions treat data cleaning as a separate offline process that takes place before analysis begins. Applying data cleaning before analysis assumes a priori knowledge of the inconsistencies and the query workload, thereby requiring effort on understanding and cleaning the data that is unnecessary for the analysis. We propose an approach that performs probabilistic repair of denial constraint violations on-demand, driven by the exploratory analysis that users perform. We introduce Daisy, a system that seamlessly integrates data cleaning into the analysis by relaxing query results. Daisy executes analytical query-workloads over dirty data by weaving cleaning operators into the query plan. Our evaluation shows that Daisy adapts to the workload and outperforms traditional offline cleaning on both synthetic and real-world workloads.Comment: To appear in SIGMOD 2020 proceeding

    Secure Diagnostics And Forensics With Network Provenance

    Get PDF
    In large-scale networks, many things can go wrong: routers can be misconfigured, programs can be buggy, and computers can be compromised by an attacker. As a result, there is a constant need to perform network diagnostics and forensics. In this dissertation, we leverage the concept of provenance to build better support for diagnostic and forensic tasks. At a high level, provenance tracks causality between network states and events, and produces a detailed explanation of any event of interest, which makes it a good starting point for investigating network problems. However, in order to use provenance for network diagnostics and forensics, several challenges need to be addressed. First, existing provenance systems cannot provide security properties on high-speed network traffic, because the cryptographic operations would cause enormous overhead when the data rates are high. To address this challenge, we design secure packet provenance, a system that comes with a novel lightweight security protocol, to maintain secure provenance with low overhead. Second, in large-scale distributed systems, the provenance of a network event can be quite complex, so it is still challenging to identify the problem root cause from the complex provenance. To address this challenge, we design differential provenance, which can identify a symptom event’s root cause by reasoning about the differences between its provenance and the provenance of a similar “reference” event. Third, provenance can only explain why a current network state came into existence, but by itself, it does not reason about changes to the network state to fix a problem. To provide operators with more diagnostic support, we design causal networks – a generalization of network provenance – to reason about network repairs that can avoid undesirable side effects in the network. Causal networks can encode multiple diagnostic goals in the same data structure, and, therefore, generate repairs that satisfy multiple constraints simultaneously. We have applied these techniques to Software-Defined Networks, Hadoop MapReduce, as well as the Internet’s data plane. Our evaluation with real-world traffic traces and network topologies shows that our systems can run with reasonable overhead, and that they can accurately identify root causes of practical problems and generate repairs without causing collateral damage

    Semiring Provenance for B\"uchi Games: Strategy Analysis with Absorptive Polynomials

    Full text link
    This paper presents a case study for the application of semiring semantics for fixed-point formulae to the analysis of strategies in B\"uchi games. Semiring semantics generalizes the classical Boolean semantics by permitting multiple truth values from certain semirings. Evaluating the fixed-point formula that defines the winning region in a given game in an appropriate semiring of polynomials provides not only the Boolean information on who wins, but also tells us how they win and which strategies they might use. This is well-understood for reachability games, where the winning region is definable as a least fixed point. The case of B\"uchi games is of special interest, not only due to their practical importance, but also because it is the simplest case where the fixed-point definition involves a genuine alternation of a greatest and a least fixed point. We show that, in a precise sense, semiring semantics provide information about all absorption-dominant strategies -- strategies that win with minimal effort, and we discuss how these relate to positional and the more general persistent strategies. This information enables further applications such as game synthesis or determining minimal modifications to the game needed to change its outcome. Lastly, we discuss limitations of our approach and present questions that cannot be immediately answered by semiring semantics.Comment: Full version of a paper submitted to GandALF 202
    • …
    corecore