3 research outputs found

    INDEPENDENT DE-DUPLICATION IN DATA CLEANING

    Get PDF
    Many organizations collect large amounts of data to support their business and decision-making processes. The data originate from a variety of sources that may have inherent data-quality problems. These problems become more pronounced when heterogeneous data sources are integrated (for example, in data warehouses). A major problem that arises from integrating different databases is the existence of duplicates. The challenge of de-duplication is identifying “equivalent” records within the database. Most published research in de-duplication propose techniques that rely heavily on domain knowledge. A few others propose solutions that are partially domain-independent. This paper identifies two levels of domain-independence in de-duplication namely: domain-independence at the attribute level, and domain-independence at the record level. The paper then proposes a positional algorithm that achieves domain-independent de-duplication at the attribute level, and a technique for field weighting by data profiling, which, when used with the positional algorithm, achieves domain-independence at the record level. Experiments show that the proposed techniques achieve more accurate de-duplication than the existing algorithms

    Generalizing spreadsheet computation for evolving spreadsheets at scale

    Get PDF
    Spreadsheets are one of the most ubiquitous ad-hoc data analysis and manipulation tools. Their strength over traditional relational database management systems lies in their ability to allow users to manipulate data interactively through an intuitive interface. However, the capabilities of current spreadsheet systems to handle datasets that evolve over time are limited in several dimensions: (a) limited power: it is difficult to perform relational-style queries, which is often needed for large data analysis, while keeping the convenience of formula-like automatic recalculation, (b) limited introspection: the ability to reason about the source of changes between versions at a higher level is often unsupported, and (c) limited interactivity: the computation in spreadsheets at scale can make the system unresponsive, rendering the strength of spreadsheets moot, (d) limited structure utilization: the computation in spreadsheets often fails to utilize the semi-structured nature of real-world spreadsheets. The dissertation discusses developments that overcome these hurdles. First, we discuss an extension to spreadsheet formulae that allows for relational-style queries in a manner that is consistent with typical formula computation engines. Second, we develop the theory of "diffing", representing data updates in a concise manner. Third, we introduce Asynchronous Formula Computation, a technique that improves spreadsheet interactivity when dealing with formula computation, while guaranteeing consistency of the results. Finally, we improve formula computation by utilizing structures of real-world spreadsheets and building a more concise representation

    Scalable Spreadsheets for Interactive Data Analysis

    No full text
    Interactive responses and natural, intuitive controls are important for data analysis. We are building ABC, a scalable spreadsheet for data analysis that combines exploration, grouping, and aggregation. We focus on interactive responses in place of long delays, and intuitive, directmanipulation operations in place of complex queries. ABC allows analysts to interactively explore the data at varying granularities of detail using a spreadsheet. Hypotheses that arise during the exploration can be checked by dynamically computing aggregates on interactively chosen groups of data items. In this paper we describe our vision for ABC. We give examples that illustrate the need for interactivity in query processing and query formulation, the advantages of dynamic group formulation, and the usefulness of exploration in discovering hypotheses about data that can then be verified by aggregation. We briefly discuss the systems issues involved in building ABC, and mention our progress so far. 1 Intro..
    corecore