741 research outputs found

    Indeterministic Handling of Uncertain Decisions in Duplicate Detection

    Get PDF
    In current research, duplicate detection is usually considered as a deterministic approach in which tuples are either declared as duplicates or not. However, most often it is not completely clear whether two tuples represent the same real-world entity or not. In deterministic approaches, however, this uncertainty is ignored, which in turn can lead to false decisions. In this paper, we present an indeterministic approach for handling uncertain decisions in a duplicate detection process by using a probabilistic target schema. Thus, instead of deciding between multiple possible worlds, all these worlds can be modeled in the resulting data. This approach minimizes the negative impacts of false decisions. Furthermore, the duplicate detection process becomes almost fully automatic and human effort can be reduced to a large extent. Unfortunately, a full-indeterministic approach is by definition too expensive (in time as well as in storage) and hence impractical. For that reason, we additionally introduce several semi-indeterministic methods for heuristically reducing the set of indeterministic handled decisions in a meaningful way

    ERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution

    Full text link
    Entity resolution (ER), an important and common data cleaning problem, is about detecting data duplicate representations for the same external entities, and merging them into single representations. Relatively recently, declarative rules called "matching dependencies" (MDs) have been proposed for specifying similarity conditions under which attribute values in database records are merged. In this work we show the process and the benefits of integrating four components of ER: (a) Building a classifier for duplicate/non-duplicate record pairs built using machine learning (ML) techniques; (b) Use of MDs for supporting the blocking phase of ML; (c) Record merging on the basis of the classifier results; and (d) The use of the declarative language "LogiQL" -an extended form of Datalog supported by the "LogicBlox" platform- for all activities related to data processing, and the specification and enforcement of MDs.Comment: Final journal version, with some minor technical corrections. Extended version of arXiv:1508.0601

    Entity reconciliation in big data sources: A systematic mapping study

    Get PDF
    The entity reconciliation (ER) problem aroused much interest as a research topic in today’s Big Dataera, full of big and open heterogeneous data sources. This problem poses when relevant information ona topic needs to be obtained using methods based on: (i) identifying records that represent the samereal world entity, and (ii) identifying those records that are similar but do not correspond to the samereal-world entity. ER is an operational intelligence process, whereby organizations can unify differentand heterogeneous data sources in order to relate possible matches of non-obvious entities. Besides, thecomplexity that the heterogeneity of data sources involves, the large number of records and differencesamong languages, for instance, must be added. This paper describes a Systematic Mapping Study (SMS) ofjournal articles, conferences and workshops published from 2010 to 2017 to solve the problem describedbefore, first trying to understand the state-of-the-art, and then identifying any gaps in current research.Eleven digital libraries were analyzed following a systematic, semiautomatic and rigorous process thathas resulted in 61 primary studies. They represent a great variety of intelligent proposals that aim tosolve ER. The conclusion obtained is that most of the research is based on the operational phase asopposed to the design phase, and most studies have been tested on real-world data sources, where a lotof them are heterogeneous, but just a few apply to industry. There is a clear trend in research techniquesbased on clustering/blocking and graphs, although the level of automation of the proposals is hardly evermentioned in the research work.Ministerio de Economía y Competitividad TIN2013-46928-C3-3-RMinisterio de Economía y Competitividad TIN2016-76956-C3-2-RMinisterio de Economía y Competitividad TIN2015-71938-RED

    A Suit of Record Normalization Methods, From Naive Ones, Globally Mine a Group of Duplicate Records

    Get PDF
    The promise of Big Data pivots after tending to a few big data integration challenges, for example, record linkage at scale, continuous data combination, and incorporating Deep Web. Although much work has been directed on these issues, there is restricted work on making a uniform, standard record from a gathering of records comparing to a similar genuine element. We allude to this errand as record normalization. Such a record portrayal, instituted normalized record, is significant for both front-end and back-end applications. In this paper, we formalize the record normalization issue, present top to bottom examination of normalization granularity levels (e.g., record, field, and worth segment) and of normalization structures (e.g., common versus complete). We propose an exhaustive structure for registering the normalized record. The proposed system incorporates a suit of record normalization techniques, from guileless ones, which utilize just the data accumulated from records themselves, to complex methodologies, which all around mine a gathering of copy records before choosing an incentive for a quality of a normalized record

    Modeling Global Syntactic Variation in English Using Dialect Classification

    Get PDF
    This paper evaluates global-scale dialect identification for 14 national varieties of English as a means for studying syntactic variation. The paper makes three main contributions: (i) introducing data-driven language mapping as a method for selecting the inventory of national varieties to include in the task; (ii) producing a large and dynamic set of syntactic features using grammar induction rather than focusing on a few hand-selected features such as function words; and (iii) comparing models across both web corpora and social media corpora in order to measure the robustness of syntactic variation across registers
    corecore