38,731 research outputs found

    ERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution

    Full text link
    Entity resolution (ER), an important and common data cleaning problem, is about detecting data duplicate representations for the same external entities, and merging them into single representations. Relatively recently, declarative rules called matching dependencies (MDs) have been proposed for specifying similarity conditions under which attribute values in database records are merged. In this work we show the process and the benefits of integrating three components of ER: (a) Classifiers for duplicate/non-duplicate record pairs built using machine learning (ML) techniques, (b) MDs for supporting both the blocking phase of ML and the merge itself; and (c) The use of the declarative language LogiQL -an extended form of Datalog supported by the LogicBlox platform- for data processing, and the specification and enforcement of MDs.Comment: To appear in Proc. SUM, 201

    Incremental Entity Resolution from Linked Documents

    Full text link
    In many government applications we often find that information about entities, such as persons, are available in disparate data sources such as passports, driving licences, bank accounts, and income tax records. Similar scenarios are commonplace in large enterprises having multiple customer, supplier, or partner databases. Each data source maintains different aspects of an entity, and resolving entities based on these attributes is a well-studied problem. However, in many cases documents in one source reference those in others; e.g., a person may provide his driving-licence number while applying for a passport, or vice-versa. These links define relationships between documents of the same entity (as opposed to inter-entity relationships, which are also often used for resolution). In this paper we describe an algorithm to cluster documents that are highly likely to belong to the same entity by exploiting inter-document references in addition to attribute similarity. Our technique uses a combination of iterative graph-traversal, locality-sensitive hashing, iterative match-merge, and graph-clustering to discover unique entities based on a document corpus. A unique feature of our technique is that new sets of documents can be added incrementally while having to re-resolve only a small subset of a previously resolved entity-document collection. We present performance and quality results on two data-sets: a real-world database of companies and a large synthetically generated `population' database. We also demonstrate benefit of using inter-document references for clustering in the form of enhanced recall of documents for resolution.Comment: 15 pages, 8 figures, patented wor

    BeRTo: An Efficient Spark-Based Tool for Linking Business Registries in Big Data Environments

    Get PDF
    Linking entities from different datasets is a crucial task for the success of modern businesses. However, aligning entities becomes challenging as common identifiers might be missing. Therefore, the process should rely on string-based attributes, such as names or addresses, thus harming precision in the matching. At the same time, powerful general-purpose record linkage tools require users to clean and pre-process the initial data, introducing a bottleneck in the success of the data integration activity and a burden on actual users. Furthermore, scalability has become a relevant issue in modern big data environments, where a lot of data flows daily from external sources. This work presents a novel record linkage tool, BeRTo, that addresses the problem of linking a specific type of data source, i.e., business registries, containing information about companies and corporations. While being domain-specific harms its usability in other contexts, it manages to reach a new frontier in terms of precision but also scalability, as it has been built on Spark. Integrating the pre-processing and cleaning steps in the same tool creates a user-friendly end-to-end pipeline that requires users only to input the raw data and set their preferred configuration, allowing to focus on recall or precision

    A SEGMENTATION ANALYSIS OF U.S. GROCERY STORE SHOPPERS

    Get PDF
    Cluster analysis was used to conduct a segmentation analysis of U.S. supermarket shoppers. This study is based on the responses of a sample of 1,000 shoppers concerning the importance of 21 store characteristics in selecting their primary grocery store for the Food Marketing Institute's 2000 consumer trends survey. Stores must satisfy the attributes important to all consumers in order to be successful. In order of importance, the four top characteristics are a clean/neat store, high quality produce, high quality meats and courteous, friendly employees. The three key supermarket shopper segments identified are time-pressed convenience seekers, sophisticates, and middle Americans. In order to cater to a particular consumer niche, a store must better fulfill the store preferences of that segment. Time-pressed convenience seekers, 36.70 percent of the sample, put a premium on features such as childcare, gas pumps and online shopping. They are likely to be younger, urban with lower or moderate incomes and have the greatest number of children six years old or younger. Quality and services are important to the sophisticates, 28.40 percent of the sample. This group is middle-aged, better educated with higher incomes than average. Middle Americans, 34.90 percent, are attracted by pricing/value factors such as frequent shopper programs, sales and private label brands. They want stores that are active in the community. Demographically they are in the middle with the highest proportion of high school graduates.Consumer/Household Economics, Food Consumption/Nutrition/Food Safety, Marketing,

    ERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution

    Full text link
    Entity resolution (ER), an important and common data cleaning problem, is about detecting data duplicate representations for the same external entities, and merging them into single representations. Relatively recently, declarative rules called "matching dependencies" (MDs) have been proposed for specifying similarity conditions under which attribute values in database records are merged. In this work we show the process and the benefits of integrating four components of ER: (a) Building a classifier for duplicate/non-duplicate record pairs built using machine learning (ML) techniques; (b) Use of MDs for supporting the blocking phase of ML; (c) Record merging on the basis of the classifier results; and (d) The use of the declarative language "LogiQL" -an extended form of Datalog supported by the "LogicBlox" platform- for all activities related to data processing, and the specification and enforcement of MDs.Comment: Final journal version, with some minor technical corrections. Extended version of arXiv:1508.0601

    Anticipation and Risk – From the inverse problem to reverse computation

    Get PDF
    Abstract. Risk assessment is relevant only if it has predictive relevance. In this sense, the anticipatory perspective has yet to contribute to more adequate predictions. For purely physics-based phenomena, predictions are as good as the science describing such phenomena. For the dynamics of the living, the physics of the matter making up the living is only a partial description of their change over time. The space of possibilities is the missing component, complementary to physics and its associated predictions based on probabilistic methods. The inverse modeling problem, and moreover the reverse computation model guide anticipatory-based predictive methodologies. An experimental setting for the quantification of anticipation is advanced and structural measurement is suggested as a possible mathematics for anticipation-based risk assessment
    • …
    corecore