12 research outputs found

    Enabling Quality Control for Entity Resolution: A Human and Machine Cooperation Framework

    Full text link
    Even though many machine algorithms have been proposed for entity resolution, it remains very challenging to find a solution with quality guarantees. In this paper, we propose a novel HUman and Machine cOoperation (HUMO) framework for entity resolution (ER), which divides an ER workload between the machine and the human. HUMO enables a mechanism for quality control that can flexibly enforce both precision and recall levels. We introduce the optimization problem of HUMO, minimizing human cost given a quality requirement, and then present three optimization approaches: a conservative baseline one purely based on the monotonicity assumption of precision, a more aggressive one based on sampling and a hybrid one that can take advantage of the strengths of both previous approaches. Finally, we demonstrate by extensive experiments on real and synthetic datasets that HUMO can achieve high-quality results with reasonable return on investment (ROI) in terms of human cost, and it performs considerably better than the state-of-the-art alternatives in quality control.Comment: 12 pages, 11 figures. Camera-ready version of the paper submitted to ICDE 2018, In Proceedings of the 34th IEEE International Conference on Data Engineering (ICDE 2018

    Multi-Source Spatial Entity Linkage

    Get PDF
    Besides the traditional cartographic data sources, spatial information can also be derived from location-based sources. However, even though different location-based sources refer to the same physical world, each one has only partial coverage of the spatial entities, describe them with different attributes, and sometimes provide contradicting information. Hence, we introduce the spatial entity linkage problem, which finds which pairs of spatial entities belong to the same physical spatial entity. Our proposed solution (QuadSky) starts with a time-efficient spatial blocking technique (QuadFlex), compares pairwise the spatial entities in the same block, ranks the pairs using Pareto optimality with the SkyRank algorithm, and finally, classifies the pairs with our novel SkyEx-* family of algorithms that yield 0.85 precision and 0.85 recall for a manually labeled dataset of 1,500 pairs and 0.87 precision and 0.6 recall for a semi-manually labeled dataset of 777,452 pairs. Moreover, we provide a theoretical guarantee and formalize the SkyEx-FES algorithm that explores only 27% of the skylines without any loss in F-measure. Furthermore, our fully unsupervised algorithm SkyEx-D approximates the optimal result with an F-measure loss of just 0.01. Finally, QuadSky provides the best trade-off between precision and recall, and the best F-measure compared to the existing baselines and clustering techniques, and approximates the results of supervised learning solutions

    Entity reconciliation in big data sources: A systematic mapping study

    Get PDF
    The entity reconciliation (ER) problem aroused much interest as a research topic in today’s Big Dataera, full of big and open heterogeneous data sources. This problem poses when relevant information ona topic needs to be obtained using methods based on: (i) identifying records that represent the samereal world entity, and (ii) identifying those records that are similar but do not correspond to the samereal-world entity. ER is an operational intelligence process, whereby organizations can unify differentand heterogeneous data sources in order to relate possible matches of non-obvious entities. Besides, thecomplexity that the heterogeneity of data sources involves, the large number of records and differencesamong languages, for instance, must be added. This paper describes a Systematic Mapping Study (SMS) ofjournal articles, conferences and workshops published from 2010 to 2017 to solve the problem describedbefore, first trying to understand the state-of-the-art, and then identifying any gaps in current research.Eleven digital libraries were analyzed following a systematic, semiautomatic and rigorous process thathas resulted in 61 primary studies. They represent a great variety of intelligent proposals that aim tosolve ER. The conclusion obtained is that most of the research is based on the operational phase asopposed to the design phase, and most studies have been tested on real-world data sources, where a lotof them are heterogeneous, but just a few apply to industry. There is a clear trend in research techniquesbased on clustering/blocking and graphs, although the level of automation of the proposals is hardly evermentioned in the research work.Ministerio de Economía y Competitividad TIN2013-46928-C3-3-RMinisterio de Economía y Competitividad TIN2016-76956-C3-2-RMinisterio de Economía y Competitividad TIN2015-71938-RED

    Multilingual Transformers for Product Matching -- Experiments and a New Benchmark in Polish

    Full text link
    Product matching corresponds to the task of matching identical products across different data sources. It typically employs available product features which, apart from being multimodal, i.e., comprised of various data types, might be non-homogeneous and incomplete. The paper shows that pre-trained, multilingual Transformer models, after fine-tuning, are suitable for solving the product matching problem using textual features both in English and Polish languages. We tested multilingual mBERT and XLM-RoBERTa models in English on Web Data Commons - training dataset and gold standard for large-scale product matching. The obtained results show that these models perform similarly to the latest solutions tested on this set, and in some cases, the results were even better. Additionally, we prepared a new dataset entirely in Polish and based on offers in selected categories obtained from several online stores for the research purpose. It is the first open dataset for product matching tasks in Polish, which allows comparing the effectiveness of the pre-trained models. Thus, we also showed the baseline results obtained by the fine-tuned mBERT and XLM-RoBERTa models on the Polish datasets.Comment: 11 pages, 5 figure

    Author classification using transfer learning and predicting stars in co-author networks

    Get PDF
    © 2020 John Wiley & Sons Ltd The vast amount of data is key challenge to mine a new scholar that is plausible to be star in the upcoming period. The enormous amount of unstructured data raise every year is infeasible for traditional learning; consequently, we need a high quality of preprocessing technique to expand the performance of traditional learning. We have persuaded a novel approach, Authors classification algorithm using Transfer Learning (ACTL) to learn new task on target area to mine the external knowledge from the source domain. Comprehensive experimental outcomes on real-world networks showed that ACTL, Node-based Influence Predicting Stars, Corresponding Authors Mutual Influence based on Predicting Stars, and Specific Topic Domain-based Predicting Stars enhanced the node classification accuracy as well as predicting rising stars to compared with contemporary baseline methods

    A model-driven engineering approach for the uniquely identity reconciliation of heterogeneous data sources.

    Get PDF
    The objectives to be achieved with this Doctoral Thesis are: 1. Perform a study of the state of the art of the different existing solutions for the entity reconciliation of heterogeneous data sources, checking if they are being used in real environments. 2. Define and develop a Framework for designing the entity reconciliation models by a systematic way for the requirement, analysis and testing phases of a software methodology. For this purpose, this objective has been divided in three sub objectives: a. Define a set of activities, represented as a process which can be added to any software development methodology to carry out the activities related to the entity reconciliation in the requirement, analysis and testing phase of any software development life cycle. b. Define a metamodel that allows us to represent an abstract view of our model-based approach. c. Define a set of derivation mechanisms that allow to stablish the base for automate the testing of the solutions where the framework proposed in this doctoral thesis has been used. Considering that the process will be applied in the early stages of the development, it is possible to say that this proposal applies Early Testing. 3. Provide a support tool for the framework. The support tool will allow to a software engineer to define the analysis model of an entity reconciliation problem between different and heterogeneous data sources. The tool will be represented as a Domain Specific Language (DSL). 4. Evaluate the results obtained of the application of the proposal in a real-world case study