Search CORE

175,756 research outputs found

End-to-End Entity Resolution for Big Data: A Survey

Author: Christophides Vassilis
Efthymiou Vasilis
Palpanas Themis
Papadakis George
Stefanidis Kostas
Publication venue
Publication date: 01/02/1988
Field of study

One of the most important tasks for improving data quality and the reliability of data analytics results is Entity Resolution (ER). ER aims to identify different descriptions that refer to the same real-world entity, and remains a challenging problem. While previous works have studied specific aspects of ER (and mostly in traditional settings), in this survey, we provide for the first time an end-to-end view of modern ER workflows, and of the novel aspects of entity indexing and matching methods in order to cope with more than one of the Big Data characteristics simultaneously. We present the basic concepts, processing steps and execution strategies that have been proposed by different communities, i.e., database, semantic Web and machine learning, in order to cope with the loose structuredness, extreme diversity, high speed and large scale of entity descriptions used by real-world applications. Finally, we provide a synthetic discussion of the existing approaches, and conclude with a detailed presentation of open research directions

arXiv.org e-Print Archive

University of Richmond

Entity Resolution in Big Data

Author: Pham Duc Minh
Vu Thanh Long
Publication venue: Digital WPI
Publication date: 16/12/2015
Field of study

Today, with the rapid development of technology, human entered a new era of Information Technology. Data is being transfer from paper to digital second by second. Therefore, the demand of data storage is increasing quickly. Human need a new technique to handle Big Data, that’s why Hadoop was born. However, the conflicts and duplicates of data is still happen in many cases. In this report, we will illustrate a new technique for entity resolution in big data which uses Hadoop\u27s Map-Reduce framework

DigitalCommons@WPI

Parallel meta-blocking for scaling entity resolution over big heterogeneous data

Author: Efthymiou Vasilis
Palpanas Themis
Papadakis George
Papastefanatos George
Stefanidis Kostas
Publication venue: 'Elsevier BV'
Publication date: 18/11/2019
Field of study

TamPub Julkaisuarkisto - TamPub Institutional Repository

Trepo - Institutional Repository of Tampere University

Benchmarking Blocking Algorithms for Web Entities

Author: Christophides Vassilis
Efthymiou Vasilis
Stefanidis Kostas
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 14/11/2019
Field of study

An increasing number of entities are described by interlinked data rather than documents on the Web. Entity Resolution (ER) aims to identify descriptions of the same real-world entity within one or across knowledge bases in the Web of data. To reduce the required number of pairwise comparisons among descriptions, ER methods typically perform a pre-processing step, called \emph{blocking}, which places similar entity descriptions into blocks and thus only compare descriptions within the same block. We experimentally evaluate several blocking methods proposed for the Web of data using real datasets, whose characteristics significantly impact their effectiveness and efficiency. The proposed experimental evaluation framework allows us to better understand the characteristics of the missed matching entity descriptions and contrast them with ground truth obtained from different kinds of relatedness links.Comment: accepted at IEEE Transactions on Big Data journa

arXiv.org e-Print Archive

TamPub Julkaisuarkisto - TamPub Institutional Repository

Trepo - Institutional Repository of Tampere University

MinoanER: Schema-Agnostic, Non-Iterative, Massively Parallel Resolution of Web Entities

Author: Christophides Vassilis
Efthymiou Vasilis
Papadakis George
Stefanidis Kostas
Publication venue
Publication date: 15/05/2019
Field of study

Entity Resolution (ER) aims to identify different descriptions in various Knowledge Bases (KBs) that refer to the same entity. ER is challenged by the Variety, Volume and Veracity of entity descriptions published in the Web of Data. To address them, we propose the MinoanER framework that simultaneously fulfills full automation, support of highly heterogeneous entities, and massive parallelization of the ER process. MinoanER leverages a token-based similarity of entities to define a new metric that derives the similarity of neighboring entities from the most important relations, as they are indicated only by statistics. A composite blocking method is employed to capture different sources of matching evidence from the content, neighbors, or names of entities. The search space of candidate pairs for comparison is compactly abstracted by a novel disjunctive blocking graph and processed by a non-iterative, massively parallel matching algorithm that consists of four generic, schema-agnostic matching rules that are quite robust with respect to their internal configuration. We demonstrate that the effectiveness of MinoanER is comparable to existing ER tools over real KBs exhibiting low Variety, but it outperforms them significantly when matching KBs with high Variety.Comment: Presented at EDBT 2001

arXiv.org e-Print Archive

TamPub Julkaisuarkisto - TamPub Institutional Repository

Trepo - Institutional Repository of Tampere University