14 research outputs found
Recommended from our members
Dynamic sorted neighborhood indexing for real-time entity resolution
Real-time Entity Resolution (ER) is the process of matching query records in subsecond time with records in a database that represent the same real-world entity. Indexing techniques are generally used to efficiently extract a set of candidate records from the database that are similar to a query record, and that are to be compared with the query record in more detail. The sorted neighborhood indexing method, which sorts a database and compares records within a sliding window, has been successfully used for ER of large static databases. However, because it is based on static sorted arrays and is designed for batch ER that resolves all records in a database rather than resolving those relating to a single query record, this technique is not suitable for real-time ER on dynamic databases that are constantly updated. We propose a tree-based technique that facilitates dynamic indexing based on the sorted neighborhood method, which can be used for real-time ER, and investigate both static and adaptive window approaches. We propose an approach to reduce query matching times by precalculating the similarities between attribute values stored in neighboring tree nodes. We also propose a multitree solution where different sorting keys are used to reduce the effects of errors and variations in attribute values on matching quality by building several distinct index trees. We experimentally evaluate our proposed techniques on large real datasets, as well as on synthetic data with different data quality characteristics. Our results show that as the index grows, no appreciable increase occurs in both record insertion and query times, and that using multiple trees gives noticeable improvements on matching quality with only a small increase in query time. Compared to earlier indexing techniques for real-time ER, our approach achieves significantly reduced indexing and query matching times while maintaining high matching accuracy
Entity reconciliation in big data sources: A systematic mapping study
The entity reconciliation (ER) problem aroused much interest as a research topic in today’s Big Dataera, full of big and open heterogeneous data sources. This problem poses when relevant information ona topic needs to be obtained using methods based on: (i) identifying records that represent the samereal world entity, and (ii) identifying those records that are similar but do not correspond to the samereal-world entity. ER is an operational intelligence process, whereby organizations can unify differentand heterogeneous data sources in order to relate possible matches of non-obvious entities. Besides, thecomplexity that the heterogeneity of data sources involves, the large number of records and differencesamong languages, for instance, must be added. This paper describes a Systematic Mapping Study (SMS) ofjournal articles, conferences and workshops published from 2010 to 2017 to solve the problem describedbefore, first trying to understand the state-of-the-art, and then identifying any gaps in current research.Eleven digital libraries were analyzed following a systematic, semiautomatic and rigorous process thathas resulted in 61 primary studies. They represent a great variety of intelligent proposals that aim tosolve ER. The conclusion obtained is that most of the research is based on the operational phase asopposed to the design phase, and most studies have been tested on real-world data sources, where a lotof them are heterogeneous, but just a few apply to industry. There is a clear trend in research techniquesbased on clustering/blocking and graphs, although the level of automation of the proposals is hardly evermentioned in the research work.Ministerio de Economía y Competitividad TIN2013-46928-C3-3-RMinisterio de Economía y Competitividad TIN2016-76956-C3-2-RMinisterio de Economía y Competitividad TIN2015-71938-RED
Dynamic Sorted Neighborhood Indexing for Real-Time Entity Resolution
Real-time entity resolution is the process of matching query records in sub-second time with records in a database that represent the same real-world entity. Indexing techniques are used to efficiently extract a set of candidate records from the databas
End-to-End Entity Resolution for Big Data: A Survey
One of the most important tasks for improving data quality and the
reliability of data analytics results is Entity Resolution (ER). ER aims to
identify different descriptions that refer to the same real-world entity, and
remains a challenging problem. While previous works have studied specific
aspects of ER (and mostly in traditional settings), in this survey, we provide
for the first time an end-to-end view of modern ER workflows, and of the novel
aspects of entity indexing and matching methods in order to cope with more than
one of the Big Data characteristics simultaneously. We present the basic
concepts, processing steps and execution strategies that have been proposed by
different communities, i.e., database, semantic Web and machine learning, in
order to cope with the loose structuredness, extreme diversity, high speed and
large scale of entity descriptions used by real-world applications. Finally, we
provide a synthetic discussion of the existing approaches, and conclude with a
detailed presentation of open research directions
Incremental Entity Blocking over Heterogeneous Streaming Data
Web systems have become a valuable source of semi-structured and streaming data. In this sense, Entity Resolution (ER) has become a key solution for integrating multiple data sources or identifying similarities between data items, namely entities. To avoid the quadratic costs of the ER task and improve efficiency, blocking techniques are usually applied. Beyond the traditional challenges faced by ER and, consequently, by the blocking techniques, there are also challenges related to streaming data, incremental processing, and noisy data. To address them, we propose a schema-agnostic blocking technique capable of handling noisy and streaming data incrementally through a distributed computational infrastructure. To the best of our knowledge, there is a lack of blocking techniques that address these challenges simultaneously. This work proposes two strategies (attribute selection and top-n neighborhood entities) to minimize resource consumption and improve blocking efficiency. Moreover, this work presents a noise-tolerant algorithm, which minimizes the impact of noisy data (e.g., typos and misspellings) on blocking effectiveness. In our experimental evaluation, we use real-world pairs of data sources, including a case study that involves data from Twitter and Google News. The proposed technique achieves better results regarding effectiveness and efficiency compared to the state-of-the-art technique (metablocking). More precisely, the application of the two strategies over the proposed technique alone improves efficiency by 56%, on average.publishedVersionPeer reviewe
A model-driven engineering approach for the uniquely identity reconciliation of heterogeneous data sources.
The objectives to be achieved with this Doctoral Thesis are:
1. Perform a study of the state of the art of the different existing solutions for the entity reconciliation of heterogeneous data sources, checking if they are being used in real environments.
2. Define and develop a Framework for designing the entity reconciliation models by a systematic way for the requirement, analysis and testing phases of a software methodology. For this purpose, this objective has been divided in three sub objectives:
a. Define a set of activities, represented as a process which can be added to any software development methodology to carry out the activities related to the entity reconciliation in the requirement, analysis and testing phase of any software development life cycle.
b. Define a metamodel that allows us to represent an abstract view of our model-based approach.
c. Define a set of derivation mechanisms that allow to stablish the base for automate the testing of the solutions where the framework proposed in this doctoral thesis has been used. Considering that the process will be applied in the early stages of the development, it is possible to say that this proposal applies Early Testing.
3. Provide a support tool for the framework. The support tool will allow to a software engineer to define the analysis model of an entity reconciliation problem between different and heterogeneous data sources. The tool will be represented as a Domain Specific Language (DSL).
4. Evaluate the results obtained of the application of the proposal in a real-world case study