Search CORE

8 research outputs found

An Extension of NDT to Model Entity Reconciliation Problems

Author: Domínguez Mayo Francisco José
Escalona Cuaresma María José
García García Julián Alberto
González Enríquez José
Publication venue: ScitePress Digital Library
Publication date: 01/01/2017
Field of study

idUS. Depósito de Investigación Universidad de Sevilla

MaRIA: a process to model entity reconciliation problems

Author: Escalona Cuaresma María José
González Enríquez José
Jiménez Ramírez Alicia
Mejías Risoto Manuel
Olivero González Miguel Ángel
Publication venue: 'River Publishers'
Publication date: 01/01/2018
Field of study

Within the development of software systems, the development of web applications may be one of the most widespread at present due to the great number of advantages they provide such as: multiplatform, speed of access or the not requiring extremely powerful hardware among others. The fact that so many web applications are being developed, makes enormous the volume of information that it is generated daily. In the management of all this information, the entity reconciliation (ER) problem occurs, which is to identify objects referring to the same real-world entity. This paper proposes to give a solution to this problem through a web perspective based on the Model-Driven Engineering paradigm. To this end, the Navigational Development Techniques (NDT) methodology, that provides a formal and complete set of processes that bring support to the software lifecycle management, has been taken as a reference and it has been extended adding new activities, artefacts and documents to cover the ER. All these elements are defined by a process named Model-Driven Entity ReconcilIAtion (MaRIA), that can be integrated in any software development methodology and allows one to define the ER problem from the early stages of the development. In addition, this proposal has been validated in a real-world case study helping companies to reduce costs when a software product that must give a solution to an ER problem has to be developedMinisterio de Economía y Competitividad TIN2013-46928-C3-3-RMinisterio de Economía y Competitividad TIN2016-76956-C3-2-R (POLOLAS)Ministerio de Economía y Competitividad TIN2015-71938-RED

idUS. Depósito de Investigación Universidad de Sevilla

Unsupervised String Transformation Learning for Entity Consolidation

Author: Abedjan Ziawasch
Deng Dong
Elmagarmid Ahmed
Ilyas Ihab F.
Li Guoliang
Madden Samuel
Ouzzani Mourad
Stonebraker Michael
Tang Nan
Tao Wenbo
Publication venue
Publication date: 30/07/2018
Field of study

Data integration has been a long-standing challenge in data management with many applications. A key step in data integration is entity consolidation. It takes a collection of clusters of duplicate records as input and produces a single "golden record" for each cluster, which contains the canonical value for each attribute. Truth discovery and data fusion methods, as well as Master Data Management (MDM) systems, can be used for entity consolidation. However, to achieve better results, the variant values (i.e., values that are logically the same with different formats) in the clusters need to be consolidated before applying these methods. For this purpose, we propose a data-driven method to standardize the variant values based on two observations: (1) the variant values usually can be transformed to the same representation (e.g., "Mary Lee" and "Lee, Mary") and (2) the same transformation often appears repeatedly across different clusters (e.g., transpose the first and last name). Our approach first uses an unsupervised method to generate groups of value pairs that can be transformed in the same way (i.e., they share a transformation). Then the groups are presented to a human for verification and the approved ones are used to standardize the data. In a real-world dataset with 17,497 records, our method achieved 75% recall and 99.5% precision in standardizing variant values by asking a human 100 yes/no questions, which completely outperformed a state of the art data wrangling tool

arXiv.org e-Print Archive

Crossref

Entity reconciliation in big data sources: A systematic mapping study

Author: Domínguez Mayo Francisco José
Escalona Cuaresma María José
González Enríquez José
Ross M.
Staples G.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2017
Field of study

The entity reconciliation (ER) problem aroused much interest as a research topic in today’s Big Dataera, full of big and open heterogeneous data sources. This problem poses when relevant information ona topic needs to be obtained using methods based on: (i) identifying records that represent the samereal world entity, and (ii) identifying those records that are similar but do not correspond to the samereal-world entity. ER is an operational intelligence process, whereby organizations can unify differentand heterogeneous data sources in order to relate possible matches of non-obvious entities. Besides, thecomplexity that the heterogeneity of data sources involves, the large number of records and differencesamong languages, for instance, must be added. This paper describes a Systematic Mapping Study (SMS) ofjournal articles, conferences and workshops published from 2010 to 2017 to solve the problem describedbefore, first trying to understand the state-of-the-art, and then identifying any gaps in current research.Eleven digital libraries were analyzed following a systematic, semiautomatic and rigorous process thathas resulted in 61 primary studies. They represent a great variety of intelligent proposals that aim tosolve ER. The conclusion obtained is that most of the research is based on the operational phase asopposed to the design phase, and most studies have been tested on real-world data sources, where a lotof them are heterogeneous, but just a few apply to industry. There is a clear trend in research techniquesbased on clustering/blocking and graphs, although the level of automation of the proposals is hardly evermentioned in the research work.Ministerio de Economía y Competitividad TIN2013-46928-C3-3-RMinisterio de Economía y Competitividad TIN2016-76956-C3-2-RMinisterio de Economía y Competitividad TIN2015-71938-RED

idUS. Depósito de Investigación Universidad de Sevilla

A model-driven engineering approach for the uniquely identity reconciliation of heterogeneous data sources.

Author: González Enríquez José
Publication venue
Publication date: 19/10/2017
Field of study

The objectives to be achieved with this Doctoral Thesis are: 1. Perform a study of the state of the art of the different existing solutions for the entity reconciliation of heterogeneous data sources, checking if they are being used in real environments. 2. Define and develop a Framework for designing the entity reconciliation models by a systematic way for the requirement, analysis and testing phases of a software methodology. For this purpose, this objective has been divided in three sub objectives: a. Define a set of activities, represented as a process which can be added to any software development methodology to carry out the activities related to the entity reconciliation in the requirement, analysis and testing phase of any software development life cycle. b. Define a metamodel that allows us to represent an abstract view of our model-based approach. c. Define a set of derivation mechanisms that allow to stablish the base for automate the testing of the solutions where the framework proposed in this doctoral thesis has been used. Considering that the process will be applied in the early stages of the development, it is possible to say that this proposal applies Early Testing. 3. Provide a support tool for the framework. The support tool will allow to a software engineer to define the analysis model of an entity reconciliation problem between different and heterogeneous data sources. The tool will be represented as a Domain Specific Language (DSL). 4. Evaluate the results obtained of the application of the proposal in a real-world case study

idUS. Depósito de Investigación Universidad de Sevilla