12 research outputs found
Enabling Quality Control for Entity Resolution: A Human and Machine Cooperation Framework
Even though many machine algorithms have been proposed for entity resolution,
it remains very challenging to find a solution with quality guarantees. In this
paper, we propose a novel HUman and Machine cOoperation (HUMO) framework for
entity resolution (ER), which divides an ER workload between the machine and
the human. HUMO enables a mechanism for quality control that can flexibly
enforce both precision and recall levels. We introduce the optimization problem
of HUMO, minimizing human cost given a quality requirement, and then present
three optimization approaches: a conservative baseline one purely based on the
monotonicity assumption of precision, a more aggressive one based on sampling
and a hybrid one that can take advantage of the strengths of both previous
approaches. Finally, we demonstrate by extensive experiments on real and
synthetic datasets that HUMO can achieve high-quality results with reasonable
return on investment (ROI) in terms of human cost, and it performs considerably
better than the state-of-the-art alternatives in quality control.Comment: 12 pages, 11 figures. Camera-ready version of the paper submitted to
ICDE 2018, In Proceedings of the 34th IEEE International Conference on Data
Engineering (ICDE 2018
Multi-Source Spatial Entity Linkage
Besides the traditional cartographic data sources, spatial information can
also be derived from location-based sources. However, even though different
location-based sources refer to the same physical world, each one has only
partial coverage of the spatial entities, describe them with different
attributes, and sometimes provide contradicting information. Hence, we
introduce the spatial entity linkage problem, which finds which pairs of
spatial entities belong to the same physical spatial entity. Our proposed
solution (QuadSky) starts with a time-efficient spatial blocking technique
(QuadFlex), compares pairwise the spatial entities in the same block, ranks the
pairs using Pareto optimality with the SkyRank algorithm, and finally,
classifies the pairs with our novel SkyEx-* family of algorithms that yield
0.85 precision and 0.85 recall for a manually labeled dataset of 1,500 pairs
and 0.87 precision and 0.6 recall for a semi-manually labeled dataset of
777,452 pairs. Moreover, we provide a theoretical guarantee and formalize the
SkyEx-FES algorithm that explores only 27% of the skylines without any loss in
F-measure. Furthermore, our fully unsupervised algorithm SkyEx-D approximates
the optimal result with an F-measure loss of just 0.01. Finally, QuadSky
provides the best trade-off between precision and recall, and the best
F-measure compared to the existing baselines and clustering techniques, and
approximates the results of supervised learning solutions
Entity reconciliation in big data sources: A systematic mapping study
The entity reconciliation (ER) problem aroused much interest as a research topic in today’s Big Dataera, full of big and open heterogeneous data sources. This problem poses when relevant information ona topic needs to be obtained using methods based on: (i) identifying records that represent the samereal world entity, and (ii) identifying those records that are similar but do not correspond to the samereal-world entity. ER is an operational intelligence process, whereby organizations can unify differentand heterogeneous data sources in order to relate possible matches of non-obvious entities. Besides, thecomplexity that the heterogeneity of data sources involves, the large number of records and differencesamong languages, for instance, must be added. This paper describes a Systematic Mapping Study (SMS) ofjournal articles, conferences and workshops published from 2010 to 2017 to solve the problem describedbefore, first trying to understand the state-of-the-art, and then identifying any gaps in current research.Eleven digital libraries were analyzed following a systematic, semiautomatic and rigorous process thathas resulted in 61 primary studies. They represent a great variety of intelligent proposals that aim tosolve ER. The conclusion obtained is that most of the research is based on the operational phase asopposed to the design phase, and most studies have been tested on real-world data sources, where a lotof them are heterogeneous, but just a few apply to industry. There is a clear trend in research techniquesbased on clustering/blocking and graphs, although the level of automation of the proposals is hardly evermentioned in the research work.Ministerio de Economía y Competitividad TIN2013-46928-C3-3-RMinisterio de Economía y Competitividad TIN2016-76956-C3-2-RMinisterio de Economía y Competitividad TIN2015-71938-RED
Multilingual Transformers for Product Matching -- Experiments and a New Benchmark in Polish
Product matching corresponds to the task of matching identical products
across different data sources. It typically employs available product features
which, apart from being multimodal, i.e., comprised of various data types,
might be non-homogeneous and incomplete. The paper shows that pre-trained,
multilingual Transformer models, after fine-tuning, are suitable for solving
the product matching problem using textual features both in English and Polish
languages. We tested multilingual mBERT and XLM-RoBERTa models in English on
Web Data Commons - training dataset and gold standard for large-scale product
matching. The obtained results show that these models perform similarly to the
latest solutions tested on this set, and in some cases, the results were even
better.
Additionally, we prepared a new dataset entirely in Polish and based on
offers in selected categories obtained from several online stores for the
research purpose. It is the first open dataset for product matching tasks in
Polish, which allows comparing the effectiveness of the pre-trained models.
Thus, we also showed the baseline results obtained by the fine-tuned mBERT and
XLM-RoBERTa models on the Polish datasets.Comment: 11 pages, 5 figure
Author classification using transfer learning and predicting stars in co-author networks
© 2020 John Wiley & Sons Ltd The vast amount of data is key challenge to mine a new scholar that is plausible to be star in the upcoming period. The enormous amount of unstructured data raise every year is infeasible for traditional learning; consequently, we need a high quality of preprocessing technique to expand the performance of traditional learning. We have persuaded a novel approach, Authors classification algorithm using Transfer Learning (ACTL) to learn new task on target area to mine the external knowledge from the source domain. Comprehensive experimental outcomes on real-world networks showed that ACTL, Node-based Influence Predicting Stars, Corresponding Authors Mutual Influence based on Predicting Stars, and Specific Topic Domain-based Predicting Stars enhanced the node classification accuracy as well as predicting rising stars to compared with contemporary baseline methods
A model-driven engineering approach for the uniquely identity reconciliation of heterogeneous data sources.
The objectives to be achieved with this Doctoral Thesis are:
1. Perform a study of the state of the art of the different existing solutions for the entity reconciliation of heterogeneous data sources, checking if they are being used in real environments.
2. Define and develop a Framework for designing the entity reconciliation models by a systematic way for the requirement, analysis and testing phases of a software methodology. For this purpose, this objective has been divided in three sub objectives:
a. Define a set of activities, represented as a process which can be added to any software development methodology to carry out the activities related to the entity reconciliation in the requirement, analysis and testing phase of any software development life cycle.
b. Define a metamodel that allows us to represent an abstract view of our model-based approach.
c. Define a set of derivation mechanisms that allow to stablish the base for automate the testing of the solutions where the framework proposed in this doctoral thesis has been used. Considering that the process will be applied in the early stages of the development, it is possible to say that this proposal applies Early Testing.
3. Provide a support tool for the framework. The support tool will allow to a software engineer to define the analysis model of an entity reconciliation problem between different and heterogeneous data sources. The tool will be represented as a Domain Specific Language (DSL).
4. Evaluate the results obtained of the application of the proposal in a real-world case study