Search CORE

28 research outputs found

A survey on Data Extraction and Data Duplication Detection

Author: Yashika A. Shah, Snehal S. Zade, Smita M. Raut, Shraddha P. Shirbhate, Vijeta U. Khadse, Anup P. Date
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 31/05/2018
Field of study

Text mining, also known as Intelligent Text Analysis is an important research area. It is very difficult to focus on the most appropriate information due to the high dimensionality of data. Feature Extraction is one of the important techniques in data reduction to discover the most important features. Processing massive amount of data stored in a unstructured form is a challenging task. Several pre-processing methods and algorithms are needed to extract useful features from huge amount of data. Dealing with collection of text documents, it is also very important to filter out duplicate data. Once duplicates are deleted, it is recommended to replace the removed duplicates. This Paper review the literature on duplicate detection and data fusion (remov e and replace duplicates).The survey provides existing text mining techniques to extract relevant features, detect duplicates and to replace the duplicate data to get fine grained knowledge to the user

International Journal on Recent and Innovation Trends in Computing and Communication

Use of graph theory measures to identify errors in record linkage

Author: Bauer J.
Boyd James
Ferrante Anna
Randall Sean
Semmens James
Publication venue: 'Elsevier BV'
Publication date: 01/01/2014
Field of study

Ensuring high linkage quality is important in many record linkage applications. Current methods for ensuring quality are manual and resource intensive. This paper seeks to determine the effectiveness of graph theory techniques in identifying record linkage errors. A range of graph theory techniques was applied to two linked datasets, with known truth sets. The ability of graph theory techniques to identify groups containing errors was compared to a widely used threshold setting technique. This methodology shows promise; however, further investigations into graph theory techniques are required. The development of more efficient and effective methods of improving linkage quality will result in higher quality datasets that can be delivered to researchers in shorter timeframes

espace@Curtin

EFFICIENT DUPLICATE DETECTION USING PROGRESSIVE ALGORITHMS

Author: Radha Gottipati
Srikanth Lukka
Publication venue: International Journal of Innovative Technology and Research
Publication date: 23/10/2016
Field of study

Duplicate detection is the way toward recognizing different representations of same certifiable elements. Today, Duplicate detection strategies need to prepare ever bigger datasets in ever shorter time: keeping up the nature of a dataset turns out to be progressively troublesome. The two novel, dynamic copy detection calculations that altogether increment the ability of discovering copies while the execution time is constrained: They boost the pickup of the general procedure inside the time accessible by reporting most results much sooner than customary methodologies. Far reaching tests demonstrate that our dynamic calculations can twofold the proficiency after some time of customary copy detection and essentially enhance related work

International Journal of Innovative Technology and Research (IJITR)

LEAPME: learning-based property matching with embeddings

Author: Ayala Hernández Daniel
Hernández Salmerón Inmaculada Concepción
Rahm Erhard
Ruiz Cortés David
Publication venue: 'Elsevier BV'
Publication date: 01/01/2022
Field of study

Data integration tasks such as the creation and extension of knowledge graphs involve the fusion of heterogeneous entities from many sources. Matching and fusion of such entities require to also match and combine their properties (attributes). However, previous schema matching approaches mostly focus on two sources only and often rely on simple similarity measurements. They thus face problems in challenging use cases such as the integration of heterogeneous product entities from many sources. We therefore present a new machine learning-based property matching approach called LEAPME (LEArning-based Property Matching with Embeddings) that utilizes numerous features of both property names and instance values. The approach heavily makes use of word embeddings to better utilize the domain-specific semantics of both property names and instance values. The use of supervised machine learning helps exploit the predictive power of word embeddings. Our comparative evaluation against five baselines for several multi-source datasets with real-world data shows the high effectiveness of LEAPME. We also show that our approach is even effective when training data from another domain (transfer learning) is used.Ministerio de Economía y Competitividad TIN2016-75394-RMinisterio de Ciencia e Innovación PID2019-105471RB-I00Junta de Andalucía P18-RT-106

idUS. Depósito de Investigación Universidad de Sevilla

LEAPME: Learning-based Property Matching with Embeddings

Author: Ayala Daniel
Hernández Inma
Rahm Erhard
Ruiz David
Publication venue
Publication date: 01/01/2020
Field of study

arXiv.org e-Print Archive

idUS. Depósito de Investigación Universidad de Sevilla

Robust Group Linkage

Author: Dong Xin Luna
Guo Songtao
Li Pei
Maurino Andrea
Srivastava Divesh
Publication venue
Publication date: 02/03/2015
Field of study

We study the problem of group linkage: linking records that refer to entities in the same group. Applications for group linkage include finding businesses in the same chain, finding conference attendees from the same affiliation, finding players from the same team, etc. Group linkage faces challenges not present for traditional record linkage. First, although different members in the same group can share some similar global values of an attribute, they represent different entities so can also have distinct local values for the same or different attributes, requiring a high tolerance for value diversity. Second, groups can be huge (with tens of thousands of records), requiring high scalability even after using good blocking strategies. We present a two-stage algorithm: the first stage identifies cores containing records that are very likely to belong to the same group, while being robust to possible erroneous values; the second stage collects strong evidence from the cores and leverages it for merging more records into the same group, while being tolerant to differences in local values of an attribute. Experimental results show the high effectiveness and efficiency of our algorithm on various real-world data sets

arXiv.org e-Print Archive

CiteSeerX