28 research outputs found
A survey on Data Extraction and Data Duplication Detection
Text mining, also known as Intelligent Text Analysis is an important research area. It is very difficult to focus on the most appropriate information due to the high dimensionality of data. Feature Extraction is one of the important techniques in data reduction to discover the most important features. Processing massive amount of data stored in a unstructured form is a challenging task. Several pre-processing methods and algorithms are needed to extract useful features from huge amount of data. Dealing with collection of text documents, it is also very important to filter out duplicate data. Once duplicates are deleted, it is recommended to replace the removed duplicates. This Paper review the literature on duplicate detection and data fusion (remov e and replace duplicates).The survey provides existing text mining techniques to extract relevant features, detect duplicates and to replace the duplicate data to get fine grained knowledge to the user
Use of graph theory measures to identify errors in record linkage
Ensuring high linkage quality is important in many record linkage applications. Current methods for ensuring quality are manual and resource intensive. This paper seeks to determine the effectiveness of graph theory techniques in identifying record linkage errors. A range of graph theory techniques was applied to two linked datasets, with known truth sets. The ability of graph theory techniques to identify groups containing errors was compared to a widely used threshold setting technique. This methodology shows promise; however, further investigations into graph theory techniques are required. The development of more efficient and effective methods of improving linkage quality will result in higher quality datasets that can be delivered to researchers in shorter timeframes
EFFICIENT DUPLICATE DETECTION USING PROGRESSIVE ALGORITHMS
Duplicate detection is the way toward recognizing different representations of same certifiable elements. Today, Duplicate detection strategies need to prepare ever bigger datasets in ever shorter time: keeping up the nature of a dataset turns out to be progressively troublesome. The two novel, dynamic copy detection calculations that altogether increment the ability of discovering copies while the execution time is constrained: They boost the pickup of the general procedure inside the time accessible by reporting most results much sooner than customary methodologies. Far reaching tests demonstrate that our dynamic calculations can twofold the proficiency after some time of customary copy detection and essentially enhance related work
LEAPME: learning-based property matching with embeddings
Data integration tasks such as the creation and extension of knowledge graphs involve the
fusion of heterogeneous entities from many sources. Matching and fusion of such entities require
to also match and combine their properties (attributes). However, previous schema matching
approaches mostly focus on two sources only and often rely on simple similarity measurements.
They thus face problems in challenging use cases such as the integration of heterogeneous
product entities from many sources.
We therefore present a new machine learning-based property matching approach called
LEAPME (LEArning-based Property Matching with Embeddings) that utilizes numerous features
of both property names and instance values. The approach heavily makes use of word
embeddings to better utilize the domain-specific semantics of both property names and instance
values. The use of supervised machine learning helps exploit the predictive power of word
embeddings.
Our comparative evaluation against five baselines for several multi-source datasets with
real-world data shows the high effectiveness of LEAPME. We also show that our approach is
even effective when training data from another domain (transfer learning) is used.Ministerio de Econom铆a y Competitividad TIN2016-75394-RMinisterio de Ciencia e Innovaci贸n PID2019-105471RB-I00Junta de Andaluc铆a P18-RT-106
LEAPME: Learning-based Property Matching with Embeddings
Data integration tasks such as the creation and extension of knowledge graphs
involve the fusion of heterogeneous entities from many sources. Matching and
fusion of such entities require to also match and combine their properties
(attributes). However, previous schema matching approaches mostly focus on two
sources only and often rely on simple similarity measurements. They thus face
problems in challenging use cases such as the integration of heterogeneous
product entities from many sources.
We therefore present a new machine learning-based property matching approach
called LEAPME (LEArning-based Property Matching with Embeddings) that utilizes
numerous features of both property names and instance values. The approach
heavily makes use of word embeddings to better utilize the domain-specific
semantics of both property names and instance values. The use of supervised
machine learning helps exploit the predictive power of word embeddings.
Our comparative evaluation against five baselines for several multi-source
datasets with real-world data shows the high effectiveness of LEAPME. We also
show that our approach is even effective when training data from another domain
(transfer learning) is used
Robust Group Linkage
We study the problem of group linkage: linking records that refer to entities
in the same group. Applications for group linkage include finding businesses in
the same chain, finding conference attendees from the same affiliation, finding
players from the same team, etc. Group linkage faces challenges not present for
traditional record linkage. First, although different members in the same group
can share some similar global values of an attribute, they represent different
entities so can also have distinct local values for the same or different
attributes, requiring a high tolerance for value diversity. Second, groups can
be huge (with tens of thousands of records), requiring high scalability even
after using good blocking strategies.
We present a two-stage algorithm: the first stage identifies cores containing
records that are very likely to belong to the same group, while being robust to
possible erroneous values; the second stage collects strong evidence from the
cores and leverages it for merging more records into the same group, while
being tolerant to differences in local values of an attribute. Experimental
results show the high effectiveness and efficiency of our algorithm on various
real-world data sets