8,388 research outputs found
Blocking techniques for efficient entity resolution over large, highly heterogeneous information spaces
[no abstract
MinoanER: Schema-Agnostic, Non-Iterative, Massively Parallel Resolution of Web Entities
Entity Resolution (ER) aims to identify different descriptions in various
Knowledge Bases (KBs) that refer to the same entity. ER is challenged by the
Variety, Volume and Veracity of entity descriptions published in the Web of
Data. To address them, we propose the MinoanER framework that simultaneously
fulfills full automation, support of highly heterogeneous entities, and massive
parallelization of the ER process. MinoanER leverages a token-based similarity
of entities to define a new metric that derives the similarity of neighboring
entities from the most important relations, as they are indicated only by
statistics. A composite blocking method is employed to capture different
sources of matching evidence from the content, neighbors, or names of entities.
The search space of candidate pairs for comparison is compactly abstracted by a
novel disjunctive blocking graph and processed by a non-iterative, massively
parallel matching algorithm that consists of four generic, schema-agnostic
matching rules that are quite robust with respect to their internal
configuration. We demonstrate that the effectiveness of MinoanER is comparable
to existing ER tools over real KBs exhibiting low Variety, but it outperforms
them significantly when matching KBs with high Variety.Comment: Presented at EDBT 2001
A survey on Data Extraction and Data Duplication Detection
Text mining, also known as Intelligent Text Analysis is an important research area. It is very difficult to focus on the most appropriate information due to the high dimensionality of data. Feature Extraction is one of the important techniques in data reduction to discover the most important features. Processing massive amount of data stored in a unstructured form is a challenging task. Several pre-processing methods and algorithms are needed to extract useful features from huge amount of data. Dealing with collection of text documents, it is also very important to filter out duplicate data. Once duplicates are deleted, it is recommended to replace the removed duplicates. This Paper review the literature on duplicate detection and data fusion (remov e and replace duplicates).The survey provides existing text mining techniques to extract relevant features, detect duplicates and to replace the duplicate data to get fine grained knowledge to the user
End-to-End Entity Resolution for Big Data: A Survey
One of the most important tasks for improving data quality and the
reliability of data analytics results is Entity Resolution (ER). ER aims to
identify different descriptions that refer to the same real-world entity, and
remains a challenging problem. While previous works have studied specific
aspects of ER (and mostly in traditional settings), in this survey, we provide
for the first time an end-to-end view of modern ER workflows, and of the novel
aspects of entity indexing and matching methods in order to cope with more than
one of the Big Data characteristics simultaneously. We present the basic
concepts, processing steps and execution strategies that have been proposed by
different communities, i.e., database, semantic Web and machine learning, in
order to cope with the loose structuredness, extreme diversity, high speed and
large scale of entity descriptions used by real-world applications. Finally, we
provide a synthetic discussion of the existing approaches, and conclude with a
detailed presentation of open research directions
Named Entity Resolution in Personal Knowledge Graphs
Entity Resolution (ER) is the problem of determining when two entities refer
to the same underlying entity. The problem has been studied for over 50 years,
and most recently, has taken on new importance in an era of large,
heterogeneous 'knowledge graphs' published on the Web and used widely in
domains as wide ranging as social media, e-commerce and search. This chapter
will discuss the specific problem of named ER in the context of personal
knowledge graphs (PKGs). We begin with a formal definition of the problem, and
the components necessary for doing high-quality and efficient ER. We also
discuss some challenges that are expected to arise for Web-scale data. Next, we
provide a brief literature review, with a special focus on how existing
techniques can potentially apply to PKGs. We conclude the chapter by covering
some applications, as well as promising directions for future research.Comment: To appear as a book chapter by the same name in an upcoming (Oct.
2023) book `Personal Knowledge Graphs (PKGs): Methodology, tools and
applications' edited by Tiwari et a
Recommended from our members
Semantic-aware blocking for entity resolution
In this paper, we propose a semantic-aware blocking framework for entity resolution (ER). The proposed framework is built using locality-sensitive hashing (LSH) techniques, which efficiently unifies both textual and semantic features into an ER blocking process. In order to understand how similarity metrics may affect the effectiveness of ER blocking, we study the robustness of similarity metrics and their properties in terms of LSH families. Then, we present how the semantic similarity of records can be captured, measured, and integrated with LSH techniques over multiple similarity spaces. In doing so, the proposed framework can support efficient similarity searches on records in both textual and semantic similarity spaces, yielding ER blocking with improved quality. We have evaluated the proposed framework over two real-world data sets, and compared it with the state-of-the-art blocking techniques. Our experimental study shows that the combination of semantic similarity and textual similarity can considerably improve the quality of blocking. Furthermore, due to the probabilistic nature of LSH, this semantic-aware blocking framework enables us to build fast and reliable blocking for performing entity resolution tasks in a large-scale data environment
- …