50,306 research outputs found
Collective Entity Resolution In Relational Data
Many databases contain imprecise references to real-world entities. For example, a social-network database records names of people. But different people can go by the same name and there may be
different observed names referring to the same person. The goal of entity resolution is to determine the mapping from database references to discovered real-world entities.
Traditional entity resolution approaches consider approximate matches between attributes of individual references, but this does not always work well. In many domains, such as social networks and academic circles, the underlying entities exhibit strong ties to each other, and as a result, their references often co-occur in the data. In this dissertation, I focus on the use of such co-occurrence relationships for jointly resolving entities. I refer to this problem as `collective entity resolution'. First, I propose a relational clustering algorithm for iteratively discovering entities by clustering references taking into account the clusters of co-occurring references. Next, I propose a probabilistic generative model for collective resolution that finds hidden group structures among the entities and uses the latent groups as evidence for entity resolution. One of my contributions is an efficient unsupervised inference algorithm for this model using Gibbs Sampling techniques that discovers the most likely number of entities. Both of these approaches improve performance over attribute-only baselines in multiple real world and synthetic datasets. I also perform a theoretical analysis of how the structural properties of the data affect collective entity resolution and verify the predicted trends experimentally. In addition, I motivate the problem of query-time entity resolution. I propose an adaptive algorithm that uses collective resolution for answering queries by recursively exploring and resolving related references. This enables resolution at query-time, while preserving the performance benefits of collective resolution. Finally, as an application of entity resolution in the domain of natural language processing, I study the sense disambiguation problem and propose models for collective sense disambiguation using multiple languages that outperform other unsupervised approaches
ERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution
Entity resolution (ER), an important and common data cleaning problem, is
about detecting data duplicate representations for the same external entities,
and merging them into single representations. Relatively recently, declarative
rules called "matching dependencies" (MDs) have been proposed for specifying
similarity conditions under which attribute values in database records are
merged. In this work we show the process and the benefits of integrating four
components of ER: (a) Building a classifier for duplicate/non-duplicate record
pairs built using machine learning (ML) techniques; (b) Use of MDs for
supporting the blocking phase of ML; (c) Record merging on the basis of the
classifier results; and (d) The use of the declarative language "LogiQL" -an
extended form of Datalog supported by the "LogicBlox" platform- for all
activities related to data processing, and the specification and enforcement of
MDs.Comment: Final journal version, with some minor technical corrections.
Extended version of arXiv:1508.0601
ERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution
Entity resolution (ER), an important and common data cleaning problem, is
about detecting data duplicate representations for the same external entities,
and merging them into single representations. Relatively recently, declarative
rules called matching dependencies (MDs) have been proposed for specifying
similarity conditions under which attribute values in database records are
merged. In this work we show the process and the benefits of integrating three
components of ER: (a) Classifiers for duplicate/non-duplicate record pairs
built using machine learning (ML) techniques, (b) MDs for supporting both the
blocking phase of ML and the merge itself; and (c) The use of the declarative
language LogiQL -an extended form of Datalog supported by the LogicBlox
platform- for data processing, and the specification and enforcement of MDs.Comment: To appear in Proc. SUM, 201
Entity Resolution In Graphs
The goal of entity resolution is to reconcile data references
corresponding to the same real world entity. Here we introduce the
problem of entity resolution in graphs, where the nodes are the
references in the data and the hyper-edges represent the relations
that are observed to hold between the references. The goal then is to
reconstruct a `cleaned' entity graph that captures the relations among
the true underlying entities from the reference graph. This is an
important first step in any graph mining process; mining an unresolved
graph will be inefficient and result in inaccurate conclusions. We
also motivate collective entity resolution in graphs where references
sharing hyper-edges are resolved jointly, as opposed to independent
pair-wise resolution of the references. We illustrate the problem of
graph-based entity resolution in bibliographic datasets. We discuss
several interesting issues such as multiple entity types, local and
global resolution and different kinds of graph-based evidence. We
formulate the graph-based entity resolution problem as an unsupervised
clustering task, where each cluster represents references that map to
the same entity, and the similarity measure between two clusters
incorporates the similarity of the references attributes and, more
interestingly, the similarity between their relations. We explore two
different measures of relational similarity. One approach, which we
call `edge detail similarity', explicitly compares the individual
edges that each cluster participates in, but is expensive to
compute. A less computationally intensive alternative is measuring
`neighborhood similarity', which only compares the multi-set of
neighboring clusters for each cluster. We perform an extensive
empirical evaluation of the two relational similarity measures for
author resolution using co-author relations in two real bibliographic
datasets. We show that both similarity measures improve performance
over unsupervised algorithms that consider only reference
attributes. We also describe an efficient implementation and show that
these algorithms scale gracefully with increasing size of the data
MinoanER: Schema-Agnostic, Non-Iterative, Massively Parallel Resolution of Web Entities
Entity Resolution (ER) aims to identify different descriptions in various
Knowledge Bases (KBs) that refer to the same entity. ER is challenged by the
Variety, Volume and Veracity of entity descriptions published in the Web of
Data. To address them, we propose the MinoanER framework that simultaneously
fulfills full automation, support of highly heterogeneous entities, and massive
parallelization of the ER process. MinoanER leverages a token-based similarity
of entities to define a new metric that derives the similarity of neighboring
entities from the most important relations, as they are indicated only by
statistics. A composite blocking method is employed to capture different
sources of matching evidence from the content, neighbors, or names of entities.
The search space of candidate pairs for comparison is compactly abstracted by a
novel disjunctive blocking graph and processed by a non-iterative, massively
parallel matching algorithm that consists of four generic, schema-agnostic
matching rules that are quite robust with respect to their internal
configuration. We demonstrate that the effectiveness of MinoanER is comparable
to existing ER tools over real KBs exhibiting low Variety, but it outperforms
them significantly when matching KBs with high Variety.Comment: Presented at EDBT 2001
- …