11 research outputs found
An Adaptive Approach for Interlinking Georeferenced Data
International audienceThe resources published on the Web of data are often described by spatial references such as coordinates. The common data linking approaches are mainly based on the hypothesis that spatially close resources are more likely to represent the same thing. However, this assumption is valid only when the spatial references that are compared have been produced with the same positional accuracy, and when they actually represent the same spatial characteristic of the resources captured in an unambiguous way. Otherwise, spatial distance-based matching algorithms may produce erroneous links. In this article, we first suggest to formalize and acquire the knowledge about the spatial references, namely their positional accuracy, their geometric modeling, their level of detail, and the vagueness of the spatial entities they represent. We then propose an interlinking approach that dynamically adapts the way spatial references are compared, based on this knowledge
Web-Scale Blocking, Iterative and Progressive Entity Resolution
International audienceEntity resolution aims to identify descriptions of the same entity within or across knowledge bases. In this work, we provide a comprehensive and cohesive overview of the key research results in the area of entity resolution. We are interested in frameworks addressing the new challenges in entity resolution posed by the Web of data in which real world entities are described by interlinked data rather than documents. Since such descriptions are usually partial, overlapping and sometimes evolving, entity resolution emerges as a central problem both to increase dataset linking, but also to search the Web of data for entities and their relations. We focus on Web-scale blocking, iterative and progressive solutions for entity resolution. Specifically, to reduce the required number of comparisons, blocking is performed to place similar descriptions into blocks and executes comparisons to identify matches only between descriptions within the same block. To minimize the number of missed matches, an iterative entity resolution process can exploit any intermediate results of blocking and matching, discovering new candidate description pairs for resolution. Finally, we overview works on progressive entity resolution, which attempt to discover as many matches as possible given limited computing budget, by estimating the matching likelihood of yet unresolved descriptions, based on the matches found so far
Geo-L: Topological Link Discovery for Geospatial Linked Data Made Easy
Geospatial linked data are an emerging domain, with growing interest in research and the industry. There is an increasing number of publicly available geospatial linked data resources, which can also be interlinked and easily integrated with private and industrial linked data on the web. The present paper introduces Geo-L, a system for the discovery of RDF spatial links based on topological relations. Experiments show that the proposed system improves state-of-the-art spatial linking processes in terms of mapping time and accuracy, as well as concerning resources retrieval efficiency and robustness
Efficient multidimensional blocking for link discovery without losing recall
Over the last three years, an increasing number of data providers have started to publish structured data according to the Linked Data principles on the Web. The resulting Web of Data currently consists of over 28 billion RDF triples. As the Web of Data grows, there is an increasing need for link discovery tools which scale to very large datasets. In record linkage, many partitioning methods have been proposed which substantially reduce the number of required entity comparisons. Unfortunately, most of these methods either lead to a decrease in recall or only work on metric spaces. We propose a novel blocking method called Multi-Block which uses a multidimensional index in which similar objects are located near each other. In each dimension the entities are indexed by a different property increasing the efficiency of the index significantly. In addition, it guarantees that no false dismissals can occur. Our approach works on complex link specifications which aggregate several different similarity measures. MultiBlock has been implemented as part of the Silk Link Discovery Framework. The evaluation shows a speedup factor of several 100 for large datasets compared to the full evaluation without losing recall
STEM: stacked threshold-based entity matching for knowledge base generation
One of the major issues encountered in the generation of knowledge bases is the integration of data coming from
a collection of heterogeneous data sources. A key essential task when integrating data instances is the entity matching. Entity
matching is based on the definition of a similarity measure among entities and on the classification of the entity pair as a match
if the similarity exceeds a certain threshold. This parameter introduces a trade-off between the precision and the recall of the
algorithm, as higher values of the threshold lead to higher precision and lower recall, and lower values lead to higher recall
and lower precision. In this paper, we propose a stacking approach for threshold-based classifiers. It runs several instances of
classifiers corresponding to different thresholds and use their predictions as a feature vector for a supervised learner. We show that
this approach is able to break the trade-off between the precision and recall of the algorithm, increasing both at the same time and
enhancing the overall performance of the algorithm. We also show that this hybrid approach performs better and is less dependent
on the amount of available training data with respect to a supervised learning approach that directly uses properties’ similarity
values. In order to test the generality of the claim, we have run experimental tests using two different threshold-based classifiers
on two different data sets. Finally, we show a concrete use case describing the implementation of the proposed approach in the
generation of the 3cixty Nice knowledge base
Recommended from our members
Populating a Linked Data Entity Name System
Resource Description Framework (RDF) is a graph-based data model used to publish data as a Web of Linked Data. RDF is an emergent foundation for large-scale data integration, the problem of providing a unified view over multiple data sources. An Entity Name System (ENS) is a thesaurus for entities, and is a crucial component in a data integration architecture. Populating a Linked Data ENS is equivalent to solving an Artificial Intelligence problem called instance matching, which concerns identifying pairs of entities referring to the same underlying entity. This dissertation presents an instance matcher with four properties, namely automation, heterogeneity, scalability and domain independence. Automation is addressed by employing inexpensive but well-performing heuristics to automatically generate a training set, which is employed by other machine learning algorithms in the pipeline. Data-driven alignment algorithms are adapted to deal with structural heterogeneity in RDF graphs. Domain independence is established by actively avoiding prior assumptions about input domains, and through evaluations on ten RDF test cases. The full system is scaled by implementing it on cloud infrastructure using MapReduce algorithms.Computer Science
Learning Expressive Linkage Rules for Entity Matching using Genetic Programming
A central problem in data integration and data cleansing is to identify
pairs of entities in data sets that describe the same real-world object.
Many existing methods for matching entities rely on explicit linkage rules,
which specify how two entities are compared for equivalence. Unfortunately,
writing accurate linkage rules by hand is a non-trivial problem that
requires detailed knowledge of the involved data sets. Another important
issue is the efficient execution of linkage rules.
In this thesis, we propose a set of novel methods that cover the complete
entity matching workflow from the generation of linkage rules using genetic
programming algorithms to their efficient execution on distributed systems.
First, we propose a supervised learning algorithm that is capable of
generating linkage rules from a gold standard consisting of set of entity
pairs that have been labeled as duplicates or non-duplicates. We show that
the introduced algorithm outperforms previously proposed entity matching
approaches including the state-of-the-art genetic programming approach by
de Carvalho et al. and is capable of learning linkage rules that achieve a
similar accuracy than the human written rule for the same problem.
In order to also cover use cases for which no gold standard is available,
we propose a complementary active learning algorithm that generates a gold
standard interactively by asking the user to confirm or decline the
equivalence of a small number of entity pairs. In the experimental
evaluation, labeling at most 50 link candidates was necessary in order to
match the performance that is achieved by the supervised GenLink algorithm
on the entire gold standard.
Finally, we propose an efficient execution workflow that can be run on
cluster of multiple machines. The execution workflow employs a novel
multidimensional indexing method that allows the efficient execution of
learned linkage rules by reducing the number of required comparisons
significantly