4,692 research outputs found
Learning Expressive Linkage Rules for Entity Matching using Genetic Programming
A central problem in data integration and data cleansing is to identify
pairs of entities in data sets that describe the same real-world object.
Many existing methods for matching entities rely on explicit linkage rules,
which specify how two entities are compared for equivalence. Unfortunately,
writing accurate linkage rules by hand is a non-trivial problem that
requires detailed knowledge of the involved data sets. Another important
issue is the efficient execution of linkage rules.
In this thesis, we propose a set of novel methods that cover the complete
entity matching workflow from the generation of linkage rules using genetic
programming algorithms to their efficient execution on distributed systems.
First, we propose a supervised learning algorithm that is capable of
generating linkage rules from a gold standard consisting of set of entity
pairs that have been labeled as duplicates or non-duplicates. We show that
the introduced algorithm outperforms previously proposed entity matching
approaches including the state-of-the-art genetic programming approach by
de Carvalho et al. and is capable of learning linkage rules that achieve a
similar accuracy than the human written rule for the same problem.
In order to also cover use cases for which no gold standard is available,
we propose a complementary active learning algorithm that generates a gold
standard interactively by asking the user to confirm or decline the
equivalence of a small number of entity pairs. In the experimental
evaluation, labeling at most 50 link candidates was necessary in order to
match the performance that is achieved by the supervised GenLink algorithm
on the entire gold standard.
Finally, we propose an efficient execution workflow that can be run on
cluster of multiple machines. The execution workflow employs a novel
multidimensional indexing method that allows the efficient execution of
learned linkage rules by reducing the number of required comparisons
significantly
Rule Extraction by Genetic Programming with Clustered Terminal Symbols
When Genetic Programming (GP) is applied to rule extraction from databases, the attributes of the data are often used for the terminal symbols. However, in the case of the database with a large number of attributes, the search space becomes vast because the size of the terminal set increases. As a result, the search performance declines. For improving the search performance, we propose new methods for dealing with the large-scale terminal set. In the methods, the terminal symbols are clustered based on the similarities of the attributes. In the beginning of search, by reducing the number of terminal symbols, the rough and rapid search is performed. In the latter stage of
search, by using the original attributes for terminal symbols, the local search is performed. By comparison with the conventional GP, the proposed methods showed the faster evolutional speed and extracted more accurate classification rules
LODE: Linking Digital Humanities Content to the Web of Data
Numerous digital humanities projects maintain their data collections in the
form of text, images, and metadata. While data may be stored in many formats,
from plain text to XML to relational databases, the use of the resource
description framework (RDF) as a standardized representation has gained
considerable traction during the last five years. Almost every digital
humanities meeting has at least one session concerned with the topic of digital
humanities, RDF, and linked data. While most existing work in linked data has
focused on improving algorithms for entity matching, the aim of the
LinkedHumanities project is to build digital humanities tools that work "out of
the box," enabling their use by humanities scholars, computer scientists,
librarians, and information scientists alike. With this paper, we report on the
Linked Open Data Enhancer (LODE) framework developed as part of the
LinkedHumanities project. With LODE we support non-technical users to enrich a
local RDF repository with high-quality data from the Linked Open Data cloud.
LODE links and enhances the local RDF repository without compromising the
quality of the data. In particular, LODE supports the user in the enhancement
and linking process by providing intuitive user-interfaces and by suggesting
high-quality linking candidates using tailored matching algorithms. We hope
that the LODE framework will be useful to digital humanities scholars
complementing other digital humanities tools
Discrete and fuzzy dynamical genetic programming in the XCSF learning classifier system
A number of representation schemes have been presented for use within
learning classifier systems, ranging from binary encodings to neural networks.
This paper presents results from an investigation into using discrete and fuzzy
dynamical system representations within the XCSF learning classifier system. In
particular, asynchronous random Boolean networks are used to represent the
traditional condition-action production system rules in the discrete case and
asynchronous fuzzy logic networks in the continuous-valued case. It is shown
possible to use self-adaptive, open-ended evolution to design an ensemble of
such dynamical systems within XCSF to solve a number of well-known test
problems
Next steps in implementing Kaput's research programme
We explore some key constructs and research themes initiated by Jim Kaput, and attempt to illuminate them further with reference to our own research. These 'design principles' focus on the evolution of digital representations since the early nineties, and we attempt to take forward our collective understanding of the cognitive and cultural affordances they offer. There are two main organising ideas for the paper. The first centres around Kaput's notion of outsourcing of processing power, and explores the implications of this for mathematical learning. We argue that a key component for design is to create visible, transparent views of outsourcing, a transparency without which there may be as many pitfalls as opportunities for mathematical learning. The second organising idea is that of communication, a key notion for Kaput, and the importance of designing for communication in ways that recognise the mutual influence of tools for communication and for mathematical expression
On Feeding Business Systems with Linked Resources from the Web of Data
Business systems that are fed with data from the Web of Data require transparent interoperability. The Linked Data principles establish that different resources that represent the same real-world entities must be linked for such purpose. Link rules are paramount to transparent interoperability since they produce the links between resources. State-of-the-art link rules are learnt by genetic programming and build on comparing the values of the attributes of the resources. Unfortunately, this approach falls short in cases in which resources have similar values for their attributes, but represent different real-world entities. In this paper, we present a proposal that leverages a genetic programming that learns link rules and an ad-hoc filtering technique that boosts them to decide whether the links that they produce must be selected or not. Our analysis of the literature reveals that our approach is novel and our experimental analysis confirms that it helps improve the F1 score by increasing precision without a significant penalty on recall.Ministerio de Economía y Competitividad TIN2013-40848-RMinisterio de Economía y Competitividad TIN2016- 75394-
- …