6 research outputs found
Instance selection for simplified decision trees through the generation and selection of instance candidate subsets
Decision trees are a useful tool to help in the extraction of information from databases, but all too often this ability is clouded by the complexity of the tree structure resulting from the decision tree algorithm. Methods such as tree pruning, attribute selection, and most recently, instance selection, currently exist to simplify the decision tree structure. We present an alternative instance selection procedure for simplifying decision trees that improves upon previous methods by increasing the quality of the space to be traversed for finding an acceptably simplified decision tree through the identification and grouping of important instances. Experimental results from this procedure are then presented and compared to decision trees with no prior simplification effort applied. We show that in some cases we are indeed able to identify important group of instances, and subsequently are able to generate a high quality solution space for finding simplified decision trees
Instance selection for model-based classifiers
Aspects of a classifier\u27s training dataset can often make building a helpful and high accuracy classifier difficult. Instance selection addresses some of the issues in a dataset by selecting a subset of the data in such a way that learning from the reduced dataset leads to a better classifier. This work introduces an integer programming formulation of instance selection that relies on column generation techniques to obtain a good solution to the problem. Experimental results show that instance selection improves the usefulness of some classifiers by optimizing the training data so that that the training dataset has easier to learn boundaries between class values. Also included in this paper are two case studies from the Surveillance, Epidemiology, and End Results (SEER) database that further confirm the benefit of instance selection. Overall, results indicate that performing instance selection for a classifier is a competitive classification approach. However, it should be noted that instance selection might overfit classifiers that have already achieved a good fit to the dataset
Learning Expressive Linkage Rules for Entity Matching using Genetic Programming
A central problem in data integration and data cleansing is to identify
pairs of entities in data sets that describe the same real-world object.
Many existing methods for matching entities rely on explicit linkage rules,
which specify how two entities are compared for equivalence. Unfortunately,
writing accurate linkage rules by hand is a non-trivial problem that
requires detailed knowledge of the involved data sets. Another important
issue is the efficient execution of linkage rules.
In this thesis, we propose a set of novel methods that cover the complete
entity matching workflow from the generation of linkage rules using genetic
programming algorithms to their efficient execution on distributed systems.
First, we propose a supervised learning algorithm that is capable of
generating linkage rules from a gold standard consisting of set of entity
pairs that have been labeled as duplicates or non-duplicates. We show that
the introduced algorithm outperforms previously proposed entity matching
approaches including the state-of-the-art genetic programming approach by
de Carvalho et al. and is capable of learning linkage rules that achieve a
similar accuracy than the human written rule for the same problem.
In order to also cover use cases for which no gold standard is available,
we propose a complementary active learning algorithm that generates a gold
standard interactively by asking the user to confirm or decline the
equivalence of a small number of entity pairs. In the experimental
evaluation, labeling at most 50 link candidates was necessary in order to
match the performance that is achieved by the supervised GenLink algorithm
on the entire gold standard.
Finally, we propose an efficient execution workflow that can be run on
cluster of multiple machines. The execution workflow employs a novel
multidimensional indexing method that allows the efficient execution of
learned linkage rules by reducing the number of required comparisons
significantly