6 research outputs found

    Instance selection for simplified decision trees through the generation and selection of instance candidate subsets

    Get PDF
    Decision trees are a useful tool to help in the extraction of information from databases, but all too often this ability is clouded by the complexity of the tree structure resulting from the decision tree algorithm. Methods such as tree pruning, attribute selection, and most recently, instance selection, currently exist to simplify the decision tree structure. We present an alternative instance selection procedure for simplifying decision trees that improves upon previous methods by increasing the quality of the space to be traversed for finding an acceptably simplified decision tree through the identification and grouping of important instances. Experimental results from this procedure are then presented and compared to decision trees with no prior simplification effort applied. We show that in some cases we are indeed able to identify important group of instances, and subsequently are able to generate a high quality solution space for finding simplified decision trees

    Instance selection for model-based classifiers

    Get PDF
    Aspects of a classifier\u27s training dataset can often make building a helpful and high accuracy classifier difficult. Instance selection addresses some of the issues in a dataset by selecting a subset of the data in such a way that learning from the reduced dataset leads to a better classifier. This work introduces an integer programming formulation of instance selection that relies on column generation techniques to obtain a good solution to the problem. Experimental results show that instance selection improves the usefulness of some classifiers by optimizing the training data so that that the training dataset has easier to learn boundaries between class values. Also included in this paper are two case studies from the Surveillance, Epidemiology, and End Results (SEER) database that further confirm the benefit of instance selection. Overall, results indicate that performing instance selection for a classifier is a competitive classification approach. However, it should be noted that instance selection might overfit classifiers that have already achieved a good fit to the dataset

    Learning Expressive Linkage Rules for Entity Matching using Genetic Programming

    Get PDF
    A central problem in data integration and data cleansing is to identify pairs of entities in data sets that describe the same real-world object. Many existing methods for matching entities rely on explicit linkage rules, which specify how two entities are compared for equivalence. Unfortunately, writing accurate linkage rules by hand is a non-trivial problem that requires detailed knowledge of the involved data sets. Another important issue is the efficient execution of linkage rules. In this thesis, we propose a set of novel methods that cover the complete entity matching workflow from the generation of linkage rules using genetic programming algorithms to their efficient execution on distributed systems. First, we propose a supervised learning algorithm that is capable of generating linkage rules from a gold standard consisting of set of entity pairs that have been labeled as duplicates or non-duplicates. We show that the introduced algorithm outperforms previously proposed entity matching approaches including the state-of-the-art genetic programming approach by de Carvalho et al. and is capable of learning linkage rules that achieve a similar accuracy than the human written rule for the same problem. In order to also cover use cases for which no gold standard is available, we propose a complementary active learning algorithm that generates a gold standard interactively by asking the user to confirm or decline the equivalence of a small number of entity pairs. In the experimental evaluation, labeling at most 50 link candidates was necessary in order to match the performance that is achieved by the supervised GenLink algorithm on the entire gold standard. Finally, we propose an efficient execution workflow that can be run on cluster of multiple machines. The execution workflow employs a novel multidimensional indexing method that allows the efficient execution of learned linkage rules by reducing the number of required comparisons significantly
    corecore