5,043 research outputs found

    A Clustering-Based Algorithm for Data Reduction

    Get PDF
    Finding an efficient data reduction method for large-scale problems is an imperative task. In this paper, we propose a similarity-based self-constructing fuzzy clustering algorithm to do the sampling of instances for the classification task. Instances that are similar to each other are grouped into the same cluster. When all the instances have been fed in, a number of clusters are formed automatically. Then the statistical mean for each cluster will be regarded as representing all the instances covered in the cluster. This approach has two advantages. One is that it can be faster and uses less storage memory. The other is that the number of new representative instances need not be specified in advance by the user. Experiments on real-world datasets show that our method can run faster and obtain better reduction rate than other methods

    MRPR: a MapReduce solution for prototype reduction in big data classification

    Get PDF
    In the era of big data, analyzing and extracting knowledge from large-scale data sets is a very interesting and challenging task. The application of standard data mining tools in such data sets is not straightforward. Hence, a new class of scalable mining method that embraces the huge storage and processing capacity of cloud platforms is required. In this work, we propose a novel distributed partitioning methodology for prototype reduction techniques in nearest neighbor classification. These methods aim at representing original training data sets as a reduced number of instances. Their main purposes are to speed up the classification process and reduce the storage requirements and sensitivity to noise of the nearest neighbor rule. However, the standard prototype reduction methods cannot cope with very large data sets. To overcome this limitation, we develop a MapReduce-based framework to distribute the functioning of these algorithms through a cluster of computing elements, proposing several algorithmic strategies to integrate multiple partial solutions (reduced sets of prototypes) into a single one. The proposed model enables prototype reduction algorithms to be applied over big data classification problems without significant accuracy loss. We test the speeding up capabilities of our model with data sets up to 5.7 millions of instances. The results show that this model is a suitable tool to enhance the performance of the nearest neighbor classifier with big data

    Investigation on prototype learning.

    Get PDF
    Keung Chi-Kin.Thesis (M.Phil.)--Chinese University of Hong Kong, 2000.Includes bibliographical references (leaves 128-135).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Classification --- p.2Chapter 1.2 --- Instance-Based Learning --- p.4Chapter 1.2.1 --- Three Basic Components --- p.5Chapter 1.2.2 --- Advantages --- p.6Chapter 1.2.3 --- Disadvantages --- p.7Chapter 1.3 --- Thesis Contributions --- p.7Chapter 1.4 --- Thesis Organization --- p.8Chapter 2 --- Background --- p.10Chapter 2.1 --- Improving Instance-Based Learning --- p.10Chapter 2.1.1 --- Scaling-up Nearest Neighbor Searching --- p.11Chapter 2.1.2 --- Data Reduction --- p.12Chapter 2.2 --- Prototype Learning --- p.12Chapter 2.2.1 --- Objectives --- p.13Chapter 2.2.2 --- Two Types of Prototype Learning --- p.15Chapter 2.3 --- Instance-Filtering Methods --- p.15Chapter 2.3.1 --- Retaining Border Instances --- p.16Chapter 2.3.2 --- Removing Border Instances --- p.21Chapter 2.3.3 --- Retaining Center Instances --- p.22Chapter 2.3.4 --- Advantages --- p.23Chapter 2.3.5 --- Disadvantages --- p.24Chapter 2.4 --- Instance-Abstraction Methods --- p.25Chapter 2.4.1 --- Advantages --- p.30Chapter 2.4.2 --- Disadvantages --- p.30Chapter 2.5 --- Other Methods --- p.32Chapter 2.6 --- Summary --- p.34Chapter 3 --- Integration of Filtering and Abstraction --- p.36Chapter 3.1 --- Incremental Integration --- p.37Chapter 3.1.1 --- Motivation --- p.37Chapter 3.1.2 --- The Integration Method --- p.40Chapter 3.1.3 --- Issues --- p.41Chapter 3.2 --- Concept Integration --- p.42Chapter 3.2.1 --- Motivation --- p.43Chapter 3.2.2 --- The Integration Method --- p.44Chapter 3.2.3 --- Issues --- p.45Chapter 3.3 --- Difference between Integration Methods and Composite Clas- sifiers --- p.48Chapter 4 --- The PGF Framework --- p.49Chapter 4.1 --- The PGF1 Algorithm --- p.50Chapter 4.1.1 --- Instance-Filtering Component --- p.51Chapter 4.1.2 --- Instance-Abstraction Component --- p.52Chapter 4.2 --- The PGF2 Algorithm --- p.56Chapter 4.3 --- Empirical Analysis --- p.57Chapter 4.3.1 --- Experimental Setup --- p.57Chapter 4.3.2 --- Results of PGF Algorithms --- p.59Chapter 4.3.3 --- Analysis of PGF1 --- p.61Chapter 4.3.4 --- Analysis of PGF2 --- p.63Chapter 4.3.5 --- Overall Behavior of PGF --- p.66Chapter 4.3.6 --- Comparisons with Other Approaches --- p.69Chapter 4.4 --- Time Complexity --- p.72Chapter 4.4.1 --- Filtering Components --- p.72Chapter 4.4.2 --- Abstraction Component --- p.74Chapter 4.4.3 --- PGF Algorithms --- p.74Chapter 4.5 --- Summary --- p.75Chapter 5 --- Integrated Concept Prototype Learner --- p.77Chapter 5.1 --- Motivation --- p.78Chapter 5.2 --- Abstraction Component --- p.80Chapter 5.2.1 --- Issues for Abstraction --- p.80Chapter 5.2.2 --- Investigation on Typicality --- p.82Chapter 5.2.3 --- Typicality in Abstraction --- p.85Chapter 5.2.4 --- The TPA algorithm --- p.86Chapter 5.2.5 --- Analysis of TPA --- p.90Chapter 5.3 --- Filtering Component --- p.93Chapter 5.3.1 --- Investigation on Associate --- p.96Chapter 5.3.2 --- The RT2 Algorithm --- p.100Chapter 5.3.3 --- Analysis of RT2 --- p.101Chapter 5.4 --- Concept Integration --- p.103Chapter 5.4.1 --- The ICPL Algorithm --- p.104Chapter 5.4.2 --- Analysis of ICPL --- p.106Chapter 5.5 --- Empirical Analysis --- p.106Chapter 5.5.1 --- Experimental Setup --- p.106Chapter 5.5.2 --- Results of ICPL Algorithm --- p.109Chapter 5.5.3 --- Comparisons with Pure Abstraction and Pure Filtering --- p.110Chapter 5.5.4 --- Comparisons with Other Approaches --- p.114Chapter 5.6 --- Time Complexity --- p.119Chapter 5.7 --- Summary --- p.120Chapter 6 --- Conclusions and Future Work --- p.122Chapter 6.1 --- Conclusions --- p.122Chapter 6.2 --- Future Work --- p.126Bibliography --- p.128Chapter A --- Detailed Information for Tested Data Sets --- p.136Chapter B --- Detailed Experimental Results for PGF --- p.13

    Techniques for data pattern selection and abstraction

    Get PDF
    This thesis concerns the problem of prototype reduction in instance-based learning. In order to deal with problems such as storage requirements, sensitivity to noise and computational complexity, various algorithms have been presented that condense the number of stored prototypes, while maintaining competent classification accuracy. Instance selection, which recovers a smaller subset of the original training set, is the most widely used technique for instance reduction. But, prototype abstraction that generates new prototypes to replace the initial ones has also gained a lot of interest recently. The major contribution of this work is the proposal of four novel frameworks for performing prototype reduction, the Class Boundary Preserving algorithm (CBP), a hybrid method that uses both selection and generation of prototypes, Instance Seriation for Prototype Abstraction (ISPA), which is an abstraction algorithm, and two selective techniques, Spectral Instance Reduction (SIR) and Direct Weight Optimization (DWO). CBP is a multi-stage method based on a simple heuristic that is very effective in identifying samples close to class borders. Using a noise filter harmful instances are removed, while the powerful heuristic determines the geometrical distribution of patterns around every instance. Together with the concepts of nearest enemy pairs and mean shift clustering this algorithm decides on the final set of retained prototypes. DWO is a selection model whose output set of prototypes is decided by a set of binary weights. These weights are computed according to an objective function composed of the ratio between the nearest friend and nearest enemy of every sample. In order to obtain good quality results DWO is optimized using a genetic algorithm. ISPA is an abstraction technique that employs the concept of data seriation to organize instances in an arrangement that favours merging between them. As a result, a new set of prototypes is created. Results show that CBP, SIR and DWO, the three major algorithms presented in this thesis, are competent and efficient in terms of at least one of the two basic objectives, classification accuracy and condensation ratio. The comparison against other successful condensation algorithms illustrates the competitiveness of the proposed models. The SIR algorithm presents a set of border discriminating features (BDFs) that depicts the local distribution of friends and enemies of all samples. These are then used along with spectral graph theory to partition the training set in to border and internal instances

    Forgetting Exceptions is Harmful in Language Learning

    Get PDF
    We show that in language learning, contrary to received wisdom, keeping exceptional training instances in memory can be beneficial for generalization accuracy. We investigate this phenomenon empirically on a selection of benchmark natural language processing tasks: grapheme-to-phoneme conversion, part-of-speech tagging, prepositional-phrase attachment, and base noun phrase chunking. In a first series of experiments we combine memory-based learning with training set editing techniques, in which instances are edited based on their typicality and class prediction strength. Results show that editing exceptional instances (with low typicality or low class prediction strength) tends to harm generalization accuracy. In a second series of experiments we compare memory-based learning and decision-tree learning methods on the same selection of tasks, and find that decision-tree learning often performs worse than memory-based learning. Moreover, the decrease in performance can be linked to the degree of abstraction from exceptions (i.e., pruning or eagerness). We provide explanations for both results in terms of the properties of the natural language processing tasks and the learning algorithms.Comment: 31 pages, 7 figures, 10 tables. uses 11pt, fullname, a4wide tex styles. Pre-print version of article to appear in Machine Learning 11:1-3, Special Issue on Natural Language Learning. Figures on page 22 slightly compressed to avoid page overloa
    corecore