5,043 research outputs found
A Clustering-Based Algorithm for Data Reduction
Finding an efficient data reduction method for large-scale
problems is an imperative task. In this paper, we propose a similarity-based self-constructing fuzzy clustering algorithm to do the sampling of instances for the classification task. Instances that are similar to each other are grouped into the same cluster. When all the instances have been fed in, a number of clusters are formed automatically. Then the statistical mean for each cluster will be regarded as representing all the instances covered in the cluster. This approach has two advantages. One is that it can be faster and uses less storage memory. The other is that the number of new representative instances need not be specified in advance by the user. Experiments on real-world datasets show that our method can run faster and obtain better reduction rate than other methods
Recommended from our members
Incremental learning of independent, overlapping, and graded concept descriptions with an instance-based process framework
Supervised learning algorithms make several simplifying assumptions concerning the characteristics of the concept descriptions to be learned. For example, concepts are often assumed to be (1) defined with respect to the same set of relevant attributes, (2) disjoint in instance space, and (3) have uniform instance distributions. While these assumptions constrain the learning task, they unfortunately limit an algorithm's applicability. We believe that supervised learning algorithms should learn attribute relevancies independently for each concept, allow instances to be members of any subset of concepts, and represent graded concept descriptions. This paper introduces a process framework for instance-based learning algorithms that exploit only specific instance and performance feedback information to guide their concept learning processes. We also introduce Bloom, a specific instantiation of this framework. Bloom is a supervised, incremental, instance-based learning algorithm that learns relative attribute relevancies independently for each concept, allows instances to be members of any subset of concepts, and represents graded concept memberships. We describe empirical evidence to support our claims that Bloom can learn independent, overlapping, and graded concept descriptions
MRPR: a MapReduce solution for prototype reduction in big data classification
In the era of big data, analyzing and extracting knowledge from large-scale data sets is a very interesting and challenging task. The application of standard data mining tools in such data sets is not straightforward. Hence, a new class of scalable mining method that embraces the huge storage and processing capacity of cloud platforms is required. In this work, we propose a novel distributed partitioning methodology for prototype reduction techniques in nearest neighbor classification. These methods aim at representing original training data sets as a reduced number of instances. Their main purposes are to speed up the classification process and reduce the storage requirements and sensitivity to noise of the nearest neighbor rule. However, the standard prototype reduction methods cannot cope with very large data sets. To overcome this limitation, we develop a MapReduce-based framework to distribute the functioning of these algorithms through a cluster of computing elements, proposing several algorithmic strategies to integrate multiple partial solutions (reduced sets of prototypes) into a single one. The proposed model enables prototype reduction algorithms to be applied over big data classification problems without significant accuracy loss. We test the speeding up capabilities of our model with data sets up to 5.7 millions of instances. The results show that this model is a suitable tool to enhance the performance of the nearest neighbor classifier with big data
Investigation on prototype learning.
Keung Chi-Kin.Thesis (M.Phil.)--Chinese University of Hong Kong, 2000.Includes bibliographical references (leaves 128-135).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Classification --- p.2Chapter 1.2 --- Instance-Based Learning --- p.4Chapter 1.2.1 --- Three Basic Components --- p.5Chapter 1.2.2 --- Advantages --- p.6Chapter 1.2.3 --- Disadvantages --- p.7Chapter 1.3 --- Thesis Contributions --- p.7Chapter 1.4 --- Thesis Organization --- p.8Chapter 2 --- Background --- p.10Chapter 2.1 --- Improving Instance-Based Learning --- p.10Chapter 2.1.1 --- Scaling-up Nearest Neighbor Searching --- p.11Chapter 2.1.2 --- Data Reduction --- p.12Chapter 2.2 --- Prototype Learning --- p.12Chapter 2.2.1 --- Objectives --- p.13Chapter 2.2.2 --- Two Types of Prototype Learning --- p.15Chapter 2.3 --- Instance-Filtering Methods --- p.15Chapter 2.3.1 --- Retaining Border Instances --- p.16Chapter 2.3.2 --- Removing Border Instances --- p.21Chapter 2.3.3 --- Retaining Center Instances --- p.22Chapter 2.3.4 --- Advantages --- p.23Chapter 2.3.5 --- Disadvantages --- p.24Chapter 2.4 --- Instance-Abstraction Methods --- p.25Chapter 2.4.1 --- Advantages --- p.30Chapter 2.4.2 --- Disadvantages --- p.30Chapter 2.5 --- Other Methods --- p.32Chapter 2.6 --- Summary --- p.34Chapter 3 --- Integration of Filtering and Abstraction --- p.36Chapter 3.1 --- Incremental Integration --- p.37Chapter 3.1.1 --- Motivation --- p.37Chapter 3.1.2 --- The Integration Method --- p.40Chapter 3.1.3 --- Issues --- p.41Chapter 3.2 --- Concept Integration --- p.42Chapter 3.2.1 --- Motivation --- p.43Chapter 3.2.2 --- The Integration Method --- p.44Chapter 3.2.3 --- Issues --- p.45Chapter 3.3 --- Difference between Integration Methods and Composite Clas- sifiers --- p.48Chapter 4 --- The PGF Framework --- p.49Chapter 4.1 --- The PGF1 Algorithm --- p.50Chapter 4.1.1 --- Instance-Filtering Component --- p.51Chapter 4.1.2 --- Instance-Abstraction Component --- p.52Chapter 4.2 --- The PGF2 Algorithm --- p.56Chapter 4.3 --- Empirical Analysis --- p.57Chapter 4.3.1 --- Experimental Setup --- p.57Chapter 4.3.2 --- Results of PGF Algorithms --- p.59Chapter 4.3.3 --- Analysis of PGF1 --- p.61Chapter 4.3.4 --- Analysis of PGF2 --- p.63Chapter 4.3.5 --- Overall Behavior of PGF --- p.66Chapter 4.3.6 --- Comparisons with Other Approaches --- p.69Chapter 4.4 --- Time Complexity --- p.72Chapter 4.4.1 --- Filtering Components --- p.72Chapter 4.4.2 --- Abstraction Component --- p.74Chapter 4.4.3 --- PGF Algorithms --- p.74Chapter 4.5 --- Summary --- p.75Chapter 5 --- Integrated Concept Prototype Learner --- p.77Chapter 5.1 --- Motivation --- p.78Chapter 5.2 --- Abstraction Component --- p.80Chapter 5.2.1 --- Issues for Abstraction --- p.80Chapter 5.2.2 --- Investigation on Typicality --- p.82Chapter 5.2.3 --- Typicality in Abstraction --- p.85Chapter 5.2.4 --- The TPA algorithm --- p.86Chapter 5.2.5 --- Analysis of TPA --- p.90Chapter 5.3 --- Filtering Component --- p.93Chapter 5.3.1 --- Investigation on Associate --- p.96Chapter 5.3.2 --- The RT2 Algorithm --- p.100Chapter 5.3.3 --- Analysis of RT2 --- p.101Chapter 5.4 --- Concept Integration --- p.103Chapter 5.4.1 --- The ICPL Algorithm --- p.104Chapter 5.4.2 --- Analysis of ICPL --- p.106Chapter 5.5 --- Empirical Analysis --- p.106Chapter 5.5.1 --- Experimental Setup --- p.106Chapter 5.5.2 --- Results of ICPL Algorithm --- p.109Chapter 5.5.3 --- Comparisons with Pure Abstraction and Pure Filtering --- p.110Chapter 5.5.4 --- Comparisons with Other Approaches --- p.114Chapter 5.6 --- Time Complexity --- p.119Chapter 5.7 --- Summary --- p.120Chapter 6 --- Conclusions and Future Work --- p.122Chapter 6.1 --- Conclusions --- p.122Chapter 6.2 --- Future Work --- p.126Bibliography --- p.128Chapter A --- Detailed Information for Tested Data Sets --- p.136Chapter B --- Detailed Experimental Results for PGF --- p.13
Recommended from our members
A study of instance-based algorithms for supervised learning tasks : mathematical, empirical, and psychological evaluations
This dissertation introduces a framework for specifying instance-based algorithms that can solve supervised learning tasks. These algorithms input a sequence of instances and yield a partial concept description, which is represented by a set of stored instances and associated information. This description can be used to predict values for subsequently presented instances. The thesis of this framework is that extensional concept descriptions and lazy generalization strategies can support efficient supervised learning behavior.The instance-based learning framework consists of three components. The pre-processor component transforms an instance into a more palatable form for the performance component, which computes the instance's similarity with a set of stored instances and yields a prediction for its target value(s). Therefore, the similarity and prediction functions impose generalizations on the stored instances to inductively derive predictions. The learning component assesses the accuracy of these prediction(s) and updates partial concept descriptions to improve their predictive accuracy.This framework is evaluated in four ways. First, its generality is evaluated by mathematically determining the classes of symbolic concepts and numeric functions that can be closely approximated by IB_1, a simple algorithm specified by this framework. Second, this framework is empirically evaluated for its ability to specify algorithms that improve IB_1's learning efficiency. Significant efficiency improvements are obtained by instance-based algorithms that reduce storage requirements, tolerate noisy data, and learn domain-specific similarity functions respectively. Alternative component definitions for these algorithms are empirically analyzed in a set of five high-level parameter studies. Third, this framework is evaluated for its ability to specify psychologically plausible process models for categorization tasks. Results from subject experiments indicate a positive correlation between a models' ability to utilize attribute correlation information and its ability to explain psychological phenomena. Finally, this framework is evaluated for its ability to explain and relate a dozen prominent instance-based learning systems. The survey shows that this framework requires only slight modifications to fit these highly diverse systems. Relationships with edited nearest neighbor algorithms, case-based reasoners, and artificial neural networks are also described
Techniques for data pattern selection and abstraction
This thesis concerns the problem of prototype reduction in instance-based learning. In order to deal with problems such as storage requirements, sensitivity to noise and computational complexity, various algorithms have been presented that condense the number of stored prototypes, while maintaining competent classification accuracy. Instance selection, which recovers a smaller subset of the original training set, is the most widely used technique for instance reduction. But, prototype abstraction that generates new prototypes to replace the initial ones has also gained a lot of interest recently. The major contribution of this work is the proposal of four novel frameworks for performing prototype reduction, the Class Boundary Preserving algorithm (CBP), a hybrid method that uses both selection and generation of prototypes, Instance Seriation for Prototype Abstraction (ISPA), which is an abstraction algorithm, and two selective techniques, Spectral Instance Reduction (SIR) and Direct Weight Optimization (DWO). CBP is a multi-stage method based on a simple heuristic that is very effective in identifying samples close to class borders. Using a noise filter harmful instances are removed, while the powerful heuristic determines the geometrical distribution of patterns around every instance. Together with the concepts of nearest enemy pairs and mean shift clustering this algorithm decides on the final set of retained prototypes. DWO is a selection model whose output set of prototypes is decided by a set of binary weights. These weights are computed according to an objective function composed of the ratio between the nearest friend and nearest enemy of every sample. In order to obtain good quality results DWO is optimized using a genetic algorithm. ISPA is an abstraction technique that employs the concept of data seriation to organize instances in an arrangement that favours merging between them. As a result, a new set of prototypes is created. Results show that CBP, SIR and DWO, the three major algorithms presented in this thesis, are competent and efficient in terms of at least one of the two basic objectives, classification accuracy and condensation ratio. The comparison against other successful condensation algorithms illustrates the competitiveness of the proposed models. The SIR algorithm presents a set of border discriminating features (BDFs) that depicts the local distribution of friends and enemies of all samples. These are then used along with spectral graph theory to partition the training set in to border and internal instances
Forgetting Exceptions is Harmful in Language Learning
We show that in language learning, contrary to received wisdom, keeping
exceptional training instances in memory can be beneficial for generalization
accuracy. We investigate this phenomenon empirically on a selection of
benchmark natural language processing tasks: grapheme-to-phoneme conversion,
part-of-speech tagging, prepositional-phrase attachment, and base noun phrase
chunking. In a first series of experiments we combine memory-based learning
with training set editing techniques, in which instances are edited based on
their typicality and class prediction strength. Results show that editing
exceptional instances (with low typicality or low class prediction strength)
tends to harm generalization accuracy. In a second series of experiments we
compare memory-based learning and decision-tree learning methods on the same
selection of tasks, and find that decision-tree learning often performs worse
than memory-based learning. Moreover, the decrease in performance can be linked
to the degree of abstraction from exceptions (i.e., pruning or eagerness). We
provide explanations for both results in terms of the properties of the natural
language processing tasks and the learning algorithms.Comment: 31 pages, 7 figures, 10 tables. uses 11pt, fullname, a4wide tex
styles. Pre-print version of article to appear in Machine Learning 11:1-3,
Special Issue on Natural Language Learning. Figures on page 22 slightly
compressed to avoid page overloa
- …