30,744 research outputs found

    A dissimilarity-based approach for Classification

    Get PDF
    The Nearest Neighbor classifier has shown to be a powerful tool for multiclass classification. In this note we explore both theoretical properties and empirical behavior of a variant of such method, in which the Nearest Neighbor rule is applied after selecting a set of so-called prototypes, whose cardinality is fixed in advance, by minimizing the empirical mis-classification cost. With this we alleviate the two serious drawbacks of the Nearest Neighbor method: high storage requirements and time-consuming queries. The problem is shown to be NP-Hard. Mixed Integer Programming (MIP) programs are formulated, theoretically compared and solved by a standard MIP solver for problem instances of small size. Large sized problem instances are solved by a metaheuristic yielding good classification rules in reasonable time.operations research and management science;

    A Latent Source Model for Nonparametric Time Series Classification

    Full text link
    For classifying time series, a nearest-neighbor approach is widely used in practice with performance often competitive with or better than more elaborate methods such as neural networks, decision trees, and support vector machines. We develop theoretical justification for the effectiveness of nearest-neighbor-like classification of time series. Our guiding hypothesis is that in many applications, such as forecasting which topics will become trends on Twitter, there aren't actually that many prototypical time series to begin with, relative to the number of time series we have access to, e.g., topics become trends on Twitter only in a few distinct manners whereas we can collect massive amounts of Twitter data. To operationalize this hypothesis, we propose a latent source model for time series, which naturally leads to a "weighted majority voting" classification rule that can be approximated by a nearest-neighbor classifier. We establish nonasymptotic performance guarantees of both weighted majority voting and nearest-neighbor classification under our model accounting for how much of the time series we observe and the model complexity. Experimental results on synthetic data show weighted majority voting achieving the same misclassification rate as nearest-neighbor classification while observing less of the time series. We then use weighted majority to forecast which news topics on Twitter become trends, where we are able to detect such "trending topics" in advance of Twitter 79% of the time, with a mean early advantage of 1 hour and 26 minutes, a true positive rate of 95%, and a false positive rate of 4%.Comment: Advances in Neural Information Processing Systems (NIPS 2013

    Comparison of Nearest Neighbor (ibk), Regression by Discretization and Isotonic Regression Classification Algorithms for Precipitation Classes Prediction

    Get PDF
    Published ArticleSelection of classifier for use in prediction is a challenge. To select the best classifier comparisons can be made on various aspects of the classifiers. The key objective of this paper was to compare performance of nearest neighbor (ibk), regression by discretization and isotonic regression classifiers for predicting predefined precipitation classes over Voi, Kenya. We sought to train, test and evaluate the performance of nearest neighbor (ibk), regression by discretization and isotonic regression classification algorithms in predicting precipitation classes. A period of 1979 to 2008 daily Kenya Meteorological Department historical dataset on minimum/maximum temperatures and precipitations for Voi station was obtained. Knowledge discovery and data mining method was applied. A preprocessing module was designed to produce training and testing sets for use with classifiers. Isotonic Regression, K-nearest neighbours classifier, and RegressionByDiscretization classifiers were used for training training and testing of the data sets. The error of the predicted values, root relative squared error and the time taken to train/build each classifier model were computed. Each classifier predicted output classes 12 months in advance. Classifiers performances were compared in terms of error of the predicted values, root relative squared error and the time taken to train/build each classifier model. The predicted output classes were also compared to actual year classes. Classifier performances to actual precipitation classes were compared. The study revealed that the nearest neighbor classifier is a suitable for training rainfall data for precipitation classes prediction

    A Clustering-Based Algorithm for Data Reduction

    Get PDF
    Finding an efficient data reduction method for large-scale problems is an imperative task. In this paper, we propose a similarity-based self-constructing fuzzy clustering algorithm to do the sampling of instances for the classification task. Instances that are similar to each other are grouped into the same cluster. When all the instances have been fed in, a number of clusters are formed automatically. Then the statistical mean for each cluster will be regarded as representing all the instances covered in the cluster. This approach has two advantages. One is that it can be faster and uses less storage memory. The other is that the number of new representative instances need not be specified in advance by the user. Experiments on real-world datasets show that our method can run faster and obtain better reduction rate than other methods
    corecore