33,548 research outputs found

    An adaptive nearest neighbor rule for classification

    Full text link
    We introduce a variant of the kk-nearest neighbor classifier in which kk is chosen adaptively for each query, rather than supplied as a parameter. The choice of kk depends on properties of each neighborhood, and therefore may significantly vary between different points. (For example, the algorithm will use larger kk for predicting the labels of points in noisy regions.) We provide theory and experiments that demonstrate that the algorithm performs comparably to, and sometimes better than, kk-NN with an optimal choice of kk. In particular, we derive bounds on the convergence rates of our classifier that depend on a local quantity we call the `advantage' which is significantly weaker than the Lipschitz conditions used in previous convergence rate proofs. These generalization bounds hinge on a variant of the seminal Uniform Convergence Theorem due to Vapnik and Chervonenkis; this variant concerns conditional probabilities and may be of independent interest

    MICE Implementation to Handle Missing Values in Rain Potential Prediction Using Support Vector Machine Algorithm

    Get PDF
    Support Vector Machine (SVM) is a machine learning algorithm used for classification. SVM has several advantages such as the ability to handle high-dimensional data, effective in handling nonlinear data through kernel functions, and resistance to overfitting through soft margins. However, SVM has weaknesses, especially when handling missing values in data. The use of SVM must consider the missing values strategy chosen. Missing values in data mining is a serious problem for researchers because it causes many problems such as loss of efficiency, complications in data handling and analysis, and the occurrence of bias due to differences between missing data and complete data. To overcome the above problems, this research focuses on understanding the characteristics of missing values and handling them using the Multiple Imputation by Chained Equations (MICE) technique. In this study, we utilized secondary data experiments that contain missing values from the Meteorological, Climatological, and Geophysical Agency (called BMKG) related to predictions of potential rain, especially in DKI Jakarta. Identification of types or patterns of missing values, exploration of the relationship between missing values and other variables, incorporation of the MICE method to handle missing values, and the Support Vector Machine Algorithm for classification will be carried out to produce a more reliable and accurate prediction model for rain potential. It shows that the imputation method with the MICE gives better results than other techniques (such as Complete Case Analysis, Imputation Method Mean, Median, Mode, and K-Nearest neighbor), namely an accuracy of 89% testing data when applying the Support Vector Machine algorithm for classification

    Most Ligand-Based Classification Benchmarks Reward Memorization Rather than Generalization

    Full text link
    Undetected overfitting can occur when there are significant redundancies between training and validation data. We describe AVE, a new measure of training-validation redundancy for ligand-based classification problems that accounts for the similarity amongst inactive molecules as well as active. We investigated seven widely-used benchmarks for virtual screening and classification, and show that the amount of AVE bias strongly correlates with the performance of ligand-based predictive methods irrespective of the predicted property, chemical fingerprint, similarity measure, or previously-applied unbiasing techniques. Therefore, it may be that the previously-reported performance of most ligand-based methods can be explained by overfitting to benchmarks rather than good prospective accuracy

    Efficient Classification for Metric Data

    Full text link
    Recent advances in large-margin classification of data residing in general metric spaces (rather than Hilbert spaces) enable classification under various natural metrics, such as string edit and earthmover distance. A general framework developed for this purpose by von Luxburg and Bousquet [JMLR, 2004] left open the questions of computational efficiency and of providing direct bounds on generalization error. We design a new algorithm for classification in general metric spaces, whose runtime and accuracy depend on the doubling dimension of the data points, and can thus achieve superior classification performance in many common scenarios. The algorithmic core of our approach is an approximate (rather than exact) solution to the classical problems of Lipschitz extension and of Nearest Neighbor Search. The algorithm's generalization performance is guaranteed via the fat-shattering dimension of Lipschitz classifiers, and we present experimental evidence of its superiority to some common kernel methods. As a by-product, we offer a new perspective on the nearest neighbor classifier, which yields significantly sharper risk asymptotics than the classic analysis of Cover and Hart [IEEE Trans. Info. Theory, 1967].Comment: This is the full version of an extended abstract that appeared in Proceedings of the 23rd COLT, 201

    Theoretical analysis of cross-validation for estimating the risk of the k-Nearest Neighbor classifier

    Full text link
    The present work aims at deriving theoretical guaranties on the behavior of some cross-validation procedures applied to the kk-nearest neighbors (kkNN) rule in the context of binary classification. Here we focus on the leave-pp-out cross-validation (LppO) used to assess the performance of the kkNN classifier. Remarkably this LppO estimator can be efficiently computed in this context using closed-form formulas derived by \cite{CelisseMaryHuard11}. We describe a general strategy to derive moment and exponential concentration inequalities for the LppO estimator applied to the kkNN classifier. Such results are obtained first by exploiting the connection between the LppO estimator and U-statistics, and second by making an intensive use of the generalized Efron-Stein inequality applied to the L11O estimator. One other important contribution is made by deriving new quantifications of the discrepancy between the LppO estimator and the classification error/risk of the kkNN classifier. The optimality of these bounds is discussed by means of several lower bounds as well as simulation experiments

    A Graph-Based Semi-Supervised k Nearest-Neighbor Method for Nonlinear Manifold Distributed Data Classification

    Get PDF
    kk Nearest Neighbors (kkNN) is one of the most widely used supervised learning algorithms to classify Gaussian distributed data, but it does not achieve good results when it is applied to nonlinear manifold distributed data, especially when a very limited amount of labeled samples are available. In this paper, we propose a new graph-based kkNN algorithm which can effectively handle both Gaussian distributed data and nonlinear manifold distributed data. To achieve this goal, we first propose a constrained Tired Random Walk (TRW) by constructing an RR-level nearest-neighbor strengthened tree over the graph, and then compute a TRW matrix for similarity measurement purposes. After this, the nearest neighbors are identified according to the TRW matrix and the class label of a query point is determined by the sum of all the TRW weights of its nearest neighbors. To deal with online situations, we also propose a new algorithm to handle sequential samples based a local neighborhood reconstruction. Comparison experiments are conducted on both synthetic data sets and real-world data sets to demonstrate the validity of the proposed new kkNN algorithm and its improvements to other version of kkNN algorithms. Given the widespread appearance of manifold structures in real-world problems and the popularity of the traditional kkNN algorithm, the proposed manifold version kkNN shows promising potential for classifying manifold-distributed data.Comment: 32 pages, 12 figures, 7 table

    Towards learning free naive bayes nearest neighbor-based domain adaptation

    Get PDF
    As of today, object categorization algorithms are not able to achieve the level of robustness and generality necessary to work reliably in the real world. Even the most powerful convolutional neural network we can train fails to perform satisfactorily when trained and tested on data from different databases. This issue, known as domain adaptation and/or dataset bias in the literature, is due to a distribution mismatch between data collections. Methods addressing it go from max-margin classifiers to learning how to modify the features and obtain a more robust representation. Recent work showed that by casting the problem into the image-to-class recognition framework, the domain adaptation problem is significantly alleviated [23]. Here we follow this approach, and show how a very simple, learning free Naive Bayes Nearest Neighbor (NBNN)-based domain adaptation algorithm can significantly alleviate the distribution mismatch among source and target data, especially when the number of classes and the number of sources grow. Experiments on standard benchmarks used in the literature show that our approach (a) is competitive with the current state of the art on small scale problems, and (b) achieves the current state of the art as the number of classes and sources grows, with minimal computational requirements. © Springer International Publishing Switzerland 2015
    • …