Search CORE

33,548 research outputs found

An adaptive nearest neighbor rule for classification

Author: Balsubramani Akshay
Dasgupta Sanjoy
Freund Yoav
Moran Shay
Publication venue
Publication date: 01/01/2019
Field of study

We introduce a variant of the

k

-nearest neighbor classifier in which

k

is chosen adaptively for each query, rather than supplied as a parameter. The choice of

k

depends on properties of each neighborhood, and therefore may significantly vary between different points. (For example, the algorithm will use larger

k

for predicting the labels of points in noisy regions.) We provide theory and experiments that demonstrate that the algorithm performs comparably to, and sometimes better than,

k

-NN with an optimal choice of

k

. In particular, we derive bounds on the convergence rates of our classifier that depend on a local quantity we call the `advantage' which is significantly weaker than the Lipschitz conditions used in previous convergence rate proofs. These generalization bounds hinge on a variant of the seminal Uniform Convergence Theorem due to Vapnik and Chervonenkis; this variant concerns conditional probabilities and may be of independent interest

arXiv.org e-Print Archive

eScholarship - University of California

MICE Implementation to Handle Missing Values in Rain Potential Prediction Using Support Vector Machine Algorithm

Author: Putri Aina Latifa Riyana
SRRM Titi Udjiani
Surarso Bayu
Publication venue: Universitas Muhammadiyah Mataram
Publication date: 01/10/2023
Field of study

Support Vector Machine (SVM) is a machine learning algorithm used for classification. SVM has several advantages such as the ability to handle high-dimensional data, effective in handling nonlinear data through kernel functions, and resistance to overfitting through soft margins. However, SVM has weaknesses, especially when handling missing values in data. The use of SVM must consider the missing values strategy chosen. Missing values in data mining is a serious problem for researchers because it causes many problems such as loss of efficiency, complications in data handling and analysis, and the occurrence of bias due to differences between missing data and complete data. To overcome the above problems, this research focuses on understanding the characteristics of missing values and handling them using the Multiple Imputation by Chained Equations (MICE) technique. In this study, we utilized secondary data experiments that contain missing values from the Meteorological, Climatological, and Geophysical Agency (called BMKG) related to predictions of potential rain, especially in DKI Jakarta. Identification of types or patterns of missing values, exploration of the relationship between missing values and other variables, incorporation of the MICE method to handle missing values, and the Support Vector Machine Algorithm for classification will be carried out to produce a more reliable and accurate prediction model for rain potential. It shows that the imputation method with the MICE gives better results than other techniques (such as Complete Case Analysis, Imputation Method Mean, Median, Mode, and K-Nearest neighbor), namely an accuracy of 89% testing data when applying the Support Vector Machine algorithm for classification

Directory of Open Access Journals

UMMAT Scientific Journals (Universitas Muhammadiyah Mataram)

Most Ligand-Based Classification Benchmarks Reward Memorization Rather than Generalization

Author: Heifets Abraham
Wallach Izhar
Publication venue: 'American Chemical Society (ACS)'
Publication date: 09/05/2018
Field of study

Undetected overfitting can occur when there are significant redundancies between training and validation data. We describe AVE, a new measure of training-validation redundancy for ligand-based classification problems that accounts for the similarity amongst inactive molecules as well as active. We investigated seven widely-used benchmarks for virtual screening and classification, and show that the amount of AVE bias strongly correlates with the performance of ligand-based predictive methods irrespective of the predicted property, chemical fingerprint, similarity measure, or previously-applied unbiasing techniques. Therefore, it may be that the previously-reported performance of most ligand-based methods can be explained by overfitting to benchmarks rather than good prospective accuracy

arXiv.org e-Print Archive

FigShare

Efficient Classification for Metric Data

Author: Gottlieb Lee-Ad
Kontorovich Aryeh
Krauthgamer Robert
Publication venue
Publication date: 10/07/2014
Field of study

Recent advances in large-margin classification of data residing in general metric spaces (rather than Hilbert spaces) enable classification under various natural metrics, such as string edit and earthmover distance. A general framework developed for this purpose by von Luxburg and Bousquet [JMLR, 2004] left open the questions of computational efficiency and of providing direct bounds on generalization error. We design a new algorithm for classification in general metric spaces, whose runtime and accuracy depend on the doubling dimension of the data points, and can thus achieve superior classification performance in many common scenarios. The algorithmic core of our approach is an approximate (rather than exact) solution to the classical problems of Lipschitz extension and of Nearest Neighbor Search. The algorithm's generalization performance is guaranteed via the fat-shattering dimension of Lipschitz classifiers, and we present experimental evidence of its superiority to some common kernel methods. As a by-product, we offer a new perspective on the nearest neighbor classifier, which yields significantly sharper risk asymptotics than the classic analysis of Cover and Hart [IEEE Trans. Info. Theory, 1967].Comment: This is the full version of an extended abstract that appeared in Proceedings of the 23rd COLT, 201

arXiv.org e-Print Archive

CiteSeerX

Theoretical analysis of cross-validation for estimating the risk of the k-Nearest Neighbor classifier

Author: Celisse Alain
Mary-Huard Tristan
Publication venue
Publication date: 12/10/2017
Field of study

The present work aims at deriving theoretical guaranties on the behavior of some cross-validation procedures applied to the

k

-nearest neighbors (

k

NN) rule in the context of binary classification. Here we focus on the leave-

p

-out cross-validation (L

p

O) used to assess the performance of the

k

NN classifier. Remarkably this L

p

O estimator can be efficiently computed in this context using closed-form formulas derived by \cite{CelisseMaryHuard11}. We describe a general strategy to derive moment and exponential concentration inequalities for the L

p

O estimator applied to the

k

NN classifier. Such results are obtained first by exploiting the connection between the L

p

O estimator and U-statistics, and second by making an intensive use of the generalized Efron-Stein inequality applied to the L

1

O estimator. One other important contribution is made by deriving new quantifications of the discrepancy between the L

p

O estimator and the classification error/risk of the

k

NN classifier. The optimality of these bounds is discussed by means of several lower bounds as well as simulation experiments

arXiv.org e-Print Archive

HAL Descartes

ProdInra

A Graph-Based Semi-Supervised k Nearest-Neighbor Method for Nonlinear Manifold Distributed Data Classification

Author: Kasabov Nikola
Tu Enmei
Yang Jie
Zhang Yaqian
Zhu Lin
Publication venue
Publication date: 03/06/2016
Field of study

k

Nearest Neighbors (

k

NN) is one of the most widely used supervised learning algorithms to classify Gaussian distributed data, but it does not achieve good results when it is applied to nonlinear manifold distributed data, especially when a very limited amount of labeled samples are available. In this paper, we propose a new graph-based

k

NN algorithm which can effectively handle both Gaussian distributed data and nonlinear manifold distributed data. To achieve this goal, we first propose a constrained Tired Random Walk (TRW) by constructing an

R

-level nearest-neighbor strengthened tree over the graph, and then compute a TRW matrix for similarity measurement purposes. After this, the nearest neighbors are identified according to the TRW matrix and the class label of a query point is determined by the sum of all the TRW weights of its nearest neighbors. To deal with online situations, we also propose a new algorithm to handle sequential samples based a local neighborhood reconstruction. Comparison experiments are conducted on both synthetic data sets and real-world data sets to demonstrate the validity of the proposed new

k

NN algorithm and its improvements to other version of

k

NN algorithms. Given the widespread appearance of manifold structures in real-world problems and the popularity of the traditional

k

NN algorithm, the proposed manifold version

k

NN shows promising potential for classifying manifold-distributed data.Comment: 32 pages, 12 figures, 7 table

arXiv.org e-Print Archive

AUT Scholarly Commons

Towards learning free naive bayes nearest neighbor-based domain adaptation

Author: A Khosla
B Caputo
H Bay
K Saenko
L Bruzzone
Q Qiu
T Tommasi
T Tommasi
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

As of today, object categorization algorithms are not able to achieve the level of robustness and generality necessary to work reliably in the real world. Even the most powerful convolutional neural network we can train fails to perform satisfactorily when trained and tested on data from different databases. This issue, known as domain adaptation and/or dataset bias in the literature, is due to a distribution mismatch between data collections. Methods addressing it go from max-margin classifiers to learning how to modify the features and obtain a more robust representation. Recent work showed that by casting the problem into the image-to-class recognition framework, the domain adaptation problem is significantly alleviated [23]. Here we follow this approach, and show how a very simple, learning free Naive Bayes Nearest Neighbor (NBNN)-based domain adaptation algorithm can significantly alleviate the distribution mismatch among source and target data, especially when the number of classes and the number of sources grow. Experiments on standard benchmarks used in the literature show that our approach (a) is competitive with the current state of the art on small scale problems, and (b) achieves the current state of the art as the number of classes and sources grows, with minimal computational requirements. © Springer International Publishing Switzerland 2015

arXiv.org e-Print Archive

Crossref

Archivio della ricerca- Università di Roma La Sapienza