252,154 research outputs found
Secure -ish Nearest Neighbors Classifier
In machine learning, classifiers are used to predict a class of a given query
based on an existing (classified) database. Given a database S of n
d-dimensional points and a d-dimensional query q, the k-nearest neighbors (kNN)
classifier assigns q with the majority class of its k nearest neighbors in S.
In the secure version of kNN, S and q are owned by two different parties that
do not want to share their data. Unfortunately, all known solutions for secure
kNN either require a large communication complexity between the parties, or are
very inefficient to run.
In this work we present a classifier based on kNN, that can be implemented
efficiently with homomorphic encryption (HE). The efficiency of our classifier
comes from a relaxation we make on kNN, where we allow it to consider kappa
nearest neighbors for kappa ~ k with some probability. We therefore call our
classifier k-ish Nearest Neighbors (k-ish NN).
The success probability of our solution depends on the distribution of the
distances from q to S and increase as its statistical distance to Gaussian
decrease.
To implement our classifier we introduce the concept of double-blinded
coin-toss. In a doubly-blinded coin-toss the success probability as well as the
output of the toss are encrypted. We use this coin-toss to efficiently
approximate the average and variance of the distances from q to S. We believe
these two techniques may be of independent interest.
When implemented with HE, the k-ish NN has a circuit depth that is
independent of n, therefore making it scalable. We also implemented our
classifier in an open source library based on HELib and tested it on a breast
tumor database. The accuracy of our classifier (F_1 score) were 98\% and
classification took less than 3 hours compared to (estimated) weeks in current
HE implementations
Fast k-means based on KNN Graph
In the era of big data, k-means clustering has been widely adopted as a basic
processing tool in various contexts. However, its computational cost could be
prohibitively high as the data size and the cluster number are large. It is
well known that the processing bottleneck of k-means lies in the operation of
seeking closest centroid in each iteration. In this paper, a novel solution
towards the scalability issue of k-means is presented. In the proposal, k-means
is supported by an approximate k-nearest neighbors graph. In the k-means
iteration, each data sample is only compared to clusters that its nearest
neighbors reside. Since the number of nearest neighbors we consider is much
less than k, the processing cost in this step becomes minor and irrelevant to
k. The processing bottleneck is therefore overcome. The most interesting thing
is that k-nearest neighbor graph is constructed by iteratively calling the fast
-means itself. Comparing with existing fast k-means variants, the proposed
algorithm achieves hundreds to thousands times speed-up while maintaining high
clustering quality. As it is tested on 10 million 512-dimensional data, it
takes only 5.2 hours to produce 1 million clusters. In contrast, to fulfill the
same scale of clustering, it would take 3 years for traditional k-means
Imputing missing genotypes with weighted k nearest neighbors
Motivation: Missing values are a common problem in genetic association studies concerned with single nucleotide polymorphisms (SNPs). Since most statistical methods cannot handle missing values, they have to be removed prior to the actual analysis. Considering only complete observations, however, often leads to an immense loss of information. Therefore, procedures are needed that can be used to replace such missing values. In this article, we propose a method based on weighted k nearest neighbors that can be employed for imputing such missing genotypes. Results: In a comparison to other imputation approaches, our procedure called KNNcatImpute shows the lowest rates of falsely imputed genotypes when applied to the SNP data from the GENICA study, a study dedicated to the identification of genetic and gene-environment interactions associated with sporadic breast cancer. Moreover, in contrast to other imputation methods that take all variables into account when replacing missing values of a particular variable, KNNcatImpute is not restricted to association studies comprising several ten to a few hundred SNPs, but can also be applied to data from whole-genome studies, as an application to a subset of the HapMap data shows. --
A Graph-Based Semi-Supervised k Nearest-Neighbor Method for Nonlinear Manifold Distributed Data Classification
Nearest Neighbors (NN) is one of the most widely used supervised
learning algorithms to classify Gaussian distributed data, but it does not
achieve good results when it is applied to nonlinear manifold distributed data,
especially when a very limited amount of labeled samples are available. In this
paper, we propose a new graph-based NN algorithm which can effectively
handle both Gaussian distributed data and nonlinear manifold distributed data.
To achieve this goal, we first propose a constrained Tired Random Walk (TRW) by
constructing an -level nearest-neighbor strengthened tree over the graph,
and then compute a TRW matrix for similarity measurement purposes. After this,
the nearest neighbors are identified according to the TRW matrix and the class
label of a query point is determined by the sum of all the TRW weights of its
nearest neighbors. To deal with online situations, we also propose a new
algorithm to handle sequential samples based a local neighborhood
reconstruction. Comparison experiments are conducted on both synthetic data
sets and real-world data sets to demonstrate the validity of the proposed new
NN algorithm and its improvements to other version of NN algorithms.
Given the widespread appearance of manifold structures in real-world problems
and the popularity of the traditional NN algorithm, the proposed manifold
version NN shows promising potential for classifying manifold-distributed
data.Comment: 32 pages, 12 figures, 7 table
- …
