19,789 research outputs found
Dynamic Feature Scaling for K-Nearest Neighbor Algorithm
Nearest Neighbors Algorithm is a Lazy Learning Algorithm, in which the
algorithm tries to approximate the predictions with the help of similar
existing vectors in the training dataset. The predictions made by the K-Nearest
Neighbors algorithm is based on averaging the target values of the spatial
neighbors. The selection process for neighbors in the Hermitian space is done
with the help of distance metrics such as Euclidean distance, Minkowski
distance, Mahalanobis distance etc. A majority of the metrics such as Euclidean
distance are scale variant, meaning that the results could vary for different
range of values used for the features. Standard techniques used for the
normalization of scaling factors are feature scaling method such as Z-score
normalization technique, Min-Max scaling etc. Scaling methods uniformly assign
equal weights to all the features, which might result in a non-ideal situation.
This paper proposes a novel method to assign weights to individual feature with
the help of out of bag errors obtained from constructing multiple decision tree
models.Comment: Presented in International Conference on Mathematical Computer
Engineering 201
KNN Ensembles for Tweedie Regression: The Power of Multiscale Neighborhoods
Very few K-nearest-neighbor (KNN) ensembles exist, despite the efficacy of
this approach in regression, classification, and outlier detection. Those that
do exist focus on bagging features, rather than varying k or bagging
observations; it is unknown whether varying k or bagging observations can
improve prediction. Given recent studies from topological data analysis,
varying k may function like multiscale topological methods, providing stability
and better prediction, as well as increased ensemble diversity.
This paper explores 7 KNN ensemble algorithms combining bagged features,
bagged observations, and varied k to understand how each of these contribute to
model fit. Specifically, these algorithms are tested on Tweedie regression
problems through simulations and 6 real datasets; results are compared to
state-of-the-art machine learning models including extreme learning machines,
random forest, boosted regression, and Morse-Smale regression.
Results on simulations suggest gains from varying k above and beyond bagging
features or samples, as well as the robustness of KNN ensembles to the curse of
dimensionality. KNN regression ensembles perform favorably against
state-of-the-art algorithms and dramatically improve performance over KNN
regression. Further, real dataset results suggest varying k is a good strategy
in general (particularly for difficult Tweedie regression problems) and that
KNN regression ensembles often outperform state-of-the-art methods.
These results for k-varying ensembles echo recent theoretical results in
topological data analysis, where multidimensional filter functions and
multiscale coverings provide stability and performance gains over
single-dimensional filters and single-scale covering. This opens up the
possibility of leveraging multiscale neighborhoods and multiple measures of
local geometry in ensemble methods.Comment: 17 pages, 11 figure
Relief-Based Feature Selection: Introduction and Review
Feature selection plays a critical role in biomedical data mining, driven by
increasing feature dimensionality in target problems and growing interest in
advanced but computationally expensive methodologies able to model complex
associations. Specifically, there is a need for feature selection methods that
are computationally efficient, yet sensitive to complex patterns of
association, e.g. interactions, so that informative features are not mistakenly
eliminated prior to downstream modeling. This paper focuses on Relief-based
algorithms (RBAs), a unique family of filter-style feature selection algorithms
that have gained appeal by striking an effective balance between these
objectives while flexibly adapting to various data characteristics, e.g.
classification vs. regression. First, this work broadly examines types of
feature selection and defines RBAs within that context. Next, we introduce the
original Relief algorithm and associated concepts, emphasizing the intuition
behind how it works, how feature weights generated by the algorithm can be
interpreted, and why it is sensitive to feature interactions without evaluating
combinations of features. Lastly, we include an expansive review of RBA
methodological research beyond Relief and its popular descendant, ReliefF. In
particular, we characterize branches of RBA research, and provide comparative
summaries of RBA algorithms including contributions, strategies, functionality,
time complexity, adaptation to key data characteristics, and software
availability.Comment: Submitted revisions for publication based on reviews by the Journal
of Biomedical Informatic
Photometric Redshift Estimation for Quasars by Integration of KNN and SVM
The massive photometric data collected from multiple large-scale sky surveys
offer significant opportunities for measuring distances of celestial objects by
photometric redshifts. However, catastrophic failure is still an unsolved
problem for a long time and exists in the current photometric redshift
estimation approaches (such as -nearest-neighbor). In this paper, we propose
a novel two-stage approach by integration of -nearest-neighbor (KNN) and
support vector machine (SVM) methods together. In the first stage, we apply KNN
algorithm on photometric data and estimate their corresponding z.
By analysis, we find two dense regions with catastrophic failure, one in the
range of z, the other in the range of z. In the second stage, we map the photometric input pattern of points
falling into the two ranges from original attribute space into a high
dimensional feature space by Gaussian kernel function in SVM. In the high
dimensional feature space, many outlier points resulting from catastrophic
failure by simple Euclidean distance computation in KNN can be identified by a
classification hyperplane of SVM and further be corrected. Experimental results
based on the SDSS (the Sloan Digital Sky Survey) quasar data show that the
two-stage fusion approach can significantly mitigate catastrophic failure and
improve the estimation accuracy of photometric redshifts of quasars. The
percents in different |z| ranges and rms (root mean square) error by
the integrated method are , , and 0.192,
respectively, compared to the results by KNN (, ,
and 0.204).Comment: 14 pages, 7 figures, 1 table, accepted by Research in Astronomy and
Astrophysic
Geometrical Complexity of Classification Problems
Despite encouraging recent progresses in ensemble approaches, classification
methods seem to have reached a plateau in development. Further advances depend
on a better understanding of geometrical and topological characteristics of
point sets in high-dimensional spaces, the preservation of such characteristics
under feature transformations and sampling processes, and their interaction
with geometrical models used in classifiers. We discuss an attempt to measure
such properties from data sets and relate them to classifier accuracies.Comment: Proceedings of the 7th Course on Ensemble Methods for Learning
Machines at the International School on Neural Nets ``E.R. Caianiello'
Compact Hash Codes for Efficient Visual Descriptors Retrieval in Large Scale Databases
In this paper we present an efficient method for visual descriptors retrieval
based on compact hash codes computed using a multiple k-means assignment. The
method has been applied to the problem of approximate nearest neighbor (ANN)
search of local and global visual content descriptors, and it has been tested
on different datasets: three large scale public datasets of up to one billion
descriptors (BIGANN) and, supported by recent progress in convolutional neural
networks (CNNs), also on the CIFAR-10 and MNIST datasets. Experimental results
show that, despite its simplicity, the proposed method obtains a very high
performance that makes it superior to more complex state-of-the-art methods
Fast kNN mode seeking clustering applied to active learning
A significantly faster algorithm is presented for the original kNN mode
seeking procedure. It has the advantages over the well-known mean shift
algorithm that it is feasible in high-dimensional vector spaces and results in
uniquely, well defined modes. Moreover, without any additional computational
effort it may yield a multi-scale hierarchy of clusterings. The time complexity
is just O(n^1.5). resulting computing times range from seconds for 10^4 objects
to minutes for 10^5 objects and to less than an hour for 10^6 objects. The
space complexity is just O(n). The procedure is well suited for finding large
sets of small clusters and is thereby a candidate to analyze thousands of
clusters in millions of objects.
The kNN mode seeking procedure can be used for active learning by assigning
the clusters to the class of the modal objects of the clusters. Its feasibility
is shown by some examples with up to 1.5 million handwritten digits. The
obtained classification results based on the clusterings are compared with
those obtained by the nearest neighbor rule and the support vector classifier
based on the same labeled objects for training. It can be concluded that using
the clustering structure for classification can be significantly better than
using the trained classifiers. A drawback of using the clustering for
classification, however, is that no classifier is obtained that may be used for
out-of-sample objects.Comment: 23 pages, 12 figure
Kernelized Hashcode Representations for Relation Extraction
Kernel methods have produced state-of-the-art results for a number of NLP
tasks such as relation extraction, but suffer from poor scalability due to the
high cost of computing kernel similarities between natural language structures.
A recently proposed technique, kernelized locality-sensitive hashing (KLSH),
can significantly reduce the computational cost, but is only applicable to
classifiers operating on kNN graphs. Here we propose to use random subspaces of
KLSH codes for efficiently constructing an explicit representation of NLP
structures suitable for general classification methods. Further, we propose an
approach for optimizing the KLSH model for classification problems by
maximizing an approximation of mutual information between the KLSH codes
(feature vectors) and the class labels. We evaluate the proposed approach on
biomedical relation extraction datasets, and observe significant and robust
improvements in accuracy w.r.t. state-of-the-art classifiers, along with
drastic (orders-of-magnitude) speedup compared to conventional kernel methods.Comment: To appear in the proceedings of conference, AAAI-1
Vector Quantization by Minimizing Kullback-Leibler Divergence
This paper proposes a new method for vector quantization by minimizing the
Kullback-Leibler Divergence between the class label distributions over the
quantization inputs, which are original vectors, and the output, which is the
quantization subsets of the vector set. In this way, the vector quantization
output can keep as much information of the class label as possible. An
objective function is constructed and we also developed an iterative algorithm
to minimize it. The new method is evaluated on bag-of-features based image
classification problem
Multipartite Pooling for Deep Convolutional Neural Networks
We propose a novel pooling strategy that learns how to adaptively rank deep
convolutional features for selecting more informative representations. To this
end, we exploit discriminative analysis to project the features onto a space
spanned by the number of classes in the dataset under study. This maps the
notion of labels in the feature space into instances in the projected space. We
employ these projected distances as a measure to rank the existing features
with respect to their specific discriminant power for each individual class. We
then apply multipartite ranking to score the separability of the instances and
aggregate one-versus-all scores to compute an overall distinction score for
each feature. For the pooling, we pick features with the highest scores in a
pooling window instead of maximum, average or stochastic random assignments.
Our experiments on various benchmarks confirm that the proposed strategy of
multipartite pooling is highly beneficial to consistently improve the
performance of deep convolutional networks via better generalization of the
trained models for the test-time data
- …