24 research outputs found
A Training Sample Sequence Planning Method for Pattern Recognition Problems
In solving pattern recognition problems, many classification methods, such as the nearest-neighbor (NN) rule, need to determine prototypes from a training set. To improve the performance of these classifiers in finding an efficient set of prototypes, this paper introduces a training sample sequence planning method. In particular, by estimating the relative nearness of the training samples to the decision boundary, the approach proposed here incrementally increases the number of prototypes until the desired classification accuracy has been reached. This approach has been tested with a NN classification method and a neural network training approach. Studies based on both artificial and real data demonstrate that higher classification accuracy can be achieved with fewer prototypes
A Distributed and Approximated Nearest Neighbors Algorithm for an Efficient Large Scale Mean Shift Clustering
In this paper we target the class of modal clustering methods where clusters
are defined in terms of the local modes of the probability density function
which generates the data. The most well-known modal clustering method is the
k-means clustering. Mean Shift clustering is a generalization of the k-means
clustering which computes arbitrarily shaped clusters as defined as the basins
of attraction to the local modes created by the density gradient ascent paths.
Despite its potential, the Mean Shift approach is a computationally expensive
method for unsupervised learning. Thus, we introduce two contributions aiming
to provide clustering algorithms with a linear time complexity, as opposed to
the quadratic time complexity for the exact Mean Shift clustering. Firstly we
propose a scalable procedure to approximate the density gradient ascent.
Second, our proposed scalable cluster labeling technique is presented. Both
propositions are based on Locality Sensitive Hashing (LSH) to approximate
nearest neighbors. These two techniques may be used for moderate sized
datasets. Furthermore, we show that using our proposed approximations of the
density gradient ascent as a pre-processing step in other clustering methods
can also improve dedicated classification metrics. For the latter, a
distributed implementation, written for the Spark/Scala ecosystem is proposed.
For all these considered clustering methods, we present experimental results
illustrating their labeling accuracy and their potential to solve concrete
problems.Comment: Algorithms are available at
https://github.com/Clustering4Ever/Clustering4Eve
Optimal Recovery of Local Truth
Probability mass curves the data space with horizons. Let f be a multivariate
probability density function with continuous second order partial derivatives.
Consider the problem of estimating the true value of f(z) > 0 at a single point
z, from n independent observations. It is shown that, the fastest possible
estimators (like the k-nearest neighbor and kernel) have minimum asymptotic
mean square errors when the space of observations is thought as conformally
curved. The optimal metric is shown to be generated by the Hessian of f in the
regions where the Hessian is definite. Thus, the peaks and valleys of f are
surrounded by singular horizons when the Hessian changes signature from
Riemannian to pseudo-Riemannian. Adaptive estimators based on the optimal
variable metric show considerable theoretical and practical improvements over
traditional methods. The formulas simplify dramatically when the dimension of
the data space is 4. The similarities with General Relativity are striking but
possibly illusory at this point. However, these results suggest that
nonparametric density estimation may have something new to say about current
physical theory.Comment: To appear in Proceedings of Maximum Entropy and Bayesian Methods
1999. Check also: http://omega.albany.edu:8008
Vehicle positioning in urban environments using particle filtering-based global positioning system, odometry, and map data fusion
This article presents a new method for land vehicle navigation using global positioning system (GPS), dead reckoning sensor (DR), and digital road map information, particularly in urban environments where GPS failures can occur. The odometer sensors and map measure can be used to provide continuous navigation and correct the vehicle location in the presence of GPS masking. To solve this estimation problem for vehicle navigation, we propose to use particle filtering for GPS/odometer/map integration. The particle filter is a method based on the Bayesian estimation technique and the Monte Carlo method, which deals with non-linear models and is not limited to Gaussian statistics. When the GPS sensor cannot provide a location due to the number of satellites in view, the filter fuses the limited GPS pseudo-range data to enhance the vehicle positioning. The developed filter is then tested in a transportation network scenario in the presence of GPS failures, which shows the advantages of the proposed approach for vehicle location compared to the extended Kalman filter
Minimum local distance density estimation
We present a local density estimator based on first-order statistics. To estimate the density at a point, x, the original sample is divided into subsets and the average minimum sample distance to x over all such subsets is used to define the density estimate at x. The tuning parameter is thus the number of subsets instead of the typical bandwidth of kernel or histogram-based density estimators. The proposed method is similar to nearest-neighbor density estimators but it provides smoother estimates. We derive the asymptotic distribution of this minimum sample distance statistic to study globally optimal values for the number and size of the subsets. Simulations are used to illustrate and compare the convergence properties of the estimator. The results show that the method provides good estimates of a wide variety of densities without changes of the tuning parameter, and that it offers competitive convergence performance.United States. Department of Energy. Applied Mathematical Sciences Program (Award DE-FG02-08ER2585)United States. Department of Energy. Applied Mathematical Sciences Program (Award de-sc0009297
A Systematic Review of Learning based Notion Change Acceptance Strategies for Incremental Mining
The data generated contemporarily from different communication environments is dynamic in content different from the earlier static data environments. The high speed streams have huge digital data transmitted with rapid context changes unlike static environments where the data is mostly stationery. The process of extracting, classifying, and exploring relevant information from enormous flowing and high speed varying streaming data has several inapplicable issues when static data based strategies are applied. The learning strategies of static data are based on observable and established notion changes for exploring the data whereas in high speed data streams there are no fixed rules or drift strategies existing beforehand and the classification mechanisms have to develop their own learning schemes in terms of the notion changes and Notion Change Acceptance by changing the existing notion, or substituting the existing notion, or creating new notions with evaluation in the classification process in terms of the previous, existing, and the newer incoming notions. The research in this field has devised numerous data stream mining strategies for determining, predicting, and establishing the notion changes in the process of exploring and accurately predicting the next notion change occurrences in Notion Change. In this context of feasible relevant better knowledge discovery in this paper we have given an illustration with nomenclature of various contemporarily affirmed models of benchmark in data stream mining for adapting the Notion Change