Search CORE

176 research outputs found

Improved Heterogeneous Distance Functions

Author: Martinez T. R.
Wilson D. R.
Publication venue
Publication date: 31/12/1996
Field of study

Instance-based learning techniques typically handle continuous and linear input values well, but often do not handle nominal input attributes appropriately. The Value Difference Metric (VDM) was designed to find reasonable distance values between nominal attribute values, but it largely ignores continuous attributes, requiring discretization to map continuous values into nominal values. This paper proposes three new heterogeneous distance functions, called the Heterogeneous Value Difference Metric (HVDM), the Interpolated Value Difference Metric (IVDM), and the Windowed Value Difference Metric (WVDM). These new distance functions are designed to handle applications with nominal attributes, continuous attributes, or both. In experiments on 48 applications the new distance metrics achieve higher classification accuracy on average than three previous distance functions on those datasets that have both nominal and continuous attributes.Comment: See http://www.jair.org/ for an online appendix and other files accompanying this articl

arXiv.org e-Print Archive

CiteSeerX

A Survey of Classification Methods

Author: Khalid A. (Ahmed)
Mohammed S. B. (Somia)
Osman S. F. (SaifeEldin)
Publication venue: 'Arunai Publications Private Limited'
Publication date: 01/10/2016
Field of study

Classification may refer to categorization, the process in which ideas and objects are recognized, differentiated, and understood. There are many types of classification, researchers face a problem to choose a suitable method that give a good classification performance to solve their classification problems. In this paper, we present the basic classification techniques. Several major kinds of classification method including neural network, decision tree, Bayesian networks, support vector machine and k-nearest neighbor classifier. The goal of this survey is to provide a comprehensive review of the above different classification techniques

Neliti

Machine learning approximation techniques using dual trees

Author: Ergashbaev Denis
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2015
Field of study

This master thesis explores a dual-tree framework as applied to a particular class of machine learning problems that are collectively referred to as generalized n-body problems. It builds a new algorithm on top of it and improves existing Boosted OGE classifier

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

High-dimensional indexing methods utilizing clustering and dimensionality reduction

Author: Zhang Lijuan
Publication venue: Digital Commons @ NJIT
Publication date: 31/05/2005
Field of study

The emergence of novel database applications has resulted in the prevalence of a new paradigm for similarity search. These applications include multimedia databases, medical imaging databases, time series databases, DNA and protein sequence databases, and many others. Features of data objects are extracted and transformed into high-dimensional data points. Searching for objects becomes a search on points in the high-dimensional feature space. The dissimilarity between two objects is determined by the distance between two feature vectors. Similarity search is usually implemented as nearest neighbor search in feature vector spaces. The cost of processing k-nearest neighbor (k-NN) queries via a sequential scan increases as the number of objects and the number of features increase. A variety of multi-dimensional index structures have been proposed to improve the efficiency of k-NN query processing, which work well in low-dimensional space but lose their efficiency in high-dimensional space due to the curse of dimensionality. This inefficiency is dealt in this study by Clustering and Singular Value Decomposition - CSVD with indexing, Persistent Main Memory - PMM index, and Stepwise Dimensionality Increasing - SDI-tree index. CSVD is an approximate nearest neighbor search method. The performance of CSVD with indexing is studied and the approximation to the distance in original space is investigated. For a given Normalized Mean Square Error - NMSE, the higher the degree of clustering, the higher the recall. However, more clusters require more disk page accesses. Certain number of clusters can be obtained to achieve a higher recall while maintaining a relatively lower query processing cost. Clustering and Indexing using Persistent Main Memory - CIPMM framework is motivated by the following consideration: (a) a significant fraction of index pages are accessed randomly, incurring a high positioning time for each access; (b) disk transfer rate is improving 40% annually, while the improvement in positioning time is only 8%; (c) query processing incurs less CPU time for main memory resident than disk resident indices. CIPMM aims at reducing the elapsed time for query processing by utilizing sequential, rather than random disk accesses. A specific instance of the CIPMM framework CIPOP, indexing using Persistent Ordered Partition - OP-tree, is elaborated and compared with clustering and indexing using the SR-tree, CISR. The results show that CIPOP outperforms CISR, and the higher the dimensionality, the higher the performance gains. The SDI-tree index is motivated by fanouts decrease with dimensionality increasing and shorter vectors reduce cache misses. The index is built by using feature vectors transformed via principal component analysis, resulting in a structure with fewer dimensions at higher levels and increasing the number of dimensions from one level to the other. Dimensions are retained in nonincreasing order of their variance according to a parameter p, which specifies the incremental fraction of variance at each level of the index. Experiments on three datasets have shown that SDL-trees with carefully tuned parameters access fewer disk accesses than SR-trees and VAMSR-trees and incur less CPU time than VA-Files in addition

Digital Commons @ New Jersey Institute of Technology (NJIT)

Recommended from our members

A study of distance-based machine learning algorithms

Author: Wettschereck Dietrich
Publication venue: 'Oregon State University'
Publication date
Field of study

Distance-based algorithms are machine learning algorithms that classify queries by computing distances between these queries and a number of internally stored exemplars. Exemplars that are closest to the query have the largest in uence on the classi cation assigned to the query. Two speci c distance-based algorithms, the nearest neighbor algorithm and the nearest-hyperrectangle algorithm, are studied in detail. It is shown that the k-nearest neighbor algorithm (kNN) outperforms the rst- nearest neighbor algorithm only under certain conditions. Data sets must contain moderate amounts of noise. Training examples from the di erent classes must belong to clusters that allow an increase in the value of k without reaching into clusters of other classes. Methods for choosing the value of k for kNN are investigated. It is shown that one-fold cross-validation on a restricted number of values for k su ces for best performance. It is also shown that for best performance the votes of the k-nearest neighbors of a query should be weighted in inverse proportion to their distances from the query. Principal component analysis is shown to reduce the number of relevant dimen- sions substantially in several domains. Two methods for learning feature weights for a weighted Euclidean distance metric are proposed. These methods improve the performance of kNN and NN in a variety of domains. The nearest-hyperrectangle algorithm (NGE) is found to give predictions that are substantially inferior to those given by kNN in a variety of domains. Experiments performed to understand this inferior performance led to the discovery of several improvements to NGE. Foremost of these is BNGE, a batch algorithm that avoids construction of overlapping hyperrectangles from di erent classes. Although it is generally superior to NGE, BNGE is still signi cantly inferior to kNN in a variety of domains. Hence, a hybrid algorithm (KBNGE), that uses BNGE in parts of the input space that can be represented by a single hyperrectangle and kNN otherwise, is introduced. The primary contributions of this dissertation are (a) several improvements to existing distance-based algorithms, (b) several new distance-based algorithms, and (c) an experimentally supported understanding of the conditions under which various distance-based algorithms are likely to give good performance

ScholarsArchive@OSU

Classification by Feature Partitioning

Author: B. Cestnik
C. Stanfill
D. Angluin
D. Wettschereck
D.E. Goldberg
D.L. Medin
D.W. Aha
D.W. Aha
H. Altay Guvenir
H.A. G�venir
H.A. G�venir
I. Kononenko
Izzet Sirin
J. Zhang
J.D. Kelly
J.R. Quinlan
J.R. Quinlan
J.R. Quinlan
L. Rendell
L.G. Valiant
R. C. Holte
R.O. Duda
S Cost
S. Salzberg
S.M. Weiss
Z. Zheng
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Batch learning of disjoint feature intervals

Author: Akkuş Aynur
Publication venue: Bilkent University
Publication date: 01/01/1996
Field of study

Ankara : Department of Computer Engineering and Information Science and the Institute of Engineering and Science of Bilkent University, 1996.Thesis (Master's) -- Bilkent University, 1996.Includes bibliographical references leaves 98-104.This thesis presents several learning algorithms for multi-concept descriptions in the form of disjoint feature intervals, called Feature Interval Learning algorithms (FIL). These algorithms are batch supervised inductive learning algorithms, and use feature projections of the training instances for the representcition of the classification knowledge induced. These projections can be generalized into disjoint feature intervals. Therefore, the concept description learned is a set of disjoint intervals separately for each feature. The classification of an unseen instance is based on the weighted majority voting among the local predictions of features. In order to handle noisy instances, several extensions are developed by placing weights to intervals rather than features. Empirical evaluation of the FIL algorithms is presented and compared with some other similar classification algorithms. Although the FIL algorithms achieve comparable accuracies with other algorithms, their average running times are much more less than the others. This thesis also presents a new adaptation of the well-known /s-NN classification algorithm to the feature projections approach, called A:-NNFP for k-Nearest Neighbor on Feature Projections, based on a majority voting on individual classifications made by the projections of the training set on each feature and compares with the /:-NN algorithm on some real-world and cirtificial datasets.Akkuş, AynurM.S

Bilkent University Institutional Repository