20 research outputs found

    Study and Observation of the Variation of Accuracies of KNN, SVM, LMNN, ENN Algorithms on Eleven Different Datasets from UCI Machine Learning Repository

    Full text link
    Machine learning qualifies computers to assimilate with data, without being solely programmed [1, 2]. Machine learning can be classified as supervised and unsupervised learning. In supervised learning, computers learn an objective that portrays an input to an output hinged on training input-output pairs [3]. Most efficient and widely used supervised learning algorithms are K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Large Margin Nearest Neighbor (LMNN), and Extended Nearest Neighbor (ENN). The main contribution of this paper is to implement these elegant learning algorithms on eleven different datasets from the UCI machine learning repository to observe the variation of accuracies for each of the algorithms on all datasets. Analyzing the accuracy of the algorithms will give us a brief idea about the relationship of the machine learning algorithms and the data dimensionality. All the algorithms are developed in Matlab. Upon such accuracy observation, the comparison can be built among KNN, SVM, LMNN, and ENN regarding their performances on each dataset.Comment: To be published in the 4th IEEE International Conference on Electrical Engineering and Information & Communication Technology (iCEEiCT 2018

    An Efficient Dual Approach to Distance Metric Learning

    Full text link
    Distance metric learning is of fundamental interest in machine learning because the distance metric employed can significantly affect the performance of many learning methods. Quadratic Mahalanobis metric learning is a popular approach to the problem, but typically requires solving a semidefinite programming (SDP) problem, which is computationally expensive. Standard interior-point SDP solvers typically have a complexity of O(D6.5)O(D^{6.5}) (with DD the dimension of input data), and can thus only practically solve problems exhibiting less than a few thousand variables. Since the number of variables is D(D+1)/2D (D+1) / 2 , this implies a limit upon the size of problem that can practically be solved of around a few hundred dimensions. The complexity of the popular quadratic Mahalanobis metric learning approach thus limits the size of problem to which metric learning can be applied. Here we propose a significantly more efficient approach to the metric learning problem based on the Lagrange dual formulation of the problem. The proposed formulation is much simpler to implement, and therefore allows much larger Mahalanobis metric learning problems to be solved. The time complexity of the proposed method is O(D3)O (D ^ 3) , which is significantly lower than that of the SDP approach. Experiments on a variety of datasets demonstrate that the proposed method achieves an accuracy comparable to the state-of-the-art, but is applicable to significantly larger problems. We also show that the proposed method can be applied to solve more general Frobenius-norm regularized SDP problems approximately

    Using Asymmetric Classification Cost Matrices in Predicting Diabetes

    Get PDF
    Often there is a need to introduce classification costs into the classifier for predicting disease. This is determined by the type of disease, its associated classification cost matrix and/or the target population on which the classifier will be used. Diabetes has higher costs associated with false negatives than true positives, as the disease can progress very rapidly when left untreated. There are two ways to skew a classifier to work towards the given classification cost matrix: (1) by changing the classification probability value, P* based on the classification cost matrix or (2) by rebalancing the training set to introduce more negative cases. Using a diabetes data set, this paper compares the two methods. The results indicate comparable values of predictive accuracy and expected classification costs for either method. However, P* works better when the p-value is less than 0.2. Hence for diabetes classification matrices, the P* method is recommended

    Exploiting diversity for optimizing margin distribution in ensemble learning

    Get PDF
    Margin distribution is acknowledged as an important factor for improving the generalization performance of classifiers. In this paper, we propose a novel ensemble learning algorithm named Double Rotation Margin Forest (DRMF), that aims to improve the margin distribution of the combined system over the training set. We utilise random rotation to produce diverse base classifiers, and optimize the margin distribution to exploit the diversity for producing an optimal ensemble. We demonstrate that diverse base classifiers are beneficial in deriving large-margin ensembles, and that therefore our proposed technique will lead to good generalization performance. We examine our method on an extensive set of benchmark classification tasks. The experimental results confirm that DRMF outperforms other classical ensemble algorithms such as Bagging, AdaBoostM1 and Rotation Forest. The success of DRMF is explained from the viewpoints of margin distribution and diversity

    Composite Kernel Optimization in Semi-Supervised Metric

    Get PDF
    Machine-learning solutions to classification, clustering and matching problems critically depend on the adopted metric, which in the past was selected heuristically. In the last decade, it has been demonstrated that an appropriate metric can be learnt from data, resulting in superior performance as compared with traditional metrics. This has recently stimulated a considerable interest in the topic of metric learning, especially using kernel functions, which map data to feature spaces with enhanced class separability, and implicitly define a new metric in the original feature space. The formulation of the problem of metric learning depends on the supervisory information available for the task. In this paper, we focus on semi-supervised kernel based distance metric learning where the training data set is unlabelled, with the exception of a small subset of pairs of points labelled as belonging to the same class (cluster) or different classes (clusters). The proposed method involves creating a pool of kernel functions. The corresponding kernels matrices are first clustered to remove redundancy in representation. A composite kernel constructed from the kernel clustering result is then expanded into an orthogonal set of basis functions. The mixing parameters of this expansion are then optimised using point similarity and dissimilarity information conveyed by the labels. The proposed method is evaluated on synthetic and real data sets. The results show the merit of using similarity and dissimilarity information jointly as compared to using just the similarity information, and the superiority of the proposed method over all the recently introduced metric learning approaches
    corecore