3,317 research outputs found
Local feature weighting in nearest prototype classification
The distance metric is the corner stone of nearest neighbor (NN)-based methods, and therefore, of nearest prototype (NP) algorithms. That is because they classify depending on the similarity of the data. When the data is characterized by a set of features which may contribute to the classification task in different levels, feature weighting or selection is required, sometimes in a local sense. However, local weighting is typically restricted to NN approaches. In this paper, we introduce local feature weighting (LFW) in NP classification. LFW provides each prototype its own weight vector, opposite to typical global weighting methods found in the NP literature, where all the prototypes share the same one. Providing each prototype its own weight vector has a novel effect in the borders of the Voronoi regions generated: They become nonlinear. We have integrated LFW with a previously developed evolutionary nearest prototype classifier (ENPC). The experiments performed both in artificial and real data sets demonstrate that the resulting algorithm that we call LFW in nearest prototype classification (LFW-NPC) avoids overfitting on training data in domains where the features may have different contribution to the classification task in different areas of the feature space. This generalization capability is also reflected in automatically obtaining an accurate and reduced set of prototypes.Publicad
A Survey on Metric Learning for Feature Vectors and Structured Data
The need for appropriate ways to measure the distance or similarity between
data is ubiquitous in machine learning, pattern recognition and data mining,
but handcrafting such good metrics for specific problems is generally
difficult. This has led to the emergence of metric learning, which aims at
automatically learning a metric from data and has attracted a lot of interest
in machine learning and related fields for the past ten years. This survey
paper proposes a systematic review of the metric learning literature,
highlighting the pros and cons of each approach. We pay particular attention to
Mahalanobis distance metric learning, a well-studied and successful framework,
but additionally present a wide range of methods that have recently emerged as
powerful alternatives, including nonlinear metric learning, similarity learning
and local metric learning. Recent trends and extensions, such as
semi-supervised metric learning, metric learning for histogram data and the
derivation of generalization guarantees, are also covered. Finally, this survey
addresses metric learning for structured data, in particular edit distance
learning, and attempts to give an overview of the remaining challenges in
metric learning for the years to come.Comment: Technical report, 59 pages. Changes in v2: fixed typos and improved
presentation. Changes in v3: fixed typos. Changes in v4: fixed typos and new
method
Scalable Feature Selection Using ReliefF Aided by Locality-Sensitive Hashing
Financiado para publicación en acceso aberto: Universidade da Coruña/CISUG[Abstract] Feature selection algorithms, such as ReliefF, are very important for processing high-dimensionality data sets. However, widespread use of popular and effective such algorithms is limited by their computational cost. We describe an adaptation of the ReliefF algorithm that simplifies the costliest of its step by approximating the nearest neighbor graph using locality-sensitive hashing (LSH). The resulting ReliefF-LSH algorithm can process data sets that are too large for the original ReliefF, a capability further enhanced by distributed implementation in Apache Spark. Furthermore, ReliefF-LSH obtains better results and is more generally applicable than currently available alternatives to the original ReliefF, as it can handle regression and multiclass data sets. The fact that it does not require any additional hyperparameters with respect to ReliefF also avoids costly tuning. A set of experiments demonstrates the validity of this new approach and confirms its good scalability.This study has been supported in part by the Spanish Ministerio de Economía y Competitividad (projects PID2019-109238GB-C2 and TIN 2015-65069-C2-1-R and 2-R), partially funded by FEDER funds of the EU and by the Xunta de Galicia (projects ED431C 2018/34 and Centro Singular de Investigación de Galicia, accreditation 2016-2019). The authors wish to thank the Fundación Pública Galega Centro Tecnolóxico de Supercomputación de Galicia (CESGA) for the use of their computing resources. Funding for open access charge: Universidade da Coruña/CISUGXunta de Galicia; ED431C 2018/3
Learning Heterogeneous Similarity Measures for Hybrid-Recommendations in Meta-Mining
The notion of meta-mining has appeared recently and extends the traditional
meta-learning in two ways. First it does not learn meta-models that provide
support only for the learning algorithm selection task but ones that support
the whole data-mining process. In addition it abandons the so called black-box
approach to algorithm description followed in meta-learning. Now in addition to
the datasets, algorithms also have descriptors, workflows as well. For the
latter two these descriptions are semantic, describing properties of the
algorithms. With the availability of descriptors both for datasets and data
mining workflows the traditional modelling techniques followed in
meta-learning, typically based on classification and regression algorithms, are
no longer appropriate. Instead we are faced with a problem the nature of which
is much more similar to the problems that appear in recommendation systems. The
most important meta-mining requirements are that suggestions should use only
datasets and workflows descriptors and the cold-start problem, e.g. providing
workflow suggestions for new datasets.
In this paper we take a different view on the meta-mining modelling problem
and treat it as a recommender problem. In order to account for the meta-mining
specificities we derive a novel metric-based-learning recommender approach. Our
method learns two homogeneous metrics, one in the dataset and one in the
workflow space, and a heterogeneous one in the dataset-workflow space. All
learned metrics reflect similarities established from the dataset-workflow
preference matrix. We demonstrate our method on meta-mining over biological
(microarray datasets) problems. The application of our method is not limited to
the meta-mining problem, its formulations is general enough so that it can be
applied on problems with similar requirements
Recommended from our members
Robocrystallographer: Automated crystal structure text descriptions and analysis
Our ability to describe crystal structure features is of crucial importance when attempting to understand structure-property relationships in the solid state. In this paper, the authors introduce robocrystallographer, an open-source toolkit for analyzing crystal structures. This package combines new and existing open-source analysis tools to provide structural information, including the local coordination and polyhedral type, polyhedral connectivity, octahedral tilt angles, component-dimensionality, and molecule-within-crystal and fuzzy prototype identification. Using this information, robocrystallographer can generate text-based descriptions of crystal structures that resemble descriptions written by human crystallographers. The authors use robocrystallographer to investigate the dimensionalities of all compounds in the Materials Project database and highlight its potential in machine learning studies
Efficient Feature Subset Selection Algorithm for High Dimensional Data
Feature selection approach solves the dimensionality problem by removing irrelevant and redundant features. Existing Feature selection algorithms take more time to obtain feature subset for high dimensional data. This paper proposes a feature selection algorithm based on Information gain measures for high dimensional data termed as IFSA (Information gain based Feature Selection Algorithm) to produce optimal feature subset in efficient time and improve the computational performance of learning algorithms. IFSA algorithm works in two folds: First apply filter on dataset. Second produce the small feature subset by using information gain measure. Extensive experiments are carried out to compare proposed algorithm and other methods with respect to two different classifiers (Naive bayes and IBK) on microarray and text data sets. The results demonstrate that IFSA not only produces the most select feature subset in efficient time but also improves the classifier performance
Optimized jk-nearest neighbor based online signature verification and evaluation of the main parameters
In this paper, we propose an enhanced jk-nearest neighbor (jk-NN) classifier for online signature verification. After studying the algorithm's main parameters, we use four separate databases to present and evaluate each algorithm parameter. The results show that the proposed method can increase the verification accuracy by 0.73-10% compared to a traditional one class k-NN classifier. The algorithm has achieved reasonable accuracy for different databases, a 3.93% error rate when using the SVC2004 database, 2.6% for MCYT-100 database, 1.75% for the SigComp'11 database, and 6% for the SigComp'15 database.The proposed algorithm uses specifically chosen parameters and a procedure to pick the optimal value for K using only the signer's reference signatures, to build a practical verification system for real-life scenarios where only these signatures are available. By applying the proposed algorithm, the average error achieved was 8% for SVC2004, 3.26% for MCYT-100, 13% for SigComp'15, and 2.22% for SigComp'11
Case-based Reasoning Method for Real-time Expert Diagnostics Systems
The method of case-based reasoning for a solution of problems of real-time diagnostics and
forecasting in intelligent decision support systems (IDSS) is considered. Special attention is drawn to case library
structure for real-time IDSS (RT IDSS) and algorithm of k-nearest neighbors type. This work was supported by
RFBR
- …