2,040 research outputs found
Applicability of semi-supervised learning assumptions for gene ontology terms prediction
Gene Ontology (GO) is one of the most important resources in bioinformatics, aiming to provide a unified framework for the biological annotation of genes and proteins across all species. Predicting GO terms is an essential task for bioinformatics, but the number of available labelled proteins is in several cases insufficient for training reliable machine learning classifiers. Semi-supervised learning methods arise as a powerful solution that explodes the information contained in unlabelled data in order to improve the estimations of traditional supervised approaches. However, semi-supervised learning methods have to make strong assumptions about the nature of the training data and thus, the performance of the predictor is highly dependent on these assumptions. This paper presents an analysis of the applicability of semi-supervised learning assumptions over the specific task of GO terms prediction, focused on providing judgment elements that allow choosing the most suitable tools for specific GO terms. The results show that semi-supervised approaches significantly outperform the traditional supervised methods and that the highest performances are reached when applying the cluster assumption. Besides, it is experimentally demonstrated that cluster and manifold assumptions are complimentary to each other and an analysis of which GO terms can be more prone to be correctly predicted with each assumption, is provided.Postprint (published version
Adaptive Ensemble of Classifiers with Regularization for Imbalanced Data Classification
The dynamic ensemble selection of classifiers is an effective approach for
processing label-imbalanced data classifications. However, such a technique is
prone to overfitting, owing to the lack of regularization methods and the
dependence of the aforementioned technique on local geometry. In this study,
focusing on binary imbalanced data classification, a novel dynamic ensemble
method, namely adaptive ensemble of classifiers with regularization (AER), is
proposed, to overcome the stated limitations. The method solves the overfitting
problem through implicit regularization. Specifically, it leverages the
properties of stochastic gradient descent to obtain the solution with the
minimum norm, thereby achieving regularization; furthermore, it interpolates
the ensemble weights by exploiting the global geometry of data to further
prevent overfitting. According to our theoretical proofs, the seemingly
complicated AER paradigm, in addition to its regularization capabilities, can
actually reduce the asymptotic time and memory complexities of several other
algorithms. We evaluate the proposed AER method on seven benchmark imbalanced
datasets from the UCI machine learning repository and one artificially
generated GMM-based dataset with five variations. The results show that the
proposed algorithm outperforms the major existing algorithms based on multiple
metrics in most cases, and two hypothesis tests (McNemar's and Wilcoxon tests)
verify the statistical significance further. In addition, the proposed method
has other preferred properties such as special advantages in dealing with
highly imbalanced data, and it pioneers the research on the regularization for
dynamic ensemble methods.Comment: Major revision; Change of authors due to contribution
A Survey on Metric Learning for Feature Vectors and Structured Data
The need for appropriate ways to measure the distance or similarity between
data is ubiquitous in machine learning, pattern recognition and data mining,
but handcrafting such good metrics for specific problems is generally
difficult. This has led to the emergence of metric learning, which aims at
automatically learning a metric from data and has attracted a lot of interest
in machine learning and related fields for the past ten years. This survey
paper proposes a systematic review of the metric learning literature,
highlighting the pros and cons of each approach. We pay particular attention to
Mahalanobis distance metric learning, a well-studied and successful framework,
but additionally present a wide range of methods that have recently emerged as
powerful alternatives, including nonlinear metric learning, similarity learning
and local metric learning. Recent trends and extensions, such as
semi-supervised metric learning, metric learning for histogram data and the
derivation of generalization guarantees, are also covered. Finally, this survey
addresses metric learning for structured data, in particular edit distance
learning, and attempts to give an overview of the remaining challenges in
metric learning for the years to come.Comment: Technical report, 59 pages. Changes in v2: fixed typos and improved
presentation. Changes in v3: fixed typos. Changes in v4: fixed typos and new
method
- …