68,999 research outputs found
An Empirical Study and Analysis of Generalized Zero-Shot Learning for Object Recognition in the Wild
Zero-shot learning (ZSL) methods have been studied in the unrealistic setting
where test data are assumed to come from unseen classes only. In this paper, we
advocate studying the problem of generalized zero-shot learning (GZSL) where
the test data's class memberships are unconstrained. We show empirically that
naively using the classifiers constructed by ZSL approaches does not perform
well in the generalized setting. Motivated by this, we propose a simple but
effective calibration method that can be used to balance two conflicting
forces: recognizing data from seen classes versus those from unseen ones. We
develop a performance metric to characterize such a trade-off and examine the
utility of this metric in evaluating various ZSL approaches. Our analysis
further shows that there is a large gap between the performance of existing
approaches and an upper bound established via idealized semantic embeddings,
suggesting that improving class semantic embeddings is vital to GZSL.Comment: ECCV2016 camera-read
Building Gene Expression Profile Classifiers with a Simple and Efficient Rejection Option in R
Background: The collection of gene expression profiles from DNA microarrays and their analysis with pattern recognition algorithms is a powerful technology applied to several biological problems. Common pattern recognition systems classify samples assigning them to a set of known classes. However, in a clinical diagnostics setup, novel and unknown classes (new pathologies) may appear and one must be able to reject those samples that do not fit the trained model. The problem of implementing a rejection option in a multi-class classifier has not been widely addressed in the statistical literature. Gene expression profiles represent a critical case study since they suffer from the curse of dimensionality problem that negatively reflects on the reliability of both traditional rejection models and also more recent approaches such as one-class classifiers. Results: This paper presents a set of empirical decision rules that can be used to implement a rejection option in a set of multi-class classifiers widely used for the analysis of gene expression profiles. In particular, we focus on the classifiers implemented in the R Language and Environment for Statistical Computing (R for short in the remaining of this paper). The main contribution of the proposed rules is their simplicity, which enables an easy integration with available data analysis environments. Since in the definition of a rejection model tuning of the involved parameters is often a complex and delicate task, in this paper we exploit an evolutionary strategy to automate this process. This allows the final user to maximize the rejection accuracy with minimum manual intervention. Conclusions: This paper shows how the use of simple decision rules can be used to help the use of complex machine learning algorithms in real experimental setups. The proposed approach is almost completely automated and therefore a good candidate for being integrated in data analysis flows in labs where the machine learning expertise required to tune traditional classifiers might not be availabl
On Machine-Learned Classification of Variable Stars with Sparse and Noisy Time-Series Data
With the coming data deluge from synoptic surveys, there is a growing need
for frameworks that can quickly and automatically produce calibrated
classification probabilities for newly-observed variables based on a small
number of time-series measurements. In this paper, we introduce a methodology
for variable-star classification, drawing from modern machine-learning
techniques. We describe how to homogenize the information gleaned from light
curves by selection and computation of real-numbered metrics ("feature"),
detail methods to robustly estimate periodic light-curve features, introduce
tree-ensemble methods for accurate variable star classification, and show how
to rigorously evaluate the classification results using cross validation. On a
25-class data set of 1542 well-studied variable stars, we achieve a 22.8%
overall classification error using the random forest classifier; this
represents a 24% improvement over the best previous classifier on these data.
This methodology is effective for identifying samples of specific science
classes: for pulsational variables used in Milky Way tomography we obtain a
discovery efficiency of 98.2% and for eclipsing systems we find an efficiency
of 99.1%, both at 95% purity. We show that the random forest (RF) classifier is
superior to other machine-learned methods in terms of accuracy, speed, and
relative immunity to features with no useful class information; the RF
classifier can also be used to estimate the importance of each feature in
classification. Additionally, we present the first astronomical use of
hierarchical classification methods to incorporate a known class taxonomy in
the classifier, which further reduces the catastrophic error rate to 7.8%.
Excluding low-amplitude sources, our overall error rate improves to 14%, with a
catastrophic error rate of 3.5%.Comment: 23 pages, 9 figure
Dissimilarity-based Ensembles for Multiple Instance Learning
In multiple instance learning, objects are sets (bags) of feature vectors
(instances) rather than individual feature vectors. In this paper we address
the problem of how these bags can best be represented. Two standard approaches
are to use (dis)similarities between bags and prototype bags, or between bags
and prototype instances. The first approach results in a relatively
low-dimensional representation determined by the number of training bags, while
the second approach results in a relatively high-dimensional representation,
determined by the total number of instances in the training set. In this paper
a third, intermediate approach is proposed, which links the two approaches and
combines their strengths. Our classifier is inspired by a random subspace
ensemble, and considers subspaces of the dissimilarity space, defined by
subsets of instances, as prototypes. We provide guidelines for using such an
ensemble, and show state-of-the-art performances on a range of multiple
instance learning problems.Comment: Submitted to IEEE Transactions on Neural Networks and Learning
Systems, Special Issue on Learning in Non-(geo)metric Space
Supervised Classification: Quite a Brief Overview
The original problem of supervised classification considers the task of
automatically assigning objects to their respective classes on the basis of
numerical measurements derived from these objects. Classifiers are the tools
that implement the actual functional mapping from these measurements---also
called features or inputs---to the so-called class label---or output. The
fields of pattern recognition and machine learning study ways of constructing
such classifiers. The main idea behind supervised methods is that of learning
from examples: given a number of example input-output relations, to what extent
can the general mapping be learned that takes any new and unseen feature vector
to its correct class? This chapter provides a basic introduction to the
underlying ideas of how to come to a supervised classification problem. In
addition, it provides an overview of some specific classification techniques,
delves into the issues of object representation and classifier evaluation, and
(very) briefly covers some variations on the basic supervised classification
task that may also be of interest to the practitioner
Multi-test Decision Tree and its Application to Microarray Data Classification
Objective:
The desirable property of tools used to investigate biological data is
easy to understand models and predictive decisions.
Decision trees are particularly promising in this regard due to their comprehensible nature that resembles the hierarchical process of human decision making. However, existing algorithms for learning decision trees have tendency to underfit gene expression data. The main aim of this work is to improve the performance and stability of decision trees with only a small increase in their complexity.
Methods:
We propose a multi-test decision tree (MTDT); our main contribution is the application of several univariate tests in each non-terminal node of the decision tree. We also search for alternative, lower-ranked features in order to obtain more stable and reliable predictions.
Results:
Experimental validation was performed on several real-life gene expression datasets. Comparison results with eight classifiers show that MTDT has a statistically significantly higher accuracy than popular decision tree classifiers, and it was highly competitive with ensemble learning algorithms. The proposed solution managed to outperform its baseline algorithm on datasets by an average percent. A study performed on one of the datasets showed that the discovered genes used in the MTDT classification model
are supported by biological evidence in the literature.
Conclusion:
This paper introduces a new type of decision tree which is more suitable for solving biological problems.
MTDTs are relatively easy to analyze and much more powerful in modeling high dimensional microarray data than their popular counterparts
- …