9,625 research outputs found
Temporally-aware algorithms for the classification of anuran sounds
Several authors have shown that the sounds of anurans can be used as an indicator of
climate change. Hence, the recording, storage and further processing of a huge
number of anuran sounds, distributed over time and space, are required in order to
obtain this indicator. Furthermore, it is desirable to have algorithms and tools for
the automatic classification of the different classes of sounds. In this paper, six
classification methods are proposed, all based on the data-mining domain, which
strive to take advantage of the temporal character of the sounds. The definition and
comparison of these classification methods is undertaken using several approaches.
The main conclusions of this paper are that: (i) the sliding window method attained
the best results in the experiments presented, and even outperformed the hidden
Markov models usually employed in similar applications; (ii) noteworthy overall
classification performance has been obtained, which is an especially striking result
considering that the sounds analysed were affected by a highly noisy background;
(iii) the instance selection for the determination of the sounds in the training dataset
offers better results than cross-validation techniques; and (iv) the temporally-aware
classifiers have revealed that they can obtain better performance than their nontemporally-aware
counterparts.Consejería de Innovación, Ciencia y Empresa (Junta de Andalucía, Spain): excellence eSAPIENS number TIC 570
Dissimilarity-based Ensembles for Multiple Instance Learning
In multiple instance learning, objects are sets (bags) of feature vectors
(instances) rather than individual feature vectors. In this paper we address
the problem of how these bags can best be represented. Two standard approaches
are to use (dis)similarities between bags and prototype bags, or between bags
and prototype instances. The first approach results in a relatively
low-dimensional representation determined by the number of training bags, while
the second approach results in a relatively high-dimensional representation,
determined by the total number of instances in the training set. In this paper
a third, intermediate approach is proposed, which links the two approaches and
combines their strengths. Our classifier is inspired by a random subspace
ensemble, and considers subspaces of the dissimilarity space, defined by
subsets of instances, as prototypes. We provide guidelines for using such an
ensemble, and show state-of-the-art performances on a range of multiple
instance learning problems.Comment: Submitted to IEEE Transactions on Neural Networks and Learning
Systems, Special Issue on Learning in Non-(geo)metric Space
Classification of protein interaction sentences via gaussian processes
The increase in the availability of protein interaction studies in textual format coupled with the demand for easier access to the key results has lead to a need for text mining solutions. In the text processing pipeline, classification is a key step for extraction of small sections of relevant text. Consequently, for the task of locating protein-protein interaction sentences, we examine the use of a classifier which has rarely been applied to text, the Gaussian processes (GPs). GPs are a non-parametric probabilistic analogue to the more popular support vector machines (SVMs). We find that GPs outperform the SVM and na\"ive Bayes classifiers on binary sentence data, whilst showing equivalent performance on abstract and multiclass sentence corpora. In addition, the lack of the margin parameter, which requires costly tuning, along with the principled multiclass extensions enabled by the probabilistic framework make GPs an appealing alternative worth of further adoption
Effective classifiers for detecting objects
Several state-of-the-art machine learning classifiers are compared for the purposes of object detection in complex images, using global image features derived from the Ohta color space and Local Binary Patterns. Image complexity in this sense refers to the degree to which the target objects are occluded and/or non-dominant (i.e. not in the foreground) in the image, and also the degree to which the images are cluttered with non-target objects. The results indicate that a voting ensemble of Support Vector Machines, Random Forests, and Boosted Decision Trees provide the best performance with AUC values of up to 0.92 and Equal Error Rate accuracies of up to 85.7% in stratified 10-fold cross validation experiments on the GRAZ02 complex image dataset
Predicting diabetes-related hospitalizations based on electronic health records
OBJECTIVE: To derive a predictive model to identify patients likely to be hospitalized during the following year due to complications attributed to Type II diabetes. METHODS: A variety of supervised machine learning classification methods were tested and a new method that discovers hidden patient clusters in the positive class (hospitalized) was developed while, at the same time, sparse linear support vector machine classifiers were derived to separate positive samples from the negative ones (non-hospitalized). The convergence of the new method was established and theoretical guarantees were proved on how the classifiers it produces generalize to a test set not seen during training. RESULTS: The methods were tested on a large set of patients from the Boston Medical Center - the largest safety net hospital in New England. It is found that our new joint clustering/classification method achieves an accuracy of 89% (measured in terms of area under the ROC Curve) and yields informative clusters which can help interpret the classification results, thus increasing the trust of physicians to the algorithmic output and providing some guidance towards preventive measures. While it is possible to increase accuracy to 92% with other methods, this comes with increased computational cost and lack of interpretability. The analysis shows that even a modest probability of preventive actions being effective (more than 19%) suffices to generate significant hospital care savings. CONCLUSIONS: Predictive models are proposed that can help avert hospitalizations, improve health outcomes and drastically reduce hospital expenditures. The scope for savings is significant as it has been estimated that in the USA alone, about $5.8 billion are spent each year on diabetes-related hospitalizations that could be prevented.Accepted manuscrip
Prediction of protein-protein interactions using one-class classification methods and integrating diverse data
This research addresses the problem of prediction of protein-protein interactions (PPI)
when integrating diverse kinds of biological information. This task has been commonly
viewed as a binary classification problem (whether any two proteins do or do not interact)
and several different machine learning techniques have been employed to solve this
task. However the nature of the data creates two major problems which can affect results.
These are firstly imbalanced class problems due to the number of positive examples (pairs
of proteins which really interact) being much smaller than the number of negative ones.
Secondly the selection of negative examples can be based on some unreliable assumptions
which could introduce some bias in the classification results.
Here we propose the use of one-class classification (OCC) methods to deal with the task of
prediction of PPI. OCC methods utilise examples of just one class to generate a predictive
model which consequently is independent of the kind of negative examples selected; additionally
these approaches are known to cope with imbalanced class problems. We have
designed and carried out a performance evaluation study of several OCC methods for this
task, and have found that the Parzen density estimation approach outperforms the rest. We
also undertook a comparative performance evaluation between the Parzen OCC method
and several conventional learning techniques, considering different scenarios, for example
varying the number of negative examples used for training purposes. We found that the
Parzen OCC method in general performs competitively with traditional approaches and in
many situations outperforms them. Finally we evaluated the ability of the Parzen OCC
approach to predict new potential PPI targets, and validated these results by searching for
biological evidence in the literature
Threshold Choice Methods: the Missing Link
Many performance metrics have been introduced for the evaluation of
classification performance, with different origins and niches of application:
accuracy, macro-accuracy, area under the ROC curve, the ROC convex hull, the
absolute error, and the Brier score (with its decomposition into refinement and
calibration). One way of understanding the relation among some of these metrics
is the use of variable operating conditions (either in the form of
misclassification costs or class proportions). Thus, a metric may correspond to
some expected loss over a range of operating conditions. One dimension for the
analysis has been precisely the distribution we take for this range of
operating conditions, leading to some important connections in the area of
proper scoring rules. However, we show that there is another dimension which
has not received attention in the analysis of performance metrics. This new
dimension is given by the decision rule, which is typically implemented as a
threshold choice method when using scoring models. In this paper, we explore
many old and new threshold choice methods: fixed, score-uniform, score-driven,
rate-driven and optimal, among others. By calculating the loss of these methods
for a uniform range of operating conditions we get the 0-1 loss, the absolute
error, the Brier score (mean squared error), the AUC and the refinement loss
respectively. This provides a comprehensive view of performance metrics as well
as a systematic approach to loss minimisation, namely: take a model, apply
several threshold choice methods consistent with the information which is (and
will be) available about the operating condition, and compare their expected
losses. In order to assist in this procedure we also derive several connections
between the aforementioned performance metrics, and we highlight the role of
calibration in choosing the threshold choice method
- …