15,041 research outputs found
Time series transductive classification on imbalanced data sets: an experimental study
Graph-based semi-supervised learning (SSL) algorithms perform well on a variety of domains, such as digit recognition and text classification, when the data lie on a low-dimensional manifold. However, it is surprising that these methods have not been effectively applied on time series classification tasks. In this paper, we provide a comprehensive empirical comparison of state-of-the-art graph-based SSL algorithms with respect to graph construction and parameter selection. Specifically, we focus in this paper on the problem of time series transductive classification on imbalanced data sets. Through a comprehensive analysis using recently proposed empirical evaluation models, we confirm some of the hypotheses raised on previous work and show that some of them may not hold in the time series domain. From our results, we suggest the use of the Gaussian Fields and Harmonic Functions algorithm with the mutual k-nearest neighbors graph weighted by the RBF kernel, setting k = 20 on general tasks of time series transductive classification on imbalanced data sets.São Paulo Research Foundation (FAPESP) (grants 2011/17698-5 and 2012/50714-7
Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values
This work is motivated by the needs of predictive analytics on healthcare
data as represented by Electronic Medical Records. Such data is invariably
problematic: noisy, with missing entries, with imbalance in classes of
interests, leading to serious bias in predictive modeling. Since standard data
mining methods often produce poor performance measures, we argue for
development of specialized techniques of data-preprocessing and classification.
In this paper, we propose a new method to simultaneously classify large
datasets and reduce the effects of missing values. It is based on a multilevel
framework of the cost-sensitive SVM and the expected maximization imputation
method for missing values, which relies on iterated regression analyses. We
compare classification results of multilevel SVM-based algorithms on public
benchmark datasets with imbalanced classes and missing values as well as real
data in health applications, and show that our multilevel SVM-based method
produces fast, and more accurate and robust classification results.Comment: arXiv admin note: substantial text overlap with arXiv:1503.0625
Recommended from our members
Multi-class protein fold classification using a new ensemble machine learning approach.
Protein structure classification represents an important process in understanding the associations
between sequence and structure as well as possible functional and evolutionary relationships.
Recent structural genomics initiatives and other high-throughput experiments have populated the
biological databases at a rapid pace. The amount of structural data has made traditional methods
such as manual inspection of the protein structure become impossible. Machine learning has been
widely applied to bioinformatics and has gained a lot of success in this research area. This work
proposes a novel ensemble machine learning method that improves the coverage of the classifiers
under the multi-class imbalanced sample sets by integrating knowledge induced from different base
classifiers, and we illustrate this idea in classifying multi-class SCOP protein fold data. We have
compared our approach with PART and show that our method improves the sensitivity of the
classifier in protein fold classification. Furthermore, we have extended this method to learning over
multiple data types, preserving the independence of their corresponding data sources, and show
that our new approach performs at least as well as the traditional technique over a single joined
data source. These experimental results are encouraging, and can be applied to other bioinformatics
problems similarly characterised by multi-class imbalanced data sets held in multiple data
sources
- …