21,586 research outputs found
Missing Value Imputation With Unsupervised Backpropagation
Many data mining and data analysis techniques operate on dense matrices or
complete tables of data. Real-world data sets, however, often contain unknown
values. Even many classification algorithms that are designed to operate with
missing values still exhibit deteriorated accuracy. One approach to handling
missing values is to fill in (impute) the missing values. In this paper, we
present a technique for unsupervised learning called Unsupervised
Backpropagation (UBP), which trains a multi-layer perceptron to fit to the
manifold sampled by a set of observed point-vectors. We evaluate UBP with the
task of imputing missing values in datasets, and show that UBP is able to
predict missing values with significantly lower sum-squared error than other
collaborative filtering and imputation techniques. We also demonstrate with 24
datasets and 9 supervised learning algorithms that classification accuracy is
usually higher when randomly-withheld values are imputed using UBP, rather than
with other methods
Gene Expression based Survival Prediction for Cancer Patients: A Topic Modeling Approach
Cancer is one of the leading cause of death, worldwide. Many believe that
genomic data will enable us to better predict the survival time of these
patients, which will lead to better, more personalized treatment options and
patient care. As standard survival prediction models have a hard time coping
with the high-dimensionality of such gene expression (GE) data, many projects
use some dimensionality reduction techniques to overcome this hurdle. We
introduce a novel methodology, inspired by topic modeling from the natural
language domain, to derive expressive features from the high-dimensional GE
data. There, a document is represented as a mixture over a relatively small
number of topics, where each topic corresponds to a distribution over the
words; here, to accommodate the heterogeneity of a patient's cancer, we
represent each patient (~document) as a mixture over cancer-topics, where each
cancer-topic is a mixture over GE values (~words). This required some
extensions to the standard LDA model eg: to accommodate the "real-valued"
expression values - leading to our novel "discretized" Latent Dirichlet
Allocation (dLDA) procedure. We initially focus on the METABRIC dataset, which
describes breast cancer patients using the r=49,576 GE values, from
microarrays. Our results show that our approach provides survival estimates
that are more accurate than standard models, in terms of the standard
Concordance measure. We then validate this approach by running it on the
Pan-kidney (KIPAN) dataset, over r=15,529 GE values - here using the mRNAseq
modality - and find that it again achieves excellent results. In both cases, we
also show that the resulting model is calibrated, using the recent
"D-calibrated" measure. These successes, in two different cancer types and
expression modalities, demonstrates the generality, and the effectiveness, of
this approach
Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values
This work is motivated by the needs of predictive analytics on healthcare
data as represented by Electronic Medical Records. Such data is invariably
problematic: noisy, with missing entries, with imbalance in classes of
interests, leading to serious bias in predictive modeling. Since standard data
mining methods often produce poor performance measures, we argue for
development of specialized techniques of data-preprocessing and classification.
In this paper, we propose a new method to simultaneously classify large
datasets and reduce the effects of missing values. It is based on a multilevel
framework of the cost-sensitive SVM and the expected maximization imputation
method for missing values, which relies on iterated regression analyses. We
compare classification results of multilevel SVM-based algorithms on public
benchmark datasets with imbalanced classes and missing values as well as real
data in health applications, and show that our multilevel SVM-based method
produces fast, and more accurate and robust classification results.Comment: arXiv admin note: substantial text overlap with arXiv:1503.0625
Neural network ensembles: Evaluation of aggregation algorithms
Ensembles of artificial neural networks show improved generalization
capabilities that outperform those of single networks. However, for aggregation
to be effective, the individual networks must be as accurate and diverse as
possible. An important problem is, then, how to tune the aggregate members in
order to have an optimal compromise between these two conflicting conditions.
We present here an extensive evaluation of several algorithms for ensemble
construction, including new proposals and comparing them with standard methods
in the literature. We also discuss a potential problem with sequential
aggregation algorithms: the non-frequent but damaging selection through their
heuristics of particularly bad ensemble members. We introduce modified
algorithms that cope with this problem by allowing individual weighting of
aggregate members. Our algorithms and their weighted modifications are
favorably tested against other methods in the literature, producing a sensible
improvement in performance on most of the standard statistical databases used
as benchmarks.Comment: 35 pages, 2 figures, In press AI Journa
- …