1,224 research outputs found
One-Class Classification: Taxonomy of Study and Review of Techniques
One-class classification (OCC) algorithms aim to build classification models
when the negative class is either absent, poorly sampled or not well defined.
This unique situation constrains the learning of efficient classifiers by
defining class boundary just with the knowledge of positive class. The OCC
problem has been considered and applied under many research themes, such as
outlier/novelty detection and concept learning. In this paper we present a
unified view of the general problem of OCC by presenting a taxonomy of study
for OCC problems, which is based on the availability of training data,
algorithms used and the application domains applied. We further delve into each
of the categories of the proposed taxonomy and present a comprehensive
literature review of the OCC algorithms, techniques and methodologies with a
focus on their significance, limitations and applications. We conclude our
paper by discussing some open research problems in the field of OCC and present
our vision for future research.Comment: 24 pages + 11 pages of references, 8 figure
Missing Value Imputation With Unsupervised Backpropagation
Many data mining and data analysis techniques operate on dense matrices or
complete tables of data. Real-world data sets, however, often contain unknown
values. Even many classification algorithms that are designed to operate with
missing values still exhibit deteriorated accuracy. One approach to handling
missing values is to fill in (impute) the missing values. In this paper, we
present a technique for unsupervised learning called Unsupervised
Backpropagation (UBP), which trains a multi-layer perceptron to fit to the
manifold sampled by a set of observed point-vectors. We evaluate UBP with the
task of imputing missing values in datasets, and show that UBP is able to
predict missing values with significantly lower sum-squared error than other
collaborative filtering and imputation techniques. We also demonstrate with 24
datasets and 9 supervised learning algorithms that classification accuracy is
usually higher when randomly-withheld values are imputed using UBP, rather than
with other methods
Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values
This work is motivated by the needs of predictive analytics on healthcare
data as represented by Electronic Medical Records. Such data is invariably
problematic: noisy, with missing entries, with imbalance in classes of
interests, leading to serious bias in predictive modeling. Since standard data
mining methods often produce poor performance measures, we argue for
development of specialized techniques of data-preprocessing and classification.
In this paper, we propose a new method to simultaneously classify large
datasets and reduce the effects of missing values. It is based on a multilevel
framework of the cost-sensitive SVM and the expected maximization imputation
method for missing values, which relies on iterated regression analyses. We
compare classification results of multilevel SVM-based algorithms on public
benchmark datasets with imbalanced classes and missing values as well as real
data in health applications, and show that our multilevel SVM-based method
produces fast, and more accurate and robust classification results.Comment: arXiv admin note: substantial text overlap with arXiv:1503.0625
A Stochastic Method for Estimating Imputation Accuracy
This thesis describes a novel imputation evaluation method and shows how this method can
be used to estimate the accuracy of the imputed values generated by any imputation
technique. This is achieved by using an iterative stochastic procedure to repeatedly measure
how accurately a set of randomly deleted values are “put back” by the imputation process.
The proposed approach builds on the ideas underpinning uncertainty estimation methods, but
differs from them in that it estimates the accuracy of the imputed values, rather than
estimating the uncertainty inherent within those values. In addition, a procedure for
comparing the accuracy of the imputed values in different data segments has been built into
the proposed method, but uncertainty estimation methods do not include such procedures.
This proposed method is implemented as a software application. This application is used to
estimate the accuracy of the imputed values generated by the expectation-maximisation (EM)
and nearest neighbour (NN) imputation algorithms. These algorithms are implemented
alongside the method, with particular attention being paid to the use of implementation
techniques which decrease algorithm execution times, so as to support the computationally
intensive nature of the method. A novel NN imputation algorithm is developed and the
experimental evaluation of this algorithm shows that it can be used to decrease the execution
time of the NN imputation process for both simulated and real datasets. The execution time of
the new NN algorithm was found to steadily decrease as the proportion of missing values in
the dataset was increased.
The method is experimentally evaluated and the results show that the proposed approach
produces reliable and valid estimates of imputation accuracy when it is used to compare the
accuracy of the imputed values generated by the EM and NN imputation algorithms. Finally,
a case study is presented which shows how the method has been applied in practice, including
a detailed description of the experiments that were performed in order to find the most
accurate methods of imputing the missing values in the case study dataset. A comprehensive
set of experimental results is given, the associated imputation accuracy statistics are analysed
and the feasibility of imputing the missing case study data is assessed
A Review of Hot Deck Imputation for Survey Non-response
L'imputation hot deck est une méthode de gestion des données manquantes dans laquelle chaque valeur manquante est remplacée par une réponse observée à partir d'une unité“similaire.” Bien qu'elle soit largement utilisée en pratique, sa théorie n'est pas aussi développée que celle des autres méthodes d'imputation. Nous avons constaté qu'il n'existe aucun consensus quant à la meilleure faon d'appliquer les hot deck et obtenir des inférences à partir de la série de données complète. Ici, nous passons en revue les différentes formes de hot deck et les recherches existantes sur ses propriétés statistiques. Nous décrivons les applications du hot deck actuellement utilisées, y compris le hot deck du Bureau US du recensement pour la Current Population Survey (CPS). Nous proposons aussi des exemples nombreux de variations du hot deck à la troisième National Health and Nutrition Examination Survey (NHANES III). Certains domaines possibles de recherches futures sont mises en évidence.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/78729/1/j.1751-5823.2010.00103.x.pd
- …