1,224 research outputs found

    One-Class Classification: Taxonomy of Study and Review of Techniques

    Full text link
    One-class classification (OCC) algorithms aim to build classification models when the negative class is either absent, poorly sampled or not well defined. This unique situation constrains the learning of efficient classifiers by defining class boundary just with the knowledge of positive class. The OCC problem has been considered and applied under many research themes, such as outlier/novelty detection and concept learning. In this paper we present a unified view of the general problem of OCC by presenting a taxonomy of study for OCC problems, which is based on the availability of training data, algorithms used and the application domains applied. We further delve into each of the categories of the proposed taxonomy and present a comprehensive literature review of the OCC algorithms, techniques and methodologies with a focus on their significance, limitations and applications. We conclude our paper by discussing some open research problems in the field of OCC and present our vision for future research.Comment: 24 pages + 11 pages of references, 8 figure

    Missing Value Imputation With Unsupervised Backpropagation

    Full text link
    Many data mining and data analysis techniques operate on dense matrices or complete tables of data. Real-world data sets, however, often contain unknown values. Even many classification algorithms that are designed to operate with missing values still exhibit deteriorated accuracy. One approach to handling missing values is to fill in (impute) the missing values. In this paper, we present a technique for unsupervised learning called Unsupervised Backpropagation (UBP), which trains a multi-layer perceptron to fit to the manifold sampled by a set of observed point-vectors. We evaluate UBP with the task of imputing missing values in datasets, and show that UBP is able to predict missing values with significantly lower sum-squared error than other collaborative filtering and imputation techniques. We also demonstrate with 24 datasets and 9 supervised learning algorithms that classification accuracy is usually higher when randomly-withheld values are imputed using UBP, rather than with other methods

    Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values

    Full text link
    This work is motivated by the needs of predictive analytics on healthcare data as represented by Electronic Medical Records. Such data is invariably problematic: noisy, with missing entries, with imbalance in classes of interests, leading to serious bias in predictive modeling. Since standard data mining methods often produce poor performance measures, we argue for development of specialized techniques of data-preprocessing and classification. In this paper, we propose a new method to simultaneously classify large datasets and reduce the effects of missing values. It is based on a multilevel framework of the cost-sensitive SVM and the expected maximization imputation method for missing values, which relies on iterated regression analyses. We compare classification results of multilevel SVM-based algorithms on public benchmark datasets with imbalanced classes and missing values as well as real data in health applications, and show that our multilevel SVM-based method produces fast, and more accurate and robust classification results.Comment: arXiv admin note: substantial text overlap with arXiv:1503.0625

    A Stochastic Method for Estimating Imputation Accuracy

    Get PDF
    This thesis describes a novel imputation evaluation method and shows how this method can be used to estimate the accuracy of the imputed values generated by any imputation technique. This is achieved by using an iterative stochastic procedure to repeatedly measure how accurately a set of randomly deleted values are “put back” by the imputation process. The proposed approach builds on the ideas underpinning uncertainty estimation methods, but differs from them in that it estimates the accuracy of the imputed values, rather than estimating the uncertainty inherent within those values. In addition, a procedure for comparing the accuracy of the imputed values in different data segments has been built into the proposed method, but uncertainty estimation methods do not include such procedures. This proposed method is implemented as a software application. This application is used to estimate the accuracy of the imputed values generated by the expectation-maximisation (EM) and nearest neighbour (NN) imputation algorithms. These algorithms are implemented alongside the method, with particular attention being paid to the use of implementation techniques which decrease algorithm execution times, so as to support the computationally intensive nature of the method. A novel NN imputation algorithm is developed and the experimental evaluation of this algorithm shows that it can be used to decrease the execution time of the NN imputation process for both simulated and real datasets. The execution time of the new NN algorithm was found to steadily decrease as the proportion of missing values in the dataset was increased. The method is experimentally evaluated and the results show that the proposed approach produces reliable and valid estimates of imputation accuracy when it is used to compare the accuracy of the imputed values generated by the EM and NN imputation algorithms. Finally, a case study is presented which shows how the method has been applied in practice, including a detailed description of the experiments that were performed in order to find the most accurate methods of imputing the missing values in the case study dataset. A comprehensive set of experimental results is given, the associated imputation accuracy statistics are analysed and the feasibility of imputing the missing case study data is assessed

    A Review of Hot Deck Imputation for Survey Non-response

    Full text link
    L'imputation hot deck est une méthode de gestion des données manquantes dans laquelle chaque valeur manquante est remplacée par une réponse observée à partir d'une unité“similaire.” Bien qu'elle soit largement utilisée en pratique, sa théorie n'est pas aussi développée que celle des autres méthodes d'imputation. Nous avons constaté qu'il n'existe aucun consensus quant à la meilleure faon d'appliquer les hot deck et obtenir des inférences à partir de la série de données complète. Ici, nous passons en revue les différentes formes de hot deck et les recherches existantes sur ses propriétés statistiques. Nous décrivons les applications du hot deck actuellement utilisées, y compris le hot deck du Bureau US du recensement pour la Current Population Survey (CPS). Nous proposons aussi des exemples nombreux de variations du hot deck à la troisième National Health and Nutrition Examination Survey (NHANES III). Certains domaines possibles de recherches futures sont mises en évidence.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/78729/1/j.1751-5823.2010.00103.x.pd
    • …
    corecore