4,231 research outputs found

    VIGAN: Missing View Imputation with Generative Adversarial Networks

    Full text link
    In an era when big data are becoming the norm, there is less concern with the quantity but more with the quality and completeness of the data. In many disciplines, data are collected from heterogeneous sources, resulting in multi-view or multi-modal datasets. The missing data problem has been challenging to address in multi-view data analysis. Especially, when certain samples miss an entire view of data, it creates the missing view problem. Classic multiple imputations or matrix completion methods are hardly effective here when no information can be based on in the specific view to impute data for such samples. The commonly-used simple method of removing samples with a missing view can dramatically reduce sample size, thus diminishing the statistical power of a subsequent analysis. In this paper, we propose a novel approach for view imputation via generative adversarial networks (GANs), which we name by VIGAN. This approach first treats each view as a separate domain and identifies domain-to-domain mappings via a GAN using randomly-sampled data from each view, and then employs a multi-modal denoising autoencoder (DAE) to reconstruct the missing view from the GAN outputs based on paired data across the views. Then, by optimizing the GAN and DAE jointly, our model enables the knowledge integration for domain mappings and view correspondences to effectively recover the missing view. Empirical results on benchmark datasets validate the VIGAN approach by comparing against the state of the art. The evaluation of VIGAN in a genetic study of substance use disorders further proves the effectiveness and usability of this approach in life science.Comment: 10 pages, 8 figures, conferenc

    Multiple Imputation Ensembles (MIE) for dealing with missing data

    Get PDF
    Missing data is a significant issue in many real-world datasets, yet there are no robust methods for dealing with it appropriately. In this paper, we propose a robust approach to dealing with missing data in classification problems: Multiple Imputation Ensembles (MIE). Our method integrates two approaches: multiple imputation and ensemble methods and compares two types of ensembles: bagging and stacking. We also propose a robust experimental set-up using 20 benchmark datasets from the UCI machine learning repository. For each dataset, we introduce increasing amounts of data Missing Completely at Random. Firstly, we use a number of single/multiple imputation methods to recover the missing values and then ensemble a number of different classifiers built on the imputed data. We assess the quality of the imputation by using dissimilarity measures. We also evaluate the MIE performance by comparing classification accuracy on the complete and imputed data. Furthermore, we use the accuracy of simple imputation as a benchmark for comparison. We find that our proposed approach combining multiple imputation with ensemble techniques outperform others, particularly as missing data increases

    Effect of Imputation Methods in the Classifier Performance

    Get PDF
    Missing values in a dataset present an important problem for almost any traditional and modernstatistical method since most of these methods were developed under the assumption that thedataset was complete. However, in the real world, no complete datasets are available and theissue of missing data is frequently encountered in veterinary field studies as in other fields.While the imputation of missing data is important in veterinary field studies where data miningis newly starting to be implemented, another important issue is how it should be imputed. Thisis because in many studies observations with any variables having missing values are beingremoved or they are completed by traditional methods. In recent years, while alternativeapproaches are widely available to prevent the removal of observations with missing values,they are being used rarely. The aim of this study is to examine mean, median, nearest neighbors,MICE and missForest methods to impute the simulated missing data which is the randomlyremoved with varying frequencies (5 to 25% by 5%) from the original veterinary dataset. Thenhighly accurate methods selected to impute the original dataset for observation of influence inclassifier performance and to determine the optimal imputation method for the original dataset
    corecore