408 research outputs found

    Time series cluster kernels to exploit informative missingness and incomplete label information

    Get PDF
    The time series cluster kernel (TCK) provides a powerful tool for analysing multivariate time series subject to missing data. TCK is designed using an ensemble learning approach in which Bayesian mixture models form the base models. Because of the Bayesian approach, TCK can naturally deal with missing values without resorting to imputation and the ensemble strategy ensures robustness to hyperparameters, making it particularly well suited for unsupervised learning. However, TCK assumes missing at random and that the underlying missingness mechanism is ignorable, i.e. uninformative, an assumption that does not hold in many real-world applications, such as e.g. medicine. To overcome this limitation, we present a kernel capable of exploiting the potentially rich information in the missing values and patterns, as well as the information from the observed data. In our approach, we create a representation of the missing pattern, which is incorporated into mixed mode mixture models in such a way that the information provided by the missing patterns is effectively exploited. Moreover, we also propose a semi-supervised kernel, capable of taking advantage of incomplete label information to learn more accurate similarities. Experiments on benchmark data, as well as a real-world case study of patients described by longitudinal electronic health record data who potentially suffer from hospital-acquired infections, demonstrate the effectiveness of the proposed method

    A Kernel to Exploit Informative Missingness in Multivariate Time Series from EHRs

    Get PDF
    A large fraction of the electronic health records (EHRs) consists of clinical measurements collected over time, such as lab tests and vital signs, which provide important information about a patient's health status. These sequences of clinical measurements are naturally represented as time series, characterized by multiple variables and large amounts of missing data, which complicate the analysis. In this work, we propose a novel kernel which is capable of exploiting both the information from the observed values as well the information hidden in the missing patterns in multivariate time series (MTS) originating e.g. from EHRs. The kernel, called TCKIM_{IM}, is designed using an ensemble learning strategy in which the base models are novel mixed mode Bayesian mixture models which can effectively exploit informative missingness without having to resort to imputation methods. Moreover, the ensemble approach ensures robustness to hyperparameters and therefore TCKIM_{IM} is particularly well suited if there is a lack of labels - a known challenge in medical applications. Experiments on three real-world clinical datasets demonstrate the effectiveness of the proposed kernel.Comment: 2020 International Workshop on Health Intelligence, AAAI-20. arXiv admin note: text overlap with arXiv:1907.0525

    Deeply-Learned Generalized Linear Models with Missing Data

    Full text link
    Deep Learning (DL) methods have dramatically increased in popularity in recent years, with significant growth in their application to supervised learning problems in the biomedical sciences. However, the greater prevalence and complexity of missing data in modern biomedical datasets present significant challenges for DL methods. Here, we provide a formal treatment of missing data in the context of deeply learned generalized linear models, a supervised DL architecture for regression and classification problems. We propose a new architecture, \textit{dlglm}, that is one of the first to be able to flexibly account for both ignorable and non-ignorable patterns of missingness in input features and response at training time. We demonstrate through statistical simulation that our method outperforms existing approaches for supervised learning tasks in the presence of missing not at random (MNAR) missingness. We conclude with a case study of a Bank Marketing dataset from the UCI Machine Learning Repository, in which we predict whether clients subscribed to a product based on phone survey data

    Multiple Imputation Using Influential Exponential Tilting in Case of Non-Ignorable Missing Data

    Get PDF
    Modern research strategies rely predominantly on three steps, data collection, data analysis, and inference. In research, if the data is not collected as designed, researchers may face challenges of having incomplete data, especially when it is non-ignorable. These situations affect the subsequent steps of evaluation and make them difficult to perform. Inference with incomplete data is a challenging task in data analysis and clinical trials when missing data related to the condition under the study. Moreover, results obtained from incomplete data are prone to biases. Parameter estimation with non-ignorable missing data is even more challenging to handle and extract useful information. This dissertation proposes a method based on the influential tilting resampling approach to address non-ignorable missing data in statistical inference. This robust approach is motivated by a brief use of the importance resampling approach used by Samawi et al. (1998) for power estimation. The exponential tilting also inspires it for non-ignorable missing data proposed by Kim & Yu (2011). One of the proposed approach bases is assuming that the non-respondents\u27 model corresponds to an exponential tilting of the respondents\u27 model. The tilted model\u27s specified function is the influential function of the function of interest (parameter). The other bases of the proposed approach are to use the importance resampling techniques to draw inference about some model parameters. Extensive simulation studies were conducted to investigate the performance of the proposed methods. We provided the theoretical justification, as well as application to real data

    Clustering of Bulk RNA-Seq Data and Missing Data Methods in Deep Learning

    Get PDF
    Clustering is a form of unsupervised learning that aims to uncover latent groups within data based on similarity across a set of features. A common application of this in biomedical research is in delineating novel cancer subtypes from patient gene expression data, given a set of informative genes. However, it is typically unknown a priori what genes may be informative in discriminating between clusters, and what the optimal number of clusters is. In addition, few methods exist for unsupervised clustering of bulk RNA-seq samples, and no method exists that can do so while simultaneously adjusting for between-sample global normalization factors, accounting for potential confounding variables, and selecting cluster-discriminatory genes. In Chapter 2, we present FSCseq (Feature Selection and Clustering of RNA-seq): a model-based clustering algorithm that utilizes a finite mixture of regression (FMR) model and employs a quadratic penalty method with a SCAD penalty. The maximization is done by a penalized EM algorithm, allowing us to include normalization factors and confounders in our modeling framework. Given the fitted model, our framework allows for subtype prediction in new patients via posterior probabilities of cluster membership. The field of deep learning has also boomed in popularity in recent years, fueled initially by its performance in the classification and manipulation of image data, and, more recently, in areas of public health, medicine, and biology. However, the presence of missing data in these latter areas is very common, and involves more complicated mechanisms of missingness than the former. While a rich statistical literature exists regarding the characterization and treatment of missing data in traditional statistical models, it is unclear how such methods may extend to deep learning methods. In Chapter 3, we present NIMIWAE (Non-Ignorably Missing Importance Weighted AutoEncoder), an unsupervised learning algorithm which provides a formal treatment of missing data in the context of Importance Weighted Autoencoders (IWAEs), an unsupervised Bayesian deep learning architecture, in order to perform single and multiple imputation of missing data. We review existing methods that handle up to the missing at random (MAR) missingness, and propose methods to handle the more difficult missing not at random (MNAR) scenario. We show that this extension is critical to ensure the performance of data imputation, as well as downstream coefficient estimation. We utilize simulation examples to illustrate the impact of missingness on such tasks, and compare the performance of several proposed methods in handling missing data. We applied our proposed methods to a large electronic healthcare record dataset, and illustrated its utility through a qualitative look at the downstream fitted models after imputation. Finally, in Chapter 4, we present dlglm (deeply-learned generalized linear model), a supervised learning algorithm that extends the missing data methods from Chapter 3 directly to supervised learning tasks such as classification and regression. We show that dlglm can be trained in the presence of missing data in both the predictors and the response, and under the MCAR, MAR, and MNAR missing data settings. We also demonstrate that the trained dlglm model can directly predict response on partially-observed samples in the prediction or test set, drawing from the learned variational posterior distribution of the missing values conditional on the observed values during model training. We utilize statistical simulation and real-world datasets to show the impact of our method in increasing accuracy of coefficient estimation and predictionunder different mechanisms of missingness.Doctor of Philosoph
    • …
    corecore