Search CORE

5,173 research outputs found

Model-based Clustering with Missing Not At Random Data

Author: Biernacki Christophe
Boyer Claire
Celeux Gilles
Josse Julie
Laporte Fabien
Marbac Matthieu
Sportisse Aude
Publication venue
Publication date: 18/05/2022
Field of study

Traditional ways for handling missing values are not designed for the clustering purpose and they rarely apply to the general case, though frequent in practice, of Missing Not At Random (MNAR) values. This paper proposes to embed MNAR data directly within model-based clustering algorithms. We introduce a mixture model for different types of data (continuous, count, categorical and mixed) to jointly model the data distribution and the MNAR mechanism. Eight different MNAR models are proposed, which may depend on the underlying (unknown) classes and/or the values of the missing variables themselves. We prove the identifiability of the parameters of both the data distribution and the mechanism, whatever the type of data and the mechanism, and propose an EM or Stochastic EM algorithm to estimate them. The code is available on \url{https://github.com/AudeSportisse/Clustering-MNAR}. %\url{https://anonymous.4open.science/r/Clustering-MNAR-0201} We also prove that MNAR models for which the missingness depends on the class membership have the nice property that the statistical inference can be carried out on the data matrix concatenated with the mask by considering a MAR mechanism instead. Finally, we perform empirical evaluations for the proposed sub-models on synthetic data and we illustrate the relevance of our method on a medical register, the TraumaBase^{\mbox{\normalsize{\textregistered}}} dataset

arXiv.org e-Print Archive

Deep Generative Imputation Model for Missing Not At Random Data

Author: Chen Jialei
Wang Pengyang
Xu Yuanbo
Yang Yongjian
Publication venue
Publication date: 16/08/2023
Field of study

Data analysis usually suffers from the Missing Not At Random (MNAR) problem, where the cause of the value missing is not fully observed. Compared to the naive Missing Completely At Random (MCAR) problem, it is more in line with the realistic scenario whereas more complex and challenging. Existing statistical methods model the MNAR mechanism by different decomposition of the joint distribution of the complete data and the missing mask. But we empirically find that directly incorporating these statistical methods into deep generative models is sub-optimal. Specifically, it would neglect the confidence of the reconstructed mask during the MNAR imputation process, which leads to insufficient information extraction and less-guaranteed imputation quality. In this paper, we revisit the MNAR problem from a novel perspective that the complete data and missing mask are two modalities of incomplete data on an equal footing. Along with this line, we put forward a generative-model-specific joint probability decomposition method, conjunction model, to represent the distributions of two modalities in parallel and extract sufficient information from both complete data and missing mask. Taking a step further, we exploit a deep generative imputation model, namely GNR, to process the real-world missing mechanism in the latent space and concurrently impute the incomplete data and reconstruct the missing mask. The experimental results show that our GNR surpasses state-of-the-art MNAR baselines with significant margins (averagely improved from 9.9% to 18.8% in RMSE) and always gives a better mask reconstruction accuracy which makes the imputation more principle

arXiv.org e-Print Archive

Unbiased Recommender Learning from Missing-Not-At-Random Implicit Feedback

Author: Nakata Kazuhide
Nishino Yuta
Saito Yuta
Yaginuma Suguru
中田和秀
齋藤優太
Publication venue
Publication date: 05/01/2020
Field of study

Recommender systems widely use implicit feedback such as click data because of its general availability. Although the presence of clicks signals the users' preference to some extent, the lack of such clicks does not necessarily indicate a negative response from the users, as it is possible that the users were not exposed to the items (positive-unlabeled problem). This leads to a difficulty in predicting the users' preferences from implicit feedback. Previous studies addressed the positive-unlabeled problem by uniformly upweighting the loss for the positive feedback data or estimating the confidence of each data having relevance information via the EM-algorithm. However, these methods failed to address the missing-not-at-random problem in which popular or frequently recommended items are more likely to be clicked than other items even if a user does not have a considerable interest in them. To overcome these limitations, we first define an ideal loss function to be optimized to realize recommendations that maximize the relevance and propose an unbiased estimator for the ideal loss. Subsequently, we analyze the variance of the proposed unbiased estimator and further propose a clipped estimator that includes the unbiased estimator as a special case. We demonstrate that the clipped estimator is expected to improve the performance of the recommender system, by considering the bias-variance trade-off. We conduct semi-synthetic and real-world experiments and demonstrate that the proposed method largely outperforms the baselines. In particular, the proposed method works better for rare items that are less frequently observed in the training data. The findings indicate that the proposed method can better achieve the objective of recommending items with the highest relevance.Comment: accepted at WSDM'2

arXiv.org e-Print Archive

Institutional Repositories DataBase (IRDB)

Imputation in missing not at random SNPs data using EM algorithm

Author: Alavi Majd Hamid
Alipour Heidari Mahmood
Azam Kamal
Hajizadeh Ebrahim
Zali Mohammad Reza
Publication venue: Publisher: School of Allied Medical Sciences, Shahid Beheshti University of Medical Sciences
Publication date: 02/10/2011
Field of study

The relation between single nucleotide polymorphisms (SNPs) and some diseases has been concerned by many researchers. Also the missing SNPs are quite common in genetic association studies. Hence, this article investigates the relation between existing SNPs in DNMT1 of human chromosome 19 with colorectal cancer. This article aims is to presents an imputation method for missing SNPs not at random. In this case-control study, 100 patients suffering from colorectal cancer consulting with the Research Institute for Gastroenterology and Liver Disease of Shahid Beheshti University of Medical Sciences were considered as the case group and 100 other patients consulting with the same research institute were considered as the control group and the genetic test was applied in order to identify the genotype of the 6 SNPs of the DNMT1 of chromosom 19 for all the patients under investigation. The obtained data were analyzed using logistic regression, then a fraction of the data was eliminated both at random and not at random and the imputation was done through the EM algorithm and the logistic regression coefficients variation before and after the imputation was compared. The results of this study implied that in both methods, at random and not at random missing SNPs, the estimation of the logistic regression coefficients after the imputation through EM algorithm has a greater correspondence to the results obtained from the complete data in comparison with the method of eliminating the missing values.

Journals Portal, Shahid Beheshti University of Medical Sciences

Mediation analysis with the mediator and outcome missing not at random

Author: Ding Peng
Ghosh Debashis
Yang Fan
Zuo Shuozhi
Publication venue
Publication date: 11/12/2022
Field of study

Mediation analysis is widely used for investigating direct and indirect causal pathways through which an effect arises. However, many mediation analysis studies are often challenged by missingness in the mediator and outcome. In general, when the mediator and outcome are missing not at random, the direct and indirect effects are not identifiable without further assumptions. In this work, we study the identifiability of the direct and indirect effects under some interpretable missing not at random mechanisms. We evaluate the performance of statistical inference under those assumptions through simulation studies and illustrate the proposed methods via the National Job Corps Study

arXiv.org e-Print Archive

Asymmetric Tri-training for Debiasing Missing-Not-At-Random Explicit Feedback

Author: Farajtabar Mehrdad
Ganin Yaroslav
Hernández-Lobato José Miguel
Hoeffding Wassily
Jiang Nan
Kingma Diederik P
Liang Dawen
Saito Kuniaki
Schnabel Tobias
Swaminathan Adith
Wang Xiaojie
Publication venue
Publication date: 02/06/2020
Field of study

In most real-world recommender systems, the observed rating data are subject to selection bias, and the data are thus missing-not-at-random. Developing a method to facilitate the learning of a recommender with biased feedback is one of the most challenging problems, as it is widely known that naive approaches under selection bias often lead to suboptimal results. A well-established solution for the problem is using propensity scoring techniques. The propensity score is the probability of each data being observed, and unbiased performance estimation is possible by weighting each data by the inverse of its propensity. However, the performance of the propensity-based unbiased estimation approach is often affected by choice of the propensity estimation model or the high variance problem. To overcome these limitations, we propose a model-agnostic meta-learning method inspired by the asymmetric tri-training framework for unsupervised domain adaptation. The proposed method utilizes two predictors to generate data with reliable pseudo-ratings and another predictor to make the final predictions. In a theoretical analysis, a propensity-independent upper bound of the true performance metric is derived, and it is demonstrated that the proposed method can minimize this bound. We conduct comprehensive experiments using public real-world datasets. The results suggest that the previous propensity-based methods are largely affected by the choice of propensity models and the variance problem caused by the inverse propensity weighting. Moreover, we show that the proposed meta-learning method is robust to these issues and can facilitate in developing effective recommendations from biased explicit feedback.Comment: 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '20

arXiv.org e-Print Archive

Crossref

not-MIWAE: Deep Generative Modelling with Missing not at Random Data

Author: Frellsen Jes
Ipsen Niels Bruun
Mattei Pierre-Alexandre
Publication venue
Publication date: 01/01/2021
Field of study

When a missing process depends on the missing values themselves, it needs to be explicitly modelled and taken into account while doing likelihood-based inference. We present an approach for building and fitting deep latent variable models (DLVMs) in cases where the missing process is dependent on the missing data. Specifically, a deep neural network enables us to flexibly model the conditional distribution of the missingness pattern given the data. This allows for incorporating prior information about the type of missingness (e.g. self-censoring) into the model. Our inference technique, based on importance-weighted variational inference, involves maximising a lower bound of the joint likelihood. Stochastic gradients of the bound are obtained by using the reparameterisation trick both in latent space and data space. We show on various kinds of data sets and missingness patterns that explicitly modelling the missing process can be invaluable.Comment: Camera-ready version for ICLR 202

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Online Research Database In Technology

Identification and Estimation of Causal Effects with Confounders Missing Not at Random

Author: Fu Bo
Sun Jian
Publication venue
Publication date: 30/10/2023
Field of study

Making causal inferences from observational studies can be challenging when confounders are missing not at random. In such cases, identifying causal effects is often not guaranteed. Motivated by a real example, we consider a treatment-independent missingness assumption under which we establish the identification of causal effects when confounders are missing not at random. We propose a weighted estimating equation (WEE) approach for estimating model parameters and introduce three estimators for the average causal effect, based on regression, propensity score weighting, and doubly robust estimation. We evaluate the performance of these estimators through simulations, and provide a real data analysis to illustrate our proposed method.Comment: arXiv admin note: substantial text overlap with arXiv:2211.1501

arXiv.org e-Print Archive

Recommended from our members

Copula selection models for non-Gaussian responses that are missing not at random

Author: Camarena Brenes J.
Gomes M.
Marra G.
Radice R.
Publication venue: 'Wiley'
Publication date: 10/02/2019
Field of study

Missing not at random (MNAR) data poses key challenges for statistical inference because the model of interest is typically not identifiable without imposing further (e.g., distributional) assumptions. Sample selection models have been routinely used for handling MNAR by jointly modelling the outcome and selection variables assuming that these follow a bivariate normal distribution. Recent studies have advocated parametric selection model approaches, for example estimated by multiple imputation and maximum likelihood, that are more robust to departures from the normality assumption. However, the proposed methods have been mostly restricted to a specific joint distribution (e.g., bivariate t-distribution). This paper discusses a flexible copula-based selection approach (which accommodates a wide range of non-Gaussian outcome distributions and offers great flexibility in the choice of functional form specifications for both the outcome and selection equations) and proposes a flexible imputation procedure that generates plausible imputed values from the copula selection model. A simulation study characterises the relative performance of the copula model compared with the most commonly used selection models for estimating average treatment effects with MNAR data. We illustrate the methods in the REFLUX study, which evaluates the causal effect of laparoscopic surgery compared to usual medical management on long-term quality of life in patients with reflux disease. We provide software code for implementing the proposed copula framework using the R package GJRM

City Research Online