5,173 research outputs found
Model-based Clustering with Missing Not At Random Data
Traditional ways for handling missing values are not designed for the
clustering purpose and they rarely apply to the general case, though frequent
in practice, of Missing Not At Random (MNAR) values. This paper proposes to
embed MNAR data directly within model-based clustering algorithms. We introduce
a mixture model for different types of data (continuous, count, categorical and
mixed) to jointly model the data distribution and the MNAR mechanism. Eight
different MNAR models are proposed, which may depend on the underlying
(unknown) classes and/or the values of the missing variables themselves. We
prove the identifiability of the parameters of both the data distribution and
the mechanism, whatever the type of data and the mechanism, and propose an EM
or Stochastic EM algorithm to estimate them. The code is available on
\url{https://github.com/AudeSportisse/Clustering-MNAR}.
%\url{https://anonymous.4open.science/r/Clustering-MNAR-0201} We also prove
that MNAR models for which the missingness depends on the class membership have
the nice property that the statistical inference can be carried out on the data
matrix concatenated with the mask by considering a MAR mechanism instead.
Finally, we perform empirical evaluations for the proposed sub-models on
synthetic data and we illustrate the relevance of our method on a medical
register, the TraumaBase^{\mbox{\normalsize{\textregistered}}} dataset
Deep Generative Imputation Model for Missing Not At Random Data
Data analysis usually suffers from the Missing Not At Random (MNAR) problem,
where the cause of the value missing is not fully observed. Compared to the
naive Missing Completely At Random (MCAR) problem, it is more in line with the
realistic scenario whereas more complex and challenging. Existing statistical
methods model the MNAR mechanism by different decomposition of the joint
distribution of the complete data and the missing mask. But we empirically find
that directly incorporating these statistical methods into deep generative
models is sub-optimal. Specifically, it would neglect the confidence of the
reconstructed mask during the MNAR imputation process, which leads to
insufficient information extraction and less-guaranteed imputation quality. In
this paper, we revisit the MNAR problem from a novel perspective that the
complete data and missing mask are two modalities of incomplete data on an
equal footing. Along with this line, we put forward a generative-model-specific
joint probability decomposition method, conjunction model, to represent the
distributions of two modalities in parallel and extract sufficient information
from both complete data and missing mask. Taking a step further, we exploit a
deep generative imputation model, namely GNR, to process the real-world missing
mechanism in the latent space and concurrently impute the incomplete data and
reconstruct the missing mask. The experimental results show that our GNR
surpasses state-of-the-art MNAR baselines with significant margins (averagely
improved from 9.9% to 18.8% in RMSE) and always gives a better mask
reconstruction accuracy which makes the imputation more principle
Unbiased Recommender Learning from Missing-Not-At-Random Implicit Feedback
Recommender systems widely use implicit feedback such as click data because
of its general availability. Although the presence of clicks signals the users'
preference to some extent, the lack of such clicks does not necessarily
indicate a negative response from the users, as it is possible that the users
were not exposed to the items (positive-unlabeled problem). This leads to a
difficulty in predicting the users' preferences from implicit feedback.
Previous studies addressed the positive-unlabeled problem by uniformly
upweighting the loss for the positive feedback data or estimating the
confidence of each data having relevance information via the EM-algorithm.
However, these methods failed to address the missing-not-at-random problem in
which popular or frequently recommended items are more likely to be clicked
than other items even if a user does not have a considerable interest in them.
To overcome these limitations, we first define an ideal loss function to be
optimized to realize recommendations that maximize the relevance and propose an
unbiased estimator for the ideal loss. Subsequently, we analyze the variance of
the proposed unbiased estimator and further propose a clipped estimator that
includes the unbiased estimator as a special case. We demonstrate that the
clipped estimator is expected to improve the performance of the recommender
system, by considering the bias-variance trade-off. We conduct semi-synthetic
and real-world experiments and demonstrate that the proposed method largely
outperforms the baselines. In particular, the proposed method works better for
rare items that are less frequently observed in the training data. The findings
indicate that the proposed method can better achieve the objective of
recommending items with the highest relevance.Comment: accepted at WSDM'2
Imputation in missing not at random SNPs data using EM algorithm
   The relation between single nucleotide polymorphisms (SNPs) and some diseases has been concerned by many researchers. Also the missing SNPs are quite common in genetic association studies. Hence, this article investigates the relation between existing SNPs in DNMT1 of human chromosome 19 with colorectal cancer. This article aims is to presents an imputation method for missing SNPs not at random. In this case-control study, 100 patients suffering from colorectal cancer consulting with the Research Institute for Gastroenterology and Liver Disease of Shahid Beheshti University of Medical Sciences were considered as the case group and 100 other patients consulting with the same research institute were considered as the control group and the genetic test was applied in order to identify the genotype of the 6 SNPs of the DNMT1 of chromosom 19 for all the patients under investigation. The obtained data were analyzed using logistic regression, then a fraction of the data was eliminated both at random and not at random and the imputation was done through the EM algorithm and the logistic regression coefficients variation before and after the imputation was compared. The results of this study implied that in both methods, at random and not at random missing SNPs, the estimation of the logistic regression coefficients after the imputation through EM algorithm has a greater correspondence to the results obtained from the complete data in comparison with the method of eliminating the missing values.
Mediation analysis with the mediator and outcome missing not at random
Mediation analysis is widely used for investigating direct and indirect
causal pathways through which an effect arises. However, many mediation
analysis studies are often challenged by missingness in the mediator and
outcome. In general, when the mediator and outcome are missing not at random,
the direct and indirect effects are not identifiable without further
assumptions. In this work, we study the identifiability of the direct and
indirect effects under some interpretable missing not at random mechanisms. We
evaluate the performance of statistical inference under those assumptions
through simulation studies and illustrate the proposed methods via the National
Job Corps Study
Asymmetric Tri-training for Debiasing Missing-Not-At-Random Explicit Feedback
In most real-world recommender systems, the observed rating data are subject
to selection bias, and the data are thus missing-not-at-random. Developing a
method to facilitate the learning of a recommender with biased feedback is one
of the most challenging problems, as it is widely known that naive approaches
under selection bias often lead to suboptimal results. A well-established
solution for the problem is using propensity scoring techniques. The propensity
score is the probability of each data being observed, and unbiased performance
estimation is possible by weighting each data by the inverse of its propensity.
However, the performance of the propensity-based unbiased estimation approach
is often affected by choice of the propensity estimation model or the high
variance problem. To overcome these limitations, we propose a model-agnostic
meta-learning method inspired by the asymmetric tri-training framework for
unsupervised domain adaptation. The proposed method utilizes two predictors to
generate data with reliable pseudo-ratings and another predictor to make the
final predictions. In a theoretical analysis, a propensity-independent upper
bound of the true performance metric is derived, and it is demonstrated that
the proposed method can minimize this bound. We conduct comprehensive
experiments using public real-world datasets. The results suggest that the
previous propensity-based methods are largely affected by the choice of
propensity models and the variance problem caused by the inverse propensity
weighting. Moreover, we show that the proposed meta-learning method is robust
to these issues and can facilitate in developing effective recommendations from
biased explicit feedback.Comment: 43rd International ACM SIGIR Conference on Research and Development
in Information Retrieval (SIGIR '20
not-MIWAE: Deep Generative Modelling with Missing not at Random Data
When a missing process depends on the missing values themselves, it needs to
be explicitly modelled and taken into account while doing likelihood-based
inference. We present an approach for building and fitting deep latent variable
models (DLVMs) in cases where the missing process is dependent on the missing
data. Specifically, a deep neural network enables us to flexibly model the
conditional distribution of the missingness pattern given the data. This allows
for incorporating prior information about the type of missingness (e.g.
self-censoring) into the model. Our inference technique, based on
importance-weighted variational inference, involves maximising a lower bound of
the joint likelihood. Stochastic gradients of the bound are obtained by using
the reparameterisation trick both in latent space and data space. We show on
various kinds of data sets and missingness patterns that explicitly modelling
the missing process can be invaluable.Comment: Camera-ready version for ICLR 202
Identification and Estimation of Causal Effects with Confounders Missing Not at Random
Making causal inferences from observational studies can be challenging when
confounders are missing not at random. In such cases, identifying causal
effects is often not guaranteed. Motivated by a real example, we consider a
treatment-independent missingness assumption under which we establish the
identification of causal effects when confounders are missing not at random. We
propose a weighted estimating equation (WEE) approach for estimating model
parameters and introduce three estimators for the average causal effect, based
on regression, propensity score weighting, and doubly robust estimation. We
evaluate the performance of these estimators through simulations, and provide a
real data analysis to illustrate our proposed method.Comment: arXiv admin note: substantial text overlap with arXiv:2211.1501
Recommended from our members
Copula selection models for non-Gaussian responses that are missing not at random
Missing not at random (MNAR) data poses key challenges for statistical inference because the model of interest is typically not identifiable without imposing further (e.g., distributional) assumptions. Sample selection models have been routinely used for handling MNAR by jointly modelling the outcome and selection variables assuming that these follow a bivariate normal distribution. Recent studies have advocated parametric selection model approaches, for example estimated by multiple imputation and maximum likelihood, that are more robust to departures from the normality assumption. However, the proposed methods have been mostly restricted to a specific joint distribution (e.g., bivariate t-distribution). This paper discusses a flexible copula-based selection approach (which accommodates a wide range of non-Gaussian outcome distributions and offers great flexibility in the choice of functional form specifications for both the outcome and selection equations) and proposes a flexible imputation procedure that generates plausible imputed values from the copula selection model. A simulation study characterises the relative performance of the copula model compared with the most commonly used selection models for estimating average treatment effects with MNAR data. We illustrate the methods in the REFLUX study, which evaluates the causal effect of laparoscopic surgery compared to usual medical management on long-term quality of life in patients with reflux disease. We provide software code for implementing the proposed copula framework using the R package GJRM
- …