13,813 research outputs found
Multiple Imputation Using Gaussian Copulas
Missing observations are pervasive throughout empirical research, especially
in the social sciences. Despite multiple approaches to dealing adequately with
missing data, many scholars still fail to address this vital issue. In this
paper, we present a simple-to-use method for generating multiple imputations
using a Gaussian copula. The Gaussian copula for multiple imputation (Hoff,
2007) allows scholars to attain estimation results that have good coverage and
small bias. The use of copulas to model the dependence among variables will
enable researchers to construct valid joint distributions of the data, even
without knowledge of the actual underlying marginal distributions. Multiple
imputations are then generated by drawing observations from the resulting
posterior joint distribution and replacing the missing values. Using simulated
and observational data from published social science research, we compare
imputation via Gaussian copulas with two other widely used imputation methods:
MICE and Amelia II. Our results suggest that the Gaussian copula approach has a
slightly smaller bias, higher coverage rates, and narrower confidence intervals
compared to the other methods. This is especially true when the variables with
missing data are not normally distributed. These results, combined with
theoretical guarantees and ease-of-use suggest that the approach examined
provides an attractive alternative for applied researchers undertaking multiple
imputations
Load curve data cleansing and imputation via sparsity and low rank
The smart grid vision is to build an intelligent power network with an
unprecedented level of situational awareness and controllability over its
services and infrastructure. This paper advocates statistical inference methods
to robustify power monitoring tasks against the outlier effects owing to faulty
readings and malicious attacks, as well as against missing data due to privacy
concerns and communication errors. In this context, a novel load cleansing and
imputation scheme is developed leveraging the low intrinsic-dimensionality of
spatiotemporal load profiles and the sparse nature of "bad data.'' A robust
estimator based on principal components pursuit (PCP) is adopted, which effects
a twofold sparsity-promoting regularization through an -norm of the
outliers, and the nuclear norm of the nominal load profiles. Upon recasting the
non-separable nuclear norm into a form amenable to decentralized optimization,
a distributed (D-) PCP algorithm is developed to carry out the imputation and
cleansing tasks using networked devices comprising the so-termed advanced
metering infrastructure. If D-PCP converges and a qualification inequality is
satisfied, the novel distributed estimator provably attains the performance of
its centralized PCP counterpart, which has access to all networkwide data.
Computer simulations and tests with real load curve data corroborate the
convergence and effectiveness of the novel D-PCP algorithm.Comment: 8 figures, submitted to IEEE Transactions on Smart Grid - Special
issue on "Optimization methods and algorithms applied to smart grid
Causal Inference and Matrix Completion with Correlated Incomplete Data
Missing data problems are frequently encountered in biomedical research, social sciences, and environmental studies. When data are missing completely at random, a complete-case analysis may be the easiest approach. However, when data are missing not completely at random, ignoring the missing values will result in biased estimators. There has been a lot of work in handling missing data in the last two decades, such as likelihood-based methods, imputation methods, and bayesian approaches. The so-called matrix completion algorithm is one of the imputation approaches that has been widely discussed in the missing data literature. However, in a longitudinal setting, limited efforts have been devoted to using covariate information to recover the outcome matrix via matrix completion, when the response is subject to missingness.
In Chapter 1, the basic definition and concepts of different types of correlated data are introduced, and matrix completion algorithms as well as the semiparametric approaches are also introduced for handling missingness in the literature of correlated data analysis. The definition of robust estimation and interference in causal inference are also presented in this chapter.
In Chapter 2 we consider the prediction of missing responses in a longitudinal dataset via matrix completion. We propose a fixed effects longitudinal low-rank model which incorporates both subject-specific and time-specific covariates. The missingness mechanism is allowed to be missing at random, and the inverse probability weighting approach is utilized to debias the traditional quadratic loss in the matrix completion literature. To solve the optimization problem, a two-step optimization algorithm is proposed which provides good statistical properties for the estimation of the fixed effects and the low-rank term. In the theoretical investigation, the non-asymptotic error bounds on the fixed effects and the low-rank term are presented. We illustrate the finite sample performance of the proposed algorithm via simulation studies and apply our method to both a Covid-19 and PM2.5 emissions dataset.
In Chapter 3, we consider the partial interference setting, that is, the whole population can be partitioned into clusters where the outcome of each unit depends on the intervention on other units within the same cluster, but not on the units in different clusters. We also assume that the confounders are subject to nonignorable missingness. We propose three distinct consistent estimators for the direct, indirect, total, and overall effect of the intervention on the outcome, and derive the asymptotic results accordingly. A comprehensive simulation study is carried out as well to investigate the finite sample properties of the proposed estimators. We illustrate the proposed methods by analyzing the data collected from an Acid Rain Program, which was launched to reduce air pollution in the USA by encouraging the scrubber’s installation on power plants, where the records of some operating characteristics of the power generating facilities are subject to missingness.
In Chapter 4, we focus on the estimation of network causal effects. Under the setting of nonignorable missing confounders, we develop a multiply robust estimation procedure that gains extra protection against model misspecification. Compared with doubly robust estimators proposed in Chapter 3, the proposed multiply robust estimators are consistent if either one pair of the propensity score of treatment and missingness mechanism, or the joint model of confounders and the outcome, is correctly specified. The finite performance of the proposed methods under different missingness rates and cluster sizes is investigated, and we further illustrate the proposed methods with the same real data used in Chapter 3.
We conclude this thesis and discuss the future work in Chapter 5. Specifically, in Section 5.1, we summarize the contributions of the chapters in this thesis. In Section 5.2, we discuss the extension of Chapter 2 where the construction of confidence intervals for the low-rank term and the estimated fixed effects are investigated. Finally, in Section 5.3, we briefly discuss the potential extensions of Chapters 3 and 4 to a more general setting
Structured Matrix Completion with Applications to Genomic Data Integration
Matrix completion has attracted significant recent attention in many fields
including statistics, applied mathematics and electrical engineering. Current
literature on matrix completion focuses primarily on independent sampling
models under which the individual observed entries are sampled independently.
Motivated by applications in genomic data integration, we propose a new
framework of structured matrix completion (SMC) to treat structured missingness
by design. Specifically, our proposed method aims at efficient matrix recovery
when a subset of the rows and columns of an approximately low-rank matrix are
observed. We provide theoretical justification for the proposed SMC method and
derive lower bound for the estimation errors, which together establish the
optimal rate of recovery over certain classes of approximately low-rank
matrices. Simulation studies show that the method performs well in finite
sample under a variety of configurations. The method is applied to integrate
several ovarian cancer genomic studies with different extent of genomic
measurements, which enables us to construct more accurate prediction rules for
ovarian cancer survival.Comment: Accepted for publication in Journal of the American Statistical
Associatio
- …