620 research outputs found
Distributed Estimation and Inference for the Analysis of Big Biomedical Data
This thesis focuses on developing and implementing new statistical methods to address some of the current difficulties encountered in the analysis of high-dimensional correlated biomedical data. Following the divide-and-conquer paradigm, I develop a theoretically sound and computationally tractable class of distributed statistical methods that are made accessible to practitioners through R statistical software.
This thesis aims to establish a class of distributed statistical methods for regression analyses with very large outcome variables arising in many biomedical fields, such as in metabolomic or imaging research. The general distributed procedure divides data into blocks that are analyzed on a parallelized computational platform and combines these separate results via Hansen’s (1982) generalized method of moments. These new methods provide distributed and efficient statistical inference in many different regression settings. Computational efficiency is achieved by leveraging recent developments in large scale computing, such as the MapReduce paradigm on the Hadoop platform.
In the first project presented in Chapter III, I develop a divide-and-conquer procedure implemented in a parallelized computational scheme for statistical estimation and inference of regression parameters with high-dimensional correlated responses. This project is motivated by an electroencephalography study whose goal is to determine the effect of iron deficiency on infant auditory recognition memory. The proposed method (published as Hector and Song (2020a)), the Distributed and Integrated Method of Moments (DIMM), divides responses into subvectors to be analyzed in parallel using pairwise composite likelihood, and combines results using an optimal one-step meta-estimator.
In the second project presented in Chapter IV, I develop an extended theoretical framework of distributed estimation and inference to incorporate a broad range of classical statistical models and biomedical data types. To reduce computational speed and meet data privacy demands, I propose to divide data by outcomes and subjects, leading to a doubly divide-and-conquer paradigm. I also address parameter heterogeneity explicitly for added flexibility. I establish a new theoretical framework for the analysis of a broad class of big data problems to facilitate valid statistical inference for biomedical researchers. Possible applications include genomic data, metabolomic data, longitudinal and spatial data, and many more.
In the third project presented in Chapter V, I propose a distributed quadratic inference function framework to jointly estimate regression parameters from multiple potentially heterogeneous data sources with correlated vector outcomes. This project is motivated by the analysis of the association between smoking and metabolites in a large cohort study. The primary goal of this joint integrative analysis is to estimate covariate effects on all outcomes through a marginal regression model in a statistically and computationally efficient way. To overcome computational and modeling challenges arising from the high-dimensional likelihood of the correlated vector outcomes, I propose to analyze each data source using Qu et al.’s quadratic inference funtions, and then to jointly reestimate parameters from each data source by accounting for correlation between data sources.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163220/1/ehector_1.pd
First-order Newton-type Estimator for Distributed Estimation and Inference
This paper studies distributed estimation and inference for a general
statistical problem with a convex loss that could be non-differentiable. For
the purpose of efficient computation, we restrict ourselves to stochastic
first-order optimization, which enjoys low per-iteration complexity. To
motivate the proposed method, we first investigate the theoretical properties
of a straightforward Divide-and-Conquer Stochastic Gradient Descent (DC-SGD)
approach. Our theory shows that there is a restriction on the number of
machines and this restriction becomes more stringent when the dimension is
large. To overcome this limitation, this paper proposes a new multi-round
distributed estimation procedure that approximates the Newton step only using
stochastic subgradient. The key component in our method is the proposal of a
computationally efficient estimator of , where is the
population Hessian matrix and is any given vector. Instead of estimating
(or ) that usually requires the second-order
differentiability of the loss, the proposed First-Order Newton-type Estimator
(FONE) directly estimates the vector of interest as a whole and
is applicable to non-differentiable losses. Our estimator also facilitates the
inference for the empirical risk minimizer. It turns out that the key term in
the limiting covariance has the form of , which can be estimated
by FONE.Comment: 60 page
Communication-Efficient Distributed Estimation and Inference for Cox's Model
Motivated by multi-center biomedical studies that cannot share individual
data due to privacy and ownership concerns, we develop communication-efficient
iterative distributed algorithms for estimation and inference in the
high-dimensional sparse Cox proportional hazards model. We demonstrate that our
estimator, even with a relatively small number of iterations, achieves the same
convergence rate as the ideal full-sample estimator under very mild conditions.
To construct confidence intervals for linear combinations of high-dimensional
hazard regression coefficients, we introduce a novel debiased method, establish
central limit theorems, and provide consistent variance estimators that yield
asymptotically valid distributed confidence intervals. In addition, we provide
valid and powerful distributed hypothesis tests for any coordinate element
based on a decorrelated score test. We allow time-dependent covariates as well
as censored survival times. Extensive numerical experiments on both simulated
and real data lend further support to our theory and demonstrate that our
communication-efficient distributed estimators, confidence intervals, and
hypothesis tests improve upon alternative methods
Distributed Linear Regression with Compositional Covariates
With the availability of extraordinarily huge data sets, solving the problems
of distributed statistical methodology and computing for such data sets has
become increasingly crucial in the big data area. In this paper, we focus on
the distributed sparse penalized linear log-contrast model in massive
compositional data. In particular, two distributed optimization techniques
under centralized and decentralized topologies are proposed for solving the two
different constrained convex optimization problems. Both two proposed
algorithms are based on the frameworks of Alternating Direction Method of
Multipliers (ADMM) and Coordinate Descent Method of Multipliers(CDMM, Lin et
al., 2014, Biometrika). It is worth emphasizing that, in the decentralized
topology, we introduce a distributed coordinate-wise descent algorithm based on
Group ADMM(GADMM, Elgabli et al., 2020, Journal of Machine Learning Research)
for obtaining a communication-efficient regularized estimation.
Correspondingly, the convergence theories of the proposed algorithms are
rigorously established under some regularity conditions. Numerical experiments
on both synthetic and real data are conducted to evaluate our proposed
algorithms.Comment: 35 pages,2 figure
One-step estimator paths for concave regularization
The statistics literature of the past 15 years has established many favorable
properties for sparse diminishing-bias regularization: techniques which can
roughly be understood as providing estimation under penalty functions spanning
the range of concavity between and norms. However, lasso
-regularized estimation remains the standard tool for industrial `Big
Data' applications because of its minimal computational cost and the presence
of easy-to-apply rules for penalty selection. In response, this article
proposes a simple new algorithm framework that requires no more computation
than a lasso path: the path of one-step estimators (POSE) does penalized
regression estimation on a grid of decreasing penalties, but adapts
coefficient-specific weights to decrease as a function of the coefficient
estimated in the previous path step. This provides sparse diminishing-bias
regularization at no extra cost over the fastest lasso algorithms. Moreover,
our `gamma lasso' implementation of POSE is accompanied by a reliable heuristic
for the fit degrees of freedom, so that standard information criteria can be
applied in penalty selection. We also provide novel results on the distance
between weighted- and penalized predictors; this allows us to build
intuition about POSE and other diminishing-bias regularization schemes. The
methods and results are illustrated in extensive simulations and in application
of logistic regression to evaluating the performance of hockey players.Comment: Data and code are in the gamlr package for R. Supplemental appendix
is at https://github.com/TaddyLab/pose/raw/master/paper/supplemental.pd
Uncovering latent structure in valued graphs: A variational approach
As more and more network-structured data sets are available, the statistical
analysis of valued graphs has become common place. Looking for a latent
structure is one of the many strategies used to better understand the behavior
of a network. Several methods already exist for the binary case. We present a
model-based strategy to uncover groups of nodes in valued graphs. This
framework can be used for a wide span of parametric random graphs models and
allows to include covariates. Variational tools allow us to achieve approximate
maximum likelihood estimation of the parameters of these models. We provide a
simulation study showing that our estimation method performs well over a broad
range of situations. We apply this method to analyze host--parasite interaction
networks in forest ecosystems.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS361 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …