Search CORE

620 research outputs found

Distributed Estimation and Inference for the Analysis of Big Biomedical Data

Author: Hector Emily
Publication venue
Publication date: 01/01/2020
Field of study

This thesis focuses on developing and implementing new statistical methods to address some of the current difficulties encountered in the analysis of high-dimensional correlated biomedical data. Following the divide-and-conquer paradigm, I develop a theoretically sound and computationally tractable class of distributed statistical methods that are made accessible to practitioners through R statistical software. This thesis aims to establish a class of distributed statistical methods for regression analyses with very large outcome variables arising in many biomedical fields, such as in metabolomic or imaging research. The general distributed procedure divides data into blocks that are analyzed on a parallelized computational platform and combines these separate results via Hansen’s (1982) generalized method of moments. These new methods provide distributed and efficient statistical inference in many different regression settings. Computational efficiency is achieved by leveraging recent developments in large scale computing, such as the MapReduce paradigm on the Hadoop platform. In the first project presented in Chapter III, I develop a divide-and-conquer procedure implemented in a parallelized computational scheme for statistical estimation and inference of regression parameters with high-dimensional correlated responses. This project is motivated by an electroencephalography study whose goal is to determine the effect of iron deficiency on infant auditory recognition memory. The proposed method (published as Hector and Song (2020a)), the Distributed and Integrated Method of Moments (DIMM), divides responses into subvectors to be analyzed in parallel using pairwise composite likelihood, and combines results using an optimal one-step meta-estimator. In the second project presented in Chapter IV, I develop an extended theoretical framework of distributed estimation and inference to incorporate a broad range of classical statistical models and biomedical data types. To reduce computational speed and meet data privacy demands, I propose to divide data by outcomes and subjects, leading to a doubly divide-and-conquer paradigm. I also address parameter heterogeneity explicitly for added flexibility. I establish a new theoretical framework for the analysis of a broad class of big data problems to facilitate valid statistical inference for biomedical researchers. Possible applications include genomic data, metabolomic data, longitudinal and spatial data, and many more. In the third project presented in Chapter V, I propose a distributed quadratic inference function framework to jointly estimate regression parameters from multiple potentially heterogeneous data sources with correlated vector outcomes. This project is motivated by the analysis of the association between smoking and metabolites in a large cohort study. The primary goal of this joint integrative analysis is to estimate covariate effects on all outcomes through a marginal regression model in a statistically and computationally efficient way. To overcome computational and modeling challenges arising from the high-dimensional likelihood of the correlated vector outcomes, I propose to analyze each data source using Qu et al.’s quadratic inference funtions, and then to jointly reestimate parameters from each data source by accounting for correlation between data sources.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163220/1/ehector_1.pd

Deep Blue Documents at the University of Michigan

First-order Newton-type Estimator for Distributed Estimation and Inference

Author: Chen Xi
Liu Weidong
Zhang Yichen
Publication venue
Publication date: 04/02/2021
Field of study

This paper studies distributed estimation and inference for a general statistical problem with a convex loss that could be non-differentiable. For the purpose of efficient computation, we restrict ourselves to stochastic first-order optimization, which enjoys low per-iteration complexity. To motivate the proposed method, we first investigate the theoretical properties of a straightforward Divide-and-Conquer Stochastic Gradient Descent (DC-SGD) approach. Our theory shows that there is a restriction on the number of machines and this restriction becomes more stringent when the dimension

p

is large. To overcome this limitation, this paper proposes a new multi-round distributed estimation procedure that approximates the Newton step only using stochastic subgradient. The key component in our method is the proposal of a computationally efficient estimator of

\Sigma^{-1} w

, where

\Sigma

is the population Hessian matrix and

w

is any given vector. Instead of estimating

\Sigma

(or

\Sigma^{-1}

) that usually requires the second-order differentiability of the loss, the proposed First-Order Newton-type Estimator (FONE) directly estimates the vector of interest

\Sigma^{-1} w

as a whole and is applicable to non-differentiable losses. Our estimator also facilitates the inference for the empirical risk minimizer. It turns out that the key term in the limiting covariance has the form of

\Sigma^{-1} w

, which can be estimated by FONE.Comment: 60 page

arXiv.org e-Print Archive

Communication-Efficient Distributed Estimation and Inference for Cox's Model

Author: Bayle Pierre
Fan Jianqing
Lou Zhipeng
Publication venue
Publication date: 28/03/2023
Field of study

Motivated by multi-center biomedical studies that cannot share individual data due to privacy and ownership concerns, we develop communication-efficient iterative distributed algorithms for estimation and inference in the high-dimensional sparse Cox proportional hazards model. We demonstrate that our estimator, even with a relatively small number of iterations, achieves the same convergence rate as the ideal full-sample estimator under very mild conditions. To construct confidence intervals for linear combinations of high-dimensional hazard regression coefficients, we introduce a novel debiased method, establish central limit theorems, and provide consistent variance estimators that yield asymptotically valid distributed confidence intervals. In addition, we provide valid and powerful distributed hypothesis tests for any coordinate element based on a decorrelated score test. We allow time-dependent covariates as well as censored survival times. Extensive numerical experiments on both simulated and real data lend further support to our theory and demonstrate that our communication-efficient distributed estimators, confidence intervals, and hypothesis tests improve upon alternative methods

arXiv.org e-Print Archive

Distributed Linear Regression with Compositional Covariates

Author: Chao Yue
Huang Lei
Ma Xuejun
Publication venue
Publication date: 21/10/2023
Field of study

With the availability of extraordinarily huge data sets, solving the problems of distributed statistical methodology and computing for such data sets has become increasingly crucial in the big data area. In this paper, we focus on the distributed sparse penalized linear log-contrast model in massive compositional data. In particular, two distributed optimization techniques under centralized and decentralized topologies are proposed for solving the two different constrained convex optimization problems. Both two proposed algorithms are based on the frameworks of Alternating Direction Method of Multipliers (ADMM) and Coordinate Descent Method of Multipliers(CDMM, Lin et al., 2014, Biometrika). It is worth emphasizing that, in the decentralized topology, we introduce a distributed coordinate-wise descent algorithm based on Group ADMM(GADMM, Elgabli et al., 2020, Journal of Machine Learning Research) for obtaining a communication-efficient regularized estimation. Correspondingly, the convergence theories of the proposed algorithms are rigorously established under some regularity conditions. Numerical experiments on both synthetic and real data are conducted to evaluate our proposed algorithms.Comment: 35 pages,2 figure

arXiv.org e-Print Archive

Generalized score matching for non-negative data

Author: Drton Mathias
Shojaie Ali
Yu Shiqing
Publication venue
Publication date: 01/01/2019
Field of study

Copenhagen University Research Information System

One-step estimator paths for concave regularization

Author: Taddy Matt
Publication venue
Publication date: 01/05/2016
Field of study

The statistics literature of the past 15 years has established many favorable properties for sparse diminishing-bias regularization: techniques which can roughly be understood as providing estimation under penalty functions spanning the range of concavity between

L_0

and

L_1

norms. However, lasso

L_1

-regularized estimation remains the standard tool for industrial `Big Data' applications because of its minimal computational cost and the presence of easy-to-apply rules for penalty selection. In response, this article proposes a simple new algorithm framework that requires no more computation than a lasso path: the path of one-step estimators (POSE) does

L_1

penalized regression estimation on a grid of decreasing penalties, but adapts coefficient-specific weights to decrease as a function of the coefficient estimated in the previous path step. This provides sparse diminishing-bias regularization at no extra cost over the fastest lasso algorithms. Moreover, our `gamma lasso' implementation of POSE is accompanied by a reliable heuristic for the fit degrees of freedom, so that standard information criteria can be applied in penalty selection. We also provide novel results on the distance between weighted-

L_1

and

L_0

penalized predictors; this allows us to build intuition about POSE and other diminishing-bias regularization schemes. The methods and results are illustrated in extensive simulations and in application of logistic regression to evaluating the performance of hockey players.Comment: Data and code are in the gamlr package for R. Supplemental appendix is at https://github.com/TaddyLab/pose/raw/master/paper/supplemental.pd

arXiv.org e-Print Archive

FigShare

Uncovering latent structure in valued graphs: A variational approach

Author: Mariadassou Mahendra
Robin Stéphane
Vacher Corinne
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2010
Field of study

As more and more network-structured data sets are available, the statistical analysis of valued graphs has become common place. Looking for a latent structure is one of the many strategies used to better understand the behavior of a network. Several methods already exist for the binary case. We present a model-based strategy to uncover groups of nodes in valued graphs. This framework can be used for a wide span of parametric random graphs models and allows to include covariates. Variational tools allow us to achieve approximate maximum likelihood estimation of the parameters of these models. We provide a simulation study showing that our estimation method performs well over a broad range of situations. We apply this method to analyze host--parasite interaction networks in forest ecosystems.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS361 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive