8 research outputs found
Communication-Avoiding Optimization Methods for Distributed Massive-Scale Sparse Inverse Covariance Estimation
Across a variety of scientific disciplines, sparse inverse covariance
estimation is a popular tool for capturing the underlying dependency
relationships in multivariate data. Unfortunately, most estimators are not
scalable enough to handle the sizes of modern high-dimensional data sets (often
on the order of terabytes), and assume Gaussian samples. To address these
deficiencies, we introduce HP-CONCORD, a highly scalable optimization method
for estimating a sparse inverse covariance matrix based on a regularized
pseudolikelihood framework, without assuming Gaussianity. Our parallel proximal
gradient method uses a novel communication-avoiding linear algebra algorithm
and runs across a multi-node cluster with up to 1k nodes (24k cores), achieving
parallel scalability on problems with up to ~819 billion parameters (1.28
million dimensions); even on a single node, HP-CONCORD demonstrates
scalability, outperforming a state-of-the-art method. We also use HP-CONCORD to
estimate the underlying dependency structure of the brain from fMRI data, and
use the result to identify functional regions automatically. The results show
good agreement with a clustering from the neuroscience literature.Comment: Main paper: 15 pages, appendix: 24 page
Recommended from our members
Communication-avoiding optimization methods for distributed massive-scale sparse inverse covariance estimation
Across a variety of scientific disciplines, sparse inverse covariance estimation is a popular tool for capturing the underlying dependency relationships in multivariate data. Unfortunately, most estimators are not scalable enough to handle the sizes of modern high-dimensional data sets (often on the order of terabytes), and assume Gaussian samples. To address these deficiencies, we introduce HP-CONCORD, a highly scalable optimization method for estimating a sparse inverse covariance matrix based on a regularized pseudolikelihood framework, without assuming Gaussianity. Our parallel proximal gradient method uses a novel communication-avoiding linear algebra algorithm and runs across a multi-node cluster with up to 1k nodes (24k cores), achieving parallel scalability on problems with up to โ819 billion parameters (1.28 million dimensions); even on a single node, HP-CONCORD demonstrates scalability, outperforming a state-of-the-art method. We also use HP-CONCORD to estimate the underlying dependency structure of the brain from fMRI data, and use the result to identify functional regions automatically. The results show good agreement with a clustering from the neuroscience literature
High-Performance Statistical Computing in the Computing Environments of the 2020s
Technological advances in the past decade, hardware and software alike, have
made access to high-performance computing (HPC) easier than ever. We review
these advances from a statistical computing perspective. Cloud computing makes
access to supercomputers affordable. Deep learning software libraries make
programming statistical algorithms easy and enable users to write code once and
run it anywhere -- from a laptop to a workstation with multiple graphics
processing units (GPUs) or a supercomputer in a cloud. Highlighting how these
developments benefit statisticians, we review recent optimization algorithms
that are useful for high-dimensional models and can harness the power of HPC.
Code snippets are provided to demonstrate the ease of programming. We also
provide an easy-to-use distributed matrix data structure suitable for HPC.
Employing this data structure, we illustrate various statistical applications
including large-scale positron emission tomography and -regularized Cox
regression. Our examples easily scale up to an 8-GPU workstation and a
720-CPU-core cluster in a cloud. As a case in point, we analyze the onset of
type-2 diabetes from the UK Biobank with 200,000 subjects and about 500,000
single nucleotide polymorphisms using the HPC -regularized Cox
regression. Fitting this half-million-variate model takes less than 45 minutes
and reconfirms known associations. To our knowledge, this is the first
demonstration of the feasibility of penalized regression of survival outcomes
at this scale.Comment: Accepted for publication in Statistical Scienc
๋ณ๋ ฌํ ์ฉ์ดํ ํต๊ณ๊ณ์ฐ ๋ฐฉ๋ฒ๋ก ๊ณผ ํ๋ ๊ณ ์ฑ๋ฅ ์ปดํจํ ํ๊ฒฝ์์ ์ ์ฉ
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ) -- ์์ธ๋ํ๊ต ๋ํ์ : ์์ฐ๊ณผํ๋ํ ํต๊ณํ๊ณผ, 2020. 8. ์์คํธ.Technological advances in the past decade, hardware and software alike, have made access to high-performance computing (HPC) easier than ever. In this dissertation, easily-parallelizable, inversion-free, and variable-separated algorithms and their implementation in statistical computing are discussed. The first part considers statistical estimation problems under structured sparsity posed as minimization of a sum of two or three convex functions, one of which is a composition of non-smooth and linear functions. Examples include graph-guided sparse fused lasso and overlapping group lasso. Two classes of inversion-free primal-dual algorithms are considered and unified from a perspective of monotone operator theory. From this unification, a continuum of preconditioned forward-backward operator splitting algorithms amenable to parallel and distributed computing is proposed. The unification is further exploited to introduce a continuum of accelerated algorithms on which the theoretically optimal asymptotic rate of convergence is obtained. For the second part, easy-to-use distributed matrix data structures in PyTorch and Julia are presented. They enable users to write code once and run it anywhere from a laptop to a workstation with multiple graphics processing units (GPUs) or a supercomputer in a cloud. With these data structures, various parallelizable statistical applications, including nonnegative matrix factorization, positron emission tomography, multidimensional scaling, and โ1-regularized Cox regression, are demonstrated. The examples scale up to an 8-GPU workstation and a 720-CPU-core cluster in a cloud. As a case in point, the onset of type-2 diabetes from the UK Biobank with 400,000 subjects and about 500,000 single nucleotide polymorphisms is analyzed using the HPC โ1-regularized Cox regression. Fitting a half-million variate model took about 50 minutes, reconfirming known associations. To my knowledge, the feasibility of a joint genome-wide association analysis of survival
outcomes at this scale is first demonstrated.์ง๋ 10๋
๊ฐ์ ํ๋์จ์ด์ ์ํํธ์จ์ด์ ๊ธฐ์ ์ ์ธ ๋ฐ์ ์ ๊ณ ์ฑ๋ฅ ์ปดํจํ
์ ์ ๊ทผ์ฅ๋ฒฝ์ ๊ทธ ์ด๋ ๋๋ณด๋ค ๋ฎ์ถ์๋ค. ์ด ํ์๋
ผ๋ฌธ์์๋ ๋ณ๋ ฌํ ์ฉ์ดํ๊ณ ์ญํ๋ ฌ ์ฐ์ฐ์ด ์๋ ๋ณ์ ๋ถ๋ฆฌ ์๊ณ ๋ฆฌ์ฆ๊ณผ ๊ทธ ํต๊ณ๊ณ์ฐ์์์ ๊ตฌํ์ ๋
ผ์ํ๋ค. ์ฒซ ๋ถ๋ถ์ ๋ณผ๋ก ํจ์ ๋ ๊ฐ ๋๋ ์ธ ๊ฐ์ ํฉ์ผ๋ก ๋ํ๋๋ ๊ตฌ์กฐํ๋ ํฌ์ ํต๊ณ ์ถ์ ๋ฌธ์ ์ ๋ํด ๋ค๋ฃฌ๋ค. ์ด ๋ ํจ์๋ค ์ค ํ๋๋ ๋นํํ ํจ์์ ์ ํ ํจ์์ ํฉ์ฑ์ผ๋ก ๋ํ๋๋ค. ๊ทธ ์์๋ก๋ ๊ทธ๋ํ ๊ตฌ์กฐ๋ฅผ ํตํด ์ ๋๋๋ ํฌ์ ์ตํฉ Lasso ๋ฌธ์ ์ ํ ๋ณ์๊ฐ ์ฌ๋ฌ ๊ทธ๋ฃน์ ์ํ ์ ์๋ ๊ทธ๋ฃน Lasso ๋ฌธ์ ๊ฐ ์๋ค. ์ด๋ฅผ ํ๊ธฐ ์ํด ์ญํ๋ ฌ ์ฐ์ฐ์ด ์๋ ๋ ์ข
๋ฅ์ ์์-์๋ (primal-dual) ์๊ณ ๋ฆฌ์ฆ์ ๋จ์กฐ ์ฐ์ฐ์ ์ด๋ก ๊ด์ ์์ ํตํฉํ๋ฉฐ ์ด๋ฅผ ํตํด ๋ณ๋ ฌํ ์ฉ์ดํ precondition๋ ์ ๋ฐฉ-ํ๋ฐฉ ์ฐ์ฐ์ ๋ถํ ์๊ณ ๋ฆฌ์ฆ์ ์งํฉ์ ์ ์ํ๋ค. ์ด ํตํฉ์ ์ ๊ทผ์ ์ผ๋ก ์ต์ ์๋ ด๋ฅ ์ ๊ฐ๋ ๊ฐ์ ์๊ณ ๋ฆฌ์ฆ์ ์งํฉ์ ๊ตฌ์ฑํ๋ ๋ฐ ํ์ฉ๋๋ค. ๋ ๋ฒ์งธ ๋ถ๋ถ์์๋ PyTorch์ Julia๋ฅผ ํตํด ์ฌ์ฉํ๊ธฐ ์ฌ์ด ๋ถ์ฐ ํ๋ ฌ ์๋ฃ ๊ตฌ์กฐ๋ฅผ ์ ์ํ๋ค. ์ด ๊ตฌ์กฐ๋ ์ฌ์ฉ์๋ค์ด ์ฝ๋๋ฅผ ํ ๋ฒ ์์ฑํ๋ฉด
์ด๊ฒ์ ๋
ธํธ๋ถ ํ ๋์์๋ถํฐ ์ฌ๋ฌ ๋์ ๊ทธ๋ํฝ ์ฒ๋ฆฌ ์ฅ์น (GPU)๋ฅผ ๊ฐ์ง ์ํฌ์คํ
์ด์
, ๋๋ ํด๋ผ์ฐ๋ ์์ ์๋ ์ํผ์ปดํจํฐ๊น์ง ๋ค์ํ ์ค์ผ์ผ์์ ์คํํ ์ ์๊ฒ ํด ์ค๋ค. ์์ธ๋ฌ, ์ด ์๋ฃ ๊ตฌ์กฐ๋ฅผ ๋น์ ํ๋ ฌ ๋ถํด, ์์ ์ ๋จ์ธต ์ดฌ์, ๋ค์ฐจ์ ์ฒ
๋๋ฒ, โ1-๋ฒ์ ํ Cox ํ๊ท ๋ถ์ ๋ฑ ๋ค์ํ ๋ณ๋ ฌํ ๊ฐ๋ฅํ ํต๊ณ์ ๋ฌธ์ ์ ์ ์ฉํ๋ค. ์ด ์์๋ค์ 8๋์ GPU๊ฐ ์๋ ์ํฌ์คํ
์ด์
๊ณผ 720๊ฐ์ ์ฝ์ด๊ฐ ์๋ ํด๋ผ์ฐ๋ ์์ ๊ฐ์ ํด๋ฌ์คํฐ์์ ํ์ฅ ๊ฐ๋ฅํ๋ค. ํ ์ฌ๋ก๋ก 400,000๋ช
์ ๋์๊ณผ 500,000๊ฐ์ ๋จ์ผ ์ผ๊ธฐ ๋คํ์ฑ ์ ๋ณด๊ฐ ์๋ UK Biobank ์๋ฃ์์์ ์ 2ํ ๋น๋จ๋ณ (T2D) ๋ฐ๋ณ ๋์ด๋ฅผ โ1-๋ฒ์ ํ Cox ํ๊ท ๋ชจํ์ ํตํด ๋ถ์ํ๋ค. 500,000๊ฐ์ ๋ณ์๊ฐ ์๋ ๋ชจํ์ ์ ํฉ์ํค๋ ๋ฐ 50๋ถ ๊ฐ๋์ ์๊ฐ์ด ๊ฑธ๋ ธ์ผ๋ฉฐ ์๋ ค์ง T2D ๊ด๋ จ ๋คํ์ฑ๋ค์ ์ฌํ์ธํ ์ ์์๋ค. ์ด๋ฌํ ๊ท๋ชจ์ ์ ์ ์ ์ฒด ๊ฒฐํฉ ์์กด ๋ถ์์ ์ต์ด๋ก ์๋๋ ๊ฒ์ด๋ค.Chapter1Prologue 1
1.1 Introduction 1
1.2 Accessible High-Performance Computing Systems 4
1.2.1 Preliminaries 4
1.2.2 Multiple CPU nodes: clusters, supercomputers, and clouds 7
1.2.3 Multi-GPU node 9
1.3 Highly Parallelizable Algorithms 12
1.3.1 MM algorithms 12
1.3.2 Proximal gradient descent 14
1.3.3 Proximal distance algorithm 16
1.3.4 Primal-dual methods 17
Chapter 2 Easily Parallelizable and Distributable Class of Algorithms for Structured Sparsity, with Optimal Acceleration 20
2.1 Introduction 20
2.2 Unification of Algorithms LV and CV (g โก 0) 30
2.2.1 Relation between Algorithms LV and CV 30
2.2.2 Unified algorithm class 34
2.2.3 Convergence analysis 35
2.3 Optimal acceleration 39
2.3.1 Algorithms 40
2.3.2 Convergence analysis 41
2.4 Stochastic optimal acceleration 45
2.4.1 Algorithm 45
2.4.2 Convergence analysis 47
2.5 Numerical experiments 50
2.5.1 Model problems 50
2.5.2 Convergence behavior 52
2.5.3 Scalability 62
2.6 Discussion 63
Chapter 3 Towards Unified Programming for High-Performance Statistical Computing Environments 66
3.1 Introduction 66
3.2 Related Software 69
3.2.1 Message-passing interface and distributed array interfaces 69
3.2.2 Unified array interfaces for CPU and GPU 69
3.3 Easy-to-use Software Libraries for HPC 70
3.3.1 Deep learning libraries and HPC 70
3.3.2 Case study: PyTorch versus TensorFlow 73
3.3.3 A brief introduction to PyTorch 76
3.3.4 A brief introduction to Julia 80
3.3.5 Methods and multiple dispatch 80
3.3.6 Multidimensional arrays 82
3.3.7 Matrix multiplication 83
3.3.8 Dot syntax for vectorization 86
3.4 Distributed matrix data structure 87
3.4.1 Distributed matrices in PyTorch: distmat 87
3.4.2 Distributed arrays in Julia: MPIArray 90
3.5 Examples 98
3.5.1 Nonnegative matrix factorization 100
3.5.2 Positron emission tomography 109
3.5.3 Multidimensional scaling 113
3.5.4 L1-regularized Cox regression 117
3.5.5 Genome-wide survival analysis of the UK Biobank dataset 121
3.6 Discussion 126
Chapter 4 Conclusion 131
Appendix A Monotone Operator Theory 134
Appendix B Proofs for Chapter II 139
B.1 Preconditioned forward-backward splitting 139
B.2 Optimal acceleration 147
B.3 Optimal stochastic acceleration 158
Appendix C AWS EC2 and ParallelCluster 168
C.1 Overview 168
C.2 Glossary 169
C.3 Prerequisites 172
C.4 Installation 173
C.5 Configuration 173
C.6 Creating, accessing, and destroying the cluster 178
C.7 Installation of libraries 178
C.8 Running a job 179
C.9 Miscellaneous 180
Appendix D Code for memory-efficient L1-regularized Cox proportional hazards model 182
Appendix E Details of SNPs selected in L1-regularized Cox regression 184
Bibliography 188
๊ตญ๋ฌธ์ด๋ก 212Docto