Simulating High-Dimensional Multivariate Data using the bigsimr R Package

Abstract

It is critical to accurately simulate data when employing Monte Carlo techniques and evaluating statistical methodology. Measurements are often correlated and high dimensional in this era of big data, such as data obtained in high-throughput biomedical experiments. Due to the computational complexity and a lack of user-friendly software available to simulate these massive multivariate constructions, researchers resort to simulation designs that posit independence or perform arbitrary data transformations. To close this gap, we developed the Bigsimr Julia package with R and Python interfaces. This paper focuses on the R interface. These packages empower high-dimensional random vector simulation with arbitrary marginal distributions and dependency via a Pearson, Spearman, or Kendall correlation matrix. bigsimr contains high-performance features, including multi-core and graphical-processing-unit-accelerated algorithms to estimate correlation and compute the nearest correlation matrix. Monte Carlo studies quantify the accuracy and scalability of our approach, up to d=10,000d=10,000. We describe example workflows and apply to a high-dimensional data set -- RNA-sequencing data obtained from breast cancer tumor samples.Comment: 22 pages, 10 figures, https://cran.r-project.org/web/packages/bigsimr/index.htm

    Similar works

    Full text

    thumbnail-image

    Available Versions