181 research outputs found
Simulating High-Dimensional Multivariate Data using the bigsimr R Package
It is critical to accurately simulate data when employing Monte Carlo
techniques and evaluating statistical methodology. Measurements are often
correlated and high dimensional in this era of big data, such as data obtained
in high-throughput biomedical experiments. Due to the computational complexity
and a lack of user-friendly software available to simulate these massive
multivariate constructions, researchers resort to simulation designs that posit
independence or perform arbitrary data transformations. To close this gap, we
developed the Bigsimr Julia package with R and Python interfaces. This paper
focuses on the R interface. These packages empower high-dimensional random
vector simulation with arbitrary marginal distributions and dependency via a
Pearson, Spearman, or Kendall correlation matrix. bigsimr contains
high-performance features, including multi-core and
graphical-processing-unit-accelerated algorithms to estimate correlation and
compute the nearest correlation matrix. Monte Carlo studies quantify the
accuracy and scalability of our approach, up to . We describe example
workflows and apply to a high-dimensional data set -- RNA-sequencing data
obtained from breast cancer tumor samples.Comment: 22 pages, 10 figures,
https://cran.r-project.org/web/packages/bigsimr/index.htm
On a Calculus-based Statistics Course for Life Science Students
The choice of pedagogy in statistics should take advantage of the quantitative capabilities and scientific background of the students. In this article, we propose a model for a statistics course that assumes student competency in calculus and a broadening knowledge in biology. We illustrate our methods and practices through examples from the curriculum
Efficient Parallel Statistical Model Checking of Biochemical Networks
We consider the problem of verifying stochastic models of biochemical
networks against behavioral properties expressed in temporal logic terms. Exact
probabilistic verification approaches such as, for example, CSL/PCTL model
checking, are undermined by a huge computational demand which rule them out for
most real case studies. Less demanding approaches, such as statistical model
checking, estimate the likelihood that a property is satisfied by sampling
executions out of the stochastic model. We propose a methodology for
efficiently estimating the likelihood that a LTL property P holds of a
stochastic model of a biochemical network. As with other statistical
verification techniques, the methodology we propose uses a stochastic
simulation algorithm for generating execution samples, however there are three
key aspects that improve the efficiency: first, the sample generation is driven
by on-the-fly verification of P which results in optimal overall simulation
time. Second, the confidence interval estimation for the probability of P to
hold is based on an efficient variant of the Wilson method which ensures a
faster convergence. Third, the whole methodology is designed according to a
parallel fashion and a prototype software tool has been implemented that
performs the sampling/verification process in parallel over an HPC
architecture
- …