685 research outputs found
Optimization of a parallel permutation testing function for the SPRINT R package
The statistical language R and its Bioconductor package are favoured by many biostatisticians for processing microarray data. The amount of data produced by some analyses has reached the limits of many common bioinformatics computing infrastructures. High Performance Computing systems offer a solution to this issue. The Simple Parallel R Interface (SPRINT) is a package that provides biostatisticians with easy access to High Performance Computing systems and allows the addition of parallelized functions to R. Previous work has established that the SPRINT implementation of an R permutation testing function has close to optimal scaling on up to 512 processors on a supercomputer. Access to supercomputers, however, is not always possible, and so the work presented here compares the performance of the SPRINT implementation on a supercomputer with benchmarks on a range of platforms including cloud resources and a common desktop machine with multiprocessing capabilities
Exploiting Parallel R in the Cloud with SPRINT
BACKGROUND: Advances in DNA Microarray devices and next-generation massively parallel DNA sequencing platforms have led to an exponential growth in data availability but the arising opportunities require adequate computing resources. High Performance Computing (HPC) in the Cloud offers an affordable way of meeting this need. OBJECTIVES: Bioconductor, a popular tool for high-throughput genomic data analysis, is distributed as add-on modules for the R statistical programming language but R has no native capabilities for exploiting multi-processor architectures. SPRINT is an R package that enables easy access to HPC for genomics researchers. This paper investigates: setting up and running SPRINT-enabled genomic analyses on Amazon’s Elastic Compute Cloud (EC2), the advantages of submitting applications to EC2 from different parts of the world and, if resource underutilization can improve application performance. METHODS: The SPRINT parallel implementations of correlation, permutation testing, partitioning around medoids and the multi-purpose papply have been benchmarked on data sets of various size on Amazon EC2. Jobs have been submitted from both the UK and Thailand to investigate monetary differences. RESULTS: It is possible to obtain good, scalable performance but the level of improvement is dependent upon the nature of algorithm. Resource underutilization can further improve the time to result. End-user’s location impacts on costs due to factors such as local taxation. Conclusions: Although not designed to satisfy HPC requirements, Amazon EC2 and cloud computing in general provides an interesting alternative and provides new possibilities for smaller organisations with limited funds
Recommended from our members
Bond-Order Time Series Analysis for Detecting Reaction Events in Ab Initio Molecular Dynamics Simulations.
Ab initio molecular dynamics is able to predict novel reaction mechanisms by directly observing the individual reaction events that occur in simulation trajectories. In this article, we describe an approach for detecting reaction events from simulation trajectories using a physically motivated model based on time series analysis of ab initio bond orders. We found that applying a threshold to the bond order was insufficient for accurate detection, whereas peak finding on the first time derivative resulted in significantly improved accuracy. The model is trained on a reference set of reaction events representing the ideal result given unlimited computing resources. Our study includes two model systems: a heptanylium carbocation that undergoes hydride shifts and an unsaturated iron carbonyl cluster that features CO ligand migration and bridging behavior. The results indicate a high level of promise for this analysis approach to be used in mechanistic analysis of reactive AIMD simulations more generally
Parallel Optimisation of Bootstrapping in R
Bootstrapping is a popular and computationally demanding resampling method
used for measuring the accuracy of sample estimates and assisting with
statistical inference. R is a freely available language and environment for
statistical computing popular with biostatisticians for genomic data analyses.
A survey of such R users highlighted its implementation of bootstrapping as a
prime candidate for parallelization to overcome computational bottlenecks. The
Simple Parallel R Interface (SPRINT) is a package that allows R users to
exploit high performance computing in multi-core desktops and supercomputers
without expert knowledge of such systems. This paper describes the
parallelization of bootstrapping for inclusion in the SPRINT R package.
Depending on the complexity of the bootstrap statistic and the number of
resamples, this implementation has close to optimal speed up on up to 16 nodes
of a supercomputer and close to 100 on 512 nodes. This performance in a
multi-node setting compares favourably with an existing parallelization option
in the native R implementation of bootstrapping
MO-ParamILS: A Multi-objective Automatic Algorithm Configuration Framework
International audienceAutomated algorithm configuration procedures play an increasingly important role in the development and application of algorithms for a wide range of computationally challenging problems. Until very recently, these configuration procedures were limited to optimising a single performance objective, such as the running time or solution quality achieved by the algorithm being configured. However, in many applications there is more than one performance objective of interest. This gives rise to the multi-objective automatic algorithm configuration problem, which involves finding a Pareto set of configurations of a given target algorithm that characterises trade-offs between multiple performance objectives. In this work, we introduce MO-ParamILS, a multi-objective extension of the state-of-the-art single-objective algorithm configuration framework ParamILS, and demonstrate that it produces good results on several challenging bi-objective algorithm configuration scenarios compared to a base-line obtained from using a state-of-the-art single-objective algorithm configurator
Scalable Data Parallel Algorithms for Texture Synthesis and Compression using Gibbs Random Fields
This paper introduces scalable data parallel algorithms for image
processing. Focusing on Gibbs and Markov Random Field model
representation for textures, we present parallel algorithms for
texture synthesis, compression, and maximum likelihood parameter
estimation, currently implemented on Thinking Machines CM-2 and CM-5.
Use of fine-grained, data parallel processing techniques yields
real-time algorithms for texture synthesis and compression that are
substantially faster than the previously known sequential
implementations. Although current implementations are on Connection
Machines, the methodology presented here enables machine independent
scalable algorithms for a number of problems in image processing and
analysis.
(Also cross-referenced as UMIACS-TR-93-80.
- …