31 research outputs found
State-of-the-Art in Parallel Computing with R
R is a mature open-source programming language for statistical computing and graphics. Many areas of statistical research are experiencing rapid growth in the size of data sets. Methodological advances drive increased use of simulations. A common approach is to use parallel computing. This paper presents an overview of techniques for parallel computing with R on computer clusters, on multi-core systems, and in grid computing. It reviews sixteen different packages, comparing them on their state of development, the parallel technology used, as well as on usability, acceptance, and performance. Two packages (snow, Rmpi) stand out as particularly useful for general use on computer clusters. Packages for grid computing are still in development, with only one package currently available to the end user. For multi-core systems four different packages exist, but a number of issues pose challenges to early adopters. The paper concludes with ideas for further developments in high performance computing with R. Example code is available in the appendix
Massively parallel approximate Gaussian process regression
We explore how the big-three computing paradigms -- symmetric multi-processor
(SMC), graphical processing units (GPUs), and cluster computing -- can together
be brought to bare on large-data Gaussian processes (GP) regression problems
via a careful implementation of a newly developed local approximation scheme.
Our methodological contribution focuses primarily on GPU computation, as this
requires the most care and also provides the largest performance boost.
However, in our empirical work we study the relative merits of all three
paradigms to determine how best to combine them. The paper concludes with two
case studies. One is a real data fluid-dynamics computer experiment which
benefits from the local nature of our approximation; the second is a synthetic
data example designed to find the largest design for which (accurate) GP
emulation can performed on a commensurate predictive set under an hour.Comment: 24 pages, 6 figures, 1 tabl
State of the Art in Parallel Computing with R
R is a mature open-source programming language for statistical computing and graphics. Many areas of statistical research are experiencing rapid growth in the size of data sets. Methodological advances drive increased use of simulations. A common approach is to use parallel computing. This paper presents an overview of techniques for parallel computing with R on computer clusters, on multi-core systems, and in grid computing. It reviews sixteen different packages, comparing them on their state of development, the parallel technology used, as well as on usability, acceptance, and performance. Two packages (snow, Rmpi) stand out as particularly suited to general use on computer clusters. Packages for grid computing are still in development, with only one package currently available to the end user. For multi-core systems five different packages exist, but a number of issues pose challenges to early adopters. The paper concludes with ideas for further developments in high performance computing with R. Example code is available in the appendix.
Parallel Computers and Complex Systems
We present an overview of the state of the art and future trends in high performance parallel and distributed computing, and discuss techniques for using such computers in the simulation of complex problems in computational science. The use of high performance parallel computers can help improve our understanding of complex systems, and the converse is also true --- we can apply techniques used for the study of complex systems to improve our understanding of parallel computing. We consider parallel computing as the mapping of one complex system --- typically a model of the world --- into another complex system --- the parallel computer. We study static, dynamic, spatial and temporal properties of both the complex systems and the map between them. The result is a better understanding of which computer architectures are good for which problems, and of software structure, automatic partitioning of data, and the performance of parallel machines
A Multi-Threaded Fast Convolver for Dynamically Parallel Image Filtering
2D convolution is a staple of digital image processing. The advent of large
format imagers makes it possible to literally ``pave'' with silicon the focal
plane of an optical sensor, which results in very large images that can require
a significant amount computation to process. Filtering of large images via 2D
convolutions is often complicated by a variety of effects (e.g.,
non-uniformities found in wide field of view instruments). This paper describes
a fast (FFT based) method for convolving images, which is also well suited to
very large images. A parallel version of the method is implemented using a
multi-threaded approach, which allows more efficient load balancing and a
simpler software architecture. The method has been implemented within in a high
level interpreted language (IDL), while also exploiting open standards vector
libraries (VSIPL) and open standards parallel directives (OpenMP). The parallel
approach and software architecture are generally applicable to a variety of
algorithms and has the advantage of enabling users to obtain the convenience of
an easy operating environment while also delivering high performance using a
fully portable code.Comment: 25 pages including color figures. Submitted to the Journal of
Parallel and Distributed Computin
High-Performance Computing and Four-Dimensional Data Assimilation: The Impact on Future and Current Problems
This is the final technical report for the project entitled: "High-Performance Computing and Four-Dimensional Data Assimilation: The Impact on Future and Current Problems", funded at NPAC by the DAO at NASA/GSFC. First, the motivation for the project is given in the introductory section, followed by the executive summary of major accomplishments and the list of project-related publications. Detailed analysis and description of research results is given in subsequent chapters and in the Appendix
Parallelization of dynamic programming recurrences in computational biology
The rapid growth of biosequence databases over the last decade has led to a performance bottleneck in the applications analyzing them. In particular, over the last five years DNA sequencing capacity of next-generation sequencers has been doubling every six months as costs have plummeted. The data produced by these sequencers is overwhelming traditional compute systems. We believe that in the future compute performance, not sequencing, will become the bottleneck in advancing genome science. In this work, we investigate novel computing platforms to accelerate dynamic programming algorithms, which are popular in bioinformatics workloads. We study algorithm-specific hardware architectures that exploit fine-grained parallelism in dynamic programming kernels using field-programmable gate arrays: FPGAs). We advocate a high-level synthesis approach, using the recurrence equation abstraction to represent dynamic programming and polyhedral analysis to exploit parallelism. We suggest a novel technique within the polyhedral model to optimize for throughput by pipelining independent computations on an array. This design technique improves on the state of the art, which builds latency-optimal arrays. We also suggest a method to dynamically switch between a family of designs using FPGA reconfiguration to achieve a significant performance boost. We have used polyhedral methods to parallelize the Nussinov RNA folding algorithm to build a family of accelerators that can trade resources for parallelism and are between 15-130x faster than a modern dual core CPU implementation. A Zuker RNA folding accelerator we built on a single workstation with four Xilinx Virtex 4 FPGAs outperforms 198 3 GHz Intel Core 2 Duo processors. Furthermore, our design running on a single FPGA is an order of magnitude faster than competing implementations on similar-generation FPGAs and graphics processors. Our work is a step toward the goal of automated synthesis of hardware accelerators for dynamic programming algorithms
A new parallelisation technique for heterogeneous CPUs
Parallelization has moved in recent years into the mainstream compilers, and the demand
for parallelizing tools that can do a better job of automatic parallelization is higher than
ever. During the last decade considerable attention has been focused on developing programming
tools that support both explicit and implicit parallelism to keep up with the
power of the new multiple core technology. Yet the success to develop automatic parallelising
compilers has been limited mainly due to the complexity of the analytic process
required to exploit available parallelism and manage other parallelisation measures such
as data partitioning, alignment and synchronization.
This dissertation investigates developing a programming tool that automatically parallelises
large data structures on a heterogeneous architecture and whether a high-level programming
language compiler can use this tool to exploit implicit parallelism and make use
of the performance potential of the modern multicore technology. The work involved the
development of a fully automatic parallelisation tool, called VSM, that completely hides
the underlying details of general purpose heterogeneous architectures. The VSM implementation
provides direct and simple access for users to parallelise array operations on the
Cellâs accelerators without the need for any annotations or process directives. This work
also involved the extension of the Glasgow Vector Pascal compiler to work with the VSM
implementation as a one compiler system. The developed compiler system, which is called
VP-Cell, takes a single source code and parallelises array expressions automatically.
Several experiments were conducted using Vector Pascal benchmarks to show the validity
of the VSM approach. The VP-Cell system achieved significant runtime performance
on one accelerator as compared to the master processorâs performance and near-linear
speedups over code runs on the Cellâs accelerators. Though VSM was mainly designed for
developing parallelising compilers it also showed a considerable performance by running
C code over the Cellâs accelerators
Numerical solutions of differential equations on FPGA-enhanced computers
Conventionally, to speed up scientific or engineering (S&E) computation programs
on general-purpose computers, one may elect to use faster CPUs, more memory, systems
with more efficient (though complicated) architecture, better software compilers, or even
coding with assembly languages. With the emergence of Field Programmable Gate
Array (FPGA) based Reconfigurable Computing (RC) technology, numerical scientists
and engineers now have another option using FPGA devices as core components to
address their computational problems. The hardware-programmable, low-cost, but
powerful âFPGA-enhanced computerâ has now become an attractive approach for many
S&E applications.
A new computer architecture model for FPGA-enhanced computer systems and its
detailed hardware implementation are proposed for accelerating the solutions of
computationally demanding and data intensive numerical PDE problems. New FPGAoptimized
algorithms/methods for rapid executions of representative numerical methods
such as Finite Difference Methods (FDM) and Finite Element Methods (FEM) are
designed, analyzed, and implemented on it. Linear wave equations based on seismic
data processing applications are adopted as the targeting PDE problems to demonstrate
the effectiveness of this new computer model. Their sustained computational
performances are compared with pure software programs operating on commodity CPUbased
general-purpose computers. Quantitative analysis is performed from a hierarchical
set of aspects as customized/extraordinary computer arithmetic or function units, compact but flexible system architecture and memory hierarchy, and hardwareoptimized
numerical algorithms or methods that may be inappropriate for conventional
general-purpose computers. The preferable property of in-system hardware
reconfigurability of the new system is emphasized aiming at effectively accelerating the
execution of complex multi-stage numerical applications. Methodologies for
accelerating the targeting PDE problems as well as other numerical PDE problems, such
as heat equations and Laplace equations utilizing programmable hardware resources are
concluded, which imply the broad usage of the proposed FPGA-enhanced computers