9,987 research outputs found
Fast matrix multiplication techniques based on the Adleman-Lipton model
On distributed memory electronic computers, the implementation and
association of fast parallel matrix multiplication algorithms has yielded
astounding results and insights. In this discourse, we use the tools of
molecular biology to demonstrate the theoretical encoding of Strassen's fast
matrix multiplication algorithm with DNA based on an -moduli set in the
residue number system, thereby demonstrating the viability of computational
mathematics with DNA. As a result, a general scalable implementation of this
model in the DNA computing paradigm is presented and can be generalized to the
application of \emph{all} fast matrix multiplication algorithms on a DNA
computer. We also discuss the practical capabilities and issues of this
scalable implementation. Fast methods of matrix computations with DNA are
important because they also allow for the efficient implementation of other
algorithms (i.e. inversion, computing determinants, and graph theory) with DNA.Comment: To appear in the International Journal of Computer Engineering
Research. Minor changes made to make the preprint as similar as possible to
the published versio
Scalable Task-Based Algorithm for Multiplication of Block-Rank-Sparse Matrices
A task-based formulation of Scalable Universal Matrix Multiplication
Algorithm (SUMMA), a popular algorithm for matrix multiplication (MM), is
applied to the multiplication of hierarchy-free, rank-structured matrices that
appear in the domain of quantum chemistry (QC). The novel features of our
formulation are: (1) concurrent scheduling of multiple SUMMA iterations, and
(2) fine-grained task-based composition. These features make it tolerant of the
load imbalance due to the irregular matrix structure and eliminate all
artifactual sources of global synchronization.Scalability of iterative
computation of square-root inverse of block-rank-sparse QC matrices is
demonstrated; for full-rank (dense) matrices the performance of our SUMMA
formulation usually exceeds that of the state-of-the-art dense MM
implementations (ScaLAPACK and Cyclops Tensor Framework).Comment: 8 pages, 6 figures, accepted to IA3 2015. arXiv admin note: text
overlap with arXiv:1504.0504
Algorithmic Based Fault Tolerance Applied to High Performance Computing
We present a new approach to fault tolerance for High Performance Computing
system. Our approach is based on a careful adaptation of the Algorithmic Based
Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel
distributed computation. We obtain a strongly scalable mechanism for fault
tolerance. We can also detect and correct errors (bit-flip) on the fly of a
computation. To assess the viability of our approach, we have developed a fault
tolerant matrix-matrix multiplication subroutine and we propose some models to
predict its running time. Our parallel fault-tolerant matrix-matrix
multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov)
and returns a correct result while one process failure has happened. This
represents 65% of the machine peak efficiency and less than 12% overhead with
respect to the fastest failure-free implementation. We predict (and have
observed) that, as we increase the processor count, the overhead of the fault
tolerance drops significantly
Better than $1/Mflops sustained: a scalable PC-based parallel computer for lattice QCD
We study the feasibility of a PC-based parallel computer for medium to large
scale lattice QCD simulations. The E\"otv\"os Univ., Inst. Theor. Phys. cluster
consists of 137 Intel P4-1.7GHz nodes with 512 MB RDRAM. The 32-bit, single
precision sustained performance for dynamical QCD without communication is 1510
Mflops/node with Wilson and 970 Mflops/node with staggered fermions. This gives
a total performance of 208 Gflops for Wilson and 133 Gflops for staggered QCD,
respectively (for 64-bit applications the performance is approximately halved).
The novel feature of our system is its communication architecture. In order to
have a scalable, cost-effective machine we use Gigabit Ethernet cards for
nearest-neighbor communications in a two-dimensional mesh. This type of
communication is cost effective (only 30% of the hardware costs is spent on the
communication). According to our benchmark measurements this type of
communication results in around 40% communication time fraction for lattices
upto 48^3\cdot96 in full QCD simulations. The price/sustained-performance ratio
for full QCD is better than 1.5/Mflops for
staggered) quarks for practically any lattice size, which can fit in our
parallel computer. The communication software is freely available upon request
for non-profit organizations.Comment: 14 pages, 3 figures, final version to appear in Comp.Phys.Com
Recommended from our members
Preparing sparse solvers for exascale computing.
Sparse solvers provide essential functionality for a wide variety of scientific applications. Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi-physics and multi-scale simulations, especially as we target exascale platforms. This paper describes the challenges, strategies and progress of the US Department of Energy Exascale Computing project towards providing sparse solvers for exascale computing platforms. We address the demands of systems with thousands of high-performance node devices where exposing concurrency, hiding latency and creating alternative algorithms become essential. The efforts described here are works in progress, highlighting current success and upcoming challenges. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'
- …