48,760 research outputs found
Efficient implementation of interval matrix multiplication
International audienceThe straightforward implementation of interval matrix product suf- fers from poor efficiency, far from the performances of highly optimized floating-point implementations. In this paper, we show how to reduce the interval matrix multiplication to 9 floating-point matrix products - for performance issues - without sacrificing the quality of the result. We show that, compared to the straightforward implementation, the overes- timation factor is at most 1.18
Parallel Implementation of Interval Matrix Multiplication
International audienceTwo main and not necessarily compatible objectives when implementing the product of two dense matrices with interval coefficients are accuracy and efficiency. In this work, we focus on an implementation on multicore architectures. One direction successfully explored to gain performance in execution time is the representation of intervals by their midpoints and radii rather than the classical representation by endpoints. Computing with the midpoint-radius representation enables the use of optimized floating-point BLAS and consequently the performances benefit from the performances of the BLAS routines. Several variants of interval matrix multiplication have been proposed, that correspond to various trade-offs between accuracy and efficiency, including some efficient ones proposed by Rump in 2012. However, in order to guarantee that the computed result encloses the exact one, these efficient algorithms rely on an assumption on the order of execution of floating-point operations which is not verified by most implementations of BLAS. In this paper, an algorithm for interval matrix product is proposed that verifies this assumption. Furthermore, several optimizations are proposed and the implementation on a multicore architecture compares reasonably well with a non-guaranteed implementation based on MKL, the optimized BLAS of Intel: the overhead is most of the time less than 2 and never exceeds 3. This implementation also exhibits a good scalability
Hierarchical Parallel Matrix Multiplication on Large-Scale Distributed Memory Platforms
Matrix multiplication is a very important computation kernel both in its own
right as a building block of many scientific applications and as a popular
representative for other scientific applications. Cannon algorithm which dates
back to 1969 was the first efficient algorithm for parallel matrix
multiplication providing theoretically optimal communication cost. However this
algorithm requires a square number of processors. In the mid 1990s, the SUMMA
algorithm was introduced. SUMMA overcomes the shortcomings of Cannon algorithm
as it can be used on a non-square number of processors as well. Since then the
number of processors in HPC platforms has increased by two orders of magnitude
making the contribution of communication in the overall execution time more
significant. Therefore, the state of the art parallel matrix multiplication
algorithms should be revisited to reduce the communication cost further. This
paper introduces a new parallel matrix multiplication algorithm, Hierarchical
SUMMA (HSUMMA), which is a redesign of SUMMA. Our algorithm reduces the
communication cost of SUMMA by introducing a two-level virtual hierarchy into
the two-dimensional arrangement of processors. Experiments on an IBM BlueGene-P
demonstrate the reduction of communication cost up to 2.08 times on 2048 cores
and up to 5.89 times on 16384 cores.Comment: 9 page
High-level synthesis optimization for blocked floating-point matrix multiplication
In the last decade floating-point matrix multiplication on FPGAs has been studied extensively and efficient architectures as well as detailed performance models have been developed. By design these IP cores take a fixed footprint which not necessarily optimizes the use of all available resources. Moreover, the low-level architectures are not easily amenable to a parameterized synthesis. In this paper high-level synthesis is used to fine-tune the configuration parameters in order to achieve the highest performance with maximal resource utilization. An\ exploration strategy is presented to optimize the use of critical resources (DSPs, memory) for any given FPGA. To account for the limited memory size on the FPGA, a block-oriented matrix multiplication is organized such that the block summation is done on the CPU while the block multiplication occurs on the logic fabric simultaneously. The communication overhead between the CPU and the FPGA is minimized by streaming the blocks in a Gray code ordering scheme which maximizes the data reuse for consecutive block matrix product calculations. Using high-level synthesis optimization, the programmable logic operates at 93% of the theoretical peak performance and the combined CPU-FPGA design achieves 76% of the available hardware processing speed for the floating-point multiplication of 2K by 2K matrices
Parallel structurally-symmetric sparse matrix-vector products on multi-core processors
We consider the problem of developing an efficient multi-threaded
implementation of the matrix-vector multiplication algorithm for sparse
matrices with structural symmetry. Matrices are stored using the compressed
sparse row-column format (CSRC), designed for profiting from the symmetric
non-zero pattern observed in global finite element matrices. Unlike classical
compressed storage formats, performing the sparse matrix-vector product using
the CSRC requires thread-safe access to the destination vector. To avoid race
conditions, we have implemented two partitioning strategies. In the first one,
each thread allocates an array for storing its contributions, which are later
combined in an accumulation step. We analyze how to perform this accumulation
in four different ways. The second strategy employs a coloring algorithm for
grouping rows that can be concurrently processed by threads. Our results
indicate that, although incurring an increase in the working set size, the
former approach leads to the best performance improvements for most matrices.Comment: 17 pages, 17 figures, reviewed related work section, fixed typo
Towards Lattice Quantum Chromodynamics on FPGA devices
In this paper we describe a single-node, double precision Field Programmable
Gate Array (FPGA) implementation of the Conjugate Gradient algorithm in the
context of Lattice Quantum Chromodynamics. As a benchmark of our proposal we
invert numerically the Dirac-Wilson operator on a 4-dimensional grid on three
Xilinx hardware solutions: Zynq Ultrascale+ evaluation board, the Alveo U250
accelerator and the largest device available on the market, the VU13P device.
In our implementation we separate software/hardware parts in such a way that
the entire multiplication by the Dirac operator is performed in hardware, and
the rest of the algorithm runs on the host. We find out that the FPGA
implementation can offer a performance comparable with that obtained using
current CPU or Intel's many core Xeon Phi accelerators. A possible multiple
node FPGA-based system is discussed and we argue that power-efficient High
Performance Computing (HPC) systems can be implemented using FPGA devices only.Comment: 17 pages, 4 figure
- …