9,384 research outputs found
Recommended from our members
Two-dimensional DCT/IDCT architecture
A fully parallel architecture for the computation of a two-dimensional (2-D) discrete cosine transform (DCT), based on row-column decomposition is presented. It uses the same one dimensional (1-D) DCT unit for the row and column computations and (N2+N) registers to perform the transposition. It possesses features of regularity and modularity, and is thus well suited for VLSI implementation. It can be used for the computation of either the forward or the inverse 2-D DCT. Each 1-D DCT unit uses N fully parallel vector inner product (VIP) units. The design of the VIP units is based on a systematic design methodology using radix-2â arithmetic, which allows partitioning of the elements of each vector into small groups. Array multipliers without the final adder are used to produce the different partial product terms. This allows a more efficient use of 4:2 compressors for the accumulation of the products in the intermediate stages and reduces the number of accumulators from N to one. Using this procedure, the 2-D DCT architecture requires less than N2 multipliers (in terms of area occupied) and only 2N adders. It can compute a N x N-point DCT at a rate of one complete transform per N cycles after an appropriate initial delay
CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication
Sparse matrix-vector multiplication (SpMV) is a fundamental building block
for numerous applications. In this paper, we propose CSR5 (Compressed Sparse
Row 5), a new storage format, which offers high-throughput SpMV on various
platforms including CPUs, GPUs and Xeon Phi. First, the CSR5 format is
insensitive to the sparsity structure of the input matrix. Thus the single
format can support an SpMV algorithm that is efficient both for regular
matrices and for irregular matrices. Furthermore, we show that the overhead of
the format conversion from the CSR to the CSR5 can be as low as the cost of a
few SpMV operations. We compare the CSR5-based SpMV algorithm with 11
state-of-the-art formats and algorithms on four mainstream processors using 14
regular and 10 irregular matrices as a benchmark suite. For the 14 regular
matrices in the suite, we achieve comparable or better performance over the
previous work. For the 10 irregular matrices, the CSR5 obtains average
performance improvement of 17.6\%, 28.5\%, 173.0\% and 293.3\% (up to 213.3\%,
153.6\%, 405.1\% and 943.3\%) over the best existing work on dual-socket Intel
CPUs, an nVidia GPU, an AMD GPU and an Intel Xeon Phi, respectively. For
real-world applications such as a solver with only tens of iterations, the CSR5
format can be more practical because of its low-overhead for format conversion.
The source code of this work is downloadable at
https://github.com/bhSPARSE/Benchmark_SpMV_using_CSR5Comment: 12 pages, 10 figures, In Proceedings of the 29th ACM International
Conference on Supercomputing (ICS '15
Speculative Segmented Sum for Sparse Matrix-Vector Multiplication on Heterogeneous Processors
Sparse matrix-vector multiplication (SpMV) is a central building block for
scientific software and graph applications. Recently, heterogeneous processors
composed of different types of cores attracted much attention because of their
flexible core configuration and high energy efficiency. In this paper, we
propose a compressed sparse row (CSR) format based SpMV algorithm utilizing
both types of cores in a CPU-GPU heterogeneous processor. We first
speculatively execute segmented sum operations on the GPU part of a
heterogeneous processor and generate a possibly incorrect results. Then the CPU
part of the same chip is triggered to re-arrange the predicted partial sums for
a correct resulting vector. On three heterogeneous processors from Intel, AMD
and nVidia, using 20 sparse matrices as a benchmark suite, the experimental
results show that our method obtains significant performance improvement over
the best existing CSR-based SpMV algorithms. The source code of this work is
downloadable at https://github.com/bhSPARSE/Benchmark_SpMV_using_CSRComment: 22 pages, 8 figures, Published at Parallel Computing (PARCO
A GPU-based hyperbolic SVD algorithm
A one-sided Jacobi hyperbolic singular value decomposition (HSVD) algorithm,
using a massively parallel graphics processing unit (GPU), is developed. The
algorithm also serves as the final stage of solving a symmetric indefinite
eigenvalue problem. Numerical testing demonstrates the gains in speed and
accuracy over sequential and MPI-parallelized variants of similar Jacobi-type
HSVD algorithms. Finally, possibilities of hybrid CPU--GPU parallelism are
discussed.Comment: Accepted for publication in BIT Numerical Mathematic
- âŚ