10,767 research outputs found
On the efficiency of reductions in µ-SIMD media extensions
Many important multimedia applications contain a significant fraction of reduction operations. Although, in general, multimedia applications are characterized for having high amounts of Data Level Parallelism, reductions and accumulations are difficult to parallelize and show a poor tolerance to increases in the latency of the instructions. This is specially significant for µ-SIMD extensions such as MMX or AltiVec. To overcome the problem of reductions in µ-SIMD ISAs, designers tend to include more and more complex instructions able to deal with the most common forms of reductions in multimedia. As long as the number of processor pipeline stages grows, the number of cycles needed to execute these multimedia instructions increases with every processor generation, severely compromising performance. The paper presents an in-depth discussion of how reductions/accumulations are performed in current µ-SIMD architectures and evaluates the performance trade-offs for near-future highly aggressive superscalar processors with three different styles of µ-SIMD extensions. We compare a MMX-like alternative to a MDMX-like extension that has packed accumulators to attack the reduction problem, and we also compare it to MOM, a matrix register ISA. We show that while packed accumulators present several advantages, they introduce artificial recurrences that severely degrade performance for processors with high number of registers and long latency operations. On the other hand, the paper demonstrates that longer SIMD media extensions such as MOM can take great advantage of accumulators by exploiting the associative parallelism implicit in reductions.Peer ReviewedPostprint (published version
A Comparative Study of Scheduling Techniques for Multimedia Applications on SIMD Pipelines
Parallel architectures are essential in order to take advantage of the
parallelism inherent in streaming applications. One particular branch of these
employ hardware SIMD pipelines. In this paper, we analyse several scheduling
techniques, namely ad hoc overlapped execution, modulo scheduling and modulo
scheduling with unrolling, all of which aim to efficiently utilize the special
architecture design. Our investigation focuses on improving throughput while
analysing other metrics that are important for streaming applications, such as
register pressure, buffer sizes and code size. Through experiments conducted on
several media benchmarks, we present and discuss trade-offs involved when
selecting any one of these scheduling techniques.Comment: Presented at DATE Friday Workshop on Heterogeneous Architectures and
Design Methods for Embedded Image Systems (HIS 2015) (arXiv:1502.07241
Architecture-Aware Optimization on a 1600-core Graphics Processor
The graphics processing unit (GPU) continues to
make significant strides as an accelerator in commodity cluster
computing for high-performance computing (HPC). For example,
three of the top five fastest supercomputers in the world, as
ranked by the TOP500, employ GPUs as accelerators. Despite this
increasing interest in GPUs, however, optimizing the performance
of a GPU-accelerated compute node requires deep technical
knowledge of the underlying architecture. Although significant
literature exists on how to optimize GPU performance on the
more mature NVIDIA CUDA architecture, the converse is true
for OpenCL on the AMD GPU.
Consequently, we present and evaluate architecture-aware optimizations
for the AMD GPU. The most prominent optimizations
include (i) explicit use of registers, (ii) use of vector types, (iii)
removal of branches, and (iv) use of image memory for global data.
We demonstrate the efficacy of our AMD GPU optimizations by
applying each optimization in isolation as well as in concert to
a large-scale, molecular modeling application called GEM. Via
these AMD-specific GPU optimizations, the AMD Radeon HD
5870 GPU delivers 65% better performance than with the wellknown
NVIDIA-specific optimizations
Instruction Set Architectures for Quantum Processing Units
Progress in quantum computing hardware raises questions about how these
devices can be controlled, programmed, and integrated with existing
computational workflows. We briefly describe several prominent quantum
computational models, their associated quantum processing units (QPUs), and the
adoption of these devices as accelerators within high-performance computing
systems. Emphasizing the interface to the QPU, we analyze instruction set
architectures based on reduced and complex instruction sets, i.e., RISC and
CISC architectures. We clarify the role of conventional constraints on memory
addressing and instruction widths within the quantum computing context.
Finally, we examine existing quantum computing platforms, including the D-Wave
2000Q and IBM Quantum Experience, within the context of future ISA development
and HPC needs.Comment: To be published in the proceedings in the International Super
Computing Conference 2017 publicatio
- …