9,187 research outputs found
Data locality and parallelism optimization using a constraint-based approach
Cataloged from PDF version of article.Embedded applications are becoming increasingly complex and processing ever-increasing datasets. In
the context of data-intensive embedded applications, there have been two complementary approaches to
enhancing application behavior, namely, data locality optimizations and improving loop-level parallelism.
Data locality needs to be enhanced to maximize the number of data accesses satisfied from the higher
levels of the memory hierarchy. On the other hand, compiler-based code parallelization schemes require
a fresh look for chip multiprocessors as interprocessor communication is much cheaper than off-chip
memory accesses. Therefore, a compiler needs to minimize the number of off-chip memory accesses. This
can be achieved by considering multiple loop nests simultaneously. Although compilers address these two
problems, there is an inherent difficulty in optimizing both data locality and parallelism simultaneously.
Therefore, an integrated approach that combines these two can generate much better results than each
individual approach. Based on these observations, this paper proposes a constraint network (CN)-based
formulation for data locality optimization and code parallelization. The paper also presents experimental
evidence, demonstrating the success of the proposed approach, and compares our results with those
obtained through previously proposed approaches. The experiments from our implementation indicate
that the proposed approach is very effective in enhancing data locality and parallelization.
© 2010 Elsevier Inc. All rights reserved
Transformations of High-Level Synthesis Codes for High-Performance Computing
Specialized hardware architectures promise a major step in performance and
energy efficiency over the traditional load/store devices currently employed in
large scale computing systems. The adoption of high-level synthesis (HLS) from
languages such as C/C++ and OpenCL has greatly increased programmer
productivity when designing for such platforms. While this has enabled a wider
audience to target specialized hardware, the optimization principles known from
traditional software design are no longer sufficient to implement
high-performance codes. Fast and efficient codes for reconfigurable platforms
are thus still challenging to design. To alleviate this, we present a set of
optimizing transformations for HLS, targeting scalable and efficient
architectures for high-performance computing (HPC) applications. Our work
provides a toolbox for developers, where we systematically identify classes of
transformations, the characteristics of their effect on the HLS code and the
resulting hardware (e.g., increases data reuse or resource consumption), and
the objectives that each transformation can target (e.g., resolve interface
contention, or increase parallelism). We show how these can be used to
efficiently exploit pipelining, on-chip distributed fast memory, and on-chip
streaming dataflow, allowing for massively parallel architectures. To quantify
the effect of our transformations, we use them to optimize a set of
throughput-oriented FPGA kernels, demonstrating that our enhancements are
sufficient to scale up parallelism within the hardware constraints. With the
transformations covered, we hope to establish a common framework for
performance engineers, compiler developers, and hardware developers, to tap
into the performance potential offered by specialized hardware architectures
using HLS
Near-optimal loop tiling by means of cache miss equations and genetic algorithms
The effectiveness of the memory hierarchy is critical for the performance of current processors. The performance of the memory hierarchy can be improved by means of program transformations such as loop tiling, which is a code transformation targeted to reduce capacity misses. This paper presents a novel systematic approach to perform near-optimal loop tiling based on an accurate data locality analysis (cache miss equations) and a powerful technique to search the solution space that is based on a genetic algorithm. The results show that this approach can remove practically all capacity misses for all considered benchmarks. The reduction of replacement misses results in a decrease of the miss ratio that can be as significant as a factor of 7 for the matrix multiply kernel.Peer ReviewedPostprint (published version
- …