1 research outputs found
Compilation Techniques for High-Performance Embedded Systems with Multiple Processors
Institute for Computing Systems ArchitectureDespite the progress made in developing more advanced compilers for embedded systems,
programming of embedded high-performance computing systems based on Digital
Signal Processors (DSPs) is still a highly skilled manual task. This is true for
single-processor systems, and even more for embedded systems based on multiple
DSPs. Compilers often fail to optimise existing DSP codes written in C due to the
employed programming style. Parallelisation is hampered by the complex multiple address
space memory architecture, which can be found in most commercial multi-DSP
configurations.
This thesis develops an integrated optimisation and parallelisation strategy that can
deal with low-level C codes and produces optimised parallel code for a homogeneous
multi-DSP architecture with distributed physical memory and multiple logical address
spaces. In a first step, low-level programming idioms are identified and recovered. This
enables the application of high-level code and data transformations well-known in the
field of scientific computing. Iterative feedback-driven search for “good” transformation
sequences is being investigated. A novel approach to parallelisation based on a
unified data and loop transformation framework is presented and evaluated. Performance
optimisation is achieved through exploitation of data locality on the one hand,
and utilisation of DSP-specific architectural features such as Direct Memory Access
(DMA) transfers on the other hand.
The proposed methodology is evaluated against two benchmark suites (DSPstone
& UTDSP) and four different high-performance DSPs, one of which is part of a commercial
four processor multi-DSP board also used for evaluation. Experiments confirm
the effectiveness of the program recovery techniques as enablers of high-level transformations
and automatic parallelisation. Source-to-source transformations of DSP
codes yield an average speedup of 2.21 across four different DSP architectures. The
parallelisation scheme is – in conjunction with a set of locality optimisations – able to
produce linear and even super-linear speedups on a number of relevant DSP kernels
and applications