Transformation Techniques to Facilitate Hardware Acceleration of Workloads

Abstract

There are many hardware and software techniques to decrease the execution time in a high workload computation program. The software techniques do not require extra hardware and could be a cheaper and easier solution. However, those techniques  have been tested in multipurpose computation systems, but not as much in embedded systems such as Systems on Chip (SoC). In this work, I evaluated how much I can increase the acceleration of a specific task using three different memory access techniques of loop optimization. The methods implemented are Decoupling Access Execution (DAE), Blocking and Polyhedral. The test case chosen was the classic square Matrix Multiplication algorithm because of its high calculation workload because it uses three nested loops. The size selected for the square matrix was 32x32 elements. I compared the obtained execution times of each method and the time got on a hardware accelerator. Based on the experiments, not all the techniques gain speedup. Instead, two of the three methods increased the execution time compared to the classic algorithm(non-accelerated algorithm). The DAE algorithm was the only technique that offered  a speedup, while the Blocking and Polyhedral algorithms increased their execution times. The Blocking optimization was inefficient due to the small size of the matrix. In this short process, the saved time is short and insufficient to overcome the cost of the two for-loops included for the implementation and far less to obtain a speedup. The Polyhedral algorithm, although reduced the number of for-loops, did not reduce the number of the iterations and, with more code lines in each iteration, caused an execution time increment. Moreover, although the DAE improved the execution time, it could not reach the accelerator device’s speedup. I assessed how critical it is to keep the minimum number of code lines inside the  inner nested loop, such as in the Blocking technique with small arrays. I demonstrated how expensive it is to include new loops and code lines with this technique, even if they work to decrease the execution time

    Similar works