1,095 research outputs found

    Loop splitting for efficient pipelining in high-level synthesis

    Get PDF
    Loop pipelining is widely adopted as a key optimization method in high-level synthesis (HLS). However, when complex memory dependencies appear in a loop, commercial HLS tools are still not able to maximize pipeline performance. In this paper, we leverage parametric polyhedral analysis to reason about memory dependence patterns that are uncertain (i.e., parameterised by an undetermined variable) and/or nonuniform (i.e., varying between loop iterations). We develop an automated source-to-source code transformation to split the loop into pieces, which are then synthesised by Vivado HLS as the hardware generation back-end. Our technique allows generated loops to run with a minimal interval, automatically inserting statically-determined parametric pipeline breaks at those iterations violating dependencies. Our experiments on seven representative benchmarks show that, compared to default loop pipelining, our parametric loop splitting improves pipeline performance by 4:3 in terms of clock cycles per iteration. The optimized pipelines consume 2:0 as many LUTs, 1:8 as many registers, and 1:1 as many DSP blocks. Hence the area-time product is improved by nearly a factor of 2

    Polyhedral-based dynamic loop pipelining for high-level synthesis

    Get PDF
    Loop pipelining is one of the most important optimization methods in high-level synthesis (HLS) for increasing loop parallelism. There has been considerable work on improving loop pipelining, which mainly focuses on optimizing static operation scheduling and parallel memory accesses. Nonetheless, when loops contain complex memory dependencies, current techniques cannot generate high performance pipelines. In this paper, we extend the capability of loop pipelining in HLS to handle loops with uncertain dependencies (i.e., parameterized by an undetermined variable) and/or nonuniform dependencies (i.e., varying between loop iterations). Our optimization allows a pipeline to be statically scheduled without the aforementioned memory dependencies, but an associated controller will change the execution speed of loop iterations at runtime. This allows the augmented pipeline to process each loop iteration as fast as possible without violating memory dependencies. We use a parametric polyhedral analysis to generate the control logic for when to safely run all loop iterations in the pipeline and when to break the pipeline execution to resolve memory conflicts. Our techniques have been prototyped in an automated source-to-source code transformation framework, with Xilinx Vivado HLS, a leading HLS tool, as the RTL generation backend. Over a suite of benchmarks, experiments show that our optimization can implement optimized pipelines at almost the same clock speed as without our transformations, running approximately 3.7-10× faster, with a reasonable resource overhead

    The Potential of Synergistic Static, Dynamic and Speculative Loop Nest Optimizations for Automatic Parallelization

    Get PDF
    Research in automatic parallelization of loop-centric programs started with static analysis, then broadened its arsenal to include dynamic inspection-execution and speculative execution, the best results involving hybrid static-dynamic schemes. Beyond the detection of parallelism in a sequential program, scalable parallelization on many-core processors involves hard and interesting parallelism adaptation and mapping challenges. These challenges include tailoring data locality to the memory hierarchy, structuring independent tasks hierarchically to exploit multiple levels of parallelism, tuning the synchronization grain, balancing the execution load, decoupling the execution into thread-level pipelines, and leveraging heterogeneous hardware with specialized accelerators. The polyhedral framework allows to model, construct and apply very complex loop nest transformations addressing most of the parallelism adaptation and mapping challenges. But apart from hardware-specific, back-end oriented transformations (if-conversion, trace scheduling, value prediction), loop nest optimization has essentially ignored dynamic and speculative techniques. Research in polyhedral compilation recently reached a significant milestone towards the support of dynamic, data-dependent control flow. This opens a large avenue for blending dynamic analyses and speculative techniques with advanced loop nest optimizations. Selecting real-world examples from SPEC benchmarks and numerical kernels, we make a case for the design of synergistic static, dynamic and speculative loop transformation techniques. We also sketch the embedding of dynamic information, including speculative assumptions, in the heart of affine transformation search spaces

    Data Flow ORB-SLAM for Real-time Performance on Embedded GPU Boards

    Get PDF
    The use of embedded boards on robots, including unmanned aerial and ground vehicles, is increasing thanks to the availability of GPU equipped low-cost embedded boards in the market. Porting algorithms originally designed for desktop CPUs on those boards is not straightforward due to hardware limitations. In this paper, we present how we modified and customized the open source SLAM algorithm ORB-SLAM2 to run in real-time on the NVIDIA Jetson TX2. We adopted a data flow paradigm to process the images, obtaining an efficient CPU/GPU load distribution that results in a processing speed of about 30 frames per second. Quantitative experimental results on four different sequences of the KITTI datasets demonstrate the effectiveness of the proposed approach. The source code of our data flow ORB-SLAM2 algorithm is publicly available on GitHub

    Performance optimizations for LTE User-Plane L2 software

    Get PDF
    Abstract. Nowadays modern mobile communication networks are expected to be able to compete with wired connections in both latency and speed. This places a lot of pressure on the mobile communication protocols, which are very complex, and much of their implementation depends on the software. The performance of the software directly affects the capacity of the network, which in turn affects the throughput and latency of the network’s users and the number of users the network can support. This thesis concentrates on identifying software components of LTE User-Plane radio interface protocols for improvements, and exploring the solutions for better performance. This study leans on system component tests and the performance profiler tool perf, which enables tracking the effects of software optimizations from function-level to the whole system-level accuracy. In addition to perf, performance counters provided by the processor are manually observed and they provide the verification on why specific optimizations affect the performance. Slow memory accesses or cache misses are identified as the most constraining factor in the software’s performance. Also many good practices are found during the optimization work, such as arranging code common path first. Surprisingly, separating hardly executed code from hotspots also has a positive impact on performance, in addition to shrinking the active binary. The optimization work results in the whole software’s load decreasing from 60% to 50% and in some individual functions load decreases of over 70% are achieved

    Compiler Discovered Dynamic Scheduling of Irregular Code in High-Level Synthesis

    Full text link
    Dynamically scheduled high-level synthesis (HLS) achieves higher throughput than static HLS for codes with unpredictable memory accesses and control flow. However, excessive dataflow scheduling results in circuits that use more resources and have a slower critical path, even when only a part of the circuit exhibits dynamic behavior. Recent work has shown that marking parts of a dataflow circuit for static scheduling can save resources and improve performance (hybrid scheduling), but the dynamic part of the circuit still bottlenecks the critical path. We propose instead to selectively introduce dynamic scheduling into static HLS. This paper presents an algorithm for identifying code regions amenable to dynamic scheduling and shows a methodology for introducing dynamically scheduled basic blocks, loops, and memory operations into static HLS. Our algorithm is informed by modulo-scheduling and can be integrated into any modulo-scheduled HLS tool. On a set of ten benchmarks, we show that our approach achieves on average an up to 3.7×\times and 3×\times speedup against dynamic and hybrid scheduling, respectively, with an area overhead of 1.3×\times and frequency degradation of 0.74×\times when compared to static HLS.Comment: To appear in the 33rd International Conference on Field-Programmable Logic and Applications (2023

    Porting Rodinia Applications to OmpSs@FPGA

    Get PDF
    La computació heterogènia amb FPGAs és una alternativa de baix consum a altres sistemes usats freqüentment, com la computació amb CPU multi-nucli i la computació heterogènia amb GPUs. No obstant, degut a que les FPGAs funcionen d'una manera totalment diferent a altres dispositius fets servir en computació, són bastant difícils de comparar. La Rodinia Benchmark Suite està formada per aplicacions que poden usar-se per comparar sistemes de computació heterogenis. La suite ha adaptat les aplicacions per computació amb CPU multi-nucli i computació amb GPU (fent servir les llibreries OpenMP, Cuda, OpenCL). L'objectiu del projecte és adaptar un cert nombre d'aquestes aplicacions per OmpSs@FPGA, un sistema de computació heterogeni amb dispositius FPGA de Xilinx. Algunes d'aquestes aplicacions també seran optimitzades fent servir eines de OmpSs i de Xilinx (Vivado HLS). Tot i que al principi la idea era adaptar i provar les aplicacions en el dispositiu FPGA físic, la absència del hardware durant la primera part de la fase d'adaptació va incentivar el desenvolupament d'un entorn de simulació de dispositius FPGA. Tal cosa va implicar modificar el runtime per fer que es comuniqués amb un programa software enlloc d'intentar accedir al hardware real. Aquesta tasca va afegir una càrrega de treball considerable en el projecte que no estava prevista. Tot i així, degut a que aquest entorn de simulació va fer molt més ràpida l'adaptació de les aplicacions, la quantitat d'hores amb les que es va desenvolupar l'entorn i es van adaptar les aplicacions va coincidir amb les hores previstes inicialment només per l'adaptació. Es van adaptar un total de 7 aplicacions, 6 de les quals es van optimitzar fins a cert punt. També es van analitzar totes les optimitzacions parcials acumulatives fent servir traces d'execució visualitzades amb el software Paraver. Un cop fets els anàlisis, es va fer un informe de sostenibilitat per avaluar l'impacte del projecte en els aspectes ambiental, econòmic i social. Finalment, s'arriba a la conclusió que s'ha completat l'objectiu inicial del projecte satisfactòriament.FPGA computing is a low power alternative to the vastly used multi-core CPU and GPU computing systems. However, due to FPGA devices being completely different in terms of architecture, they are quite complex to compare to other forms of computing. Rodinia Benchmark Suite consists of a number of applications that can be used to benchmark heterogeneous computing systems. The suite has currently adapted the applications for multi-core CPU and GPU computing (using OpenMP, Cuda and OpenCL libraries). The objective of this project is to port some of the applications from the Rodinia Benchmark Suite to OmpSs@FPGA, a heterogeneous FPGA computing environment based on Xilinx FPGA devices. A portion of these applications will also be optimized using both OmpSs features and Xilinx tools (Vivado HLS). While the original intentions were to port and test the applications with a physical FPGA device, the lack of access to the hardware during the initial porting phase encouraged the development of a simulated FPGA environment. This implied modifying the runtime to communicate with a software block running as an executable instead of trying to access the real hardware. Even though it added a significant workload to the project that was not intended at first, it ended up making the porting of the applications much faster than with the real hardware. Ultimately, the expected number of hours from the initial planning matched the hours it took to both develop the simulated environment and the applications. A total of 7 applications were ported to the OmpSs@FPGA environment, 6 of which were optimized to a certain extent. Furthermore, each of the accumulated optimization stages for every optimized application was analyzed and explained using Paraver traces. After that, a sustainability report was made in order to evaluate the impact of the project environmentally, economically and socially wise. In the final conclusions, it is stated that the original objective of the project has been fulfilled and thus the project has been completed successfully
    corecore