44 research outputs found

    Exploiting Hardware Abstraction for Parallel Programming Framework: Platform and Multitasking

    Get PDF
    With the help of the parallelism provided by the fine-grained architecture, hardware accelerators on Field Programmable Gate Arrays (FPGAs) can significantly improve the performance of many applications. However, designers are required to have excellent hardware programming skills and unique optimization techniques to explore the potential of FPGA resources fully. Intermediate frameworks above hardware circuits are proposed to improve either performance or productivity by leveraging parallel programming models beyond the multi-core era. In this work, we propose the PolyPC (Polymorphic Parallel Computing) framework, which targets enhancing productivity without losing performance. It helps designers develop parallelized applications and implement them on FPGAs. The PolyPC framework implements a custom hardware platform, on which programs written in an OpenCL-like programming model can launch. Additionally, the PolyPC framework extends vendor-provided tools to provide a complete development environment including intermediate software framework, and automatic system builders. Designers\u27 programs can be either synthesized as hardware processing elements (PEs) or compiled to executable files running on software PEs. Benefiting from nontrivial features of re-loadable PEs, and independent group-level schedulers, the multitasking is enabled for both software and hardware PEs to improve the efficiency of utilizing hardware resources. The PolyPC framework is evaluated regarding performance, area efficiency, and multitasking. The results show a maximum 66 times speedup over a dual-core ARM processor and 1043 times speedup over a high-performance MicroBlaze with 125 times of area efficiency. It delivers a significant improvement in response time to high-priority tasks with the priority-aware scheduling. Overheads of multitasking are evaluated to analyze trade-offs. With the help of the design flow, the OpenCL application programs are converted into executables through the front-end source-to-source transformation and back-end synthesis/compilation to run on PEs, and the framework is generated from users\u27 specifications

    LEGaTO: first steps towards energy-efficient toolset for heterogeneous computing

    Get PDF
    LEGaTO is a three-year EU H2020 project which started in December 2017. The LEGaTO project will leverage task-based programming models to provide a software ecosystem for Made-in-Europe heterogeneous hardware composed of CPUs, GPUs, FPGAs and dataflow engines. The aim is to attain one order of magnitude energy savings from the edge to the converged cloud/HPC.Peer ReviewedPostprint (author's final draft

    The Case for Polymorphic Registers in Dataflow Computing

    Get PDF

    High-level synthesis of fine-grained weakly consistent C concurrency

    Get PDF
    High-level synthesis (HLS) is the process of automatically compiling high-level programs into a netlist (collection of gates). Given an input program, HLS tools exploit its inherent parallelism and pipelining opportunities to generate efficient customised hardware. C-based programs are the most popular input for HLS tools, but these tools historically only synthesise sequential C programs. As the appeal for software concurrency rises, HLS tools are beginning to synthesise concurrent C programs, such as C/C++ pthreads and OpenCL. Although supporting software concurrency leads to better hardware parallelism, shared memory synchronisation is typically serialised to ensure correct memory behaviour, via locks. Locks are safety resources that ensure exclusive access of shared memory, eliminating data races and providing synchronisation guarantees for programmers.  As an alternative to lock-based synchronisation, the C memory model also defines the possibility of lock-free synchronisation via fine-grained atomic operations (`atomics'). However, most HLS tools either do not support atomics at all or implement atomics using locks. Instead, we treat the synthesis of atomics as a scheduling problem. We show that we can augment the intra-thread memory constraints during memory scheduling of concurrent programs to support atomics. On average, hardware generated by our method is 7.5x faster than the state-of-the-art, for our set of experiments. Our method of synthesising atomics enables several unique possibilities. Chiefly, we are capable of supporting weakly consistent (`weak') atomics, which necessitate fewer ordering constraints compared to sequentially consistent (SC) atomics. However, implementing weak atomics is complex and error-prone and hence we formally verify our methods via automated model checking to ensure our generated hardware is correct. Furthermore, since the C memory model defines memory behaviour globally, we can globally analyse the entire program to generate its memory constraints. Additionally, we can also support loop pipelining by extending our methods to generate inter-iteration memory constraints. On average, weak atomics, global analysis and loop pipelining improve performance by 1.6x, 3.4x and 1.4x respectively, for our set of experiments. Finally, we present a case study of a real-world example via an HLS-based Google PageRank algorithm, whose performance improves by 4.4x via lock-free streaming and work-stealing.Open Acces

    Template-based hardware-software codesign for high-performance embedded numerical accelerators

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2013.Cataloged from PDF version of thesis.Includes bibliographical references (pages 129-132).Sophisticated algorithms for control, state estimation and equalization have tremendous potential to improve performance and create new capabilities in embedded and mobile systems. Traditional implementation approaches are not well suited for porting these algorithmic solutions into practical implementations within embedded system constraints. Most of the technical challenges arise from design approach that manipulates only one level in the design stack, thus being forced to conform to constraints imposed by other levels without question. In tightly constrained environments, like embedded and mobile systems, such approaches have a hard time efficiently delivering and delivering efficiency. In this work we offer a solution that cuts through all the design stack layers. We build flexible structures at the hardware, software and algorithm level, and approach the solution through design space exploration. To do this efficiently we use a template-based hardware-software development flow. The main incentive for template use is, as in software development, to relax the generality vs. efficiency/performance type tradeoffs that appear in solutions striving to achieve run-time flexibility. As a form of static polymorphism, templates typically incur very little performance overhead once the design is instantiated, thus offering the possibility to defer many design decisions until later stages when more is known about the overall system design. However, simply including templates into design flow is not sufficient to result in benefits greater than some level of code reuse. In our work we propose using templates as flexible interfaces between various levels in the design stack. As such, template parameters become the common language that designers at different levels of design hierarchy can use to succinctly express their assumptions and ideas. Thus, it is of great benefit if template parameters map directly and intuitively into models at every level. To showcase the approach we implement a numerical accelerator for embedded Model Predictive Control (MPC) algorithm. While most of this work and design flow are quite general, their full power is realized in search for good solutions to a specific problem. This is best understood in direct comparison with recent works on embedded and high-speed MPC implementations. The controllers we generate outperform published works by a handsome margin in both speed and power consumption, while taking very little time to generate.by Ranko Radovin Sredojević.Ph.D
    corecore