88 research outputs found
Timing verification of dynamically reconfigurable logic for Xilinx Virtex FPGA series
This paper reports on a method for extending existing VHDL design and verification software available for the Xilinx Virtex series of FPGAs. It allows the designer to apply standard hardware design and verification tools to the design of dynamically reconfigurable logic (DRL). The technique involves the conversion of a dynamic design into multiple static designs, suitable for input to standard synthesis and APR tools. For timing and functional verification after APR, the sections of the design can then be recombined into a single dynamic system. The technique has been automated by extending an existing DRL design tool named DCSTech, which is part of the Dynamic Circuit Switching (DCS) CAD framework. The principles behind the tools are generic and should be readily extensible to other architectures and CAD toolsets. Implementation of the dynamic system involves the production of partial configuration bitstreams to load sections of circuitry. The process of creating such bitstreams, the final stage of our design flow, is summarized
Just In Time Assembly (JITA) - A Run Time Interpretation Approach for Achieving Productivity of Creating Custom Accelerators in FPGAs
The reconfigurable computing community has yet to be successful in allowing programmers to access FPGAs through traditional software development flows. Existing barriers that prevent programmers from using FPGAs include: 1) knowledge of hardware programming models, 2) the need to work within the vendor specific CAD tools and hardware synthesis. This thesis presents a series of published papers that explore different aspects of a new approach being developed to remove the barriers and enable programmers to compile accelerators on next generation reconfigurable manycore architectures. The approach is entitled Just In Time Assembly (JITA) of hardware accelerators. The approach has been defined to allow hardware accelerators to be built and run through software compilation and run time interpretation outside of CAD tools and without requiring each new accelerator to be synthesized. The approach advocates the use of libraries of pre-synthesized components that can be referenced through symbolic links in a similar fashion to dynamically linked software libraries. Synthesis still must occur but is moved out of the application programmers software flow and into the initial coding process that occurs when programming patterns that define a Domain Specific Language (DSL) are first coded. Programmers see no difference between creating software or hardware functionality when using the DSL. A new run time interpreter is introduced to assemble the individual pre-synthesized hardware accelerators that comprise the accelerator functionality within a configurable tile array of partially reconfigurable slots at run time. Quantitative results are presented that compares utilization, performance, and productivity of the approach to what would be achieved by full custom accelerators created through traditional CAD flows using hardware programming models and passing through synthesis
A Hybrid Partially Reconfigurable Overlay Supporting Just-In-Time Assembly of Custom Accelerators on FPGAs
The state of the art in design and development flows for FPGAs are not sufficiently mature to allow programmers to implement their applications through traditional software development flows. The stipulation of synthesis as well as the requirement of background knowledge on the FPGAs\u27 low-level physical hardware structure are major challenges that prevent programmers from using FPGAs. The reconfigurable computing community is seeking solutions to raise the level of design abstraction at which programmers must operate, and move the synthesis process out of the programmers\u27 path through the use of overlays. A recent approach, Just-In-Time Assembly (JITA), was proposed that enables hardware accelerators to be assembled at runtime, all from within a traditional software compilation flow. The JITA approach presents a promising path to constructing hardware designs on FPGAs using pre-synthesized parallel programming patterns, but suffers from two major limitations. First, all variant programming patterns must be pre-synthesized. Second, conditional operations are not supported.
In this thesis, I present a new reconfigurable overlay, URUK, that overcomes the two limitations imposed by the JITA approach. Similar to the original JITA approach, the proposed URUK overlay allows hardware accelerators to be constructed on FPGAs through software compilation flows. To this basic capability, URUK adds additional support to enable the assembly of presynthesized fine-grained computational operators to be assembled within the FPGA.
This thesis provides analysis of URUK from three different perspectives; utilization, performance, and productivity. The analysis includes comparisons against High-Level Synthesis (HLS) and the state of the art approach to creating static overlays. The tradeoffs conclude that URUK can achieve approximately equivalent performance for algebra operations compared to HLS custom accelerators, which are designed with simple experience on FPGAs. Further, URUK shows a high degree of flexibility for runtime placement and routing of the primitive operations. The analysis shows how this flexibility can be leveraged to reduce communication overhead among tiles, compared to traditional static overlays. The results also show URUK can enable software programmers without any hardware skills to create hardware accelerators at productivity levels consistent with software development and compilation
Transformations of High-Level Synthesis Codes for High-Performance Computing
Specialized hardware architectures promise a major step in performance and
energy efficiency over the traditional load/store devices currently employed in
large scale computing systems. The adoption of high-level synthesis (HLS) from
languages such as C/C++ and OpenCL has greatly increased programmer
productivity when designing for such platforms. While this has enabled a wider
audience to target specialized hardware, the optimization principles known from
traditional software design are no longer sufficient to implement
high-performance codes. Fast and efficient codes for reconfigurable platforms
are thus still challenging to design. To alleviate this, we present a set of
optimizing transformations for HLS, targeting scalable and efficient
architectures for high-performance computing (HPC) applications. Our work
provides a toolbox for developers, where we systematically identify classes of
transformations, the characteristics of their effect on the HLS code and the
resulting hardware (e.g., increases data reuse or resource consumption), and
the objectives that each transformation can target (e.g., resolve interface
contention, or increase parallelism). We show how these can be used to
efficiently exploit pipelining, on-chip distributed fast memory, and on-chip
streaming dataflow, allowing for massively parallel architectures. To quantify
the effect of our transformations, we use them to optimize a set of
throughput-oriented FPGA kernels, demonstrating that our enhancements are
sufficient to scale up parallelism within the hardware constraints. With the
transformations covered, we hope to establish a common framework for
performance engineers, compiler developers, and hardware developers, to tap
into the performance potential offered by specialized hardware architectures
using HLS
FPGAs for the Masses: Affordable Hardware Synthesis from Domain-Specific Languages
Field Programmable Gate Arrays (FPGAs) have the unique ability to be configured into application-specific architectures that are well suited to specific computing problems. This enables them to achieve performances and energy efficiencies that outclass other processor-based architectures, such as Chip Multiprocessors (CMPs), Graphic Processing Units (GPUs) and Digital Signal Processors (DSPs). Despite this, FPGAs are yet to gain widespread adoption, especially among application and software developers, because of their laborious application development process that requires hardware design expertise. In some application areas, domain-specific hardware synthesis tools alleviate this problem by using a Domain-Specific Language (DSL) to hide the low-level hardware details and also improve productivity of the developer. Additionally, these tools leverage domain knowledge to perform optimizations and produce high-quality hardware designs. While this approach holds great promise, the significant effort and cost of developing such domain-specific tools make it unaffordable in many application areas. In this thesis, we develop techniques to reduce the effort and cost of developing domain-specific hardware synthesis tools. To demonstrate our approach, we develop a toolchain to generate complete hardware systems from high-level functional specifications written in a DSL. Firstly, our approach uses language embedding and type-directed staging to develop a DSL and compiler in a cost-effective manner. To further reduce effort, we develop this compiler by composing reusable optimization modules, and integrate it with existing hardware synthesis tools. However, most synthesis tools require users to have hardware design knowledge to produce high-quality results. Therefore, secondly, to facilitate people without hardware design skills to develop domain-specific tools, we develop a methodology to generate high-quality hardware designs from well known computational patterns, such as map, zipWith, reduce and foreach; computational patterns are algorithmic methods that capture the nature of computation and communication and can be easily understood and used without expert knowledge. In our approach, we decompose the DSL specifications into constituent computational patterns and exploit the properties of these patterns, such as degree of parallelism, interdependence between operations and data-access characteristics, to generate high-quality hardware modules to implement them, and compose them into a complete system design. Lastly, we extended our methodology to automatically parallelize computations across multiple hardware modules to benefit from the spatial parallelism of the FPGA as well as overcome performance problems caused by non-sequential data access patterns and long access latency to external memory. To achieve this, we utilize the data-access properties of the computational patterns to automatically identify synchronization requirements and generate such multi-module designs from the same high-level functional specifications. Driven by power and performance constraints, today the world is turning to reconfigurable technology (i.e., FPGAs) to meet the computational needs of tomorrow. In this light, this work addresses the cardinal problem of making tomorrow's computing infrastructure programmable to application developers
Rapid Prototyping and Functional Verification of Power Efficient AI Processor on FPGA
Prototyping a design on a Field Programmable Gate Array (FPGA) involves different stages such as developing a design, performing synthesis, handling placement and routing and finally generating the programming bit file for the FPGA. After successful completion of the above stages, it is important to functionally verify the design. This thesis addresses the challenges involved in rapid prototyping and functional verification of a low power AI processor provided by the industry partner. This research also addresses the methodology used in generating programming bit file and testing the design. Traditional method of testing a design using RTL level testbench utilises more time and relies on functioning of other components associated with the design. This thesis incorporated a new technique of testing the design using software programs focusing on verification of the functionality of a particular module without depending on the other. This methodology reduced the time for functionality verification for part of the design from approximately 1 month to about 2 weeks. Finally, using the methodology mentioned above, the design was synthesized for two FPGA kits, along with analysing the power consumption of the design. The results show the low power nature of the design as it does not use any external memory resulting in faster Arithmetic Logic Unit (ALU) operations thereby saving time to access the data
- …