272 research outputs found
Transformations of High-Level Synthesis Codes for High-Performance Computing
Specialized hardware architectures promise a major step in performance and
energy efficiency over the traditional load/store devices currently employed in
large scale computing systems. The adoption of high-level synthesis (HLS) from
languages such as C/C++ and OpenCL has greatly increased programmer
productivity when designing for such platforms. While this has enabled a wider
audience to target specialized hardware, the optimization principles known from
traditional software design are no longer sufficient to implement
high-performance codes. Fast and efficient codes for reconfigurable platforms
are thus still challenging to design. To alleviate this, we present a set of
optimizing transformations for HLS, targeting scalable and efficient
architectures for high-performance computing (HPC) applications. Our work
provides a toolbox for developers, where we systematically identify classes of
transformations, the characteristics of their effect on the HLS code and the
resulting hardware (e.g., increases data reuse or resource consumption), and
the objectives that each transformation can target (e.g., resolve interface
contention, or increase parallelism). We show how these can be used to
efficiently exploit pipelining, on-chip distributed fast memory, and on-chip
streaming dataflow, allowing for massively parallel architectures. To quantify
the effect of our transformations, we use them to optimize a set of
throughput-oriented FPGA kernels, demonstrating that our enhancements are
sufficient to scale up parallelism within the hardware constraints. With the
transformations covered, we hope to establish a common framework for
performance engineers, compiler developers, and hardware developers, to tap
into the performance potential offered by specialized hardware architectures
using HLS
Recommended from our members
An application of formal semantics to student modelling : an investigation in the domain of teaching Prolog
This thesis reports on research undertaken in an exploration of the use of formal semantics for student modelling in intelligent tutoring systems. The domain chosen was that of tutoring programming languages and within that domain Prolog was selected to be the target language for this exploration. The problem considered is one of how to analyse students' errors at a level which allows diagnosis to be more flexible and meaningful than is possible with the 'mal-rules' and 'bugcatalogue' approach of existing systems. The ideas put forward by Robin Milner [1980] in his Calculus of Communicating Systems (CCS) form the basis of the formalism which is proposed as a solution to this problem. Based on the findings of an empirical investigation, novices' misconceptions of control flow in Prolog was defined as a suitable area in which to explore the application of this solution. A selection of Prolog programs used in that investigation was formally described in terms of CCS. These formal descriptions were used by a production rule system to generate a number of the incomplete or faulty models of Prolog execution which were identified in the first empirical study. In a second empirical study, a machine-analysis tool, designed to be part of a diagnostic tutoring module, used these models to diagnose students' misconceptions of Prolog control flow. This initial application of CCS to student modelling showed that the models of Prolog execution generated by the system could be used successfully to detect students' misunderstandings. Results from the research reported here indicate that the use of formal semantics to model programming languages has a useful contribution to make to the task of student modelling
Xar-Trek: Run-Time Execution Migration among FPGAs and Heterogeneous-ISA CPUs
Datacenter servers are increasingly heterogeneous: from x86 host CPUs, to ARM
or RISC-V CPUs in NICs/SSDs, to FPGAs. Previous works have demonstrated that
migrating application execution at run-time across heterogeneous-ISA CPUs can
yield significant performance and energy gains, with relatively little
programmer effort. However, FPGAs have often been overlooked in that context:
hardware acceleration using FPGAs involves statically implementing select
application functions, which prohibits dynamic and transparent migration. We
present Xar-Trek, a new compiler and run-time software framework that overcomes
this limitation. Xar-Trek compiles an application for several CPU ISAs and
select application functions for acceleration on an FPGA, allowing execution
migration between heterogeneous-ISA CPUs and FPGAs at run-time. Xar-Trek's
run-time monitors server workloads and migrates application functions to an
FPGA or to heterogeneous-ISA CPUs based on a scheduling policy. We develop a
heuristic policy that uses application workload profiles to make scheduling
decisions. Our evaluations conducted on a system with x86-64 server CPUs, ARM64
server CPUs, and an Alveo accelerator card reveal 88%-1% performance gains over
no-migration baselines
The Developmental Stages of the Acquisition of Arabic By Adult English-speaking Learners: Processability Theory and the Formulaic Language
The aim of this study is to look at the developmental stages of the acquisition of Arabic as a foreign language by adult English learners. Processability theory (Pienemann, 1998, 2005) is adopted to investigate in detail whether the acquisition development will follow the hierarchy as stated by PT. The study targeted agreement within seven grammatical structures. The structures belong to three procedural levels of the hierarchy (stages three to five).
Six adult learners participated in this study. They were tested via different tasks to elicit data either to support the predictions of PT hierarchy, or to disconfirm it. Two participants produced subject – verb agreement (stage 4) at a higher rate than N-aAdj / N-N agreement (stage 3). Before disconfirming the Prediction of PT hierarchy, the two participants took a second test to make sure the language they produced is processed and not retrieved as a formula. Students were introduced to a set of new vocabulary and were asked to tell a story based on three picture stories. By learning unfamiliar vocabulary in isolation, the two participants applied grammatical relations to combine words together. Data in test 2 showed a decrease in the acquisition rate of S – V agreement. Therefore, confirming the predictions of PT
A Modular Approach to Adaptive Reactive Streaming Systems
The latest generations of FPGA devices offer large resource counts that provide the headroom to implement large-scale and complex systems. However, there are increasing challenges for the designer, not just because of pure size and complexity, but also in harnessing effectively the flexibility and programmability of the FPGA. A central issue is the need to integrate modules from diverse sources to promote modular design and reuse. Further, the capability to perform dynamic partial reconfiguration (DPR) of FPGA devices means that implemented systems can be made reconfigurable, allowing components to be changed during operation. However, use of DPR typically requires low-level planning of the system implementation, adding to the design challenge. This dissertation presents ReShape: a high-level approach for designing systems by interconnecting modules, which gives a ‘plug and play’ look and feel to the designer, is supported by tools that carry out implementation and verification functions, and is carried through to support system reconfiguration during operation. The emphasis is on the inter-module connections and abstracting the communication patterns that are typical between modules – for example, the streaming of data that is common in many FPGA-based systems, or the reading and writing of data to and from memory modules. ShapeUp is also presented as the static precursor to ReShape. In both, the details of wiring and signaling are hidden from view, via metadata associated with individual modules. ReShape allows system reconfiguration at the module level, by supporting type checking of replacement modules and by managing the overall system implementation, via metadata associated with its FPGA floorplan. The methodology and tools have been implemented in a prototype for a broad domain-specific setting – networking systems – and have been validated on real telecommunications design projects
- …