184 research outputs found

    Transformations of High-Level Synthesis Codes for High-Performance Computing

    Full text link
    Specialized hardware architectures promise a major step in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C/C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target specialized hardware, the optimization principles known from traditional software design are no longer sufficient to implement high-performance codes. Fast and efficient codes for reconfigurable platforms are thus still challenging to design. To alleviate this, we present a set of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. Our work provides a toolbox for developers, where we systematically identify classes of transformations, the characteristics of their effect on the HLS code and the resulting hardware (e.g., increases data reuse or resource consumption), and the objectives that each transformation can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip streaming dataflow, allowing for massively parallel architectures. To quantify the effect of our transformations, we use them to optimize a set of throughput-oriented FPGA kernels, demonstrating that our enhancements are sufficient to scale up parallelism within the hardware constraints. With the transformations covered, we hope to establish a common framework for performance engineers, compiler developers, and hardware developers, to tap into the performance potential offered by specialized hardware architectures using HLS

    On Extracting Course-Grained Function Parallelism from C Programs

    Get PDF
    To efficiently utilize the emerging heterogeneous multi-core architecture, it is essential to exploit the inherent coarse-grained parallelism in applications. In addition to data parallelism, applications like telecommunication, multimedia, and gaming can also benefit from the exploitation of coarse-grained function parallelism. To exploit coarse-grained function parallelism, the common wisdom is to rely on programmers to explicitly express the coarse-grained data-flow between coarse-grained functions using data-flow or streaming languages. This research is set to explore another approach to exploiting coarse-grained function parallelism, that is to rely on compiler to extract coarse-grained data-flow from imperative programs. We believe imperative languages and the von Neumann programming model will still be the dominating programming languages programming model in the future. This dissertation discusses the design and implementation of a memory data-flow analysis system which extracts coarse-grained data-flow from C programs. The memory data-flow analysis system partitions a C program into a hierarchy of program regions. It then traverses the program region hierarchy from bottom up, summarizing the exposed memory access patterns for each program region, meanwhile deriving a conservative producer-consumer relations between program regions. An ensuing top-down traversal of the program region hierarchy will refine the producer-consumer relations by pruning spurious relations. We built an in-lining based prototype of the memory data-flow analysis system on top of the IMPACT compiler infrastructure. We applied the prototype to analyze the memory data-flow of several MediaBench programs. The experiment results showed that while the prototype performed reasonably well for the tested programs, the in-lining based implementation may not efficient for larger programs. Also, there is still room in improving the effectiveness of the memory data-flow analysis system. We did root cause analysis for the inaccuracy in the memory data-flow analysis results, which provided us insights on how to improve the memory data-flow analysis system in the future

    Accelerating legacy applications with spatial computing devices

    Get PDF
    Heterogeneous computing is the major driving factor in designing new energy-efficient high-performance computing systems. Despite the broad adoption of GPUs and other specialized architectures, the interest in spatial architectures like field-programmable gate arrays (FPGAs) has grown. While combining high performance, low power consumption and high adaptability constitute an advantage, these devices still suffer from a weak software ecosystem, which forces application developers to use tools requiring deep knowledge of the underlying system, often leaving legacy code (e.g., Fortran applications) unsupported. By realizing this, we describe a methodology for porting Fortran (legacy) code on modern FPGA architectures, with the target of preserving performance/power ratios. Aimed as an experience report, we considered an industrial computational fluid dynamics application to demonstrate that our methodology produces synthesizable OpenCL codes targeting Intel Arria10 and Stratix10 devices. Although performance gain is not far beyond that of the original CPU code (we obtained a relative speedup of x 0.59 and x 0.63, respectively, for a single optimized main kernel, while only on the Stratix10 we achieved x 2.56 by replicating the main optimized kernel 4 times), our results are quite encouraging to drawn the path for further investigations. This paper also reports some major criticalities in porting Fortran code on FPGA architectures

    EuroEXA - D2.6: Final ported application software

    Get PDF
    This document describes the ported software of the EuroEXA applications to the single CRDB testbed and it discusses the experiences extracted from porting and optimization activities that should be actively taken into account in future redesign and optimization. This document accompanies the ported application software, found in the EuroEXA private repository (https://github.com/euroexa). In particular, this document describes the status of the software for each of the EuroEXA applications, sketches the redesign and optimization strategy for each application, discusses issues and difficulties faced during the porting activities and the relative lesson learned. A few preliminary evaluation results have been presented, however the full evaluation will be discussed in deliverable 2.8

    An FPGA implementation of an investigative many-core processor, Fynbos : in support of a Fortran autoparallelising software pipeline

    Get PDF
    Includes bibliographical references.In light of the power, memory, ILP, and utilisation walls facing the computing industry, this work examines the hypothetical many-core approach to finding greater compute performance and efficiency. In order to achieve greater efficiency in an environment in which Moore’s law continues but TDP has been capped, a means of deriving performance from dark and dim silicon is needed. The many-core hypothesis is one approach to exploiting these available transistors efficiently. As understood in this work, it involves trading in hardware control complexity for hundreds to thousands of parallel simple processing elements, and operating at a clock speed sufficiently low as to allow the efficiency gains of near threshold voltage operation. Performance is there- fore dependant on exploiting a new degree of fine-grained parallelism such as is currently only found in GPGPUs, but in a manner that is not as restrictive in application domain range. While removing the complex control hardware of traditional CPUs provides space for more arithmetic hardware, a basic level of control is still required. For a number of reasons this work chooses to replace this control largely with static scheduling. This pushes the burden of control primarily to the software and specifically the compiler, rather not to the programmer or to an application specific means of control simplification. An existing legacy tool chain capable of autoparallelising sequential Fortran code to the degree of parallelism necessary for many-core exists. This work implements a many-core architecture to match it. Prototyping the design on an FPGA, it is possible to examine the real world performance of the compiler-architecture system to a greater degree than simulation only would allow. Comparing theoretical peak performance and real performance in a case study application, the system is found to be more efficient than any other reviewed, but to also significantly under perform relative to current competing architectures. This failing is apportioned to taking the need for simple hardware too far, and an inability to implement static scheduling mitigating tactics due to lack of support for such in the compiler

    Compilation techniques for irregular problems on parallel machines

    Get PDF
    Massively parallel computers have ushered in the era of teraflop computing. Even though large and powerful machines are being built, they are used by only a fraction of the computing community. The fundamental reason for this situation is that parallel machines are difficult to program. Development of compilers that automatically parallelize programs will greatly increase the use of these machines.;A large class of scientific problems can be categorized as irregular computations. In this class of computation, the data access patterns are known only at runtime, creating significant difficulties for a parallelizing compiler to generate efficient parallel codes. Some compilers with very limited abilities to parallelize simple irregular computations exist, but the methods used by these compilers fail for any non-trivial applications code.;This research presents development of compiler transformation techniques that can be used to effectively parallelize an important class of irregular programs. A central aim of these transformation techniques is to generate codes that aggressively prefetch data. Program slicing methods are used as a part of the code generation process. In this approach, a program written in a data-parallel language, such as HPF, is transformed so that it can be executed on a distributed memory machine. An efficient compiler runtime support system has been developed that performs data movement and software caching

    Polyhedral+Dataflow Graphs

    Get PDF
    This research presents an intermediate compiler representation that is designed for optimization, and emphasizes the temporary storage requirements and execution schedule of a given computation to guide optimization decisions. The representation is expressed as a dataflow graph that describes computational statements and data mappings within the polyhedral compilation model. The targeted applications include both the regular and irregular scientific domains. The intermediate representation can be integrated into existing compiler infrastructures. A specification language implemented as a domain specific language in C++ describes the graph components and the transformations that can be applied. The visual representation allows users to reason about optimizations. Graph variants can be translated into source code or other representation. The language, intermediate representation, and associated transformations have been applied to improve the performance of differential equation solvers, or sparse matrix operations, tensor decomposition, and structured multigrid methods

    Compilation Techniques for High-Performance Embedded Systems with Multiple Processors

    Get PDF
    Institute for Computing Systems ArchitectureDespite the progress made in developing more advanced compilers for embedded systems, programming of embedded high-performance computing systems based on Digital Signal Processors (DSPs) is still a highly skilled manual task. This is true for single-processor systems, and even more for embedded systems based on multiple DSPs. Compilers often fail to optimise existing DSP codes written in C due to the employed programming style. Parallelisation is hampered by the complex multiple address space memory architecture, which can be found in most commercial multi-DSP configurations. This thesis develops an integrated optimisation and parallelisation strategy that can deal with low-level C codes and produces optimised parallel code for a homogeneous multi-DSP architecture with distributed physical memory and multiple logical address spaces. In a first step, low-level programming idioms are identified and recovered. This enables the application of high-level code and data transformations well-known in the field of scientific computing. Iterative feedback-driven search for “good” transformation sequences is being investigated. A novel approach to parallelisation based on a unified data and loop transformation framework is presented and evaluated. Performance optimisation is achieved through exploitation of data locality on the one hand, and utilisation of DSP-specific architectural features such as Direct Memory Access (DMA) transfers on the other hand. The proposed methodology is evaluated against two benchmark suites (DSPstone & UTDSP) and four different high-performance DSPs, one of which is part of a commercial four processor multi-DSP board also used for evaluation. Experiments confirm the effectiveness of the program recovery techniques as enablers of high-level transformations and automatic parallelisation. Source-to-source transformations of DSP codes yield an average speedup of 2.21 across four different DSP architectures. The parallelisation scheme is – in conjunction with a set of locality optimisations – able to produce linear and even super-linear speedups on a number of relevant DSP kernels and applications
    corecore