150 research outputs found

    Transformation of functional programs for identification of parallel skeletons

    Get PDF
    Hardware is becoming increasingly parallel. Thus, it is essential to identify and exploit inherent parallelism in a given program to effectively utilise the computing power available. However, parallel programming is tedious and error-prone when done by hand, and is very difficult for a compiler to do automatically to the desired level. One possible approach to parallel programming is to use transformation techniques to automatically identify and explicitly specify parallel computations in a given program using parallelisable algorithmic skeletons. Current existing methods for systematic derivation of parallel programs or parallel skeleton identification allow automation. However, they place constraints on the programs to which they are applicable, require manual derivation of operators with specific properties for parallel execution, or allow the use of inefficient intermediate data structures in the parallel programs. In this thesis, we present a program transformation method that addresses these issues and has the following attributes: (1) Reduces the number of inefficient data structures used in the parallel program; (2) Transforms a program into a form that is more suited to identifying parallel skeletons; (3) Automatically identifies skeletons that can be efficiently executed using their parallel implementations. Our transformation method does not place restrictions on the program to be parallelised, and allows automatic verification of skeleton operator properties to allow parallel execution. To evaluate the performance of our transformation method, we use a set of benchmark programs. The parallel version of each program produced by our method is compared with other versions of the program, including parallel versions that are derived by hand. Consequently, we have been able to evaluate the strengths and weaknesses of the proposed transformation method. The results demonstrate improvements in the efficiency of parallel programs produced in some examples, and also highlight the role of some intermediate data structures required for parallelisation in other examples

    Pattern discovery for parallelism in functional languages

    Get PDF
    No longer the preserve of specialist hardware, parallel devices are now ubiquitous. Pattern-based approaches to parallelism, such as algorithmic skeletons, simplify traditional low-level approaches by presenting composable high-level patterns of parallelism to the programmer. This allows optimal parallel configurations to be derived automatically, and facilitates the use of different parallel architectures. Moreover, parallel patterns can be swap-replaced for sequential recursion schemes, thus simplifying their introduction. Unfortunately, there is no guarantee that recursion schemes are present in all functional programs. Automatic pattern discovery techniques can be used to discover recursion schemes. Current approaches are limited by both the range of analysable functions, and by the range of discoverable patterns. In this thesis, we present an approach based on program slicing techniques that facilitates the analysis of a wider range of explicitly recursive functions. We then present an approach using anti-unification that expands the range of discoverable patterns. In particular, this approach is user-extensible; i.e. patterns developed by the programmer can be discovered without significant effort. We present prototype implementations of both approaches, and evaluate them on a range of examples, including five parallel benchmarks and functions from the Haskell Prelude. We achieve maximum speedups of 32.93x on our 28-core hyperthreaded experimental machine for our parallel benchmarks, demonstrating that our approaches can discover patterns that produce good parallel speedups. Together, the approaches presented in this thesis enable the discovery of more loci of potential parallelism in pure functional programs than currently possible. This leads to more possibilities for parallelism, and so more possibilities to take advantage of the potential performance gains that heterogeneous parallel systems present

    Work-stealing prefix scan: Addressing load imbalance in large-scale image registration

    Get PDF
    Parallelism patterns (e.g., map or reduce) have proven to be effective tools for parallelizing high-performance applications. In this article, we study the recursive registration of a series of electron microscopy images - a time consuming and imbalanced computation necessary for nano-scale microscopy analysis. We show that by translating the image registration into a specific instance of the prefix scan, we can convert this seemingly sequential problem into a parallel computation that scales to over thousand of cores. We analyze a variety of scan algorithms that behave similarly for common low-compute operators and propose a novel work-stealing procedure for a hierarchical prefix scan. Our evaluation shows that by identifying a suitable and well-optimized prefix scan algorithm, we reduce time-to-solution on a series of 4,096 images spanning ten seconds of microscopy acquisition from over 10 hours to less than 3 minutes (using 1024 Intel Haswell cores), enabling derivation of material properties at nanoscale for long microscopy image series.ISSN:1045-9219ISSN:1558-2183ISSN:2161-988

    Efficient Kalman Filtering and Smoothing

    Get PDF
    The Kalman filter and Kalman smoother are important components in modern multitarget tracking systems. Their application are vast which include guidance, navigation and control of vehicles. On top of that, when the motion model is uncertain, MultipleModel approach can be combined with the filtering and smoothing method. However, with large amount of retrodiction window size, number of motion models and large number of targets, this process can become very computationally intensive and thus time consuming. Very often, real-time processing is needed in the world of tracking and therefore, this computational bottleneck become a problem. This is the motivation behind this thesis, to reduce the computational complexity when multi-target, multiwindow or multi-model applications are used. This thesis presents several approaches to tackle this multi-dimensional problems in terms of complexity while maintaining satisfactory precision. A natural step forward will be in leveraging the modern multi-core architectures. However, in order to parallelize such process, these algorithms have to be reformulated to be fitted into the parallel processors. In order to parallelise multi-target and multi-window scenario, this thesis introduce nested parallelism and prefix-sum algorithm to tackle the problem and realised this on Intel Knights Landing (KNL) Processor and OpenMP memory model. On the other hand, in the case of limited parallel resources, this thesis also develop alternatives called Fast Kalman smoother (FRTS) to lower the computation complexity due to multi-window problem. Specifically the smoother algorithm is reformulated such that it is computationally independent of number of window size in the fixed-lag configuration. Although the underlying mathematics is the same as the conventional approach, FRTS introduced numerical stability issue which makes the smoother unstable. Therefore, this thesis introduce the idea of condition number to monitor the deterioration rate in order to correct the numerical error once the pre-set threshold is breached. In addition to the large number of targets and retrodiction window size mentioned earlier, the number of models running simultaneously make the problem even more challenging in the perspective of real-time performance. Since such algorithms are the fundamental backbone of a large amount of multi-frame tracking algorithms, it would be beneficial to have a multi-model algorithm that is computationally independent to number of model utilised. Consequently, this thesis extend the FRTS concept to fixed-lag Multiple-Model smoothing method to achieve this goal. The proposed algorithms are compared and tested through an extensive and exhaustive set of evaluations against the literature, and discuss the relative merits. These evaluations show that these contributions pave a way to secure substantial performance gains for multi-dimensional tracking algorithms over conventional approaches

    High performance numerical modeling of ultra-short laser pulse propagation based on multithreaded parallel hardware

    Get PDF
    The focus of this study is development of parallelised version of severely sequential and iterative numerical algorithms based on multi-threaded parallel platform such as a graphics processing unit. This requires design and development of a platform-specific numerical solution that can benefit from the parallel capabilities of the chosen platform. Graphics processing unit was chosen as a parallel platform for design and development of a numerical solution for a specific physical model in non-linear optics. This problem appears in describing ultra-short pulse propagation in bulk transparent media that has recently been subject to several theoretical and numerical studies. The mathematical model describing this phenomenon is a challenging and complex problem and its numerical modeling limited on current modern workstations. Numerical modeling of this problem requires a parallelisation of an essentially serial algorithms and elimination of numerical bottlenecks. The main challenge to overcome is parallelisation of the globally non-local mathematical model. This thesis presents a numerical solution for elimination of numerical bottleneck associated with the non-local nature of the mathematical model. The accuracy and performance of the parallel code is identified by back-to-back testing with a similar serial version

    From constraint programming to heterogeneous parallelism

    Get PDF
    The scaling limitations of multi-core processor development have led to a diversification of the processor cores used within individual computers. Heterogeneous computing has become widespread, involving the cooperation of several structurally different processor cores. Central processor (CPU) cores are most frequently complemented with graphics processors (GPUs), which despite their name are suitable for many highly parallel computations besides computer graphics. Furthermore, deep learning accelerators are rapidly gaining relevance. Many applications could profit from heterogeneous computing but are held back by the surrounding software ecosystems. Heterogeneous systems are a challenge for compilers in particular, which usually target only the increasingly marginalised homogeneous CPU cores. Therefore, heterogeneous acceleration is primarily accessible via libraries and domain-specific languages (DSLs), requiring application rewrites and resulting in vendor lock-in. This thesis presents a compiler method for automatically targeting heterogeneous hardware from existing sequential C/C++ source code. A new constraint programming method enables the declarative specification and automatic detection of computational idioms within compiler intermediate representation code. Examples of computational idioms are stencils, reductions, and linear algebra. Computational idioms denote algorithmic structures that commonly occur in performance-critical loops. Consequently, well-designed accelerator DSLs and libraries support computational idioms with their programming models and function interfaces. The detection of computational idioms in their middle end enables compilers to incorporate DSL and library backends for code generation. These backends leverage domain knowledge for the efficient utilisation of heterogeneous hardware. The constraint programming methodology is first derived on an abstract model and then implemented as an extension to LLVM. Two constraint programming languages are designed to target this implementation: the Compiler Analysis Description Language (CAnDL), and the extended Idiom Detection Language (IDL). These languages are evaluated on a range of different compiler problems, culminating in a complete heterogeneous acceleration pipeline integrated with the Clang C/C++ compiler. This pipeline was evaluated on the established benchmark collections NPB and Parboil. The approach was applicable to 10 of the benchmark programs, resulting in significant speedups from 1.26× on “histo” to 275× on “sgemm” when starting from sequential baseline versions. In summary, this thesis shows that the automatic recognition of computational idioms during compilation enables the heterogeneous acceleration of sequential C/C++ programs. Moreover, the declarative specification of computational idioms is derived in novel declarative programming languages, and it is demonstrated that constraint programming on Single Static Assignment intermediate code is a suitable method for their automatic detection

    A multi-level functional IR with rewrites for higher-level synthesis of accelerators

    Get PDF
    Specialised accelerators deliver orders of magnitude higher energy-efficiency than general-purpose processors. Field Programmable Gate Arrays (FPGAs) have become the substrate of choice, because the ever-changing nature of modern workloads, such as machine learning, demands reconfigurability. However, they are notoriously hard to program directly using Hardware Description Languages (HDLs). Traditional High-Level Synthesis (HLS) tools improve productivity, but come with their own problems. They often produce sub-optimal designs and programmers are still required to write hardware-specific code, thus development cycles remain long. This thesis proposes Shir, a higher-level synthesis approach for high-performance accelerator design with a hardware-agnostic programming entry point, a multi-level Intermediate Representation (IR), a compiler and rewrite rules for optimisation. First, a novel, multi-level functional IR structure for accelerator design is described. The IRs operate on different levels of abstraction, cleanly separating different hardware concerns. They enable the expression of different forms of parallelism and standard memory features, such as asynchronous off-chip memories or synchronous on-chip buffers, as well as arbitration of such shared resources. Exposing these features at the IR level is essential for achieving high performance. Next, mechanical lowering procedures are introduced to automatically compile a program specification through Shir’s functional IRs until low-level HDL code for FPGA synthesis is emitted. Each lowering step gradually adds implementation details. Finally, this thesis presents rewrite rules for automatic optimisations around parallelisation, buffering and data reshaping. Reshaping operations pose a challenge to functional approaches in particular. They introduce overheads that compromise performance or even prevent the generation of synthesisable hardware designs altogether. This fundamental issue is solved by the application of rewrite rules. The viability of this approach is demonstrated by running matrix multiplication and 2D convolution on an Intel Arria 10 FPGA. A limited design space exploration is conducted, confirming the ability of the IR to exploit various hardware features. Using rewrite rules for optimisation, it is possible to generate high-performance designs that are competitive with highly tuned OpenCL implementations and that outperform hardware-agnostic OpenCL code. The performance impact of the optimisations is further evaluated showing that they are essential to achieving high performance, and in many cases also necessary to produce hardware that fits the resource constraints

    Analysis and transformation of legacy code

    Get PDF
    Hardware evolves faster than software. While a hardware system might need replacement every one to five years, the average lifespan of a software system is a decade, with some instances living up to several decades. Inevitably, code outlives the platform it was developed for and may become legacy: development of the software stops, but maintenance has to continue to keep up with the evolving ecosystem. No new features are added, but the software is still used to fulfil its original purpose. Even in the cases where it is still functional (which discourages its replacement), legacy code is inefficient, costly to maintain, and a risk to security. This thesis proposes methods to leverage the expertise put in the development of legacy code and to extend its useful lifespan, rather than to throw it away. A novel methodology is proposed, for automatically exploiting platform specific optimisations when retargeting a program to another platform. The key idea is to leverage the optimisation information embedded in vector processing intrinsic functions. The performance of the resulting code is shown to be close to the performance of manually retargeted programs, however with the human labour removed. Building on top of that, the question of discovering optimisation information when there are no hints in the form of intrinsics or annotations is investigated. This thesis postulates that such information can potentially be extracted from profiling the data flow during executions of the program. A context-aware data dependence profiling system is described, detailing previously overlooked aspects in related research. The system is shown to be essential in surpassing the information that can be inferred statically, in particular about loop iterators. Loop iterators are the controlling part of a loop. This thesis describes and evaluates a system for extracting the loop iterators in a program. It is found to significantly outperform previously known techniques and further increases the amount of information about the structure of a program that is available to a compiler. Combining this system with data dependence profiling improves its results even more. Loop iterator recognition enables other code modernising techniques, like source code rejuvenation and commutativity analysis. The former increases the use of idiomatic code and as a result increases the maintainability of the program. The latter can potentially drive parallelisation and thus dramatically improve runtime performance
    • 

    corecore