67 research outputs found

    Custom optimization algorithms for efficient hardware implementation

    No full text
    The focus is on real-time optimal decision making with application in advanced control systems. These computationally intensive schemes, which involve the repeated solution of (convex) optimization problems within a sampling interval, require more efficient computational methods than currently available for extending their application to highly dynamical systems and setups with resource-constrained embedded computing platforms. A range of techniques are proposed to exploit synergies between digital hardware, numerical analysis and algorithm design. These techniques build on top of parameterisable hardware code generation tools that generate VHDL code describing custom computing architectures for interior-point methods and a range of first-order constrained optimization methods. Since memory limitations are often important in embedded implementations we develop a custom storage scheme for KKT matrices arising in interior-point methods for control, which reduces memory requirements significantly and prevents I/O bandwidth limitations from affecting the performance in our implementations. To take advantage of the trend towards parallel computing architectures and to exploit the special characteristics of our custom architectures we propose several high-level parallel optimal control schemes that can reduce computation time. A novel optimization formulation was devised for reducing the computational effort in solving certain problems independent of the computing platform used. In order to be able to solve optimization problems in fixed-point arithmetic, which is significantly more resource-efficient than floating-point, tailored linear algebra algorithms were developed for solving the linear systems that form the computational bottleneck in many optimization methods. These methods come with guarantees for reliable operation. We also provide finite-precision error analysis for fixed-point implementations of first-order methods that can be used to minimize the use of resources while meeting accuracy specifications. The suggested techniques are demonstrated on several practical examples, including a hardware-in-the-loop setup for optimization-based control of a large airliner.Open Acces

    Tools for efficient Deep Learning

    Get PDF
    In the era of Deep Learning (DL), there is a fast-growing demand for building and deploying Deep Neural Networks (DNNs) on various platforms. This thesis proposes five tools to address the challenges for designing DNNs that are efficient in time, in resources and in power consumption. We first present Aegis and SPGC to address the challenges in improving the memory efficiency of DL training and inference. Aegis makes mixed precision training (MPT) stabler by layer-wise gradient scaling. Empirical experiments show that Aegis can improve MPT accuracy by at most 4\%. SPGC focuses on structured pruning: replacing standard convolution with group convolution (GConv) to avoid irregular sparsity. SPGC formulates GConv pruning as a channel permutation problem and proposes a novel heuristic polynomial-time algorithm. Common DNNs pruned by SPGC have maximally 1\% higher accuracy than prior work. This thesis also addresses the challenges lying in the gap between DNN descriptions and executables by Polygeist for software and POLSCA for hardware. Many novel techniques, e.g. statement splitting and memory partitioning, are explored and used to expand polyhedral optimisation. Polygeist can speed up software execution in sequential and parallel by 2.53 and 9.47 times on Polybench/C. POLSCA achieves 1.5 times speedup over hardware designs directly generated from high-level synthesis on Polybench/C. Moreover, this thesis presents Deacon, a framework that generates FPGA-based DNN accelerators of streaming architectures with advanced pipelining techniques to address the challenges from heterogeneous convolution and residual connections. Deacon provides fine-grained pipelining, graph-level optimisation, and heuristic exploration by graph colouring. Compared with prior designs, Deacon shows resource/power consumption efficiency improvement of 1.2x/3.5x for MobileNets and 1.0x/2.8x for SqueezeNets. All these tools are open source, some of which have already gained public engagement. We believe they can make efficient deep learning applications easier to build and deploy.Open Acces

    Improving low latency applications for reconfigurable devices

    Get PDF
    This thesis seeks to improve low latency application performance via architectural improvements in reconfigurable devices. This is achieved by improving resource utilisation and access, and by exploiting the different environments within which reconfigurable devices are deployed. Our first contribution leverages devices deployed at the network level to enable the low latency processing of financial market data feeds. Financial exchanges transmit messages via two identical data feeds to reduce the chance of message loss. We present an approach to arbitrate these redundant feeds at the network level using a Field-Programmable Gate Array (FPGA). With support for any messaging protocol, we evaluate our design using the NASDAQ TotalView-ITCH, OPRA, and ARCA data feed protocols, and provide two simultaneous outputs: one prioritising low latency, and one prioritising high reliability with three dynamically configurable windowing methods. Our second contribution is a new ring-based architecture for low latency, parallel access to FPGA memory. Traditional FPGA memory is formed by grouping block memories (BRAMs) together and accessing them as a single device. Our architecture accesses these BRAMs independently and in parallel. Targeting memory-based computing, which stores pre-computed function results in memory, we benefit low latency applications that rely on: highly-complex functions; iterative computation; or many parallel accesses to a shared resource. We assess square root, power, trigonometric, and hyperbolic functions within the FPGA, and provide a tool to convert Python functions to our new architecture. Our third contribution extends the ring-based architecture to support any FPGA processing element. We unify E heterogeneous processing elements within compute pools, with each element implementing the same function, and the pool serving D parallel function calls. Our implementation-agnostic approach supports processing elements with different latencies, implementations, and pipeline lengths, as well as non-deterministic latencies. Compute pools evenly balance access to processing elements across the entire application, and are evaluated by implementing eight different neural network activation functions within an FPGA.Open Acces

    Computational Methods and Graphical Processing Units for Real-time Control of Tomographic Adaptive Optics on Extremely Large Telescopes.

    Get PDF
    Ground based optical telescopes suffer from limited imaging resolution as a result of the effects of atmospheric turbulence on the incoming light. Adaptive optics technology has so far been very successful in correcting these effects, providing nearly diffraction limited images. Extremely Large Telescopes will require more complex Adaptive Optics configurations that introduce the need for new mathematical models and optimal solvers. In addition, the amount of data to be processed in real time is also greatly increased, making the use of conventional computational methods and hardware inefficient, which motivates the study of advanced computational algorithms, and implementations on parallel processors. Graphical Processing Units (GPUs) are massively parallel processors that have so far demonstrated a very high increase in speed compared to CPUs and other devices, and they have a high potential to meet the real-time restrictions of adaptive optics systems. This thesis focuses on the study and evaluation of existing proposed computational algorithms with respect to computational performance, and their implementation on GPUs. Two basic methods, one direct and one iterative are implemented and tested and the results presented provide an evaluation of the basic concept upon which other algorithms are based, and demonstrate the benefits of using GPUs for adaptive optics

    Hardware / Software System for Portable and Low-Cost Genome Assembly

    Full text link
    “The enjoyment of the highest attainable standard of health is one of the fundamental rights of every human being without distinction of race, religion, political belief, economic or social condition” [56]. Genomics (the study of the entire DNA) provides such a standard of health for people with rare diseases and helps control the spread of pandemics. Still, millions of human beings are unable to access genomics due to its cost, and portability. In genomics, DNA sequencers digitise DNA information, and computers analyse the digitised information. We have desktop and thumb-sized DNA sequencers, that digitise the DNA data rapidly. But computations necessary for the analysis of this data are inevitably performed on high-performance computers (HPCs) and cloud computers. These computations not only require powerful computers but also necessitate high-speed networks since the data generated are in the hundreds of gigabytes. Relying on HPCs and high-speed networks, deny the benefits that can be reaped by genomics for the masses who live in remote areas and in poorer nations. Developing a low-cost and portable genomics computation platform would provide personalised treatment based on an individual’s DNA and identify the source of the fast-spreading epidemics in remote areas and areas without HPC or network infrastructure. But developing a low-cost and portable genome analysing computing platform is a challenging task. This thesis develops novel computer architecture solutions to assemble the whole human DNA and COVID-19 virus RNA on a low-cost and portable platform. The first phase of the solution describes a ring-pipelined processor architecture for a key genome assembly algorithm. The human genome is partitioned to fit into the small memory footprint of embedded processors. These techniques allow an entire human genome to be assembled using highly portable and low-cost embedded processor cores. These processor cores can be housed within a single chip. Each processor was only 0.08 mm 2 and consumed just 37.5 mW. It has only 2 GB memory, 32-bit instruction width, and a clock with a 1 GHz frequency. The second phase of the solution describes how application-specific instruction-set processors can be sped up to execute a key genome assembly algorithm. A fully automated design system is presented, which improves the performance of large applications (such as genome assembly algorithm) and generates application-specific instructions for a commercial processor design tool (Xtensa). The tool enhances the base processor, which was used in the ring pipeline processor architecture. Thus, the alignment algorithms execute 2.1 times faster with only 11% additional hardware. The energy-delay product was reduced by 7.3× compared to the base processor. This tool is the only one of its type which can handle applications which are large. The third phase of the solution designs a portable low-cost genome assembly computer (PGA). PGA enhances the ring pipeline architecture with the customised processor found in phase two and with improved inter-processor communication. The results show that the COVID-19 virus RNA can be assembled in under 10 minutes and the whole human genome can be assembled in 11 days on a portable platform (HPC take around two days) for 30× coverage. PGA has an area footprint of just 5.68 mm 2 in a 28 nm technology node and is far smaller than a high-performance computer processor chip. The PGA consumes only 4W of power, which is lower than the power requirement of a high-performance processor chip. The manufacturing cost of the PGA also would be much cheaper than the high-performance system cost, when produced in volume. The developed solution can be powered by a USB port of a laptop. This thesis is the first of its type to show the design of a single-chip solution to be able to process a complex genomic problem. This thesis contributes to attaining one of the fundamental rights of every human being wherever they may live

    Precision analysis for hardware acceleration of numerical algorithms

    No full text
    The precision used in an algorithm affects the error and performance of individual computations, the memory usage, and the potential parallelism for a fixed hardware budget. However, when migrating an algorithm onto hardware, the potential improvements that can be obtained by tuning the precision throughout an algorithm to meet a range or error specification are often overlooked; the major reason is that it is hard to choose a number system which can guarantee any such specification can be met. Instead, the problem is mitigated by opting to use IEEE standard double precision arithmetic so as to be ‘no worse’ than a software implementation. However, the flexibility in the number representation is one of the key factors that can be exploited on reconfigurable hardware such as FPGAs, and hence ignoring this potential significantly limits the performance achievable. In order to optimise the performance of hardware reliably, we require a method that can tractably calculate tight bounds for the error or range of any variable within an algorithm, but currently only a handful of methods to calculate such bounds exist, and these either sacrifice tightness or tractability, whilst simulation-based methods cannot guarantee the given error estimate. This thesis presents a new method to calculate these bounds, taking into account both input ranges and finite precision effects, which we show to be, in general, tighter in comparison to existing methods; this in turn can be used to tune the hardware to the algorithm specifications. We demonstrate the use of this software to optimise hardware for various algorithms to accelerate the solution of a system of linear equations, which forms the basis of many problems in engineering and science, and show that significant performance gains can be obtained by using this new approach in conjunction with more traditional hardware optimisations

    An FPGA implementation of an investigative many-core processor, Fynbos : in support of a Fortran autoparallelising software pipeline

    Get PDF
    Includes bibliographical references.In light of the power, memory, ILP, and utilisation walls facing the computing industry, this work examines the hypothetical many-core approach to finding greater compute performance and efficiency. In order to achieve greater efficiency in an environment in which Moore’s law continues but TDP has been capped, a means of deriving performance from dark and dim silicon is needed. The many-core hypothesis is one approach to exploiting these available transistors efficiently. As understood in this work, it involves trading in hardware control complexity for hundreds to thousands of parallel simple processing elements, and operating at a clock speed sufficiently low as to allow the efficiency gains of near threshold voltage operation. Performance is there- fore dependant on exploiting a new degree of fine-grained parallelism such as is currently only found in GPGPUs, but in a manner that is not as restrictive in application domain range. While removing the complex control hardware of traditional CPUs provides space for more arithmetic hardware, a basic level of control is still required. For a number of reasons this work chooses to replace this control largely with static scheduling. This pushes the burden of control primarily to the software and specifically the compiler, rather not to the programmer or to an application specific means of control simplification. An existing legacy tool chain capable of autoparallelising sequential Fortran code to the degree of parallelism necessary for many-core exists. This work implements a many-core architecture to match it. Prototyping the design on an FPGA, it is possible to examine the real world performance of the compiler-architecture system to a greater degree than simulation only would allow. Comparing theoretical peak performance and real performance in a case study application, the system is found to be more efficient than any other reviewed, but to also significantly under perform relative to current competing architectures. This failing is apportioned to taking the need for simple hardware too far, and an inability to implement static scheduling mitigating tactics due to lack of support for such in the compiler

    Application acceleration : an investigation of automatic porting methods for application accelerators

    Get PDF
    Future HPC systems will contain both large collections of multi-core proces sors and specialist many-core co-processors. These specialised many-core co processors are typically classified as Application Accelerators. More specifically, Application Accelerations are devices such as GPUs, CELL Processors, FPGAs and custom application specific integrated circuit devices(ASICs). These devices present new challenges to overcome, including their programming difficulties, their diversity and lack of commonality of programming approach between them and the issue of selecting the most appropriate device for an application. This thesis attempts to tackle these problems by examining the suitability of automatic porting methods. In the course of this research, relevant software, both academic and com mercial, has been analysed to determine how it attempts to solve the problems relating to the use of application acceleration devices. A new approach is then constructed, this approach is an Automatic Self-Modifying Application Porting system that is able to not only port code to an acceleration device, but, using performance data, predict the appropriate device for the code being ported. Additionally, this system is also able to use the performance data that are gathered by the system to modify its own decision making model and improve its future predictions. Once the system has been developed, a series of applications are trialled and their performance, both in terms of execution time and the accuracy of the systems predictions, are analysed. This analysis has shown that, although the system is not able to flawlessly predict the correct device for an unseen application, it is able to achieve an accuracy of over 80% and, just as importantly, the code it produces is within 15% of that produced by an experienced human programmer. This analysis has also shown that while automatically ported code performs favourably in nearly all cases when compared to a single-core CPU, automatically ported code only out performs a quad-core CPU in three out of seven application case studies. From these results, it is also shown that the system is able to utilise this performance data and build a decision model allowing the users to determine if an automatically ported version of their application will provide performance improvement compared to both CPU types considered. The availability of such a system may prove valuable in allowing a diverse range of users to utilise the performance supplied by many-core devices within next generation HPC systems
    corecore