4 research outputs found

    Automated Scratchpad Mapping and Allocation for Embedded Processors

    Get PDF
    Embedded system-on-chip processors such as the Texas Instruments C66 DSP and the IBM Cell provide the programmer with a software controlled on-chip memory to supplement a traditional but simple two-level cache. By decomposing data sets and their corresponding workload into small subsets that fit within this on-chip memory, the processor can potentially achieve equivalent or better performance, power efficiency, and area efficiency than with its sophisticated cache. However, program controlled on chip memory requires a shift in the responsibility for management and allocation from the hardware to the programmer. Specifically, this requires the explicit mapping of program arrays to specific types of on chip memory structure and the addition of supporting code that allocates and manages the on chip memory. Previous work in tiling focuses on automated loop transformations but are hardware agnostic and do not incorporate a performance model of the underlying memory design. In this work we will explore the relationship between mapping and allocation of tiles for stencil loops and linear algebra kernels on the Texas Instruments Keystone II DSP platform

    Advanced control systems for fast orbit feedback of synchrotron electron beams

    Get PDF
    Diamond Light Source is the UK’s national synchrotron facility that produces synchrotron radiation for research. At source points of synchrotron radiation, the electron beam stability relative to the beam size is critical for the optimal performance of synchrotrons. The current requirement at Diamond is that variations in the beam position should not exceed 10% of the beam size for frequencies up to 140Hz. This is guaranteed by the fast orbit feedback that actuates hundreds of corrector magnets at a sampling rate of 10kHz to reduce beam vibrations down to sub-micron levels. For the next-generation upgrade, Diamond-II, the beam stability requirements will be raised to 3% up to 1kHz. Consequently, the sampling rate will be increased to 100kHz and an additional array of fast correctors will be introduced, which precludes the use of the existing controller. This thesis develops two different control approaches to accommodate the additional array of fast correctors at Diamond-II: internal model control based on the generalised singular value decomposition (GSVD) and model predictive control (MPC). In contrast to existing controllers, the proposed approaches treat the control problem as a whole and consider both arrays simultaneously. To achieve the sampling rate of 100kHz, this thesis proposes to reduce the computational complexity of the controllers in several ways, such as by exploiting symmetries of the magnetic lattice. To validate the controllers for Diamond-II, a real-time control system is implemented on high-performance hardware and integrated in the existing synchrotron. As a first-of-its-kind application to electron beam stabilisation in synchrotrons, this thesis presents real-world results from both MPC and GSVD-based controllers, demonstrating that the proposed approaches meet theoretical expectations with respect to performance and robustness in practice. The results from this thesis, and in particular the novel GSVD-based method, were successfully adopted for the Diamond-II upgrade. This may enable the use of more advanced control systems in similar large-scale and high-speed applications in the future

    A mixed-signal computer architecture and its application to power system problems

    Get PDF
    Radical changes are taking place in the landscape of modern power systems. This massive shift in the way the system is designed and operated has been termed the advent of the ``smart grid''. One of its implications is a strong market pull for faster power system analysis computing. This work is concerned in particular with transient simulation, which is one of the most demanding power system analyses. This refers to the imitation of the operation of the real-world system over time, for time scales that cover the majority of slow electromechanical transient phenomena. The general mathematical formulation of the simulation problem includes a set of non-linear differential algebraic equations (DAEs). In the algebraic part of this set, heavy linear algebra computations are included, which are related to the admittance matrix of the topology. These computations are a critical factor to the overall performance of a transient simulator. This work proposes the use of analog electronic computing as a means of exceeding the performance barriers of conventional digital computers for the linear algebra operations. Analog computing is integrated in the frame of a power system transient simulator yielding significant computational performance benefits to the latter. Two hybrid, analog and digital computers are presented. The first prototype has been implemented using reconfigurable hardware. In its core, analog computing is used for linear algebra operations, while pipelined digital resources on a field programmable gate array (FPGA) handle all remaining computations. The properties of the analog hardware are thoroughly examined, with special attention to accuracy and timing. The application of the platform to the transient analysis of power system dynamics showed a speedup of two orders of magnitude against conventional software solutions. The second prototype is proposed as a future conceptual architecture that would overcome the limitations of the already implemented hardware, while retaining its virtues. The design space of this future architecture has been thoroughly explored, with the help of a software emulator. For one possible suggested implementation, speedups of four orders of magnitude against software solvers have been observed for the linear algebra operations

    Low-power System-on-Chip Processors for Energy Efficient High Performance Computing: The Texas Instruments Keystone II

    No full text
    The High Performance Computing (HPC) community recognizes energy consumption as a major problem. Extensive research is underway to identify means to increase energy efficiency of HPC systems including consideration of alternative building blocks for future systems. This thesis considers one such system, the Texas Instruments Keystone II, a heterogeneous Low-Power System-on-Chip (LPSoC) processor that combines a quad core ARM CPU with an octa-core Digital Signal Processor (DSP). It was first released in 2012. Four issues are considered: i) maximizing the Keystone II ARM CPU performance; ii) implementation and extension of the OpenMP programming model for the Keystone II; iii) simultaneous use of ARM and DSP cores across multiple Keystone SoCs; and iv) an energy model for applications running on LPSoCs like the Keystone II and heterogeneous systems in general. Maximizing the performance of the ARM CPU on the Keystone II system is fundamental to adoption of this system by the HPC community and, of the ARM architecture more broadly. Key to achieving good performance is exploitation of the ARM vector instructions. This thesis presents the first detailed comparison of the use of ARM compiler intrinsic functions with automatic compiler vectorization across four generations of ARM processors. Comparisons are also made with x86 based platforms and the use of equivalent Intel vector instructions. Implementation of the OpenMP programming model on the Keystone II system presents both challenges and opportunities. Challenges in that the OpenMP model was originally developed for a homogeneous programming environment with a common instruction set architecture, and in 2012 work had only just begun to consider how OpenMP might work with accelerators. Opportunities in that shared memory is accessible to all processing elements on the LPSoC, offering performance advantages over what typically exists with attached accelerators. This thesis presents an analysis of a prototype version of OpenMP implemented as a bare-metal runtime on the DSP of a Keystone I system. An implementation for the Keystone II that maps OpenMP 4.0 accelerator directives to OpenCL runtime library operations is presented and evaluated. Exploitation of some of the underlying hardware features of the Keystone II is also discussed. Simultaneous use of the ARM and DSP cores across multiple Keystone II boards is fundamental to the creation of commercially viable HPC offerings based on Keystone technology. The nCore BrownDwarf and HPE Moonshot systems represent two such systems. This thesis presents a proof-of-concept implementation of matrix multiplication (GEMM) for the BrownDwarf system. The BrownDwarf utilizes both Keystone II and Keystone I SoCs through a point-to-point interconnect called Hyperlink. Details of how a novel message passing communication framework across Hyperlink was implemented to support this complex environment are provided. An energy model that can be used to predict energy usage as a function of what fraction of a particular computation is performed on each of the available compute devices offers the opportunity for making runtime decisions on how best to minimize energy usage. This thesis presents a basic energy usage model that considers rates of executions on each device and their active and idle power usages. Using this model, it is shown that only under certain conditions does there exist an energy-optimal work partition that uses multiple compute devices. To validate the model a high resolution energy measurement environment is developed and used to gather energy measurements for a matrix multiplication benchmark running on a variety of systems. Results presented support the model. Drawing on the four issues noted above and other developments that have occurred since the Keystone II system was first announced, the thesis concludes by making comments regarding the future of LPSoCs as building blocks for HPC systems
    corecore