119 research outputs found

    Combining dynamic and static scheduling in high-level synthesis

    Get PDF
    Field Programmable Gate Arrays (FPGAs) are starting to become mainstream devices for custom computing, particularly deployed in data centres. However, using these FPGA devices requires familiarity with digital design at a low abstraction level. In order to enable software engineers without a hardware background to design custom hardware, high-level synthesis (HLS) tools automatically transform a high-level program, for example in C/C++, into a low-level hardware description. A central task in HLS is scheduling: the allocation of operations to clock cycles. The classic approach to scheduling is static, in which each operation is mapped to a clock cycle at compile time, but recent years have seen the emergence of dynamic scheduling, in which an operation’s clock cycle is only determined at run-time. Both approaches have their merits: static scheduling can lead to simpler circuitry and more resource sharing, while dynamic scheduling can lead to faster hardware when the computation has a non-trivial control flow. This thesis proposes a scheduling approach that combines the best of both worlds. My idea is to use existing program analysis techniques in software designs, such as probabilistic analysis and formal verification, to optimize the HLS hardware. First, this thesis proposes a tool named DASS that uses a heuristic-based approach to identify the code regions in the input program that are amenable to static scheduling and synthesises them into statically scheduled components, also known as static islands, leaving the top-level hardware dynamically scheduled. Second, this thesis addresses a problem of this approach: that the analysis of static islands and their dynamically scheduled surroundings are separate, where one treats the other as black boxes. We apply static analysis including dependence analysis between static islands and their dynamically scheduled surroundings to optimize the offsets of static islands for high performance. We also apply probabilistic analysis to estimate the performance of the dynamically scheduled part and use this information to optimize the static islands for high area efficiency. Finally, this thesis addresses the problem of conservatism in using sequential control flow designs which can limit the throughput of the hardware. We show this challenge can be solved by formally proving that certain control flows can be safely parallelised for high performance. This thesis demonstrates how to use automated formal verification to find out-of-order loop pipelining solutions and multi-threading solutions from a sequential program.Open Acces

    Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?

    Full text link
    The inference of Large language models (LLMs) requires immense computation and memory resources. To curtail these costs, quantisation has merged as a promising solution, but existing LLM quantisation mainly focuses on 8-bit. In this work, we explore the statistical and learning properties of the LLM layer and attribute the bottleneck of LLM quantisation to numerical scaling offsets. To address this, we adapt block quantisations for LLMs, a family of methods that share scaling factors across packed numbers. Block quantisations efficiently reduce the numerical scaling offsets solely from an arithmetic perspective, without additional treatments in the computational path. Our nearly-lossless quantised 6-bit LLMs achieve a 19×19\times higher arithmetic density and 5×5\times memory density than the float32 baseline, surpassing the prior art 8-bit quantisation by 2.5×2.5\times in arithmetic density and 1.2×1.2\times in memory density, without requiring any data calibration or re-training. We also share our insights into sub-8-bit LLM quantisation, including the mismatch between activation and weight distributions, optimal fine-tuning strategies, and a lower quantisation granularity inherent in the statistical properties of LLMs. The latter two tricks enable nearly-lossless 4-bit LLMs on downstream tasks. Our code is open-sourced.Comment: Accepted by EMNLP202

    GSA to HDL: Towards principled generation of dynamically scheduled circuits

    Full text link
    High-level synthesis (HLS) refers to the automatic translation of a software program written in a high-level language into a hardware design. Modern HLS tools have moved away from the traditional approach of static (compile time) scheduling of operations to generating dynamic circuits that schedule operations at run time. Such circuits trade-off area utilisation for increased dynamism and throughput. However, existing lowering flows in dynamically scheduled HLS tools rely on conservative assumptions on their input program due to both the intermediate representations (IR) utilised as well as the lack of formal specifications on the translation into hardware. These assumptions cause suboptimal hardware performance. In this work, we lift these assumptions by proposing a new and efficient abstraction for hardware mapping; namely h-GSA, an extension of the Gated Single Static Assignment (GSA) IR. Using this abstraction, we propose a lowering flow that transforms GSA into h-GSA and maps h-GSA into dynamically scheduled hardware circuits. We compare the schedules generated by our approach to those by the state-of-the-art dynamic-scheduling HLS tool, Dynamatic, and illustrate the potential performance improvement from hardware mapping using the proposed abstraction.Comment: Presented at the 19th International Summer School on Advanced Computer Architecture and Compilation for High-performance Embedded Systems (ACACES 2023

    SEER: Super-Optimization Explorer for HLS using E-graph Rewriting with MLIR

    Full text link
    High-level synthesis (HLS) is a process that automatically translates a software program in a high-level language into a low-level hardware description. However, the hardware designs produced by HLS tools still suffer from a significant performance gap compared to manual implementations. This is because the input HLS programs must still be written using hardware design principles. Existing techniques either leave the program source unchanged or perform a fixed sequence of source transformation passes, potentially missing opportunities to find the optimal design. We propose a super-optimization approach for HLS that automatically rewrites an arbitrary software program into efficient HLS code that can be used to generate an optimized hardware design. We developed a toolflow named SEER, based on the e-graph data structure, to efficiently explore equivalent implementations of a program at scale. SEER provides an extensible framework, orchestrating existing software compiler passes and hardware synthesis optimizers. Our work is the first attempt to exploit e-graph rewriting for large software compiler frameworks, such as MLIR. Across a set of open-source benchmarks, we show that SEER achieves up to 38x the performance within 1.4x the area of the original program. Via an Intel-provided case study, SEER demonstrates the potential to outperform manually optimized designs produced by hardware experts

    Fast Prototyping Next-Generation Accelerators for New ML Models using MASE: ML Accelerator System Exploration

    Full text link
    Machine learning (ML) accelerators have been studied and used extensively to compute ML models with high performance and low power. However, designing such accelerators normally takes a long time and requires significant effort. Unfortunately, the pace of development of ML software models is much faster than the accelerator design cycle, leading to frequent and drastic modifications in the model architecture, thus rendering many accelerators obsolete. Existing design tools and frameworks can provide quick accelerator prototyping, but only for a limited range of models that can fit into a single hardware device, such as an FPGA. Furthermore, with the emergence of large language models, such as GPT-3, there is an increased need for hardware prototyping of these large models within a many-accelerator system to ensure the hardware can scale with the ever-growing model sizes. In this paper, we propose an efficient and scalable approach for exploring accelerator systems to compute large ML models. We developed a tool named MASE that can directly map large ML models onto an efficient streaming accelerator system. Over a set of ML models, we show that MASE can achieve better energy efficiency to GPUs when computing inference for recent transformer models. Our tool will open-sourced upon publication

    Non-Hermitian topological whispering gallery

    Get PDF
    In 1878, Lord Rayleigh observed the highly celebrated phenomenon of sound waves that creep around the curved gallery of St Paul's Cathedral in London1,2. These whispering-gallery waves scatter efficiently with little diffraction around an enclosure and have since found applications in ultrasonic fatigue and crack testing, and in the optical sensing of nanoparticles or molecules using silica microscale toroids. Recently, intense research efforts have focused on exploring non-Hermitian systems with cleverly matched gain and loss, facilitating unidirectional invisibility and exotic characteristics of exceptional points3,4. Likewise, the surge in physics using topological insulators comprising non-trivial symmetry-protected phases has laid the groundwork in reshaping highly unconventional avenues for robust and reflection-free guiding and steering of both sound and light5,6. Here we construct a topological gallery insulator using sonic crystals made of thermoplastic rods that are decorated with carbon nanotube films, which act as a sonic gain medium by virtue of electro-thermoacoustic coupling. By engineering specific non-Hermiticity textures to the activated rods, we are able to break the chiral symmetry of the whispering-gallery modes, which enables the out-coupling of topological "audio lasing" modes with the desired handedness. We foresee that these findings will stimulate progress in non-destructive testing and acoustic sensing.This work was supported by the National Basic Research Program of China (2017YFA0303702), NSFC (12074183, 11922407, 11904035, 11834008, 11874215 and 12104226) and the Fundamental Research Funds for the Central Universities (020414380181). Z.Z. acknowledges the support from the China National Postdoctoral Program for Innovative Talents (BX20200165) and the China Postdoctoral Science Foundation (2020M681541). L.Z. acknowledges support from the CONEX-Plus programme funded by Universidad Carlos III de Madrid and the European Union's Horizon 2020 research and innovation programme under Marie Sklodowska-Curie grant agreement 801538. J.C. acknowledges support from the European Research Council (ERC) through the Starting Grant 714577 PHONOMETA and from the MINECO through a Ramón y Cajal grant (grant number RYC-2015-17156)

    Research on Temperature Field and Stress Field of Prefabricate Block Electric Furnace Roof

    Get PDF
    This paper establishes the CAD/CAE model of high aluminum brick furnace cover and a precast furnace cover (casting three block, eight block, twelve block) based on a 30t electric furnace roof real model of a steel factory and simulates the temperature and stress field of the firebrick roof and prefabricate block roof with ANSYS. The calculation results have indicated that the contact stress between furnace cover and precast block will affect the performance of the furnace cover and the furnace cover which is assembled by three pieces of casting precast block obtains lower stress levels has a longer service life, providing a quantitative reference for selection of casting scheme

    Influence Factors on Stress Distribution of Electric Furnace Roof

    Get PDF
    Electric furnace roof is an important device for electric steel making, whose heat preservation performance and life-span have a direct impact on the economic benefits of iron and steel enterprise. This paper investigates the effect between the stress level of electric furnace roof and the material parameters. Research indicates that they have a trend to change in the same direction

    China’s 10-year progress in DC gas-insulated equipment: From basic research to industry perspective

    Get PDF
    The construction of the future energy structure of China under the 2050 carbon-neutral vision requires compact direct current (DC) gas-insulation equipment as important nodes and solutions to support electric power transmission and distribution of long-distance and large-capacity. This paper reviews China's 10-year progress in DC gas-insulated equipment. Important progresses in basic research and industry perspective are presented, with related scientific issues and technical bottlenecks being discussed. The progress in DC gas-insulated equipment worldwide (Europe, Japan, America) is also reported briefly
    corecore