119 research outputs found
Combining dynamic and static scheduling in high-level synthesis
Field Programmable Gate Arrays (FPGAs) are starting to become mainstream devices for custom computing, particularly deployed in data centres. However, using these FPGA devices requires familiarity with digital design at a low abstraction level. In order to enable software engineers without a hardware background to design custom hardware, high-level synthesis (HLS) tools automatically transform a high-level program, for example in C/C++, into a low-level hardware description.
A central task in HLS is scheduling: the allocation of operations to clock cycles. The classic approach to scheduling is static, in which each operation is mapped to a clock cycle at compile time, but recent years have seen the emergence of dynamic scheduling, in which an operation’s clock cycle is only determined at run-time. Both approaches have their merits: static scheduling can lead to simpler circuitry and more resource sharing, while dynamic scheduling can lead to faster hardware when the computation has a non-trivial control flow.
This thesis proposes a scheduling approach that combines the best of both worlds. My idea is to use existing program analysis techniques in software designs, such as probabilistic analysis and formal verification, to optimize the HLS hardware. First, this thesis proposes a tool named DASS that uses a heuristic-based approach to identify the code regions in the input program that are amenable to static scheduling and synthesises them into statically scheduled components, also known as static islands, leaving the top-level hardware dynamically scheduled. Second, this thesis addresses a problem of this approach: that the analysis of static islands and their dynamically scheduled surroundings are separate, where one treats the other as black boxes. We apply static analysis including dependence analysis between static islands and their dynamically scheduled surroundings to optimize the offsets of static islands for high performance. We also apply probabilistic analysis to estimate the performance of the dynamically scheduled part and use this information to optimize the static islands for high area efficiency. Finally, this thesis addresses the problem of conservatism in using sequential control flow designs which can limit the throughput of the hardware. We show this challenge can be solved by formally proving that certain control flows can be safely parallelised for high performance. This thesis demonstrates how to use automated formal verification to find out-of-order loop pipelining solutions and multi-threading solutions from a sequential program.Open Acces
Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?
The inference of Large language models (LLMs) requires immense computation
and memory resources. To curtail these costs, quantisation has merged as a
promising solution, but existing LLM quantisation mainly focuses on 8-bit. In
this work, we explore the statistical and learning properties of the LLM layer
and attribute the bottleneck of LLM quantisation to numerical scaling offsets.
To address this, we adapt block quantisations for LLMs, a family of methods
that share scaling factors across packed numbers. Block quantisations
efficiently reduce the numerical scaling offsets solely from an arithmetic
perspective, without additional treatments in the computational path. Our
nearly-lossless quantised 6-bit LLMs achieve a higher arithmetic
density and memory density than the float32 baseline, surpassing the
prior art 8-bit quantisation by in arithmetic density and
in memory density, without requiring any data calibration or
re-training. We also share our insights into sub-8-bit LLM quantisation,
including the mismatch between activation and weight distributions, optimal
fine-tuning strategies, and a lower quantisation granularity inherent in the
statistical properties of LLMs. The latter two tricks enable nearly-lossless
4-bit LLMs on downstream tasks. Our code is open-sourced.Comment: Accepted by EMNLP202
GSA to HDL: Towards principled generation of dynamically scheduled circuits
High-level synthesis (HLS) refers to the automatic translation of a software
program written in a high-level language into a hardware design. Modern HLS
tools have moved away from the traditional approach of static (compile time)
scheduling of operations to generating dynamic circuits that schedule
operations at run time. Such circuits trade-off area utilisation for increased
dynamism and throughput. However, existing lowering flows in dynamically
scheduled HLS tools rely on conservative assumptions on their input program due
to both the intermediate representations (IR) utilised as well as the lack of
formal specifications on the translation into hardware. These assumptions cause
suboptimal hardware performance. In this work, we lift these assumptions by
proposing a new and efficient abstraction for hardware mapping; namely h-GSA,
an extension of the Gated Single Static Assignment (GSA) IR. Using this
abstraction, we propose a lowering flow that transforms GSA into h-GSA and maps
h-GSA into dynamically scheduled hardware circuits. We compare the schedules
generated by our approach to those by the state-of-the-art dynamic-scheduling
HLS tool, Dynamatic, and illustrate the potential performance improvement from
hardware mapping using the proposed abstraction.Comment: Presented at the 19th International Summer School on Advanced
Computer Architecture and Compilation for High-performance Embedded Systems
(ACACES 2023
SEER: Super-Optimization Explorer for HLS using E-graph Rewriting with MLIR
High-level synthesis (HLS) is a process that automatically translates a
software program in a high-level language into a low-level hardware
description. However, the hardware designs produced by HLS tools still suffer
from a significant performance gap compared to manual implementations. This is
because the input HLS programs must still be written using hardware design
principles.
Existing techniques either leave the program source unchanged or perform a
fixed sequence of source transformation passes, potentially missing
opportunities to find the optimal design. We propose a super-optimization
approach for HLS that automatically rewrites an arbitrary software program into
efficient HLS code that can be used to generate an optimized hardware design.
We developed a toolflow named SEER, based on the e-graph data structure, to
efficiently explore equivalent implementations of a program at scale. SEER
provides an extensible framework, orchestrating existing software compiler
passes and hardware synthesis optimizers.
Our work is the first attempt to exploit e-graph rewriting for large software
compiler frameworks, such as MLIR. Across a set of open-source benchmarks, we
show that SEER achieves up to 38x the performance within 1.4x the area of the
original program. Via an Intel-provided case study, SEER demonstrates the
potential to outperform manually optimized designs produced by hardware
experts
Fast Prototyping Next-Generation Accelerators for New ML Models using MASE: ML Accelerator System Exploration
Machine learning (ML) accelerators have been studied and used extensively to
compute ML models with high performance and low power. However, designing such
accelerators normally takes a long time and requires significant effort.
Unfortunately, the pace of development of ML software models is much faster
than the accelerator design cycle, leading to frequent and drastic
modifications in the model architecture, thus rendering many accelerators
obsolete. Existing design tools and frameworks can provide quick accelerator
prototyping, but only for a limited range of models that can fit into a single
hardware device, such as an FPGA. Furthermore, with the emergence of large
language models, such as GPT-3, there is an increased need for hardware
prototyping of these large models within a many-accelerator system to ensure
the hardware can scale with the ever-growing model sizes. In this paper, we
propose an efficient and scalable approach for exploring accelerator systems to
compute large ML models. We developed a tool named MASE that can directly map
large ML models onto an efficient streaming accelerator system. Over a set of
ML models, we show that MASE can achieve better energy efficiency to GPUs when
computing inference for recent transformer models. Our tool will open-sourced
upon publication
Non-Hermitian topological whispering gallery
In 1878, Lord Rayleigh observed the highly celebrated phenomenon of sound waves that creep around the curved gallery of St Paul's Cathedral in London1,2. These whispering-gallery waves scatter efficiently with little diffraction around an enclosure and have since found applications in ultrasonic fatigue and crack testing, and in the optical sensing of nanoparticles or molecules using silica microscale toroids. Recently, intense research efforts have focused on exploring non-Hermitian systems with cleverly matched gain and loss, facilitating unidirectional invisibility and exotic characteristics of exceptional points3,4. Likewise, the surge in physics using topological insulators comprising non-trivial symmetry-protected phases has laid the groundwork in reshaping highly unconventional avenues for robust and reflection-free guiding and steering of both sound and light5,6. Here we construct a topological gallery insulator using sonic crystals made of thermoplastic rods that are decorated with carbon nanotube films, which act as a sonic gain medium by virtue of electro-thermoacoustic coupling. By engineering specific non-Hermiticity textures to the activated rods, we are able to break the chiral symmetry of the whispering-gallery modes, which enables the out-coupling of topological "audio lasing" modes with the desired handedness. We foresee that these findings will stimulate progress in non-destructive testing and acoustic sensing.This work was supported by the National Basic Research Program of China (2017YFA0303702), NSFC (12074183, 11922407, 11904035, 11834008, 11874215 and 12104226) and the Fundamental Research Funds for the Central Universities (020414380181). Z.Z. acknowledges the support from the China National Postdoctoral Program for Innovative Talents (BX20200165) and the China Postdoctoral Science Foundation (2020M681541). L.Z. acknowledges support from the CONEX-Plus programme funded by Universidad Carlos III de Madrid and the European Union's Horizon 2020 research and innovation programme under Marie Sklodowska-Curie grant agreement 801538. J.C. acknowledges support from the European Research Council (ERC) through the Starting Grant 714577 PHONOMETA and from the MINECO through a Ramón y Cajal grant (grant number RYC-2015-17156)
Research on Temperature Field and Stress Field of Prefabricate Block Electric Furnace Roof
This paper establishes the CAD/CAE model of high aluminum brick furnace cover and a precast furnace cover (casting three block, eight block, twelve block) based on a 30t electric furnace roof real model of a steel factory and simulates the temperature and stress field of the firebrick roof and prefabricate block roof with ANSYS. The calculation results have indicated that the contact stress between furnace cover and precast block will affect the performance of the furnace cover and the furnace cover which is assembled by three pieces of casting precast block obtains lower stress levels has a longer service life, providing a quantitative reference for selection of casting scheme
Influence Factors on Stress Distribution of Electric Furnace Roof
Electric furnace roof is an important device for electric steel making, whose heat preservation performance and life-span have a direct impact on the economic benefits of iron and steel enterprise. This paper investigates the effect between the stress level of electric furnace roof and the material parameters. Research indicates that they have a trend to change in the same direction
China’s 10-year progress in DC gas-insulated equipment: From basic research to industry perspective
The construction of the future energy structure of China under the 2050 carbon-neutral vision requires compact direct current (DC) gas-insulation equipment as important nodes and solutions to support electric power transmission and distribution of long-distance and large-capacity. This paper reviews China's 10-year progress in DC gas-insulated equipment. Important progresses in basic research and industry perspective are presented, with related scientific issues and technical bottlenecks being discussed. The progress in DC gas-insulated equipment worldwide (Europe, Japan, America) is also reported briefly
- …