1,051 research outputs found
ACiS: smart switches with application-level acceleration
Network performance has contributed fundamentally to the growth of supercomputing over the past decades. In parallel, High Performance Computing (HPC) peak performance has depended, first, on ever faster/denser CPUs, and then, just on increasing density alone. As operating frequency, and now feature size, have levelled off, two new approaches are becoming central to achieving higher net performance: configurability and integration. Configurability enables hardware to map to the application, as well as vice versa. Integration enables system components that have generally been single function-e.g., a network to transport data—to have additional functionality, e.g., also to operate on that data. More generally, integration enables compute-everywhere: not just in CPU and accelerator, but also in network and, more specifically, the communication switches.
In this thesis, we propose four novel methods of enhancing HPC performance through Advanced Computing in the Switch (ACiS). More specifically, we propose various flexible and application-aware accelerators that can be embedded into or attached to existing communication switches to improve the performance and scalability of HPC and Machine Learning (ML) applications. We follow a modular design discipline through introducing composable plugins to successively add ACiS capabilities.
In the first work, we propose an inline accelerator to communication switches for user-definable collective operations. MPI collective operations can often be performance killers in HPC applications; we seek to solve this bottleneck by offloading them to reconfigurable hardware within the switch itself. We also introduce a novel mechanism that enables the hardware to support MPI communicators of arbitrary shape and that is scalable to very large systems.
In the second work, we propose a look-aside accelerator for communication switches that is capable of processing packets at line-rate. Functions requiring loops and states are addressed in this method. The proposed in-switch accelerator is based on a RISC-V compatible Coarse Grained Reconfigurable Arrays (CGRAs).
To facilitate usability, we have developed a framework to compile user-provided C/C++ codes to appropriate back-end instructions for configuring the accelerator.
In the third work, we extend ACiS to support fused collectives and the combining of collectives with map operations. We observe that there is an opportunity of fusing communication (collectives) with computation. Since the computation can vary for different applications, ACiS support should be programmable in this method.
In the fourth work, we propose that switches with ACiS support can control and manage the execution of applications, i.e., that the switch be an active device with decision-making capabilities. Switches have a central view of the network; they can collect telemetry information and monitor application behavior and then use this information for control, decision-making, and coordination of nodes.
We evaluate the feasibility of ACiS through extensive RTL-based simulation as well as deployment in an open-access cloud infrastructure. Using this simulation framework, when considering a Graph Convolutional Network (GCN) application as a case study, a speedup of on average 3.4x across five real-world datasets is achieved on 24 nodes compared to a CPU cluster without ACiS capabilities
FireFly v2: Advancing Hardware Support for High-Performance Spiking Neural Network with a Spatiotemporal FPGA Accelerator
Spiking Neural Networks (SNNs) are expected to be a promising alternative to
Artificial Neural Networks (ANNs) due to their strong biological
interpretability and high energy efficiency. Specialized SNN hardware offers
clear advantages over general-purpose devices in terms of power and
performance. However, there's still room to advance hardware support for
state-of-the-art (SOTA) SNN algorithms and improve computation and memory
efficiency. As a further step in supporting high-performance SNNs on
specialized hardware, we introduce FireFly v2, an FPGA SNN accelerator that can
address the issue of non-spike operation in current SOTA SNN algorithms, which
presents an obstacle in the end-to-end deployment onto existing SNN hardware.
To more effectively align with the SNN characteristics, we design a
spatiotemporal dataflow that allows four dimensions of parallelism and
eliminates the need for membrane potential storage, enabling on-the-fly spike
processing and spike generation. To further improve hardware acceleration
performance, we develop a high-performance spike computing engine as a backend
based on a systolic array operating at 500-600MHz. To the best of our
knowledge, FireFly v2 achieves the highest clock frequency among all FPGA-based
implementations. Furthermore, it stands as the first SNN accelerator capable of
supporting non-spike operations, which are commonly used in advanced SNN
algorithms. FireFly v2 has doubled the throughput and DSP efficiency when
compared to our previous version of FireFly and it exhibits 1.33 times the DSP
efficiency and 1.42 times the power efficiency compared to the current most
advanced FPGA accelerators
Exploring space situational awareness using neuromorphic event-based cameras
The orbits around earth are a limited natural resource and one that hosts a vast range of vital space-based systems that support international systems use by both commercial industries, civil organisations, and national defence. The availability of this space resource is rapidly depleting due to the ever-growing presence of space debris and rampant overcrowding, especially in the limited and highly desirable slots in geosynchronous orbit. The field of Space Situational Awareness encompasses tasks aimed at mitigating these hazards to on-orbit systems through the monitoring of satellite traffic. Essential to this task is the collection of accurate and timely observation data. This thesis explores the use of a novel sensor paradigm to optically collect and process sensor data to enhance and improve space situational awareness tasks. Solving this issue is critical to ensure that we can continue to utilise the space environment in a sustainable way. However, these tasks pose significant engineering challenges that involve the detection and characterisation of faint, highly distant, and high-speed targets. Recent advances in neuromorphic engineering have led to the availability of high-quality neuromorphic event-based cameras that provide a promising alternative to the conventional cameras used in space imaging. These cameras offer the potential to improve the capabilities of existing space tracking systems and have been shown to detect and track satellites or ‘Resident Space Objects’ at low data rates, high temporal resolutions, and in conditions typically unsuitable for conventional optical cameras. This thesis presents a thorough exploration of neuromorphic event-based cameras for space situational awareness tasks and establishes a rigorous foundation for event-based space imaging. The work conducted in this project demonstrates how to enable event-based space imaging systems that serve the goals of space situational awareness by providing accurate and timely information on the space domain. By developing and implementing event-based processing techniques, the asynchronous operation, high temporal resolution, and dynamic range of these novel sensors are leveraged to provide low latency target acquisition and rapid reaction to challenging satellite tracking scenarios. The algorithms and experiments developed in this thesis successfully study the properties and trade-offs of event-based space imaging and provide comparisons with traditional observing methods and conventional frame-based sensors. The outcomes of this thesis demonstrate the viability of event-based cameras for use in tracking and space imaging tasks and therefore contribute to the growing efforts of the international space situational awareness community and the development of the event-based technology in astronomy and space science applications
Experimental survey of FPGA-based monolithic switches and a novel queue balancer
This paper studies small to medium-sized monolithic switches for FPGA implementation and presents a novel switch design that achieves high algorithmic performance and FPGA implementation efficiency. Crossbar switches based on virtual output queues (VOQs) and variations have been rather popular for implementing switches on FPGAs, with applications in network switches, memory interconnects, network-on-chip (NoC) routers etc. The implementation efficiency of crossbar-based switches is well-documented on ASICs, though we show that their disadvantages can outweigh their advantages on FPGAs. One of the most important challenges in such input-queued switches is the requirement for iterative scheduling algorithms. In contrast to ASICs, this is more harmful on FPGAs, as the reduced operating frequency and narrower packets cannot “hide” multiple iterations of scheduling that are required to achieve a modest scheduling performance.Our proposed design uses an output-queued switch internally for simplifying scheduling, and a queue balancing technique to avoid queue fragmentation and reduce the need for memory-sharing VOQs. Its implementation approaches the scheduling performance of a state-of-the-art FPGA-based switch, while requiring considerably fewer resources
Low Power Memory/Memristor Devices and Systems
This reprint focusses on achieving low-power computation using memristive devices. The topic was designed as a convenient reference point: it contains a mix of techniques starting from the fundamental manufacturing of memristive devices all the way to applications such as physically unclonable functions, and also covers perspectives on, e.g., in-memory computing, which is inextricably linked with emerging memory devices such as memristors. Finally, the reprint contains a few articles representing how other communities (from typical CMOS design to photonics) are fighting on their own fronts in the quest towards low-power computation, as a comparison with the memristor literature. We hope that readers will enjoy discovering the articles within
Towards Deep Learning with Competing Generalisation Objectives
The unreasonable effectiveness of Deep Learning continues to deliver unprecedented Artificial Intelligence capabilities to billions of people. Growing datasets and technological advances keep extending the reach of expressive model architectures trained through efficient optimisations. Thus, deep learning approaches continue to provide increasingly proficient subroutines for, among others, computer vision and natural interaction through speech and text. Due to their scalable learning and inference priors, higher performance is often gained cost-effectively through largely automatic training. As a result, new and improved capabilities empower more people while the costs of access drop.
The arising opportunities and challenges have profoundly influenced research. Quality attributes of scalable software became central desiderata of deep learning paradigms, including reusability, efficiency, robustness and safety. Ongoing research into continual, meta- and robust learning aims to maximise such scalability metrics in addition to multiple generalisation criteria, despite possible conflicts. A significant challenge is to satisfy competing criteria automatically and cost-effectively.
In this thesis, we introduce a unifying perspective on learning with competing generalisation objectives and make three additional contributions. When autonomous learning through multi-criteria optimisation is impractical, it is reasonable to ask whether knowledge of appropriate trade-offs could make it simultaneously effective and efficient. Informed by explicit trade-offs of interest to particular applications, we developed and evaluated bespoke model architecture priors. We introduced a novel architecture for sim-to-real transfer of robotic control policies by learning progressively to generalise anew. Competing desiderata of continual learning were balanced through disjoint capacity and hierarchical reuse of previously learnt representations. A new state-of-the-art meta-learning approach is then proposed. We showed that meta-trained hypernetworks efficiently store and flexibly reuse knowledge for new generalisation criteria through few-shot gradient-based optimisation. Finally, we characterised empirical trade-offs between the many desiderata of adversarial robustness and demonstrated a novel defensive capability of implicit neural networks to hinder many attacks simultaneously
A multi-level functional IR with rewrites for higher-level synthesis of accelerators
Specialised accelerators deliver orders of magnitude higher energy-efficiency than
general-purpose processors. Field Programmable Gate Arrays (FPGAs) have become
the substrate of choice, because the ever-changing nature of modern workloads, such
as machine learning, demands reconfigurability. However, they are notoriously hard
to program directly using Hardware Description Languages (HDLs). Traditional High-Level Synthesis (HLS) tools improve productivity, but come with their own problems.
They often produce sub-optimal designs and programmers are still required to write
hardware-specific code, thus development cycles remain long.
This thesis proposes Shir, a higher-level synthesis approach for high-performance
accelerator design with a hardware-agnostic programming entry point, a multi-level
Intermediate Representation (IR), a compiler and rewrite rules for optimisation.
First, a novel, multi-level functional IR structure for accelerator design is described.
The IRs operate on different levels of abstraction, cleanly separating different hardware
concerns. They enable the expression of different forms of parallelism and standard
memory features, such as asynchronous off-chip memories or synchronous on-chip
buffers, as well as arbitration of such shared resources. Exposing these features at the
IR level is essential for achieving high performance.
Next, mechanical lowering procedures are introduced to automatically compile
a program specification through Shir’s functional IRs until low-level HDL code for
FPGA synthesis is emitted. Each lowering step gradually adds implementation details.
Finally, this thesis presents rewrite rules for automatic optimisations around parallelisation, buffering and data reshaping. Reshaping operations pose a challenge to
functional approaches in particular. They introduce overheads that compromise performance or even prevent the generation of synthesisable hardware designs altogether.
This fundamental issue is solved by the application of rewrite rules.
The viability of this approach is demonstrated by running matrix multiplication
and 2D convolution on an Intel Arria 10 FPGA. A limited design space exploration is
conducted, confirming the ability of the IR to exploit various hardware features. Using
rewrite rules for optimisation, it is possible to generate high-performance designs
that are competitive with highly tuned OpenCL implementations and that outperform
hardware-agnostic OpenCL code. The performance impact of the optimisations is
further evaluated showing that they are essential to achieving high performance, and
in many cases also necessary to produce hardware that fits the resource constraints
- …