129 research outputs found
FireFly: A High-Throughput and Reconfigurable Hardware Accelerator for Spiking Neural Networks
Spiking neural networks (SNNs) have been widely used due to their strong
biological interpretability and high energy efficiency. With the introduction
of the backpropagation algorithm and surrogate gradient, the structure of
spiking neural networks has become more complex, and the performance gap with
artificial neural networks has gradually decreased. However, most SNN hardware
implementations for field-programmable gate arrays (FPGAs) cannot meet
arithmetic or memory efficiency requirements, which significantly restricts the
development of SNNs. They do not delve into the arithmetic operations between
the binary spikes and synaptic weights or assume unlimited on-chip RAM
resources by using overly expensive devices on small tasks. To improve
arithmetic efficiency, we analyze the neural dynamics of spiking neurons,
generalize the SNN arithmetic operation to the multiplex-accumulate operation,
and propose a high-performance implementation of such operation by utilizing
the DSP48E2 hard block in Xilinx Ultrascale FPGAs. To improve memory
efficiency, we design a memory system to enable efficient synaptic weights and
membrane voltage memory access with reasonable on-chip RAM consumption.
Combining the above two improvements, we propose an FPGA accelerator that can
process spikes generated by the firing neuron on-the-fly (FireFly). FireFly is
implemented on several FPGA edge devices with limited resources but still
guarantees a peak performance of 5.53TSOP/s at 300MHz. As a lightweight
accelerator, FireFly achieves the highest computational density efficiency
compared with existing research using large FPGA devices
ACiS: smart switches with application-level acceleration
Network performance has contributed fundamentally to the growth of supercomputing over the past decades. In parallel, High Performance Computing (HPC) peak performance has depended, first, on ever faster/denser CPUs, and then, just on increasing density alone. As operating frequency, and now feature size, have levelled off, two new approaches are becoming central to achieving higher net performance: configurability and integration. Configurability enables hardware to map to the application, as well as vice versa. Integration enables system components that have generally been single function-e.g., a network to transport data—to have additional functionality, e.g., also to operate on that data. More generally, integration enables compute-everywhere: not just in CPU and accelerator, but also in network and, more specifically, the communication switches.
In this thesis, we propose four novel methods of enhancing HPC performance through Advanced Computing in the Switch (ACiS). More specifically, we propose various flexible and application-aware accelerators that can be embedded into or attached to existing communication switches to improve the performance and scalability of HPC and Machine Learning (ML) applications. We follow a modular design discipline through introducing composable plugins to successively add ACiS capabilities.
In the first work, we propose an inline accelerator to communication switches for user-definable collective operations. MPI collective operations can often be performance killers in HPC applications; we seek to solve this bottleneck by offloading them to reconfigurable hardware within the switch itself. We also introduce a novel mechanism that enables the hardware to support MPI communicators of arbitrary shape and that is scalable to very large systems.
In the second work, we propose a look-aside accelerator for communication switches that is capable of processing packets at line-rate. Functions requiring loops and states are addressed in this method. The proposed in-switch accelerator is based on a RISC-V compatible Coarse Grained Reconfigurable Arrays (CGRAs).
To facilitate usability, we have developed a framework to compile user-provided C/C++ codes to appropriate back-end instructions for configuring the accelerator.
In the third work, we extend ACiS to support fused collectives and the combining of collectives with map operations. We observe that there is an opportunity of fusing communication (collectives) with computation. Since the computation can vary for different applications, ACiS support should be programmable in this method.
In the fourth work, we propose that switches with ACiS support can control and manage the execution of applications, i.e., that the switch be an active device with decision-making capabilities. Switches have a central view of the network; they can collect telemetry information and monitor application behavior and then use this information for control, decision-making, and coordination of nodes.
We evaluate the feasibility of ACiS through extensive RTL-based simulation as well as deployment in an open-access cloud infrastructure. Using this simulation framework, when considering a Graph Convolutional Network (GCN) application as a case study, a speedup of on average 3.4x across five real-world datasets is achieved on 24 nodes compared to a CPU cluster without ACiS capabilities
Towards trustworthy computing on untrustworthy hardware
Historically, hardware was thought to be inherently secure and trusted due to its
obscurity and the isolated nature of its design and manufacturing. In the last two
decades, however, hardware trust and security have emerged as pressing issues.
Modern day hardware is surrounded by threats manifested mainly in undesired
modifications by untrusted parties in its supply chain, unauthorized and pirated
selling, injected faults, and system and microarchitectural level attacks. These threats,
if realized, are expected to push hardware to abnormal and unexpected behaviour
causing real-life damage and significantly undermining our trust in the electronic and
computing systems we use in our daily lives and in safety critical applications. A
large number of detective and preventive countermeasures have been proposed in
literature. It is a fact, however, that our knowledge of potential consequences to
real-life threats to hardware trust is lacking given the limited number of real-life
reports and the plethora of ways in which hardware trust could be undermined. With
this in mind, run-time monitoring of hardware combined with active mitigation of
attacks, referred to as trustworthy computing on untrustworthy hardware, is proposed
as the last line of defence. This last line of defence allows us to face the issue of live
hardware mistrust rather than turning a blind eye to it or being helpless once it occurs.
This thesis proposes three different frameworks towards trustworthy computing
on untrustworthy hardware. The presented frameworks are adaptable to different
applications, independent of the design of the monitored elements, based on
autonomous security elements, and are computationally lightweight. The first
framework is concerned with explicit violations and breaches of trust at run-time,
with an untrustworthy on-chip communication interconnect presented as a potential
offender. The framework is based on the guiding principles of component guarding,
data tagging, and event verification. The second framework targets hardware elements
with inherently variable and unpredictable operational latency and proposes a
machine-learning based characterization of these latencies to infer undesired latency
extensions or denial of service attacks. The framework is implemented on a DDR3
DRAM after showing its vulnerability to obscured latency extension attacks. The
third framework studies the possibility of the deployment of untrustworthy hardware
elements in the analog front end, and the consequent integrity issues that might arise
at the analog-digital boundary of system on chips. The framework uses machine
learning methods and the unique temporal and arithmetic features of signals at this
boundary to monitor their integrity and assess their trust level
Optical Networks and Interconnects
The rapid evolution of communication technologies such as 5G and beyond, rely
on optical networks to support the challenging and ambitious requirements that
include both capacity and reliability. This chapter begins by giving an
overview of the evolution of optical access networks, focusing on Passive
Optical Networks (PONs). The development of the different PON standards and
requirements aiming at longer reach, higher client count and delivered
bandwidth are presented. PON virtualization is also introduced as the
flexibility enabler. Triggered by the increase of bandwidth supported by access
and aggregation network segments, core networks have also evolved, as presented
in the second part of the chapter. Scaling the physical infrastructure requires
high investment and hence, operators are considering alternatives to optimize
the use of the existing capacity. This chapter introduces different planning
problems such as Routing and Spectrum Assignment problems, placement problems
for regenerators and wavelength converters, and how to offer resilience to
different failures. An overview of control and management is also provided.
Moreover, motivated by the increasing importance of data storage and data
processing, this chapter also addresses different aspects of optical data
center interconnects. Data centers have become critical infrastructure to
operate any service. They are also forced to take advantage of optical
technology in order to keep up with the growing capacity demand and power
consumption. This chapter gives an overview of different optical data center
network architectures as well as some expected directions to improve the
resource utilization and increase the network capacity
Improving low latency applications for reconfigurable devices
This thesis seeks to improve low latency application performance via architectural improvements in reconfigurable devices. This is achieved by improving resource utilisation and access, and by exploiting the different environments within which reconfigurable devices are deployed.
Our first contribution leverages devices deployed at the network level to enable the low latency processing of financial market data feeds. Financial exchanges transmit messages via two identical data feeds to reduce the chance of message loss. We present an approach to arbitrate these redundant feeds at the network level using a Field-Programmable Gate Array (FPGA). With support for any messaging protocol, we evaluate our design using the NASDAQ TotalView-ITCH, OPRA, and ARCA data feed protocols, and provide two simultaneous outputs: one prioritising low latency, and one prioritising high reliability with three dynamically configurable windowing methods.
Our second contribution is a new ring-based architecture for low latency, parallel access to FPGA memory. Traditional FPGA memory is formed by grouping block memories (BRAMs) together and accessing them as a single device. Our architecture accesses these BRAMs independently and in parallel. Targeting memory-based computing, which stores pre-computed function results in memory, we benefit low latency applications that rely on: highly-complex functions; iterative computation; or many parallel accesses to a shared resource. We assess square root, power, trigonometric, and hyperbolic functions within the FPGA, and provide a tool to convert Python functions to our new architecture.
Our third contribution extends the ring-based architecture to support any FPGA processing element. We unify E heterogeneous processing elements within compute pools, with each element implementing the same function, and the pool serving D parallel function calls. Our implementation-agnostic approach supports processing elements with different latencies, implementations, and pipeline lengths, as well as non-deterministic latencies. Compute pools evenly balance access to processing elements across the entire application, and are evaluated by implementing eight different neural network activation functions within an FPGA.Open Acces
Recommended from our members
Conformable transistors for bioelectronics
The diversity of network disruptions that occur in patients with neuropsychiatric disorders creates a strong demand for personalized medicine. Such approaches often take the form of implantable bioelectronic devices that are capable of monitoring pathophysiological activity for identifying biomarkers to allow for local and responsive delivery of intervention. They are also required to transmit this data outside of the body for evaluation of the treatment’s efficacy.
However, the ability to perform these demanding electronic functions in the complex physiological environment with minimum disruption to the biological tissue remains a big challenge. An optimal fully implantable bioelectronic device would require each component from the front-end to the data transmission to be conformable and biocompatible. For this reason, organic material-based conformable electronics are ideal candidates for components of bioelectronic circuits due to their inherent flexibility, and soft nature.
In this work, first an organic mixed-conducting particulate composite material (MCP) able to form functional electronic components and non-invasively acquire high–spatiotemporal resolution electrophysiological signals by directly interfacing human skin is presented. Secondly, we introduce organic electrochemical internal ion-gated transistors (IGTs) as a high-density, high-amplification sensing component as well as a low leakage, high-speed processing unit.
Finally, a novel wireless, battery-free strategy for electrophysiological signal acquisition, processing, and transmission that employs IGTs and an ionic communication circuit (IC) is introduced. We show that the wirelessly-powered IGTs are able to acquire and modulate neurophysiological data in-vivo and transmit them transdermally, eliminating the need for any hard Si-based electronics in the implant
Toward Fault-Tolerant Applications on Reconfigurable Systems-on-Chip
L'abstract è presente nell'allegato / the abstract is in the attachmen
Configurable data center switch architectures
In this thesis, we explore alternative architectures for implementing con_gurable Data Center Switches along with the advantages that can be provided by such switches. Our first contribution centers around determining switch architectures that can be implemented on Field Programmable Gate Array (FPGA) to provide configurable switching protocols. In the process, we identify a gap in the availability of frameworks to realistically evaluate the performance of switch architectures in data centers and contribute a simulation framework that relies on realistic data center traffic patterns. Our framework is then used to evaluate the performance of currently existing as well as newly proposed FPGA-amenable switch designs. Through collaborative work with Meng and Papaphilippou, we establish that only small-medium range switches can be implemented on today's FPGAs. Our second contribution is a novel switch architecture that integrates a custom in-network hardware accelerator with a generic switch to accelerate Deep Neural Network training applications in data centers. Our proposed accelerator architecture is prototyped on an FPGA, and a scalability study is conducted to demonstrate the trade-offs of an FPGA implementation when compared to an ASIC implementation. In addition to the hardware prototype, we contribute a light weight load-balancing and congestion control protocol that leverages the unique communication patterns of ML data-parallel jobs to enable fair sharing of network resources across different jobs. Our large-scale simulations demonstrate the ability of our novel switch architecture and light weight congestion control protocol to both accelerate the training time of machine learning jobs by up to 1.34x and benefit other latency-sensitive applications by reducing their 99%-tile completion time by up to 4.5x. As for our final contribution, we identify the main requirements of in-network applications and propose a Network-on-Chip (NoC)-based architecture for supporting a heterogeneous set of applications. Observing the lack of tools to support such research, we provide a tool that can be used to evaluate NoC-based switch architectures.Open Acces
Security of Electrical, Optical and Wireless On-Chip Interconnects: A Survey
The advancement of manufacturing technologies has enabled the integration of
more intellectual property (IP) cores on the same system-on-chip (SoC).
Scalable and high throughput on-chip communication architecture has become a
vital component in today's SoCs. Diverse technologies such as electrical,
wireless, optical, and hybrid are available for on-chip communication with
different architectures supporting them. Security of the on-chip communication
is crucial because exploiting any vulnerability would be a goldmine for an
attacker. In this survey, we provide a comprehensive review of threat models,
attacks, and countermeasures over diverse on-chip communication technologies as
well as sophisticated architectures.Comment: 41 pages, 24 figures, 4 table
- …