297 research outputs found
Scalable reconfigurable computing leveraging latency-insensitive channels
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2013.Cataloged from PDF version of thesis.Includes bibliographical references (p. 190-197).Traditionally, FPGAs have been confined to the limited role of small, low-volume ASIC replacements and as circuit emulators. However, continued Moore's law scaling has given FPGAs new life as accelerators for applications that map well to fine-grained parallel substrates. Examples of such applications include processor modelling, compression, and digital signal processing. Although FPGAs continue to increase in size, some interesting designs still fail to fit in to a single FPGA. Many tools exist that partition RTL descriptions across FPGAs. Unfortunately, existing tools have low performance due to the inefficiency of maintaining the cycle-by-cycle behavior of RTL among discrete FPGAs. These tools are unsuitable for use in FPGA program acceleration, as the purpose of an accelerator is to make applications run faster. This thesis presents latency-insensitive channels, a language-level mechanism by which programmers express points in their their design at which the cycle-by-cycle behavior of the design may be modified by the compiler. By decoupling the timing of portions of the RTL from the high-level function of the program, designs may be mapped to multiple FPGAs without suffering the performance degradation observed in existing tools. This thesis demonstrates, using a diverse set of large designs, that FPGA programs described in terms of latency-insensitive channels obtain significant gains in design feasibility, compilation time, and run-time when mapped to multiple FPGAs.by Kermin Elliott Fleming, Jr.Ph.D
Agile SoC Development with Open ESP
ESP is an open-source research platform for heterogeneous SoC design. The
platform combines a modular tile-based architecture with a variety of
application-oriented flows for the design and optimization of accelerators. The
ESP architecture is highly scalable and strikes a balance between regularity
and specialization. The companion methodology raises the level of abstraction
to system-level design and enables an automated flow from software and hardware
development to full-system prototyping on FPGA. For application developers, ESP
offers domain-specific automated solutions to synthesize new accelerators for
their software and to map complex workloads onto the SoC architecture. For
hardware engineers, ESP offers automated solutions to integrate their
accelerator designs into the complete SoC. Conceived as a heterogeneous
integration platform and tested through years of teaching at Columbia
University, ESP supports the open-source hardware community by providing a
flexible platform for agile SoC development.Comment: Invited Paper at the 2020 International Conference On Computer Aided
Design (ICCAD) - Special Session on Opensource Tools and Platforms for Agile
Development of Specialized Architecture
Efficient Intra-Rack Resource Disaggregation for HPC Using Co-Packaged DWDM Photonics
The diversity of workload requirements and increasing hardware heterogeneity
in emerging high performance computing (HPC) systems motivate resource
disaggregation. Resource disaggregation allows compute and memory resources to
be allocated individually as required to each workload. However, it is unclear
how to efficiently realize this capability and cost-effectively meet the
stringent bandwidth and latency requirements of HPC applications. To that end,
we describe how modern photonics can be co-designed with modern HPC racks to
implement flexible intra-rack resource disaggregation and fully meet the bit
error rate (BER) and high escape bandwidth of all chip types in modern HPC
racks. Our photonic-based disaggregated rack provides an average application
speedup of 11% (46% maximum) for 25 CPU and 61% for 24 GPU benchmarks compared
to a similar system that instead uses modern electronic switches for
disaggregation. Using observed resource usage from a production system, we
estimate that an iso-performance intra-rack disaggregated HPC system using
photonics would require 4x fewer memory modules and 2x fewer NICs than a
non-disaggregated baseline.Comment: 15 pages, 12 figures, 4 tables. Published in IEEE Cluster 202
ACCL+: an FPGA-Based Collective Engine for Distributed Applications
FPGAs are increasingly prevalent in cloud deployments, serving as Smart NICs
or network-attached accelerators. Despite their potential, developing
distributed FPGA-accelerated applications remains cumbersome due to the lack of
appropriate infrastructure and communication abstractions. To facilitate the
development of distributed applications with FPGAs, in this paper we propose
ACCL+, an open-source versatile FPGA-based collective communication library.
Portable across different platforms and supporting UDP, TCP, as well as RDMA,
ACCL+ empowers FPGA applications to initiate direct FPGA-to-FPGA collective
communication. Additionally, it can serve as a collective offload engine for
CPU applications, freeing the CPU from networking tasks. It is user-extensible,
allowing new collectives to be implemented and deployed without having to
re-synthesize the FPGA circuit. We evaluated ACCL+ on an FPGA cluster with 100
Gb/s networking, comparing its performance against software MPI over RDMA. The
results demonstrate ACCL+'s significant advantages for FPGA-based distributed
applications and highly competitive performance for CPU applications. We
showcase ACCL+'s dual role with two use cases: seamlessly integrating as a
collective offload engine to distribute CPU-based vector-matrix multiplication,
and serving as a crucial and efficient component in designing fully FPGA-based
distributed deep-learning recommendation inference
Recommended from our members
Silicon Photonic Subsystems for Inter-Chip Optical Networks
The continuous growth of electronic compute and memory nodes in terms of the number of I/O pins, bandwidth, and areal throughput poses major integration and packaging challenges associated with offloading multi-Tbit/s data rates within the few pJ/bit targets. While integrated photonics are already deployed in long and short distances such as inter and intra data centers communications, the promising characteristics of the silicon photonic platform set it as the future technology for optical interconnects in ultra short inter-chip distances. The high index contrast between the waveguide and the cladding together with strong thermo-optic and carrier effects in silicon allows developing a wide range of micro-scale and low power optical devices compatible with the CMOS fabrication processes. Furthermore, the availability of photonic foundries and new electrical and optical co-packaging techniques further pushes this platform for the next steps of commercial deployment.
The work in this dissertation presents the current trends in high-performance memory and processor nodes and gives motivation for disaggregated and reconfigurable inter-chip network enabled with the silicon photonic layer. A dense WDM transceiver and broadband switch architectures are discussed to support a bi-directional network of ten hybrid-memory cubes (HMC) interconnected to ten processor nodes with an overall aggregated bandwidth of 9.6Tbit/s. Latency and energy consumption are key performance parameters in a processor to primary memory nodes connectivity. The transceiver design is based on energy-efficient micro-ring resonators, and the broadband switch is constructed with 2x2 Mach-Zehnder elements for nano-second reconfiguration. Each transceiver is based on hundreds of micro-rings to convert the native HMC electrical protocol to the optical domain and the switch is based on tens of hundreds of 2x2 elements to achieve non-blocking all-to-all connectivity.
The next chapters focus on developing methods for controlling and monitoring such complex and highly integrated silicon photonic subsystems. The thermo-optic effect is characterized and we show experimentally that the phase of the optical carrier can be reliably controlled with pulse-width modulation (PWM) signal, ultimately relaxing the need for hundreds of digital to analog converters (DACs). We further show that doped waveguide heaters can be utilized as \textit{in-line} optical power monitors by measuring photo-conductance current, which is an alternative for the conventional tapping and integration of photo-diodes.
The next part concerned with a common cascaded micro-ring resonator in a WDM transceiver design. We develop on an FPGA control algorithm that abstracts the physical layer and takes user-defined inputs to set the resonances to the desired wavelength in a unicast and multicast transmission modes. The associated sensitivities of these silicon ring resonators are presented and addressed with three closed-loop solutions. We first show a closed-loop operation based on tapping the error signal from the drop port of the micro-ring. The second solution presents a resonance wavelength locking with a single digital I/O for control and feedback signals. Lastly, we leverage the photo-conductance effect and demonstrate the locking procedure using only the doped heater for both control and feedback purposes.
To achieve the inter-chip reconfigurability we discuss recent advances of high-port-count SiP broadband switches for reconfigurable inter-chip networks. To ensure optimal operation in terms of low insertion loss, low cross-talk and high signal integrity per routing path, hundreds of 2x2 Mach-Zehnder elements need to be biased precisely for the cross and bar states. We address this challenge with a tapless and a design agnostic calibration approach based on the photo-conductance effect. The automated algorithm returns a look-up table for all for each 2x2 element and the associated calibrated biases. Each routing scenario is then tested for insertion loss, crosstalk and bit-error rate of 25Gbit/s 4-level pulse amplitude modulation signals. The last part utilizes the Mach-Zehnder interferometers in WDM transceiver applications. We demonstrate a polarization insensitive four-channel WDM receiver with 40Gbit/s per channel and a transmitter design generating 8-level pulse amplitude modulation signals at 30Gbit/s
A High-Frequency Load-Store Queue with Speculative Allocations for High-Level Synthesis
Dynamically scheduled high-level synthesis (HLS) enables the use of
load-store queues (LSQs) which can disambiguate data hazards at circuit
runtime, increasing throughput in codes with unpredictable memory accesses.
However, the increased throughput comes at the price of lower clock frequency
and higher resource usage compared to statically scheduled circuits without
LSQs. The lower frequency often nullifies any throughput improvements over
static scheduling, while the resource usage becomes prohibitively expensive
with large queue sizes. This paper presents a method for achieving dynamically
scheduled memory operations in HLS without significant clock period and
resource usage increase. We present a novel LSQ based on shift-registers
enabled by the opportunity to specialize queue sizes to a target code in HLS.
We show a method to speculatively allocate addresses to our LSQ, significantly
increasing pipeline parallelism in codes that could not benefit from an LSQ
before. In stark contrast to traditional load value speculation, we do not
require pipeline replays and have no overhead on misspeculation. On a set of
benchmarks with data hazards, our approach achieves an average speedup of
11 against static HLS and 5 against dynamic HLS that uses a
state of the art LSQ from previous work. Our LSQ also uses several times fewer
resources, scaling to queues with hundreds of entries, and supports both
on-chip and off-chip memory.Comment: To appear in the International Conference on Field Programmable
Technology (FPT'23), Yokohama, Japan, 11-14 December 202
Optimized Surface Code Communication in Superconducting Quantum Computers
Quantum computing (QC) is at the cusp of a revolution. Machines with 100
quantum bits (qubits) are anticipated to be operational by 2020
[googlemachine,gambetta2015building], and several-hundred-qubit machines are
around the corner. Machines of this scale have the capacity to demonstrate
quantum supremacy, the tipping point where QC is faster than the fastest
classical alternative for a particular problem. Because error correction
techniques will be central to QC and will be the most expensive component of
quantum computation, choosing the lowest-overhead error correction scheme is
critical to overall QC success. This paper evaluates two established quantum
error correction codes---planar and double-defect surface codes---using a set
of compilation, scheduling and network simulation tools. In considering
scalable methods for optimizing both codes, we do so in the context of a full
microarchitectural and compiler analysis. Contrary to previous predictions, we
find that the simpler planar codes are sometimes more favorable for
implementation on superconducting quantum computers, especially under
conditions of high communication congestion.Comment: 14 pages, 9 figures, The 50th Annual IEEE/ACM International Symposium
on Microarchitectur
- …