91 research outputs found
Parallel and Distributed Computing
The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing
TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings
In response to innovations in machine learning (ML) models, production
workloads changed radically and rapidly. TPU v4 is the fifth Google domain
specific architecture (DSA) and its third supercomputer for such ML models.
Optical circuit switches (OCSes) dynamically reconfigure its interconnect
topology to improve scale, availability, utilization, modularity, deployment,
security, power, and performance; users can pick a twisted 3D torus topology if
desired. Much cheaper, lower power, and faster than Infiniband, OCSes and
underlying optical components are <5% of system cost and <3% of system power.
Each TPU v4 includes SparseCores, dataflow processors that accelerate models
that rely on embeddings by 5x-7x yet use only 5% of die area and power.
Deployed since 2020, TPU v4 outperforms TPU v3 by 2.1x and improves
performance/Watt by 2.7x. The TPU v4 supercomputer is 4x larger at 4096 chips
and thus ~10x faster overall, which along with OCS flexibility helps large
language models. For similar sized systems, it is ~4.3x-4.5x faster than the
Graphcore IPU Bow and is 1.2x-1.7x faster and uses 1.3x-1.9x less power than
the Nvidia A100. TPU v4s inside the energy-optimized warehouse scale computers
of Google Cloud use ~3x less energy and produce ~20x less CO2e than
contemporary DSAs in a typical on-premise data center.Comment: 15 pages; 16 figures; to be published at ISCA 2023 (the International
Symposium on Computer Architecture
Recommended from our members
High Performance Silicon Photonic Interconnected Systems
Advances in data-driven applications, particularly artificial intelligence and deep learning, are driving the explosive growth of computation and communication in today’s data centers and high-performance computing (HPC) systems. Increasingly, system performance is not constrained by the compute speed at individual nodes, but by the data movement between them. This calls for innovative architectures, smart connectivity, and extreme bandwidth densities in interconnect designs. Silicon photonics technology leverages mature complementary metal-oxide-semiconductor (CMOS) manufacturing infrastructure and is promising for low cost, high-bandwidth, and reconfigurable interconnects. Flexible and high-performance photonic switched architectures are capable of improving the system performance. The work in this dissertation explores various photonic interconnected systems and the associated optical switching functionalities, hardware platforms, and novel architectures. It demonstrates the capabilities of silicon photonics to enable efficient deep learning training.
We first present field programmable gate array (FPGA) based open-loop and closed-loop control for optical spectral-and-spatial switching of silicon photonic cascaded micro-ring resonator (MRR) switches. Our control achieves wavelength locking at the user-defined resonance of the MRR for optical unicast, multicast, and multiwavelength-select functionalities. Digital-to-analog converters (DACs) and analog-to-digital converters (ADCs) are necessary for the control of the switch. We experimentally demonstrate the optical switching functionalities using an FPGA-based switch controller through both traditional multi-bit DAC/ADC and novel single-wired DAC/ADC circuits. For system-level integration, interfaces to the switch controller in a network control plane are developed. The successful control and the switching functionalitiesachieved are essential for system-level architectural innovations as presented in the following sections.
Next, this thesis presents two novel photonic switched architectures using the MRR-based switches. First, a photonic switched memory system architecture was designed to address memory challenges in deep learning. The reconfigurable photonic interconnects provide scalable solutions and enable efficient use of disaggregated memory resources for deep learning training. An experimental testbed was built with a processing system and two remote memory nodes using silicon photonic switch fabrics and system performance improvements were demonstrated. The collective results and existing high-bandwidth optical I/Os show the potential of integrating the photonic switched memory to state-of-the-art processing systems. Second, the scaling trends of deep learning models and distributed training workloads are challenging network capacities in today’s data centers and HPCs. A system architecture that leverages SiP switch-enabled server regrouping is proposed to tackle the challenges and accelerate distributed deep learning training. An experimental testbed with a SiP switch-enabled reconfigurable fat tree topology was built to evaluate the network performance of distributed ring all-reduce and parameter server workloads. We also present system-scale simulations. Server regrouping and bandwidth steering were performed on a large-scale tapered fat tree with 1024 compute nodes to show the benefits of using photonic switched architectures in systems at scale.
Finally, this dissertation explores high-bandwidth photonic interconnect designs for disaggregated systems. We first introduce and discuss two disaggregated architectures leveraging extreme high bandwidth interconnects with optically interconnected computing resources. We present the concept of rack-scale graphics processing unit (GPU) disaggregation with optical circuit switches and electrical aggregator switches. The architecture can leverage the flexibility of high bandwidth optical switches to increase hardware utilization and reduce application runtimes. A testbed was built to demonstrate resource disaggregation and defragmentation. In addition, we also present an extreme high-bandwidth optical interconnect accelerated low-latency communication architecture for deep learning training. The disaggregated architecture utilizes comb laser sources and MRR-based cross-bar switching fabrics to enable an all-to-all high bandwidth communication with a constant latency cost for distributed deep learning training. We discuss emerging technologies in the silicon photonics platform, including light source, transceivers, and switch architectures, to accommodate extreme high bandwidth requirements in HPC and data center environments. A prototype hardware innovation - Optical Network Interface Cards (comprised of FPGA, photonic integrated circuits (PIC), electronic integrated circuits (EIC), interposer, and high-speed printed circuit board (PCB)) is presented to show the path toward fast lanes for expedited execution at 10 terabits.
Taken together, the work in this dissertation demonstrates the capabilities of high-bandwidth silicon photonic interconnects and innovative architectural designs to accelerate deep learning training in optically connected data center and HPC systems
Recommended from our members
Silicon Photonic Subsystems for Inter-Chip Optical Networks
The continuous growth of electronic compute and memory nodes in terms of the number of I/O pins, bandwidth, and areal throughput poses major integration and packaging challenges associated with offloading multi-Tbit/s data rates within the few pJ/bit targets. While integrated photonics are already deployed in long and short distances such as inter and intra data centers communications, the promising characteristics of the silicon photonic platform set it as the future technology for optical interconnects in ultra short inter-chip distances. The high index contrast between the waveguide and the cladding together with strong thermo-optic and carrier effects in silicon allows developing a wide range of micro-scale and low power optical devices compatible with the CMOS fabrication processes. Furthermore, the availability of photonic foundries and new electrical and optical co-packaging techniques further pushes this platform for the next steps of commercial deployment.
The work in this dissertation presents the current trends in high-performance memory and processor nodes and gives motivation for disaggregated and reconfigurable inter-chip network enabled with the silicon photonic layer. A dense WDM transceiver and broadband switch architectures are discussed to support a bi-directional network of ten hybrid-memory cubes (HMC) interconnected to ten processor nodes with an overall aggregated bandwidth of 9.6Tbit/s. Latency and energy consumption are key performance parameters in a processor to primary memory nodes connectivity. The transceiver design is based on energy-efficient micro-ring resonators, and the broadband switch is constructed with 2x2 Mach-Zehnder elements for nano-second reconfiguration. Each transceiver is based on hundreds of micro-rings to convert the native HMC electrical protocol to the optical domain and the switch is based on tens of hundreds of 2x2 elements to achieve non-blocking all-to-all connectivity.
The next chapters focus on developing methods for controlling and monitoring such complex and highly integrated silicon photonic subsystems. The thermo-optic effect is characterized and we show experimentally that the phase of the optical carrier can be reliably controlled with pulse-width modulation (PWM) signal, ultimately relaxing the need for hundreds of digital to analog converters (DACs). We further show that doped waveguide heaters can be utilized as \textit{in-line} optical power monitors by measuring photo-conductance current, which is an alternative for the conventional tapping and integration of photo-diodes.
The next part concerned with a common cascaded micro-ring resonator in a WDM transceiver design. We develop on an FPGA control algorithm that abstracts the physical layer and takes user-defined inputs to set the resonances to the desired wavelength in a unicast and multicast transmission modes. The associated sensitivities of these silicon ring resonators are presented and addressed with three closed-loop solutions. We first show a closed-loop operation based on tapping the error signal from the drop port of the micro-ring. The second solution presents a resonance wavelength locking with a single digital I/O for control and feedback signals. Lastly, we leverage the photo-conductance effect and demonstrate the locking procedure using only the doped heater for both control and feedback purposes.
To achieve the inter-chip reconfigurability we discuss recent advances of high-port-count SiP broadband switches for reconfigurable inter-chip networks. To ensure optimal operation in terms of low insertion loss, low cross-talk and high signal integrity per routing path, hundreds of 2x2 Mach-Zehnder elements need to be biased precisely for the cross and bar states. We address this challenge with a tapless and a design agnostic calibration approach based on the photo-conductance effect. The automated algorithm returns a look-up table for all for each 2x2 element and the associated calibrated biases. Each routing scenario is then tested for insertion loss, crosstalk and bit-error rate of 25Gbit/s 4-level pulse amplitude modulation signals. The last part utilizes the Mach-Zehnder interferometers in WDM transceiver applications. We demonstrate a polarization insensitive four-channel WDM receiver with 40Gbit/s per channel and a transmitter design generating 8-level pulse amplitude modulation signals at 30Gbit/s
Recommended from our members
Photonic Interconnection Networks for Applications in Heterogeneous Utility Computing Systems
Growing demands in heterogeneous utility computing systems in future cloud and high performance computing systems are driving the development of processor-hardware accelerator interconnects with greater performance, flexibility, and dynamism. Recent innovations in the field of utility computing have led to an emergence in the use of heterogeneous compute elements. By leveraging the computing advantages of hardware accelerators alongside typical general purpose processors, performance efficiency can be maximized. The network linking these compute nodes is increasingly becoming the bottleneck in these architectures, limiting the hardware accelerators to be restricted to localized computing.
A high-bandwidth, agile interconnect is an imperative enabler for hardware accelerator delocalization in heterogeneous utility computing. A redesign of these systems' interconnect and architecture will be essential to establishing high-bandwidth, low-latency, efficient, and dynamic heterogeneous systems that can meet the challenges of next-generation utility computing.
By leveraging an optics-based approach, this dissertation presents the design and implementation of optically-connected hardware accelerators (OCHA) that exploit the distance-independent energy dissipation and bandwidth density of photonic transceivers, in combination with the flexibility, efficiency and data parallelization offered by optical networks. By replacing the electronic buses with an optical interconnection network, architectures that delocalize hardware accelerators can be created that are otherwise infeasible.
With delocalized optically-connected hardware accelerator nodes accessible by processors at run time, the system can alleviate the network latency issues plague current heterogeneous systems. Accelerators that would otherwise sit idle, waiting for it's master CPU to feed it data, can instead operate at high utilization rates, leading to dramatic improvements in overall system performance.
This work presents a prototype optically-connect hardware accelerator module and custom optical-network-aware, dynamic hardware accelerator allocator that communicate transparently and optically across an optical interconnection network. The hardware accelerators and processor are optimized to enable hardware acceleration across an optical network using fast packet-switching. The versatility of the optical network enables additional performance benefits including optical multicasting to exploit the data parallelism found in many accelerated data sets. The integration of hardware acceleration, heterogeneous computing, and optics constitutes a critical step for both computing and optics.
The massive data parallelism, application dependent-location and function, as well as network latency, and bandwidth limitations facing networks today complement well with the strength of optical communications-based systems. Moreover, ongoing efforts focusing on development of low-cost optical components and subsystems that are suitable for computing environment may benefit from the high-volume heterogeneous computing market. This work, therefore, takes the first steps in merging the areas of hardware acceleration and optics by developing architectures, protocols, and systems to interface with the two technologies and demonstrating areas of potential benefits and areas for future work. Next-generation heterogeneous utility computing systems will indubitably benefit from the use of efficient, flexible and high-performance optically connect hardware acceleration
Testing facilities for end-to-end test of vertical applications enabled by 5G networks:Eindhoven 5G Brainport Testbed
The key-performance indicators (KPIs) that will be delivered by 5G networks such as extremely low-latency, high capacity, robustness and highly flexible network are key enablers for applications such as autonomous driving, cooperative robotics, transport and processing of large volumes of video and images, to name but a few. This paper presents the ongoing build up and deployment of the Eindhoven based 5G-Brainport testbed towards an open environment for validation and test of end-to-end applications benefitting from the 5G KPIs
Brain fame:From FPGA to heterogeneous acceleration of brain simulations
Among the various methods in neuroscience for understanding brain function, in-silico simulations have been gaining popularity. Advances in neuroscience and engineering led to the creation of mathematical models of networks that do not simply mimic biological behaviour in an abstract fashion but emulate its in significant detail, even to the level of its biophysical properties. Such an example is the Spiking Neural Network (SNN) that can model a variety of additional behavioural features, like encoding data and adapting according to a spike train`s amplitude, frequency and general precise pattern of arrival of spiking events on a neuron. As a result, SNNs have higher explanatory power than their predecessors, thus brain simulations based on SNNs become an attractive topic to explore. In-silico simulations of SNNs can have beneficial results not only for neuroscience research but breakthroughs can also potentially benefit medical, computing and A.I. research. SNNs, though, computationally depending workloads that traditional computing might not be able to cover. Thus, the use of High Performance Computing (HPC) platforms in this application domain becomes desirable. This dissertation explores the topic of HPC-based in-silico brain simulations. Initially, the effort focuses on custom hardware accelerators, due to their potential in providing real-time performance alongside support for large-scale non-real-time experiments and specifically Field Programmable Gate Arrays (FPGAs). The nature of FPGA-based accelerators provides specific benefits against other similar paradigms like Application Specific Integrated Circuit (ASIC) designs.Firstly, we explore the general characteristics of typical SNNs model types to identify their computational requirements in relation to their explanatory strength. We also identify major design characteristics in model development that can directly affect its performance and behaviour when ported to an HPC platform. Subsequently, a detailed literature review is made on FPGA-based SNN implementations. The HPC porting effort begins with the implementation of an extended-Hodgkin-Huxley model of the Inferior-olivary nucleus featuring advanced connectivity. The model is quite demanding and complex enough to act as a realistic benchmark for HPC implementations, while also being scientifically relevant in its own right. FPGA development shows promising performance results not only when doing custom designs but also using High-level synthesis (HLS) toolflows that significantly reduce development time. FPGAs have proven suitable for small-scale embedded-HPC uses as well. The various efforts, though, reveal a very specific weakness of FPGA development that has less to do with the silicon itself and more with its programming environment. The FPGA tools are very inaccessible to non-experts, thus any acceleration effort would require the engineer (and the FPGA development time) to be in the critical path of the research process. An important question to be answered is how the FPGA platform would compare to other popular software-based HPC solutions such as GPU- and CPU-based platforms. A detailed comparison of the best FPGA implementation with GPU and manycore-CPU ports of the same benchmark is conducted. The comparison and evaluation shows that, when it comes to real-time performance, FPGAs have a clear advantage. But for non-real-time, large scale simulations, there is no single platform that can optimally support the complete range of experiments that could be conducted with the inferior olive model. The comparison makes a clear case for BrainFrame, a platform that supports heterogeneous HPC substrates. This dissertation, thus, concludes with the proposal of the BrainFrame system. The proof-of-concept design supports standard and extended Hodgkin-Huxley models, , such as the original inferior-olive model. The system integrates a GPU-, CPU- and FPGA-based HPC back-end while also using a standard neuroscientific language front-end (PyNN) that can score best-in-class performance, alleviate some of the development hurdles and make it far more user-friendly for the typical model developer. Additionally, the multi-node potential of the platform is being explored. BrainFrame provides both a powerful heterogeneous platform for acceleration and also a front-end familiar to the neuroscientist
In-memory photonic dot-product engine with electrically programmable weight banks
Electronically reprogrammable photonic circuits based on phase-change chalcogenides present an avenue to resolve the von-Neumann bottleneck; however, implementation of such hybrid photonic–electronic processing has not achieved computational success. Here, we achieve this milestone by demonstrating an in-memory photonic–electronic dot-product engine, one that decouples electronic programming of phase-change materials (PCMs) and photonic computation. Specifically, we develop non-volatile electronically reprogrammable PCM memory cells with a record-high 4-bit weight encoding, the lowest energy consumption per unit modulation depth (1.7 nJ/dB) for Erase operation (crystallization), and a high switching contrast (158.5%) using non-resonant silicon-on-insulator waveguide microheater devices. This enables us to perform parallel multiplications for image processing with a superior contrast-to-noise ratio (≥87.36) that leads to an enhanced computing accuracy (standard deviation σ ≤ 0.007). An in-memory hybrid computing system is developed in hardware for convolutional processing for recognizing images from the MNIST database with inferencing accuracies of 86% and 87%
Reconfigurable Antennas
In this new book, we present a collection of the advanced developments in reconfigurable antennas and metasurfaces. It begins with a review of reconfigurability technologies, and proceeds to the presentation of a series of reconfigurable antennas, UWB MIMO antennas and reconfigurable arrays. Then, reconfigurable metasurfaces are introduced and the latest advances are presented and discussed
On-chip phonon-magnon reservoir for neuromorphic computing
Reservoir computing is a concept involving mapping signals onto a high-dimensional phase space of a dynamical system called “reservoir” for subsequent recognition by an artificial neural network. We implement this concept in a nanodevice consisting of a sandwich of a semiconductor phonon waveguide and a patterned ferromagnetic layer. A pulsed write-laser encodes input signals into propagating phonon wavepackets, interacting with ferromagnetic magnons. The second laser reads the output signal reflecting a phase-sensitive mix of phonon and magnon modes, whose content is highly sensitive to the write- and read-laser positions. The reservoir efficiently separates the visual shapes drawn by the write-laser beam on the nanodevice surface in an area with a size comparable to a single pixel of a modern digital camera. Our finding suggests the phonon-magnon interaction as a promising hardware basis for realizing on-chip reservoir computing in future neuromorphic architectures
- …