19 research outputs found

    HTA: A Scalable High-Throughput Accelerator for Irregular HPC Workloads

    Get PDF
    We propose a new architecture called HTA for high throughput irregular HPC applications with little data reuse. HTA reduces the contention within the memory system with the help of a partitioned memory controller that is amenable for 2.5D implementation using Silicon Photonics. In terms of scalability, HTA supports 4 × higher number of compute units compared to the state-of-the-art GPU systems. Our simulation-based evaluation on a representative set of HPC benchmarks shows that the proposed design reduces the queuing latency by 10% to 30%, and improves the variability in memory access latency by 10% to 60%. Our results show that the HTA improves the L1 miss penalty by 2.3 × to 5 × over GPUs. When compared to a multi-GPU system with the same number of compute units, our simulation results show that the HTA can provide up to 2 × speedup

    LLM: Realizing Low-Latency Memory by Exploiting Embedded Silicon Photonics for Irregular Workloads

    Get PDF
    As emerging workloads exhibit irregular memory access patterns with poor data reuse and locality, they would benefit from a DRAM that achieves low latency without sacrificing bandwidth and energy efficiency. We propose LLM (Low Latency Memory), a codesign of the DRAM microarchitecture, the memory controller and the LLC/DRAM interconnect by leveraging embedded silicon photonics in 2.5D/3D integrated system on chip. LLM relies on Wavelength Division Multiplexing (WDM)-based photonic interconnects to reduce the contention throughout the memory subsystem. LLM also increases the bank-level parallelism, eliminates bus conflicts by using dedicated optical data paths, and reduces the access energy per bit with shorter global bitlines and smaller row buffers. We evaluate the design space of LLM for a variety of synthetic benchmarks and representative graph workloads on a full-system simulator (gem5). LLM exhibits low memory access latency for traffics with both regular and irregular access patterns. For irregular traffic, LLM achieves high bandwidth utilization (over 80% peak throughput compared to 20% of HBM2.0). For real workloads, LLM achieves 3 Ă— and 1.8 Ă— lower execution time compared to HBM2.0 and a state-of-the-art memory system with high memory level parallelism, respectively. This study also demonstrates that by reducing queuing on the data path, LLM can achieve on average 3.4 Ă— lower memory latency variation compared to HBM2.0

    Towards Cache-Coherent Chiplet-Based Architectures with Wireless Interconnects

    Get PDF
    Cache-coherent chiplet-based architectures have gained significant attention due to their potential for scalability and improved performance in modern computing systems. However, the interconnects in such architectures often pose challenges in maintaining cache coherence across chiplets, leading to increased latency and energy consumption. This thesis focuses on exploring the feasibility and advantages of integrating wireless interconnects into cache-coherent chiplet-based architectures. Through extensive simulations of 16 and 64 core systems segmented in 4 and 8 chiplet systems with multiple inter-chiplet latencies we debug and obtain traffic data. By studying the inter-chiplet traffic for different chiplet-based configurations and analyzing it in terms of spatial, temporal and time variance we derive that chiplet scaling degrades performance. Further we formulate the impact of hybrid wired and wireless interconnects and assess the potential performance benefits they offer. The findings from this research will contribute to the design and optimization of cache-coherent chiplet-based architectures, shedding light on the practicality and advantages of utilizing wireless interconnects in future computing systems

    Cross-layer design of thermally-aware 2.5D systems

    Full text link
    Over the past decade, CMOS technology scaling has slowed down. To sustain the historic performance improvement predicted by Moore's Law, in the mid-2000s the computing industry moved to using manycore systems and exploiting parallelism. The on-chip power densities of manycore systems, however, continued to increase after the breakdown of Dennard's Scaling. This leads to the `dark silicon' problem, whereby not all cores can operate at the highest frequency or can be turned on simultaneously due to thermal constraints. As a result, we have not been able to take full advantage of the parallelism in manycore systems. One of the 'More than Moore' approaches that is being explored to address this problem is integration of diverse functional components onto a substrate using 2.5D integration technology. 2.5D integration provides opportunities to exploit chiplet placement flexibility to address the dark silicon problem and mitigate the thermal stress of today's high-performance systems. These opportunities can be leveraged to improve the overall performance of the manycore heterogeneous computing systems. Broadly, this thesis aims at designing thermally-aware 2.5D systems. More specifically, to address the dark silicon problem of manycore systems, we first propose a single-layer thermally-aware chiplet organization methodology for homogeneous 2.5D systems. The key idea is to strategically insert spacing between the chiplets of a 2.5D manycore system to lower the operating temperature, and thus reclaim dark silicon by allowing more active cores and/or higher operating frequency under a temperature threshold. We investigate manufacturing cost and thermal behavior of 2.5D systems, then formulate and solve an optimization problem that jointly maximizes performance and minimizes manufacturing cost. We then enhance our methodology by incorporating a cross-layer co-optimization approach. We jointly maximize performance and minimize manufacturing cost and operating temperature across logical, physical, and circuit layers. We propose a novel gas-station link design that enables pipelining in passive interposers. We then extend our thermally-aware optimization methodology for network routing and chiplet placement of heterogeneous 2.5D systems, which consist of central processing unit (CPU) chiplets, graphics processing unit (GPU) chiplets, accelerator chiplets, and/or memory stacks. We jointly minimize the total wirelength and the system temperature. Our enhanced methodology increases the thermal design power budget and thereby improves thermal-constraint performance of the system

    Multiplexed control of spin quantum memories in a photonic circuit

    Full text link
    A central goal in many quantum information processing applications is a network of quantum memories that can be entangled with each other while being individually controlled and measured with high fidelity. This goal has motivated the development of programmable photonic integrated circuits (PICs) with integrated spin quantum memories using diamond color center spin-photon interfaces. However, this approach introduces a challenge in the microwave control of individual spins within closely packed registers. Here, we present a quantum-memory-integrated photonics platform capable of (i) the integration of multiple diamond color center spins into a cryogenically compatible, high-speed programmable PIC platform; (ii) selective manipulation of individual spin qubits addressed via tunable magnetic field gradients; and (iii) simultaneous control of multiple qubits using numerically optimized microwave pulse shaping. The combination of localized optical control, enabled by the PIC platform, together with selective spin manipulation opens the path to scalable quantum networks on intra-chip and inter-chip platforms.Comment: 10 pages, 4 figure

    Ball lens embedded through-package via to enable backside coupling between silicon photonics interposer and board-level interconnects

    Get PDF
    Development of an efficient and densely integrated optical coupling interface for silicon photonics based board-level optical interconnects is one of the key challenges in the domain of 2.5D/3D electro-optic integration. Enabling high-speed on-chip electro-optic conversion and efficient optical transmission across package/board-level short-reach interconnections can help overcome the limitations of a conventional electrical I/O in terms of bandwidth density and power consumption in a high-performance computing environment. In this context, we have demonstrated a novel optical coupling interface to integrate silicon photonics with board-level optical interconnects. We show that by integrating a ball lens in a via drilled in an organic package substrate, the optical beam diffracted from a downward directionality grating on a photonics chip can be coupled to a board-level polymer multimode waveguide with a good alignment tolerance. A key result from the experiment was a 14 chip-to-package 1-dB lateral alignment tolerance for coupling into a polymer waveguide with a cross-section of 20 x 25. An in-depth analysis of loss distribution across several interfaces was done and a -3.4 dB coupling efficiency was measured between the optical interface comprising of output grating, ball lens and polymer waveguide. Furthermore, it is shown that an efficiency better than -2 dB can be achieved by tweaking few parameters in the coupling interface. The fabrication of the optical interfaces and related measurements are reported and verified with simulation results

    CiFHER: A Chiplet-Based FHE Accelerator with a Resizable Structure

    Full text link
    Fully homomorphic encryption (FHE) is in the spotlight as a definitive solution for privacy, but the high computational overhead of FHE poses a challenge to its practical adoption. Although prior studies have attempted to design ASIC accelerators to mitigate the overhead, their designs require excessive amounts of chip resources (e.g., areas) to contain and process massive data for FHE operations. We propose CiFHER, a chiplet-based FHE accelerator with a resizable structure, to tackle the challenge with a cost-effective multi-chip module (MCM) design. First, we devise a flexible architecture of a chiplet core whose configuration can be adjusted to conform to the global organization of chiplets and design constraints. The distinctive feature of our core is a recomposable functional unit providing varying computational throughput for number-theoretic transform (NTT), the most dominant function in FHE. Then, we establish generalized data mapping methodologies to minimize the network overhead when organizing the chips into the MCM package in a tiled manner, which becomes a significant bottleneck due to the technology constraints of MCMs. Also, we analyze the effectiveness of various algorithms, including a novel limb duplication algorithm, on the MCM architecture. A detailed evaluation shows that a CiFHER package composed of 4 to 64 compact chiplets provides performance comparable to state-of-the-art monolithic ASIC FHE accelerators with significantly lower package-wide power consumption while reducing the area of a single core to as small as 4.28mm2^2.Comment: 15 pages, 9 figure
    corecore