27 research outputs found

    Characterization of Neural Network Backpropagation on Chiplet-based GPU Architectures

    Get PDF
    Advances in parallel computing architectures (e.g., Graphics Processing Units (GPUs)) have had great success in helping meet the performance and energy-efficiency demands of many high-performance computing (HPC) applications. DRAM bandwidth is generally a critical performance bottleneck for many of such applications. With the advances in memory technology, the DRAM bandwidth bottleneck is shifting towards other parts of the system hierarchy (e.g., interconnects). We identify neural network backpropagation as one application where the interconnect network is one of the biggest performance bottlenecks. We show that the interconnect bottleneck for backpropagation can be significantly alleviated if computing cores and caching units are carefully tiled (an architecture commonly known as ``chiplet ) and organized on the interconnect fabric. To simulate a chiplet design, we augment an existing, well-documented GPU simulator, GPGPU-Sim. Our modifications add an additional level of cache between on-chip L1s and an interconnect network-on-chip. This additional layer of cache reduces demand on the interconnect by localizing memory traffic to individual chiplets. We show that under a fixed core budget with additional cache, a chiplet architecture can increase Instruction Per Cycle (IPC) counts for important CUDA kernels by up to 20% during the training phase

    The AMD Rome Memory Barrier

    Full text link
    With the rapid growth of AMD as a competitor in the CPU industry, it is imperative that high-performance and architectural engineers analyze new AMD CPUs. By understanding new and unfamiliar architectures, engineers are able to adapt their algorithms to fully utilize new hardware. Furthermore, engineers are able to anticipate the limitations of an architecture and determine when an alternate platform is desirable for a particular workload. This paper presents results which show that the AMD "Rome" architecture performance suffers once an application's memory bandwidth exceeds 37.5 GiB/s for integer-heavy applications, or 100 GiB/s for floating-point-heavy workloads. Strong positive correlations between memory bandwidth and CPI are presented, as well as strong positive correlations between increased memory load and time-to-completion of benchmarks from the SPEC CPU2017 benchmark suites.Comment: Very, very early draft for IEEE SoutheastCon 2017, 9 pages (need to get down to 8), 6 figures, 7 table

    Towards Cache-Coherent Chiplet-Based Architectures with Wireless Interconnects

    Get PDF
    Cache-coherent chiplet-based architectures have gained significant attention due to their potential for scalability and improved performance in modern computing systems. However, the interconnects in such architectures often pose challenges in maintaining cache coherence across chiplets, leading to increased latency and energy consumption. This thesis focuses on exploring the feasibility and advantages of integrating wireless interconnects into cache-coherent chiplet-based architectures. Through extensive simulations of 16 and 64 core systems segmented in 4 and 8 chiplet systems with multiple inter-chiplet latencies we debug and obtain traffic data. By studying the inter-chiplet traffic for different chiplet-based configurations and analyzing it in terms of spatial, temporal and time variance we derive that chiplet scaling degrades performance. Further we formulate the impact of hybrid wired and wireless interconnects and assess the potential performance benefits they offer. The findings from this research will contribute to the design and optimization of cache-coherent chiplet-based architectures, shedding light on the practicality and advantages of utilizing wireless interconnects in future computing systems

    Design, Extraction, and Optimization Tool Flows and Methodologies for Homogeneous and Heterogeneous Multi-Chip 2.5D Systems

    Get PDF
    Chip and packaging industries are making significant progress in 2.5D design as a result of increasing popularity of their application. In advanced high-density 2.5D packages, package redistribution layers become similar to chip Back-End-of-Line routing layers, and the gap between them scales down with pin density improvement. Chiplet-package interactions become significant and severely affect system performance and reliability. Moreover, 2.5D integration offers opportunities to apply novel design techniques. The traditional die-by-die design approach neither carefully considers these interactions nor fully exploits the cross-boundary design opportunities. This thesis presents chiplet-package cross-boundary design, extraction, analysis, and optimization tool flows and methodologies for high-density 2.5D packaging technologies. A holistic flow is presented that can capture all parasitics from chiplets and the package and improve system performance through iterative optimizations. Several design techniques are demonstrated for agile development and quick turn-around time. To validate the flow in silicon, a chip was taped out and studied in TSMC 65nm technology. As the holistic flow cannot handle heterogeneous technologies, in-context flows are presented. Three different flavors of the in-context flow are presented, which offer trade-offs between scalability and accuracy in heterogeneous 2.5D system designs. Inductance is an inseparable part of a package design. A holistic flow is presented that takes package inductance into account in timing analysis and optimization steps. Custom CAD tools are developed to make these flows compatible with the industry standard tools and the foundry model. To prove the effectiveness of the flows several design cases of an ARM Cortex-M0 are implemented for comparitive study

    ToSHI - Towards Secure Heterogeneous Integration: Security Risks, Threat Assessment, and Assurance

    Get PDF
    The semiconductor industry is entering a new age in which device scaling and cost reduction will no longer follow the decades-long pattern. Packing more transistors on a monolithic IC at each node becomes more difficult and expensive. Companies in the semiconductor industry are increasingly seeking technological solutions to close the gap and enhance cost-performance while providing more functionality through integration. Putting all of the operations on a single chip (known as a system on a chip, or SoC) presents several issues, including increased prices and greater design complexity. Heterogeneous integration (HI), which uses advanced packaging technology to merge components that might be designed and manufactured independently using the best process technology, is an attractive alternative. However, although the industry is motivated to move towards HI, many design and security challenges must be addressed. This paper presents a three-tier security approach for secure heterogeneous integration by investigating supply chain security risks, threats, and vulnerabilities at the chiplet, interposer, and system-in-package levels. Furthermore, various possible trust validation methods and attack mitigation were proposed for every level of heterogeneous integration. Finally, we shared our vision as a roadmap toward developing security solutions for a secure heterogeneous integration

    Design Challenges of Intra- and Inter- Chiplet Interconnection

    Get PDF
    In a chiplet-based many-core system, intra- and inter- chiplet interconnection is key to system performance and power consumption. There are a few challenges in intra- and inter- chiplet interconnection network: 1) Fast and accurate simulation is necessary to analyze the performance metrics. 2) Efficient network architecture for inter- and intra- chiplet is necessary, including topology, PHY design and deadlock free routing algorithms, etc. 3) Deep learning based AI systems are demanding more computation power, which calls for the need of efficient and low power chiplet-based systems. This paper proposes network designs to address these challenges and provides future research directions

    RapidChiplet: A Toolchain for Rapid Design Space Exploration of Chiplet Architectures

    Full text link
    Chiplet architectures are a promising paradigm to overcome the scaling challenges of monolithic chips. Chiplets offer heterogeneity, modularity, and cost-effectiveness. The design space of chiplet architectures is huge as there are many degrees of freedom such as the number, size and placement of chiplets, the topology of the inter-chiplet interconnect and many more. Existing tools for cost and performance prediction are often too slow to explore this design space. We present RapidChiplet, a fast, open-source toolchain to predict latency and throughput of the inter-chiplet interconnect, as well as a chip's manufacturing cost and thermal stability

    ECO-CHIP: Estimation of Carbon Footprint of Chiplet-based Architectures for Sustainable VLSI

    Full text link
    Decades of progress in energy-efficient and low-power design have successfully reduced the operational carbon footprint in the semiconductor industry. However, this has led to an increase in embodied emissions, encompassing carbon emissions arising from design, manufacturing, packaging, and other infrastructural activities. While existing research has developed tools to analyze embodied carbon at the computer architecture level for traditional monolithic systems, these tools do not apply to near-mainstream heterogeneous integration (HI) technologies. HI systems offer significant potential for sustainable computing by minimizing carbon emissions through two key strategies: ``reducing" computation by reusing pre-designed chiplet IP blocks and adopting hierarchical approaches to system design. The reuse of chiplets across multiple designs, even spanning multiple generations of integrated circuits (ICs), can substantially reduce embodied carbon emissions throughout the operational lifespan. This paper introduces a carbon analysis tool specifically designed to assess the potential of HI systems in facilitating greener VLSI system design and manufacturing approaches. The tool takes into account scaling, chiplet and packaging yields, design complexity, and even carbon overheads associated with advanced packaging techniques employed in heterogeneous systems. Experimental results demonstrate that HI can achieve a reduction of embodied carbon emissions up to 70\% compared to traditional large monolithic systems. These findings suggest that HI can pave the way for sustainable computing practices, contributing to a more environmentally conscious semiconductor industry.Comment: Under review at HPCA2
    corecore