4 research outputs found

    Network-on-chip design for a chiplet-based waferscale processor

    Get PDF
    Motivated by the failing of Moore’s law and Dennard scaling, as well as increasingly large parallel tasks like machine learning and big data analysis, processors continue to increase in area and incorporate more computational cores. This growth requires innovation in manufacturing processes to build larger systems, and architectural changes to enable performance to scale acceptably. One significant architectural change is the shift from bus and crossbar based processor interconnections to networks-on-chip (NoCs). This thesis details the design of an NoC to enable a shared memory architecture in a chiplet-based wafer scale processor with architectural support for up to 14,336 cores

    HTA: A Scalable High-Throughput Accelerator for Irregular HPC Workloads

    Get PDF
    We propose a new architecture called HTA for high throughput irregular HPC applications with little data reuse. HTA reduces the contention within the memory system with the help of a partitioned memory controller that is amenable for 2.5D implementation using Silicon Photonics. In terms of scalability, HTA supports 4 × higher number of compute units compared to the state-of-the-art GPU systems. Our simulation-based evaluation on a representative set of HPC benchmarks shows that the proposed design reduces the queuing latency by 10% to 30%, and improves the variability in memory access latency by 10% to 60%. Our results show that the HTA improves the L1 miss penalty by 2.3 × to 5 × over GPUs. When compared to a multi-GPU system with the same number of compute units, our simulation results show that the HTA can provide up to 2 × speedup

    Massive Data-Centric Parallelism in the Chiplet Era

    Full text link
    Traditionally, massively parallel applications are executed on distributed systems, where computing nodes are distant enough that the parallelization schemes must minimize communication and synchronization to achieve scalability. Mapping communication-intensive workloads to distributed systems requires complicated problem partitioning and dataset pre-processing. With the current AI-driven trend of having thousands of interconnected processors per chip, there is an opportunity to re-think these communication-bottlenecked workloads. This bottleneck often arises from data structure traversals, which cause irregular memory accesses and poor cache locality. Recent works have introduced task-based parallelization schemes to accelerate graph traversal and other sparse workloads. Data structure traversals are split into tasks and pipelined across processing units (PUs). Dalorex demonstrated the highest scalability (up to thousands of PUs on a single chip) by having the entire dataset on-chip, scattered across PUs, and executing the tasks at the PU where the data is local. However, it also raised questions on how to scale to larger datasets when all the memory is on chip, and at what cost. To address these challenges, we propose a scalable architecture composed of a grid of Data-Centric Reconfigurable Array (DCRA) chiplets. Package-time reconfiguration enables creating chip products that optimize for different target metrics, such as time-to-solution, energy, or cost, while software reconfigurations avoid network saturation when scaling to millions of PUs across many chip packages. We evaluate six applications and four datasets, with several configurations and memory technologies, to provide a detailed analysis of the performance, power, and cost of data-local execution at scale. Our parallelization of Breadth-First-Search with RMAT-26 across a million PUs reaches 3323 GTEPS

    Hardware and Software Optimizations for Accelerating Deep Neural Networks: Survey of Current Trends, Challenges, and the Road Ahead

    Get PDF
    Currently, Machine Learning (ML) is becoming ubiquitous in everyday life. Deep Learning (DL) is already present in many applications ranging from computer vision for medicine to autonomous driving of modern cars as well as other sectors in security, healthcare, and finance. However, to achieve impressive performance, these algorithms employ very deep networks, requiring a significant computational power, both during the training and inference time. A single inference of a DL model may require billions of multiply-and-accumulated operations, making the DL extremely compute-and energy-hungry. In a scenario where several sophisticated algorithms need to be executed with limited energy and low latency, the need for cost-effective hardware platforms capable of implementing energy-efficient DL execution arises. This paper first introduces the key properties of two brain-inspired models like Deep Neural Network (DNN), and Spiking Neural Network (SNN), and then analyzes techniques to produce efficient and high-performance designs. This work summarizes and compares the works for four leading platforms for the execution of algorithms such as CPU, GPU, FPGA and ASIC describing the main solutions of the state-of-the-art, giving much prominence to the last two solutions since they offer greater design flexibility and bear the potential of high energy-efficiency, especially for the inference process. In addition to hardware solutions, this paper discusses some of the important security issues that these DNN and SNN models may have during their execution, and offers a comprehensive section on benchmarking, explaining how to assess the quality of different networks and hardware systems designed for them
    corecore