10,543 research outputs found

    Characterization of Neural Network Backpropagation on Chiplet-based GPU Architectures

    Get PDF
    Advances in parallel computing architectures (e.g., Graphics Processing Units (GPUs)) have had great success in helping meet the performance and energy-efficiency demands of many high-performance computing (HPC) applications. DRAM bandwidth is generally a critical performance bottleneck for many of such applications. With the advances in memory technology, the DRAM bandwidth bottleneck is shifting towards other parts of the system hierarchy (e.g., interconnects). We identify neural network backpropagation as one application where the interconnect network is one of the biggest performance bottlenecks. We show that the interconnect bottleneck for backpropagation can be significantly alleviated if computing cores and caching units are carefully tiled (an architecture commonly known as ``chiplet ) and organized on the interconnect fabric. To simulate a chiplet design, we augment an existing, well-documented GPU simulator, GPGPU-Sim. Our modifications add an additional level of cache between on-chip L1s and an interconnect network-on-chip. This additional layer of cache reduces demand on the interconnect by localizing memory traffic to individual chiplets. We show that under a fixed core budget with additional cache, a chiplet architecture can increase Instruction Per Cycle (IPC) counts for important CUDA kernels by up to 20% during the training phase

    The Octopus switch

    Get PDF
    This chapter1 discusses the interconnection architecture of the Mobile Digital Companion. The approach to build a low-power handheld multimedia computer presented here is to have autonomous, reconfigurable modules such as network, video and audio devices, interconnected by a switch rather than by a bus, and to offload as much as work as possible from the CPU to programmable modules placed in the data streams. Thus, communication between components is not broadcast over a bus but delivered exactly where it is needed, work is carried out where the data passes through, bypassing the memory. The amount of buffering is minimised, and if it is required at all, it is placed right on the data path, where it is needed. A reconfigurable internal communication network switch called Octopus exploits locality of reference and eliminates wasteful data copies. The switch is implemented as a simplified ATM switch and provides Quality of Service guarantees and enough bandwidth for multimedia applications. We have built a testbed of the architecture, of which we will present performance and energy consumption characteristics

    Nature-Inspired Interconnects for Self-Assembled Large-Scale Network-on-Chip Designs

    Get PDF
    Future nano-scale electronics built up from an Avogadro number of components needs efficient, highly scalable, and robust means of communication in order to be competitive with traditional silicon approaches. In recent years, the Networks-on-Chip (NoC) paradigm emerged as a promising solution to interconnect challenges in silicon-based electronics. Current NoC architectures are either highly regular or fully customized, both of which represent implausible assumptions for emerging bottom-up self-assembled molecular electronics that are generally assumed to have a high degree of irregularity and imperfection. Here, we pragmatically and experimentally investigate important design trade-offs and properties of an irregular, abstract, yet physically plausible 3D small-world interconnect fabric that is inspired by modern network-on-chip paradigms. We vary the framework's key parameters, such as the connectivity, the number of switch nodes, the distribution of long- versus short-range connections, and measure the network's relevant communication characteristics. We further explore the robustness against link failures and the ability and efficiency to solve a simple toy problem, the synchronization task. The results confirm that (1) computation in irregular assemblies is a promising and disruptive computing paradigm for self-assembled nano-scale electronics and (2) that 3D small-world interconnect fabrics with a power-law decaying distribution of shortcut lengths are physically plausible and have major advantages over local 2D and 3D regular topologies

    Interconnect Fabric Reconfigurability for Network on Chip

    Get PDF
    Microprocessor architectures are evolving at a pace greater than ever before. To meet the industry’s stringent power, performance and cost demands there is a rising trend towards building heterogeneous processors with both CPU cores and off-chip components on the same chip. This is known as a System on Chip. These systems show promising solutions including chip interconnects consisting of Network on Chips (NoCs). These NoCs are composed of routers that control traffic, and channels used to connect different components of the chip itself together. Depending on the processor core's type, specifications, and technology used, the NoC fabrics may consume anywhere ranging from 28% to 40% of the total system power. To reduce this significant power consumption, various solutions were proposed targeting CMOS technology. In this work we focus on NoC topology improvements and reconfigurability using novel VeSFET technology. The work deploys tools used to simulate full systems, such as GPGPUSIM, to evaluate the possible performance/power gains of a hybrid CMOS-VeSFET system. This hybrid system includes CMOS core and memory layers, while the NoC layer is made up of VeSFET transistors. This allows for shorter wire lengths between routers and cores, as well as it permits for extra area to include network reconfigurability features. The necessary modifications to build this hybrid system are area changes due to VeSFET additional layer, routing length changes, pipelining changes, and VeSFET technology parameter additions. The tools modifications necessary to include this system are described in further details in this thesis. The gathered data indicates great promise for the hybrid reconfigurable CMOS-VeSFET system over the conventional non-reconfigurable CMOS system. It is demonstrated that the hybrid VeSFET system has both a power decrease of approximately 57.0% and a performance increase of approximately 50.2%

    Multiport VNA Measurements

    Get PDF
    This article presents some of the most recent multiport VNA measurement methodologies used to characterize these highspeed digital networks for signal integrity. There will be a discussion of the trends and measurement challenges of high-speed digital systems, followed by a presentation of the multiport VNA measurement system details, calibration, and measurement techniques, as well as some examples of interconnect device measurements. The intent here is to present some general concepts and trends for multiport VNA measurements as applied to computer system board-level interconnect structures, and not to promote any particular brand or produc

    A Micro Power Hardware Fabric for Embedded Computing

    Get PDF
    Field Programmable Gate Arrays (FPGAs) mitigate many of the problemsencountered with the development of ASICs by offering flexibility, faster time-to-market, and amortized NRE costs, among other benefits. While FPGAs are increasingly being used for complex computational applications such as signal and image processing, networking, and cryptology, they are far from ideal for these tasks due to relatively high power consumption and silicon usage overheads compared to direct ASIC implementation. A reconfigurable device that exhibits ASIC-like power characteristics and FPGA-like costs and tool support is desirable to fill this void. In this research, a parameterized, reconfigurable fabric model named as domain specific fabric (DSF) is developed that exhibits ASIC-like power characteristics for Digital Signal Processing (DSP) style applications. Using this model, the impact of varying different design parameters on power and performance has been studied. Different optimization techniques like local search and simulated annealing are used to determine the appropriate interconnect for a specific set of applications. A design space exploration tool has been developed to automate and generate a tailored architectural instance of the fabric.The fabric has been synthesized on 160 nm cell-based ASIC fabrication process from OKI and 130 nm from IBM. A detailed power-performance analysis has been completed using signal and image processing benchmarks from the MediaBench benchmark suite and elsewhere with comparisons to other hardware and software implementations. The optimized fabric implemented using the 130 nm process yields energy within 3X of a direct ASIC implementation, 330X better than a Virtex-II Pro FPGA and 2016X better than an Intel XScale processor
    • …
    corecore