21,327 research outputs found

    The effect of interconnect design on the performance of large L2 caches

    Get PDF
    Journal ArticleThe ever increasing sizes of on-chip caches and the growing domination of wire delay have changed the traditional design approach of the memory hierarchy. Many recent proposals advocate splitting the cache into a large number of banks and employ an on-chip network to allow fast access to nearby banks (referred to as Non-Uniform Cache Architectures (NUCA)). While these proposals focus on optimizing logical policies (placement, searching, and movement) associated with a cache design, initial design choices do not include the complexity of the network. With wire delay being the major performance limiting factor in modern processors, components designed without including wire parameters and network overhead will be sub-optimal with respect to both delay and power. The primary contributions of this work are: 1. An extension of the current version of CACTI to include network overhead and find the optimal design point for large on-chip caches. 2. An evaluation of novel techniques at the microarchitecture level that exploit special wires in the L2 cache network to improve performance

    Heracles: Fully Synthesizable Parameterized MIPS-Based Multicore System

    Get PDF
    Heracles is an open-source complete multicore system written in Verilog. It is fully parameterized and can be reconfigured and synthesized into different topologies and sizes. Each processing node has a 7-stage pipeline, fully bypassed, microprocessor running the MIPS-III ISA, a 4-stage input-buffer, virtual-channel router, and a local variable-size shared memory. Our design is highly modular with clear interfaces between the core, the memory hierarchy, and the on-chip network. In the baseline design, the microprocessor is attached to two caches, one instruction cache and one data cache, which are oblivious to the global memory organization. The memory system in Heracles can be configured as one single global shared memory (SM), or distributed shared memory (DSM), or any combination thereof. Each core is connected to the rest of the network of processors by a parameterized, realistic, wormhole router. We show different topology configurations of the system, and their synthesis results on the Xilinx Virtex-5 LX330T FPGA board. We also provide a small MIPS cross-compiler toolchain to assist in developing software for Heracles

    Low Power Processor Architectures and Contemporary Techniques for Power Optimization – A Review

    Get PDF
    The technological evolution has increased the number of transistors for a given die area significantly and increased the switching speed from few MHz to GHz range. Such inversely proportional decline in size and boost in performance consequently demands shrinking of supply voltage and effective power dissipation in chips with millions of transistors. This has triggered substantial amount of research in power reduction techniques into almost every aspect of the chip and particularly the processor cores contained in the chip. This paper presents an overview of techniques for achieving the power efficiency mainly at the processor core level but also visits related domains such as buses and memories. There are various processor parameters and features such as supply voltage, clock frequency, cache and pipelining which can be optimized to reduce the power consumption of the processor. This paper discusses various ways in which these parameters can be optimized. Also, emerging power efficient processor architectures are overviewed and research activities are discussed which should help reader identify how these factors in a processor contribute to power consumption. Some of these concepts have been already established whereas others are still active research areas. © 2009 ACADEMY PUBLISHER

    MGSim - Simulation tools for multi-core processor architectures

    Get PDF
    MGSim is an open source discrete event simulator for on-chip hardware components, developed at the University of Amsterdam. It is intended to be a research and teaching vehicle to study the fine-grained hardware/software interactions on many-core and hardware multithreaded processors. It includes support for core models with different instruction sets, a configurable multi-core interconnect, multiple configurable cache and memory models, a dedicated I/O subsystem, and comprehensive monitoring and interaction facilities. The default model configuration shipped with MGSim implements Microgrids, a many-core architecture with hardware concurrency management. MGSim is furthermore written mostly in C++ and uses object classes to represent chip components. It is optimized for architecture models that can be described as process networks.Comment: 33 pages, 22 figures, 4 listings, 2 table

    A Many-Core Overlay for High-Performance Embedded Computing on FPGAs

    Get PDF
    In this work, we propose a configurable many-core overlay for high-performance embedded computing. The size of internal memory, supported operations and number of ports can be configured independently for each core of the overlay. The overlay was evaluated with matrix multiplication, LU decomposition and Fast-Fourier Transform (FFT) on a ZYNQ-7020 FPGA platform. The results show that using a system-level many-core overlay avoids complex hardware design and still provides good performance results.Comment: Presented at First International Workshop on FPGAs for Software Programmers (FSP 2014) (arXiv:1408.4423

    Studies on the Impact of Cache Configuration on Multicore Processor

    Get PDF
    The demand for a powerful memory subsystem is increasing with increase in the number of cores in a multicore processor. The technology adapted to meet the above demands are: increasing the cache size, increasing the number of levels of caches and bymeans of a powerful interconnection network. Caches feeds the processing element at a faster rate. They also provide high bandwidth local memory to work with. In this research, an attempt has beenmade to analyze the impact of cache size on performance of multicore processors by varying L1 and L2 cache size on the multicore processor with internal network (MPIN), also referenced from NIAGRA architecture. As the number of cores increases, traditional on-chip interconnect like bus and crossbar proves to be less efficient as well as suffers from poor scalability. In order to overcome the scalability and efficiency issues in these conventional interconnects, ring based design has been proposed. The effect of interconnect on the performance of multicore processors has been analyzed and a novel scalable on-chip interconnection mechanism (INoC) for multicore processors has been proposed. The benchmark results are presented using a full system simulator. Results shows that, using the proposed INoC,execution time can be significantly reduced, compared with MPIN.Cache size and set-associativity are the features on which the cache performance is dependent. If the cache size is doubled, then the cache performance can increase but at the cost of high hardware, larger area and more power consumption. Moreover, considering the small form-factor of themobile processors, increase in cache size affects the device size and battery running time. Re-organization and reanalysis of cache onfiguration ofmobile processors are required for achieving better cache performance, lower power consumption and chip area. With identical cache size, performance gained can be obtained from a novel cache mechanism. For simulation, we used SPLASH2 benchmark suite
    corecore