18 research outputs found

    High Performance On-Chip Interconnects Design for Future Many-Core Architectures

    Get PDF
    Switch-based Network-on-Chip (NoC) is a widely accepted inter-core communication infrastructure for Chip Multiprocessors (CMPs). With the continued advance of CMOS technology, the number of cores on a single chip keeps increasing at a rapid pace. It is highly expected that many-core architectures with more than hundreds of processor cores will be commercialized in the near future. In such a large scale CMP system, NoC overheads are more dominant than computation power in determining overall system performance. Also, for modern computational workloads requiring abundant thread level parallelism (TLP), NoC design for highly-parallel, many-core accelerators such as General Purpose Graphics Processing Units (GPGPUs) is of primary importance in harnessing the potential of massive thread- and data-level parallelism. In these contexts, it is critical that NoC provides both low latency and high bandwidth within limited on-chip power and area budgets. In this dissertation, we explore various design issues inherent in future many-core architectures, CMPs and GPGPUs, to achieve both high performance and power efficiency. First, we deal with issues in using a promising next generation memory technology, Spin-Transfer Torque Magnetic RAM (STT-MRAM), for NoC input buffers in CMPs. Using a high density and low leakage memory offers more buffer capacities with the same die footprint, thus helping increase network throughput in NoC routers. However, its long latency and high power consumption in write operations still need to be addressed. Thus, we propose a hybrid design of input buffers using both SRAM and STT-MRAM to hide the long write latency efficiently. Considering that simple data migration in the hybrid buffer consumes more dynamic power compared to SRAM, we provide a lazy migration scheme that reduces the dynamic power consumption of the hybrid buffer. Second, we propose the first NoC router design that uses only STT-MRAM, providing much larger buffer space with less power consumption, while preserving data integrity. To hide the multicycle writes, we employ a multibank STT-MRAM buffer, a virtual channel with multiple banks where every incoming flit is seamlessly pipelined to each bank alternately. Our STT-MRAM design has aggressively reduced the retention time, resulting in a significant reduction in the latency and power overheads of write operations. To ensure data integrity against inadvertent bit flips from the thermal fluctuation during the given retention time, we propose a cost-efficient dynamic buffer refresh scheme combined with Error Correcting Codes (ECC) to detect and correct data corruption. Third, we present schemes for bandwidth-efficient on-chip interconnects in GPGPUs. GPGPUs place a heavy demand on the on-chip interconnect between the many cores and a few memory controllers (MCs). Thus, traffic is highly asymmetric, impacting on-chip resource utilization and system performance. Here, we analyze the communication demands of typical GPGPU applications, and propose efficient NoC designs to meet those demands

    MOCAST 2021

    Get PDF
    The 10th International Conference on Modern Circuit and System Technologies on Electronics and Communications (MOCAST 2021) will take place in Thessaloniki, Greece, from July 5th to July 7th, 2021. The MOCAST technical program includes all aspects of circuit and system technologies, from modeling to design, verification, implementation, and application. This Special Issue presents extended versions of top-ranking papers in the conference. The topics of MOCAST include:Analog/RF and mixed signal circuits;Digital circuits and systems design;Nonlinear circuits and systems;Device and circuit modeling;High-performance embedded systems;Systems and applications;Sensors and systems;Machine learning and AI applications;Communication; Network systems;Power management;Imagers, MEMS, medical, and displays;Radiation front ends (nuclear and space application);Education in circuits, systems, and communications

    Application Centric Networks-On-Chip Design Solutions for Future Multicore Systems

    Get PDF
    With advances in technology, future multicore systems scaled to 100s and 1000s of cores/accelerators are being touted as an effective solution for extracting huge performance gains using parallel programming paradigms. However with the failure of Dennard Scaling all the components on the chip cannot be run simultaneously without breaking the power and thermal constraints leading to strict chip power envelops. The scaling up of the number of on chip components has also brought upon Networks-On-Chip (NoC) based interconnect designs like 2D mesh. The contribution of NoC to the total on chip power and overall performance has been increasing steadily and hence high performance power-efficient NoC designs are becoming crucial. Future multicore paradigms can be broadly classified, based on the applications they are tailored to, into traditional Chip Multi processor(CMP) based application based systems, characterized by low core and NoC utilization, and emerging big data application based systems, characterized by large amounts of data movement necessitating high throughput requirements. To this order, we propose NoC design solutions for power-savings in future CMPs tailored to traditional applications and higher effective throughput gains in multicore systems tailored to bandwidth intensive applications. First, we propose Fly-over, a light-weight distributed mechanism for power-gating routers attached to switched off cores to reduce NoC power consumption in low load CMP environment. Secondly, we plan on utilizing a promising next generation memory technology, Spin-Transfer Torque Magnetic RAM(STT-MRAM), to achieve enhanced NoC performance to satisfy the high throughput demands in emerging bandwidth intensive applications, while reducing the power consumption simultaneously. Thirdly, we present a hardware data approximation framework for NoCs, APPROX-NoC, with an online data error control mechanism, which can leverage the approximate computing paradigm in the emerging data intensive big data applications to attain higher performance per watt

    Scaling High-Performance Interconnect Architectures to Many-Core Systems.

    Full text link
    The ever-increasing demand for performance scaling has made multi-core (2-8 cores) chips prevalent in today’s computing systems and foreshadows the shift toward many-core (10s- 100s of cores) chips in the near future. Although the potential performance gains from many-core systems remain appealing, the widespread adoption of these systems hinges on their ability to scale performance while simultaneously satisfying Quality-of-Service (QoS) and energy-efficiency constraints. This work makes the case that the interconnect for these many-core systems has a significant impact on the aforementioned scalability issues. The impact of interconnects on many-core systems is illustrated by observing that the degree of the interconnect has a signicant effect on system scalability and demonstrating that the architecture of high-radix, many-core systems are feasible, energy-efficient, and high-performance. The feasibility of high-radix crossbars for many-core systems is first shown through a new circuit-level building block called the Swizzle-Switch which can operate at frequencies up to 1.5GHz for 128-bit, radix-64 crossbars. This work then shows how a many-core system called the Swizzle-Switch Network (SSN) can use the Swizzle-Switch as the central building block for a flat crossbar interconnect. The SSN is shown to be advantageous to traditional Network-on-Chip (NoC) for systems up to 64 cores. The SSN performance by 21% relative to a Mesh while also providing a 25% energy savings over the Mesh. The Swizzle-Switch is also leveraged as a building block for high-radix NoC topologies that can support many-core architectures. The Swizzle-Switch-based Flattened Butterfly topology is demonstrated to provide a 15% speedup and 10% energy savings over the Mesh. Finally, the impact that 3D stacking technology has on many-core scalability is evaluated for bus and crossbar interconnects. A 3D-optimized Swizzle-Switch Network is able to leverage frequency gains to achieve a 15-28% speedup over a 2D-Swizzle-Switch Network when using memory- intensive benchmarks. Additionally, a bus-based 64-core architecture is shown to provide an average speedup of 49× over a baseline uniprocessor system when using 3D technology.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/93980/1/ksewell_1.pd

    Non-minimal adaptive routing for efficient interconnection networks

    Get PDF
    RESUMEN: La red de interconexión es un concepto clave de los sistemas de computación paralelos. El primer aspecto que define una red de interconexión es su topología. Habitualmente, las redes escalables y eficientes en términos de coste y consumo energético tienen bajo diámetro y se basan en topologías que encaran el límite de Moore y en las que no hay diversidad de caminos mínimos. Una vez definida la topología, quedando implícitamente definidos los límites de rendimiento de la red, es necesario diseñar un algoritmo de enrutamiento que se acerque lo máximo posible a esos límites y debido a la ausencia de caminos mínimos, este además debe explotar los caminos no mínimos cuando el tráfico es adverso. Estos algoritmos de enrutamiento habitualmente seleccionan entre rutas mínimas y no mínimas en base a las condiciones de la red. Las rutas no mínimas habitualmente se basan en el algoritmo de balanceo de carga propuesto por Valiant, esto implica que doblan la longitud de las rutas mínimas y por lo tanto, la latencia soportada por los paquetes se incrementa. En cuanto a la tecnología, desde su introducción en entornos HPC a principios de los años 2000, Ethernet ha sido usado en un porcentaje representativo de los sistemas. Esta tesis introduce una implementación realista y competitiva de una red escalable y sin pérdidas basada en dispositivos de red Ethernet commodity, considerando topologías de bajo diámetro y bajo consumo energético y logrando un ahorro energético de hasta un 54%. Además, propone un enrutamiento sobre la citada arquitectura, en adelante QCN-Switch, el cual selecciona entre rutas mínimas y no mínimas basado en notificaciones de congestión explícitas. Una vez implementada la decisión de enrutar siguiendo rutas no mínimas, se introduce un enrutamiento adaptativo en fuente capaz de adaptar el número de saltos en las rutas no mínimas. Este enrutamiento, en adelante ACOR, es agnóstico de la topología y mejora la latencia en hasta un 28%. Finalmente, se introduce un enrutamiento dependiente de la topología, en adelante LIAN, que optimiza el número de saltos de las rutas no mínimas basado en las condiciones de la red. Los resultados de su evaluación muestran que obtiene una latencia cuasi óptima y mejora el rendimiento de algoritmos de enrutamiento actuales reduciendo la latencia en hasta un 30% y obteniendo un rendimiento estable y equitativo.ABSTRACT: Interconnection network is a key concept of any parallel computing system. The first aspect to define an interconnection network is its topology. Typically, power and cost-efficient scalable networks with low diameter rely on topologies that approach the Moore bound in which there is no minimal path diversity. Once the topology is defined, the performance bounds of the network are determined consequently, so a suitable routing algorithm should be designed to accomplish as much as possible of those limits and, due to the lack of minimal path diversity, it must exploit non-minimal paths when the traffic pattern is adversarial. These routing algorithms usually select between minimal and non-minimal paths based on the network conditions, where the non-minimal paths are built according to Valiant load-balancing algorithm. This implies that these paths double the length of minimal ones and then the latency supported by packets increases. Regarding the technology, from its introduction in HPC systems in the early 2000s, Ethernet has been used in a significant fraction of the systems. This dissertation introduces a realistic and competitive implementation of a scalable lossless Ethernet network for HPC environments considering low-diameter and low-power topologies. This allows for up to 54% power savings. Furthermore, it proposes a routing upon the cited architecture, hereon QCN-Switch, which selects between minimal and non-minimal paths per packet based on explicit congestion notifications instead of credits. Once the miss-routing decision is implemented, it introduces two mechanisms regarding the selection of the intermediate switch to develop a source adaptive routing algorithm capable of adapting the number of hops in the non-minimal paths. This routing, hereon ACOR, is topology-agnostic and improves average latency in all cases up to 28%. Finally, a topology-dependent routing, hereon LIAN, is introduced to optimize the number of hops in the non-minimal paths based on the network live conditions. Evaluations show that LIAN obtains almost-optimal latency and outperforms state-of-the-art adaptive routing algorithms, reducing latency by up to 30.0% and providing stable throughput and fairness.This work has been supported by the Spanish Ministry of Education, Culture and Sports under grant FPU14/02253, the Spanish Ministry of Economy, Industry and Competitiveness under contracts TIN2010-21291-C02-02, TIN2013-46957-C2-2-P, and TIN2013-46957-C2-2-P (AEI/FEDER, UE), the Spanish Research Agency under contract PID2019-105660RBC22/AEI/10.13039/501100011033, the European Union under agreements FP7-ICT-2011- 7-288777 (Mont-Blanc 1) and FP7-ICT-2013-10-610402 (Mont-Blanc 2), the University of Cantabria under project PAR.30.P072.64004, and by the European HiPEAC Network of Excellence through an internship grant supported by the European Union’s Horizon 2020 research and innovation program under grant agreement No. H2020-ICT-2015-687689

    DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

    Full text link
    Data movement between the CPU and main memory is a first-order obstacle against improving performance, scalability, and energy efficiency in modern systems. Computer systems employ a range of techniques to reduce overheads tied to data movement, spanning from traditional mechanisms (e.g., deep multi-level cache hierarchies, aggressive hardware prefetchers) to emerging techniques such as Near-Data Processing (NDP), where some computation is moved close to memory. Our goal is to methodically identify potential sources of data movement over a broad set of applications and to comprehensively compare traditional compute-centric data movement mitigation techniques to more memory-centric techniques, thereby developing a rigorous understanding of the best techniques to mitigate each source of data movement. With this goal in mind, we perform the first large-scale characterization of a wide variety of applications, across a wide range of application domains, to identify fundamental program properties that lead to data movement to/from main memory. We develop the first systematic methodology to classify applications based on the sources contributing to data movement bottlenecks. From our large-scale characterization of 77K functions across 345 applications, we select 144 functions to form the first open-source benchmark suite (DAMOV) for main memory data movement studies. We select a diverse range of functions that (1) represent different types of data movement bottlenecks, and (2) come from a wide range of application domains. Using NDP as a case study, we identify new insights about the different data movement bottlenecks and use these insights to determine the most suitable data movement mitigation mechanism for a particular application. We open-source DAMOV and the complete source code for our new characterization methodology at https://github.com/CMU-SAFARI/DAMOV.Comment: Our open source software is available at https://github.com/CMU-SAFARI/DAMO

    Design of Optical Interconnect Transceiver Circuits and Network-on-chip Architectures for Inter- and Intra-chip Communication

    Get PDF
    The rapid expansion in data communication due to the increased multimedia applications and cloud computing services necessitates improvements in optical transceiver circuitry power efficiency as these systems scale well past 10 Gb/s. In order to meet these requirements, a 26 GHz transimpedance amplifier (TIA) is presented in a 0.25-µm SiGe BiCMOS technology. It employs a transformer-based regulated cascode (RGC) input stage which provides passive negative-feedback gain that enhances the effective transconductance of the TIA’s input common-base transistor; reducing the input resistance and pro- viding considerable bandwidth extension without significant noise degradation or power consumption. The TIA achieves a 53 dBΩ single-ended transimpedance gain with a 26√ GHz bandwidth and 21.3 pA/H z average input-referred noise current spectral density. Total chip power including output buffering is 28.2 mW from a 2.5 V supply, with the core TIA consuming 8.2 mW, and the chip area including pads is 960 µm × 780 µm. With the advance of photonic devices, optical interconnects becomes a promising technology to replace the conventional electrical channels for the high-bandwidth and power efficient inter/intra-chip interconnect. Second, a silicon photonic transceiver is presented for a silicon ring resonator-based optical interconnect architecture in a 1V standard 65nm CMOS technology. The transmitter circuits incorporate high-swing drivers with non-linear pre-emphasis and automatic bias-based tuning for resonance wavelength stabilization. An optical forwarded-clock adaptive inverter-based transimpedance amplifier (TIA) receiver trades-off power for varying link budgets by employing an on-die eye monitor and scaling the TIA supply for the required sensitivity. At 5 GB/s operation, the ring modulator un- der 4Vpp driver achieves 12.7dB extinction ratio with 4.04mW power consumption, while a 0.28nm tuning range is obtained at 6.8µW/GHz efficiency with the bias-based tuning scheme implemented with the 2Vpp transmitter. When tested with a wire-bonded 150f- F p-i-n photodetector, the receiver achieves -12.7dBm sensitivity at a BER=10−15 and consumes 2.2mW at 8 GB/s. Third, a novel Nano-Photonic Network-on-Chip (NoC) architecture, called LumiNoC, is proposed for high performance and power-efficient interconnects for the chip-multi- processors (CMPs). A 64-node LumiNoC under synthetic traffic enjoys 50% less latency at low loads versus other reported photonic NoCs, and ∼25% less latency versus the electrical 2D mesh NoCs on realistic workloads. Under the same ideal throughput, LumiNoC achieves laser power reduction of 78%, and overall power reduction of 44% versus competing designs

    Low-Power Embedded Design Solutions and Low-Latency On-Chip Interconnect Architecture for System-On-Chip Design

    Get PDF
    This dissertation presents three design solutions to support several key system-on-chip (SoC) issues to achieve low-power and high performance. These are: 1) joint source and channel decoding (JSCD) schemes for low-power SoCs used in portable multimedia systems, 2) efficient on-chip interconnect architecture for massive multimedia data streaming on multiprocessor SoCs (MPSoCs), and 3) data processing architecture for low-power SoCs in distributed sensor network (DSS) systems and its implementation. The first part includes a low-power embedded low density parity check code (LDPC) - H.264 joint decoding architecture to lower the baseband energy consumption of a channel decoder using joint source decoding and dynamic voltage and frequency scaling (DVFS). A low-power multiple-input multiple-output (MIMO) and H.264 video joint detector/decoder design that minimizes energy for portable, wireless embedded systems is also designed. In the second part, a link-level quality of service (QoS) scheme using unequal error protection (UEP) for low-power network-on-chip (NoC) and low latency on-chip network designs for MPSoCs is proposed. This part contains WaveSync, a low-latency focused network-on-chip architecture for globally-asynchronous locally-synchronous (GALS) designs and a simultaneous dual-path routing (SDPR) scheme utilizing path diversity present in typical mesh topology network-on-chips. SDPR is akin to having a higher link width but without the significant hardware overhead associated with simple bus width scaling. The last part shows data processing unit designs for embedded SoCs. We propose a data processing and control logic design for a new radiation detection sensor system generating data at or above Peta-bits-per-second level. Implementation results show that the intended clock rate is achieved within the power target of less than 200mW. We also present a digital signal processing (DSP) accelerator supporting configurable MAC, FFT, FIR, and 3-D cross product operations for embedded SoCs. It consumes 12.35mW along with 0.167mm2 area at 333MHz
    corecore