430 research outputs found

    Static and Dynamic Scheduling for Effective Use of Multicore Systems

    Get PDF
    Multicore systems have increasingly gained importance in high performance computers. Compared to the traditional microarchitectures, multicore architectures have a simpler design, higher performance-to-area ratio, and improved power efficiency. Although the multicore architecture has various advantages, traditional parallel programming techniques do not apply to the new architecture efficiently. This dissertation addresses how to determine optimized thread schedules to improve data reuse on shared-memory multicore systems and how to seek a scalable solution to designing parallel software on both shared-memory and distributed-memory multicore systems. We propose an analytical cache model to predict the number of cache misses on the time-sharing L2 cache on a multicore processor. The model provides an insight into the impact of cache sharing and cache contention between threads. Inspired by the model, we build the framework of affinity based thread scheduling to determine optimized thread schedules to improve data reuse on all the levels in a complex memory hierarchy. The affinity based thread scheduling framework includes a model to estimate the cost of a thread schedule, which consists of three submodels: an affinity graph submodel, a memory hierarchy submodel, and a cost submodel. Based on the model, we design a hierarchical graph partitioning algorithm to determine near-optimal solutions. We have also extended the algorithm to support threads with data dependences. The algorithms are implemented and incorporated into a feedback directed optimization prototype system. The prototype system builds upon a binary instrumentation tool and can improve program performance greatly on shared-memory multicore architectures. We also study the dynamic data-availability driven scheduling approach to designing new parallel software on distributed-memory multicore architectures. We have implemented a decentralized dynamic runtime system. The design of the runtime system is focused on the scalability metric. At any time only a small portion of a task graph exists in memory. We propose an algorithm to solve data dependences without process cooperation in a distributed manner. Our experimental results demonstrate the scalability and practicality of the approach for both shared-memory and distributed-memory multicore systems. Finally, we present a scalable nonblocking topology-aware multicast scheme for distributed DAG scheduling applications

    Architecting a One-to-many Traffic-Aware and Secure Millimeter-Wave Wireless Network-in-Package Interconnect for Multichip Systems

    Get PDF
    With the aggressive scaling of device geometries, the yield of complex Multi Core Single Chip(MCSC) systems with many cores will decrease due to the higher probability of manufacturing defects especially, in dies with a large area. Disintegration of large System-on-Chips(SoCs) into smaller chips called chiplets has shown to improve the yield and cost of complex systems. Therefore, platform-based computing modules such as embedded systems and micro-servers have already adopted Multi Core Multi Chip (MCMC) architectures overMCSC architectures. Due to the scaling of memory intensive parallel applications in such systems, data is more likely to be shared among various cores residing in different chips resulting in a significant increase in chip-to-chip traffic, especially one-to-many traffic. This one-to-many traffic is originated mainly to maintain cache-coherence between many cores residing in multiple chips. Besides, one-to-many traffics are also exploited by many parallel programming models, system-level synchronization mechanisms, and control signals. How-ever, state-of-the-art Network-on-Chip (NoC)-based wired interconnection architectures do not provide enough support as they handle such one-to-many traffic as multiple unicast trafficusing a multi-hop MCMC communication fabric. As a result, even a small portion of such one-to-many traffic can significantly reduce system performance as traditional NoC-basedinterconnect cannot mask the high latency and energy consumption caused by chip-to-chipwired I/Os. Moreover, with the increase in memory intensive applications and scaling of MCMC systems, traditional NoC-based wired interconnects fail to provide a scalable inter-connection solution required to support the increased cache-coherence and synchronization generated one-to-many traffic in future MCMC-based High-Performance Computing (HPC) nodes. Therefore, these computation and memory intensive MCMC systems need an energy-efficient, low latency, and scalable one-to-many (broadcast/multicast) traffic-aware interconnection infrastructure to ensure high-performance. Research in recent years has shown that Wireless Network-in-Package (WiNiP) architectures with CMOS compatible Millimeter-Wave (mm-wave) transceivers can provide a scalable, low latency, and energy-efficient interconnect solution for on and off-chip communication. In this dissertation, a one-to-many traffic-aware WiNiP interconnection architecture with a starvation-free hybrid Medium Access Control (MAC), an asymmetric topology, and a novel flow control has been proposed. The different components of the proposed architecture are individually one-to-many traffic-aware and as a system, they collaborate with each other to provide required support for one-to-many traffic communication in a MCMC environment. It has been shown that such interconnection architecture can reduce energy consumption and average packet latency by 46.96% and 47.08% respectively for MCMC systems. Despite providing performance enhancements, wireless channel, being an unguided medium, is vulnerable to various security attacks such as jamming induced Denial-of-Service (DoS), eavesdropping, and spoofing. Further, to minimize the time-to-market and design costs, modern SoCs often use Third Party IPs (3PIPs) from untrusted organizations. An adversary either at the foundry or at the 3PIP design house can introduce a malicious circuitry, to jeopardize an SoC. Such malicious circuitry is known as a Hardware Trojan (HT). An HTplanted in the WiNiP from a vulnerable design or manufacturing process can compromise a Wireless Interface (WI) to enable illegitimate transmission through the infected WI resulting in a potential DoS attack for other WIs in the MCMC system. Moreover, HTs can be used for various other malicious purposes, including battery exhaustion, functionality subversion, and information leakage. This information when leaked to a malicious external attackercan reveals important information regarding the application suites running on the system, thereby compromising the user profile. To address persistent jamming-based DoS attack in WiNiP, in this dissertation, a secure WiNiP interconnection architecture for MCMC systems has been proposed that re-uses the one-to-many traffic-aware MAC and existing Design for Testability (DFT) hardware along with Machine Learning (ML) approach. Furthermore, a novel Simulated Annealing (SA)-based routing obfuscation mechanism was also proposed toprotect against an HT-assisted novel traffic analysis attack. Simulation results show that,the ML classifiers can achieve an accuracy of 99.87% for DoS attack detection while SA-basedrouting obfuscation could reduce application detection accuracy to only 15% for HT-assistedtraffic analysis attack and hence, secure the WiNiP fabric from age-old and emerging attacks

    Switch-based packing technique to reduce traffic and latency in token coherence

    Full text link
    Token Coherence is a cache coherence protocol able to simultaneously capture the best attributes of traditional protocols: low latency and scalability. However it may lose these desired features when (1) several nodes contend for the same memory block and (2) nodes write highly-shared blocks. The first situation leads to the issue of simultaneous broadcast requests which threaten the protocol scalability. The second situation results in a burst of token responses directed to the writer, which turn it into a bottleneck and increase the latency. To address these problems, we propose a switch-based packing technique able to encapsulate several messages (while in transit) into just one. Its application to the simultaneous broadcasts significantly reduces their bandwidth requirements (up to 45%). Its application to token responses lowers their transmission latency (by 70%). Thus, the packing technique decreases both the latency and coherence traffic, thereby improving system performance (about 15% of reduction in runtime). © 2011 Elsevier Inc. All rights reserved.This work was partially supported by the Spanish MEC and MICINN, as well as European Commission FEDER funds, under Grants CSD2006-00046 and TIN2009-14475-C04-01.Cuesta Sáez, BA.; Robles Martínez, A.; Duato Marín, JF. (2012). Switch-based packing technique to reduce traffic and latency in token coherence. Journal of Parallel and Distributed Computing. 72(3):409-423. https://doi.org/10.1016/j.jpdc.2011.11.010S40942372

    OrthoNoC: a broadcast-oriented dual-plane wireless network-on-chip architecture

    Get PDF
    © 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other worksOn-chip communication remains as a key research issue at the gates of the manycore era. In response to this, novel interconnect technologies have opened the door to new Network-on-Chip (NoC) solutions towards greater scalability and architectural flexibility. Particularly, wireless on-chip communication has garnered considerable attention due to its inherent broadcast capabilities, low latency, and system-level simplicity. This work presents ORTHONOC, a wired-wireless architecture that differs from existing proposals in that both network planes are decoupled and driven by traffic steering policies enforced at the network interfaces. With these and other design decisions, ORTHONOC seeks to emphasize the ordered broadcast advantage offered by the wireless technology. The performance and cost of ORTHONOC are first explored using synthetic traffic, showing substantial improvements with respect to other wired-wireless designs with a similar number of antennas. Then, the applicability of ORTHONOC in the multiprocessor scenario is demonstrated through the evaluation of a simple architecture that implements fast synchronization via ordered broadcast transmissions. Simulations reveal significant execution time speedups and communication energy savings for 64-threaded benchmarks, proving that the value of ORTHONOC goes beyond simply improving the performance of the on-chip interconnect.Peer ReviewedPostprint (author's final draft

    MorphIC: A 65-nm 738k-Synapse/mm2^2 Quad-Core Binary-Weight Digital Neuromorphic Processor with Stochastic Spike-Driven Online Learning

    Full text link
    Recent trends in the field of neural network accelerators investigate weight quantization as a means to increase the resource- and power-efficiency of hardware devices. As full on-chip weight storage is necessary to avoid the high energy cost of off-chip memory accesses, memory reduction requirements for weight storage pushed toward the use of binary weights, which were demonstrated to have a limited accuracy reduction on many applications when quantization-aware training techniques are used. In parallel, spiking neural network (SNN) architectures are explored to further reduce power when processing sparse event-based data streams, while on-chip spike-based online learning appears as a key feature for applications constrained in power and resources during the training phase. However, designing power- and area-efficient spiking neural networks still requires the development of specific techniques in order to leverage on-chip online learning on binary weights without compromising the synapse density. In this work, we demonstrate MorphIC, a quad-core binary-weight digital neuromorphic processor embedding a stochastic version of the spike-driven synaptic plasticity (S-SDSP) learning rule and a hierarchical routing fabric for large-scale chip interconnection. The MorphIC SNN processor embeds a total of 2k leaky integrate-and-fire (LIF) neurons and more than two million plastic synapses for an active silicon area of 2.86mm2^2 in 65nm CMOS, achieving a high density of 738k synapses/mm2^2. MorphIC demonstrates an order-of-magnitude improvement in the area-accuracy tradeoff on the MNIST classification task compared to previously-proposed SNNs, while having no penalty in the energy-accuracy tradeoff.Comment: This document is the paper as accepted for publication in the IEEE Transactions on Biomedical Circuits and Systems journal (2019), the fully-edited paper is available at https://ieeexplore.ieee.org/document/876400

    Many-core and heterogeneous architectures: programming models and compilation toolchains

    Get PDF
    1noL'abstract è presente nell'allegato / the abstract is in the attachmentopen677. INGEGNERIA INFORMATInopartially_openembargoed_20211002Barchi, Francesc

    Scalability of broadcast performance in wireless network-on-chip

    Get PDF
    Networks-on-Chip (NoCs) are currently the paradigm of choice to interconnect the cores of a chip multiprocessor. However, conventional NoCs may not suffice to fulfill the on-chip communication requirements of processors with hundreds or thousands of cores. The main reason is that the performance of such networks drops as the number of cores grows, especially in the presence of multicast and broadcast traffic. This not only limits the scalability of current multiprocessor architectures, but also sets a performance wall that prevents the development of architectures that generate moderate-to-high levels of multicast. In this paper, a Wireless Network-on-Chip (WNoC) where all cores share a single broadband channel is presented. Such design is conceived to provide low latency and ordered delivery for multicast/broadcast traffic, in an attempt to complement a wireline NoC that will transport the rest of communication flows. To assess the feasibility of this approach, the network performance of WNoC is analyzed as a function of the system size and the channel capacity, and then compared to that of wireline NoCs with embedded multicast support. Based on this evaluation, preliminary results on the potential performance of the proposed hybrid scheme are provided, together with guidelines for the design of MAC protocols for WNoC.Peer ReviewedPostprint (published version

    Architectural Support for Efficient Communication in Future Microprocessors

    Get PDF
    Traditionally, the microprocessor design has focused on the computational aspects of the problem at hand. However, as the number of components on a single chip continues to increase, the design of communication architecture has become a crucial and dominating factor in defining performance models of the overall system. On-chip networks, also known as Networks-on-Chip (NoC), emerged recently as a promising architecture to coordinate chip-wide communication. Although there are numerous interconnection network studies in an inter-chip environment, an intra-chip network design poses a number of substantial challenges to this well-established interconnection network field. This research investigates designs and applications of on-chip interconnection network in next-generation microprocessors for optimizing performance, power consumption, and area cost. First, we present domain-specific NoC designs targeted to large-scale and wire-delay dominated L2 cache systems. The domain-specifically designed interconnect shows 38% performance improvement and uses only 12% of the mesh-based interconnect. Then, we present a methodology of communication characterization in parallel programs and application of characterization results to long-channel reconfiguration. Reconfigured long channels suited to communication patterns enhance the latency of the mesh network by 16% and 14% in 16-core and 64-core systems, respectively. Finally, we discuss an adaptive data compression technique that builds a network-wide frequent value pattern map and reduces the packet size. In two examined multi-core systems, cache traffic has 69% compressibility and shows high value sharing among flows. Compression-enabled NoC improves the latency by up to 63% and saves energy consumption by up to 12%

    General hardware multicasting for fine-grained message-passing architectures

    Get PDF
    Manycore architectures are increasingly favouring message-passing or partitioned global address spaces (PGAS) over cache coherency for reasons of power efficiency and scalability. However, in the absence of cache coherency, there can be a lack of hardware support for one-to-many communication patterns, which are prevalent in someapplication domains. To address this, we present new hardware primitives for multicast communication in rack-scale manycore systems. These primitives guarantee delivery to both colocated and distributed destinations, and can capture large unstructured communication patterns precisely. As a result, reliable multicast transfers among any number of software tasks, connected in any topology, can be fully offloaded to hardware. We implement the new primitives in a research platform consisting of 50K RISC-V threads distributed over 48 FPGAs, and demonstrate significant performance benefits on a range of applications expressed using a high-level vertex-centric programming model

    General hardware multicasting for fine-grained message-passing architectures

    Get PDF
    Manycore architectures are increasingly favouring message-passing or partitioned global address spaces (PGAS) over cache coherency for reasons of power efficiency and scalability. However, in the absence of cache coherency, there can be a lack of hardware support for one-to-many communication patterns, which are prevalent in some application domains. To address this, we present new hardware primitives for multicast communication in rack-scale manycore systems. These primitives guarantee delivery to both colocated and distributed destinations, and can capture large unstructured communication patterns precisely. As a result, reliable multicast transfers among any number of software tasks, connected in any topology, can be fully offloaded to hardware. We implement the new primitives in a research platform consisting of 50K RISC-V threads distributed over 48 FPGAs, and demonstrate significant performance benefits on a range of applications expressed using a high-level vertex-centric programming model
    • …
    corecore