1,003 research outputs found

    Scalability of broadcast performance in wireless network-on-chip

    Get PDF
    Networks-on-Chip (NoCs) are currently the paradigm of choice to interconnect the cores of a chip multiprocessor. However, conventional NoCs may not suffice to fulfill the on-chip communication requirements of processors with hundreds or thousands of cores. The main reason is that the performance of such networks drops as the number of cores grows, especially in the presence of multicast and broadcast traffic. This not only limits the scalability of current multiprocessor architectures, but also sets a performance wall that prevents the development of architectures that generate moderate-to-high levels of multicast. In this paper, a Wireless Network-on-Chip (WNoC) where all cores share a single broadband channel is presented. Such design is conceived to provide low latency and ordered delivery for multicast/broadcast traffic, in an attempt to complement a wireline NoC that will transport the rest of communication flows. To assess the feasibility of this approach, the network performance of WNoC is analyzed as a function of the system size and the channel capacity, and then compared to that of wireline NoCs with embedded multicast support. Based on this evaluation, preliminary results on the potential performance of the proposed hybrid scheme are provided, together with guidelines for the design of MAC protocols for WNoC.Peer ReviewedPostprint (published version

    Probabilistically time-analyzable complex processors in hard real- time systems

    Get PDF
    Critical Real-Time Embedded Systems (CRTES) feature performance-demanding functionality. High-performance hardware and complex software can provide such functionality, but the use of aggressive technology challenges time-predictability. Our work focuses on the investigation and development of (1) hardware mechanisms to control inter-task interferences in shared timerandomized caches and (2) manycore network-on-chip designs meeting the requirements of Probabilistic Timing Analysis (PTA)

    Fast behavioural RTL simulation of 10B transistor SoC designs with Metro-Mpi

    Get PDF
    Chips with tens of billions of transistors have become today's norm. These designs are straining our electronic design automation tools throughout the design process, requiring ever more computational resources. In many tools, parallelisation has improved both latency and throughput for the designer's benefit. However, tools largely remain restricted to a single machine and in the case of RTL simulation, we believe that this leaves much potential performance on the table. We introduce Metro-MPI to improve RTL simulation for modern 10 billion transistor-scale chips. Metro-MPI exploits the natural boundaries present in chip designs to partition RTL simulations and leverage High Performance Computing (HPC) techniques to extract parallelism. For chip designs that scale in size by exploiting latency-insensitive interfaces like networks-on-chip and AXI, Metro-MPI offers a new paradigm for RTL simulation scalability. Our implementation of Metro-MPI in Open-Piton+Ariane delivers 2.7 MIPS of RTL simulation throughput for the first time on a design with more than 10 billion transistors and 1,024 Linux-capable cores, opening new avenues for distributed RTL simulation of emerging system-on-chip designs. Compared to sequential and multithreaded RTL simulations of smaller designs, Metro-MPI achieves up to 135.98× and 9.29× speedups. Similarly, for a representative regression run, Metro-Mpireduces energy consumption by up to 2.53× and 2.91× .This work has been partially supported by the Spanish Ministry of Economy and Competitiveness (contract PID2019-107255GB-C21), by the Generalitat de Catalunya (contract 2017-SGR-1328), by the European Union within the framework of the ERDF of Catalonia 2014-2020 under the DRAC project [001-P-001723], and by the Arm-BSC Center of Excellence. G. Lopez-Paradís has been supported by the Generalitat de Catalunya through a FI fellowship 2021FI-B00994 and GSoC 2021, and M. Moreto by a Ramon y Cajal fellowship no. RYC-2016-21104. A. Armejach is a Serra Hunter Fellow.Peer ReviewedPostprint (author's final draft

    Introducing a Data Sliding Mechanism for Cooperative Caching in Manycore Architectures

    Get PDF
    International audienceIn this paper, we propose a new cooperative caching method improving the cache miss rate for manycore micro- architec- tures. The work is motivated by some limitations of recent adaptive cooperative caching proposals. Elastic Cooperative caching (ECC), is a dynamic memory partitioning mechanism that allows sharing cache across cooperative nodes according to the application behavior. However, it is mainly limited with cache eviction rate in case of highly stressed neighbor- hood. Another system, the adaptive Set-Granular Cooperative Caching (ASCC), is based on finer set-based mechanisms for a better adaptability. However, heavy localized cache loads are not efficiently managed. In such a context, we propose a cooperative caching strategy that consists in sliding data through closer neighbors. When a cache receives a storing request of a neighbor's private block, it spills the least recently used private data to a close neighbor. Thus, solicited saturated nodes slide local blocks to their respective neighbors to always provide free cache space. We also propose a new Priority- based Data Replacement policy to decide efficiently which blocks should be spilled, and a new mechanism to choose host destination called Best Neighbor selector. The first analytic performance evaluation shows that the proposed cache management policies reduce by half the average global communication rate. As frequent accesses are focused in the neighboring zones, it efficiently improves on-Chip traffic. Finally, our evaluation shows that cache miss rate is en- hanced: each tile keeps the most frequently accessed data 1- Hop close to it, instead of ejecting them Off-Chip. Proposed techniques notably reduce the cache miss rate in case of high solicitation of the cooperative zone, as it is shown in the performed experiments

    Exploring heterogeneous computing with advanced path tracing algorithms

    Get PDF
    The CG research community has a renewed interest on rendering algorithms based on path space integration, mainly due to new approaches to discover, generate and exploit relevant light paths while keeping the numerical integrator unbiased or, at the very least, consistent. Simultaneously, the current trend towards massive parallelism and heterogeneous environments, based on a mix of conventional computing units with accelerators, is playing a major role both in HPC and embedded platforms. To efficiently use the available resources in these and future systems, algorithms and software packages are being revisited and reevaluated to assess their adequateness to these environments. This paper assesses the performance and scalability of three different path based algorithms running on homogeneous servers (dual multicore Xeons) and heterogeneous systems (those multicore plus manycore Xeon and NVidia Kepler GPU devices). These algorithms include path tracing (PT), its bidirectional counterpart (BPT) and the more recent Vertex Connect and Merge (VCM). Experimental results with two conventional scenes (one mainly diffuse, the other exhibiting specular-diffuse-specular paths) show that all algorithms scale well across the different platforms, the actual scalability depending on whether shared data structures are accessed or not (PT vs. BPT vs. VCM).This work was supported by COMPETE: POCI-01-0145FEDER-007043 and FCT (Fundação para a Ciência e Tecnologia) within Project Scope (UID/CEC/00319/2013), by the Cooperation Program with the University of Texas at Austin and co-funded by the North Portugal Regional Operational Programme (ON.2 - O Novo Norte), under the National Strategic Reference Framework, through the European Regional Development Fund
    corecore