114 research outputs found

    Improving performance guarantees in wormhole mesh NoC designs

    Get PDF
    Wormhole-based mesh Networks-on-Chip (wNoC) are deployed in high-performance many-core processors due to their physical scalability and low-cost. Delivering tight and time composable Worst-Case Execution Time (WCET) estimates for applications as needed in safety-critical real-time embedded systems is challenged by wNoCs due to their distributed nature. We propose a bandwidth control mechanism for wNoCs that enables the computation of tight time-composable WCET estimates with low average performance degradation and high scalability. Our evaluation with the EEMBC automotive suite and an industrial real-time parallel avionics application confirms so.The research leading to these results is funded by the European Union Seventh Framework Programme under grant agreement no. 287519 (parMERASA) and by the Ministry of Science and Technology of Spain under contract TIN2012-34557. Milos Panic is funded by the Spanish Ministry of Education under the FPU grant FPU12/05966. Carles Hernández is jointly funded by the Spanish Ministry of Economy and Competitiveness and FEDER funds through grant TIN2014-60404-JIN. Jaume Abella is partially supported by the Ministry of Economy and Competitiveness under Ramon y Cajal postdoctoral fellowship number RYC-2013-14717.Peer ReviewedPostprint (author's final draft

    Hardware/Software Co-design for Multicore Architectures

    Get PDF
    Siirretty Doriast

    High-Performance Communication Primitives and Data Structures on Message-Passing Manycores:Broadcast and Map

    Get PDF
    The constant increase in single core frequency reached a plateau during recent years since the produced heat inside the chip cannot be cooled down by existing technologies anymore. An alternative to harvest more computational power per die is to fabricate more number of cores into a single chip. Therefore manycore chips with more than thousand cores are expected by the end of the decade. These environments provide a high level of parallel processing power while their energy consumption is considerably lower than their multi-chip counterparts. Although shared-memory programming is the classical paradigm to program these environments, there are numerous claims that taking into account the full life cycle of software, the message-passing programming model have numerous advantages. The direct architectural consequence of applying a message-passing programming model is to support message passing between the processing entities directly in the hardware. Therefore manycore architectures with hardware support for message passing are becoming more and more visible. These platforms can be seen in two ways: (i) as a High Performance Computing (HPC) cluster programmed by highly trained scientists using Message Passing Interface (MPI) libraries; or (ii) as a mainstream computing platform requiring a global operating system to abstract away the architectural complexities from the ordinary programmer. In the first view, performance of communication primitives is an important bottleneck for MPI applications. In the second view, kernel data structures have been shown to be a limiting factor. In this thesis (i) we overview existing state-of-the-art techniques to circumvent the mentioned bottlenecks; and (ii) we study high-performance broadcast communication primitive and map data structure on modern manycore architectures, with message-passing support in hardware, in two different chapters respectively. In one chapter, we study how to make use of the hardware features to implement an efficient broadcast primitive. We consider the Intel Single-chip Cloud Computer (SCC) as our target platform which offers the ability to move data between on-chip Message Passing Buffers (MPB) using Remote Memory Access (RMA). We propose OC-Bcast (On-Chip Broadcast), a pipelined k-ary tree algorithm tailored to exploit the parallelism provided by on-chip RMA. Experimental results show that OC-Bcast attains considerably better performance in terms of latency and throughput compared to state-of-the-art solutions. This performance improvement highlights the benefits of exploiting hardware features of the target platform: Our broadcast algorithm takes direct advantage of RMA, unlike the other broadcast solutions which are based on a higher-level send/receive interface. In the other chapter, we study the implementation of high-throughput concurrent maps in message-passing manycores. Partitioning and replication are the two approaches to achieve high throughput in a message-passing system. This chapter presents and compares different strongly-consistent map algorithms based on partitioning and replication. To assess the performance of these algorithms independently of architecture-specific features, we propose a communication model of message-passing manycores to express the throughput of each algorithm. The model is validated through experiments on a 36-core TILE-Gx8036 processor. Evaluations show that replication outperforms partitioning only in a narrow domain

    Towards Optimal Application Mapping for Energy-Efficient Many-Core Platforms

    Get PDF
    Siirretty Doriast

    Investigation on AUTOSAR-Compliant Solutions for Many-Core Architectures

    Get PDF
    As of today, AUTOSAR is the de facto standard in the automotive industry, providing a common software architec- ture and development process for automotive applications. While this standard is originally written for singlecore operated Elec- tronic Control Units (ECU), new guidelines and recommendations have been added recently to provide support for multicore archi- tectures. This update came as a response to the steady increase of the number and complexity of the software functions embedded in modern vehicles, which call for the computing power of multicore execution environments. In this paper, we enumerate and analyze the design options and the challenges of porting AUTOSAR-based automotive applications onto multicore platforms. In particular, we investigate those options when considering the emerging many- core architectures that provide a more scalable environment than the traditional multicore systems. Such platforms are suitable to enable massive parallel execution, and their design is more suitable for partitioning and isolating the software components.Euromicro Conference on Digital System Design (DSD 2015), Funchal, Portugal

    Modeling high-performance wormhole NoCs for critical real-time embedded systems

    Get PDF
    Manycore chips are a promising computing platform to cope with the increasing performance needs of critical real-time embedded systems (CRTES). However, manycores adoption by CRTES industry requires understanding task's timing behavior when their requests use manycore's network-on-chip (NoC) to access hardware shared resources. This paper analyzes the contention in wormhole-based NoC (wNoC) designs - widely implemented in the high-performance domain - for which we introduce a new metric: worst-contention delay (WCD) that captures wNoC impact on worst-case execution time (WCET) in a tighter manner than the existing metric, worst-case traversal time (WCTT). Moreover, we provide an analytical model of the WCD that requests can suffer in a wNoC and we validate it against wNoC designs resembling those in the Tilera-Gx36 and the Intel-SCC 48-core processors. Building on top of our WCD analytical model, we analyze the impact on WCD that different design parameters such as the number of virtual channels, and we make a set of recommendations on what wNoC setups to use in the context of CRTES.Peer ReviewedPostprint (author's final draft

    Scalable and Distributed Resource Management for Many-Core Systems

    Get PDF
    Many-core systems provide researchers with important new challenges, including the handling of very dynamic and hardly predictable computational loads. The large number of applications and cores causes scalability issues for centrally acting heuristics, which always must retain a global view of the entire system. Resource management itself can become a bottleneck which limits the achievable performance of the system. The focus of this work is to achieve scalability of resource management

    NoC-based Architectures for Real-Time Applications : Performance Analysis and Design Space Exploration

    Get PDF
    Monoprocessor architectures have reached their limits in regard to the computing power they offer vs the needs of modern systems. Although multicore architectures partially mitigate this limitation and are commonly used nowadays, they usually rely on intrinsically non-scalable buses to interconnect the cores. The manycore paradigm was proposed to tackle the scalability issue of bus-based multicore processors. It can scale up to hundreds of processing elements (PEs) on a single chip, by organizing them into computing tiles (holding one or several PEs). Intercore communication is usually done using a Network-on-Chip (NoC) that consists of interconnected onchip routers allowing communication between tiles. However, manycore architectures raise numerous challenges, particularly for real-time applications. First, NoC-based communication tends to generate complex blocking patterns when congestion occurs, which complicates the analysis, since computing accurate worst-case delays becomes difficult. Second, running many applications on large Systems-on-Chip such as manycore architectures makes system design particularly crucial and complex. On one hand, it complicates Design Space Exploration, as it multiplies the implementation alternatives that will guarantee the desired functionalities. On the other hand, once a hardware architecture is chosen, mapping the tasks of all applications on the platform is a hard problem, and finding an optimal solution in a reasonable amount of time is not always possible. Therefore, our first contributions address the need for computing tight worst-case delay bounds in wormhole NoCs. We first propose a buffer-aware worst-case timing analysis (BATA) to derive upper bounds on the worst-case end-to-end delays of constant-bit rate data flows transmitted over a NoC on a manycore architecture. We then extend BATA to cover a wider range of traffic types, including bursty traffic flows, and heterogeneous architectures. The introduced method is called G-BATA for Graph-based BATA. In addition to covering a wider range of assumptions, G-BATA improves the computation time; thus increases the scalability of the method. In a second part, we develop a method addressing design and mapping for applications with real-time constraints on manycore platforms. It combines model-based engineering tools (TTool) and simulation with our analytical verification technique (G-BATA) and tools (WoPANets) to provide an efficient design space exploration framework. Finally, we validate our contributions on (a) a serie of experiments on a physical platform and (b) two case studies taken from the real world: an autonomous vehicle control application, and a 5G signal decoder applicatio

    On-Chip Optical Interconnection Networks for Multi/Manycore Architectures

    Get PDF
    The rapid development of multi/manycore technologies offers the opportunity for highly parallel architectures implemented on a single chip. While the first, low-parallelism multicore products have been based on simple interconnection structures (single bus, very simple crossbar), the emerging highly parallel architectures will require complex, limited-degree interconnection networks. This thesis studies this trend according to the general theory of interconnection structures for parallel machines, and investigates some solutions in terms of performance, cost, fault-tolerance, and run-time support to shared-memory and/or message passing programming mechanisms
    • …
    corecore