43 research outputs found

    Software-based and regionally-oriented traffic management in Networks-on-Chip

    Get PDF
    Since the introduction of chip-multiprocessor systems, the number of integrated cores has been steady growing and workload applications have been adapted to exploit the increasing parallelism. This changed the importance of efficient on-chip communication significantly and the infrastructure has to keep step with these new requirements. The work at hand makes significant contributions to the state-of-the-art of the latest generation of such solutions, called Networks-on-Chip, to improve the performance, reliability, and flexible management of these on-chip infrastructures

    Automatic synthesis and optimization of chip multiprocessors

    Get PDF
    The microprocessor technology has experienced an enormous growth during the last decades. Rapid downscale of the CMOS technology has led to higher operating frequencies and performance densities, facing the fundamental issue of power dissipation. Chip Multiprocessors (CMPs) have become the latest paradigm to improve the power-performance efficiency of computing systems by exploiting the parallelism inherent in applications. Industrial and prototype implementations have already demonstrated the benefits achieved by CMPs with hundreds of cores.CMP architects are challenged to take many complex design decisions. Only a few of them are:- What should be the ratio between the core and cache areas on a chip?- Which core architectures to select?- How many cache levels should the memory subsystem have?- Which interconnect topologies provide efficient on-chip communication?These and many other aspects create a complex multidimensional space for architectural exploration. Design Automation tools become essential to make the architectural exploration feasible under the hard time-to-market constraints. The exploration methods have to be efficient and scalable to handle future generation on-chip architectures with hundreds or thousands of cores.Furthermore, once a CMP has been fabricated, the need for efficient deployment of the many-core processor arises. Intelligent techniques for task mapping and scheduling onto CMPs are necessary to guarantee the full usage of the benefits brought by the many-core technology. These techniques have to consider the peculiarities of the modern architectures, such as availability of enhanced power saving techniques and presence of complex memory hierarchies.This thesis has several objectives. The first objective is to elaborate the methods for efficient analytical modeling and architectural design space exploration of CMPs. The efficiency is achieved by using analytical models instead of simulation, and replacing the exhaustive exploration with an intelligent search strategy. Additionally, these methods incorporate high-level models for physical planning. The related contributions are described in Chapters 3, 4 and 5 of the document.The second objective of this work is to propose a scalable task mapping algorithm onto general-purpose CMPs with power management techniques, for efficient deployment of many-core systems. This contribution is explained in Chapter 6 of this document.Finally, the third objective of this thesis is to address the issues of the on-chip interconnect design and exploration, by developing a model for simultaneous topology customization and deadlock-free routing in Networks-on-Chip. The developed methodology can be applied to various classes of the on-chip systems, ranging from general-purpose chip multiprocessors to application-specific solutions. Chapter 7 describes the proposed model.The presented methods have been thoroughly tested experimentally and the results are described in this dissertation. At the end of the document several possible directions for the future research are proposed

    Real-Time Application Mapping for Many-Cores Using a Limited Migrative Model

    Get PDF
    Many-core platforms are an emerging technology in the real-time embedded domain. These devices offer various options for power savings, cost reductions and contribute to the overall system flexibility, however, issues such as unpredictability, scalability and analysis pessimism are serious challenges to their integration into the aforementioned area. The focus of this work is on many-core platforms using a limited migrative model (LMM). LMM is an approach based on the fundamental concepts of the multi-kernel paradigm, which is a promising step towards scalable and predictable many-cores. In this work, we formulate the problem of real-time application mapping on a many-core platform using LMM, and propose a three-stage method to solve it. An extended version of the existing analysis is used to assure that derived mappings (i) guarantee the fulfilment of timing constraints posed on worst-case communication delays of individual applications, and (ii) provide an environment to perform load balancing for e.g. energy/thermal management, fault tolerance and/or performance reasons

    PaSE : Parallel Speedup Estimation Framework for Network-on-Chip Based Multi-core Systems

    Get PDF
    The massive integration of cores in multi-core system has enabled chip designer to design systems while meeting the power-performance demands of the applications. However, full-system simulations traditionally used to evaluate the speedup with these systems are computationally expensive and time consuming. On the other hand, analytical speedup models such as Amdahl’s law are powerful and fast ways to calculate the achievable speedup of these systems. However, Amdahl’s Law disregards the communication among the cores that play a vital role in defining the achievable speedup with the multi-core systems. To bridge this gap, in this work, we present PaSE a parallel speedup estimation framework for multi-core systems that considers the latency of the Network-on-Chip (NoC). To accurately capture the latency of the NoC, we propose a queuing theory based analytical model. Using our proposed PaSE framework, We conduct a speedup analysis for multicore system with real application based traffic i.g. Matrix Multiplication. the multiplication of two [32x32] matrices is considered in NoC based multi-core system with respect to three different cases ideal-core case, integer case (i.g. matrix elements are integer number), and denormal case (i.g. matrix elements are denormal number, NaNs, or infinity). From this analysis, we show how the system size, Network-on-Chip NoC architecture, and the computation to communication (C-to-C) ratio effect the achievable speedup. To sum up, instead of the simulation based performance estimation, our PaSE framework can be utilized as a design guideline i.g. it is possible to use it to understand the optimal multi-core system-size for certain applications. Thus, this model can reduce the design time and effort of such NoC based multi-core systems

    Doctor of Philosophy

    Get PDF
    dissertationPortable electronic devices will be limited to available energy of existing battery chemistries for the foreseeable future. However, system-on-chips (SoCs) used in these devices are under a demand to offer more functionality and increased battery life. A difficult problem in SoC design is providing energy-efficient communication between its components while maintaining the required performance. This dissertation introduces a novel energy-efficient network-on-chip (NoC) communication architecture. A NoC is used within complex SoCs due it its superior performance, energy usage, modularity, and scalability over traditional bus and point-to-point methods of connecting SoC components. This is the first academic research that combines asynchronous NoC circuits, a focus on energy-efficient design, and a software framework to customize a NoC for a particular SoC. Its key contribution is demonstrating that a simple, asynchronous NoC concept is a good match for low-power devices, and is a fruitful area for additional investigation. The proposed NoC is energy-efficient in several ways: simple switch and arbitration logic, low port radix, latch-based router buffering, a topology with the minimum number of 3-port routers, and the asynchronous advantages of zero dynamic power consumption while idle and the lack of a clock tree. The tool framework developed for this work uses novel methods to optimize the topology and router oorplan based on simulated annealing and force-directed movement. It studies link pipelining techniques that yield improved throughput in an energy-efficient manner. A simulator is automatically generated for each customized NoC, and its traffic generators use a self-similar message distribution, as opposed to Poisson, to better match application behavior. Compared to a conventional synchronous NoC, this design is superior by achieving comparable message latency with half the energy

    Least Upper Delay Bound for VBR Flows in Networks-on- Chip with Virtual Channels

    Get PDF
    Real-time applications such as multimedia and gaming require stringent performance guarantees, usually enforced by a tight upper bound on the maximum end-to-end delay. For FIFO multiplexed on-chip packet switched networks we consider worst-case delay bounds for Variable Bit-Rate (VBR) flows with aggregate scheduling, which schedules multiple flows as an aggregate flow. VBR Flows are characterized by a maximum transfer size, peak rate, burstiness, and average sustainable rate. Based on network calculus, we present and prove theorems to derive per-flow end-to-end Equivalent Service Curves (ESC) which are in turn used for computing Least Upper Delay Bounds (LUDBs) of individual flows. In a realistic case study we find that the end-to-end delay bound is up to 46.9% more accurate than the case without considering the traffic peak behavior. Likewise, results also show similar improvements for synthetic traffic patterns. The proposed methodology is implemented in C++ and has low run-time complexity, enabling quick evaluation for large and complex SoCs

    Comparing energy and latency of asynchronous and synchronous NoCs for embedded SoCs

    Get PDF
    Journal ArticlePower consumption of on-chip interconnects is a primary concern for many embedded system-on-chip (SoC) applications. In this paper, we compare energy and performance characteristics of asynchronous (clockless) and synchronous network on-chip implementations, optimized for a number of SoC designs. We adapted the COSI-2.0 framework with ORION 2.0 router and wire models for synchronous network generation. Our own tool, ANetGen, specifies the asynchronous network by determining the topology with simulated-annealing and router locations with force-directed placement. It uses energy and delay models from our 65 nm bundled-data router design. SystemC simulations varied traffic burstiness using the self-similar b-model. Results show that the asynchronous network provided lower median and maximum message latency, especially under bursty traffic, and used far less router energy with a slight overhead for the interrouter wires
    corecore