1,081 research outputs found

    Cycle-accurate evaluation of reconfigurable photonic networks-on-chip

    Get PDF
    There is little doubt that the most important limiting factors of the performance of next-generation Chip Multiprocessors (CMPs) will be the power efficiency and the available communication speed between cores. Photonic Networks-on-Chip (NoCs) have been suggested as a viable route to relieve the off- and on-chip interconnection bottleneck. Low-loss integrated optical waveguides can transport very high-speed data signals over longer distances as compared to on-chip electrical signaling. In addition, with the development of silicon microrings, photonic switches can be integrated to route signals in a data-transparent way. Although several photonic NoC proposals exist, their use is often limited to the communication of large data messages due to a relatively long set-up time of the photonic channels. In this work, we evaluate a reconfigurable photonic NoC in which the topology is adapted automatically (on a microsecond scale) to the evolving traffic situation by use of silicon microrings. To evaluate this system's performance, the proposed architecture has been implemented in a detailed full-system cycle-accurate simulator which is capable of generating realistic workloads and traffic patterns. In addition, a model was developed to estimate the power consumption of the full interconnection network which was compared with other photonic and electrical NoC solutions. We find that our proposed network architecture significantly lowers the average memory access latency (35% reduction) while only generating a modest increase in power consumption (20%), compared to a conventional concentrated mesh electrical signaling approach. When comparing our solution to high-speed circuit-switched photonic NoCs, long photonic channel set-up times can be tolerated which makes our approach directly applicable to current shared-memory CMPs

    Using an FPGA for Fast Bit Accurate SoC Simulation

    Get PDF
    In this paper we describe a sequential simulation method to simulate large parallel homo- and heterogeneous systems on a single FPGA. The method is applicable for parallel systems were lengthy cycle and bit accurate simulations are required. It is particularly designed for systems that do not fit completely on the simulation platform (i.e. FPGA). As a case study, we use a Network-on-Chip (NoC) that is simulated in SystemC and on the described FPGA simulator. This enables us to observe the NoC behavior under a large variety of traffic patterns. Compared with the SystemC simulation we achieved a factor 80-300 of speed improvement, without compromising the cycle and bit level accuracy

    Fast, Accurate and Detailed NoC Simulations

    Get PDF
    Network-on-Chip (NoC) architectures have a wide variety of parameters that can be adapted to the designer's requirements. Fast exploration of this parameter space is only possible at a high-level and several methods have been proposed. Cycle and bit accurate simulation is necessary when the actual router's RTL description needs to be evaluated and verified. However, extensive simulation of the NoC architecture with cycle and bit accuracy is prohibitively time consuming. In this paper we describe a simulation method to simulate large parallel homogeneous and heterogeneous network-on-chips on a single FPGA. The method is especially suitable for parallel systems where lengthy cycle and bit accurate simulations are required. As a case study, we use a NoC that was modelled and simulated in SystemC. We simulate the same NoC on the described FPGA simulator. This enables us to observe the NoC behavior under a large variety of traffic patterns. Compared with the SystemC simulation we achieved a speed-up of 80-300, without compromising the cycle and bit level accuracy

    An Energy and Performance Exploration of Network-on-Chip Architectures

    Get PDF
    In this paper, we explore the designs of a circuit-switched router, a wormhole router, a quality-of-service (QoS) supporting virtual channel router and a speculative virtual channel router and accurately evaluate the energy-performance tradeoffs they offer. Power results from the designs placed and routed in a 90-nm CMOS process show that all the architectures dissipate significant idle state power. The additional energy required to route a packet through the router is then shown to be dominated by the data path. This leads to the key result that, if this trend continues, the use of more elaborate control can be justified and will not be immediately limited by the energy budget. A performance analysis also shows that dynamic resource allocation leads to the lowest network latencies, while static allocation may be used to meet QoS goals. Combining the power and performance figures then allows an energy-latency product to be calculated to judge the efficiency of each of the networks. The speculative virtual channel router was shown to have a very similar efficiency to the wormhole router, while providing a better performance, supporting its use for general purpose designs. Finally, area metrics are also presented to allow a comparison of implementation costs

    Exploiting Properties of CMP Cache Traffic in Designing Hybrid Packet/Circuit Switched NoCs

    Get PDF
    Chip multiprocessors with few to tens of processing cores are already commercially available. Increased scaling of technology is making it feasible to integrate even more cores on a single chip. Providing the cores with fast access to data is vital to overall system performance. When a core requires access to a piece of data, the core's private cache memory is searched first. If a miss occurs, the data is looked up in the next level(s) of the memory hierarchy, where often one or more levels of cache are shared between two or more cores. Communication between the cores and the slices of the on-chip shared cache is carried through the network-on-chip(NoC). Interestingly, the cache and NoC mutually affect the operation of each other; communication over the NoC affects the access latency of cache data, while the cache organization generates the coherence and data messages, thus affecting the communication patterns and latency over the NoC. This thesis considers hybrid packet/circuit switched NoCs, i.e., packet switched NoCs enhanced with the ability to configure circuits. The communication and performance benefit that come from using circuits is predicated on amortizing the time cost incurred for configuring the circuits. To address this challenge, NoC designs are proposed that take advantage of properties of the cache traffic, namely temporal locality and predictability, to amortize or hide the circuit configuration time cost. First, a coarse-grained circuit configuration policy is proposed that exploits the temporal locality in the cache traffic to periodically configure circuits for the heavily communicating nodes. This allows the design of a locality-aware cache that promotes temporal communication locality through data placement, while designing suitable data replacement and migration policies. Next, a fine-grained configuration policy, called Déjà Vu switching, is proposed for leveraging predictability of data messages by initiating a circuit configuration as soon as a cache hit is detected and before the data becomes available. Its benefit is demonstrated for saving interconnect energy in multi-plane NoCs. Finally, a more proactive configuration policy is proposed for fast caches, where circuit reservations are initiated by request messages, which can greatly improve communication latency and system performance

    Exploration within the Network-on-Chip Paradigm

    Get PDF
    A general purpose processor used to consist of a single processing core, which performed and controlled all tasks on the chip. Its functionality and maximum clock frequency grew steadily over the years. Due to the continuous increase of the number of transistors available on-chip and the operational clock frequency, it became impossible to reach every function within the chip in a single clock cycle. Furthermore, centralized control becomes hard with the increase in functionality. This lead to the split of the processing into a set of independent processing cores integrated into a single chip.\ud These multi-core architectures will rely on a well designed on-chip communication architecture. Global wires and bus-based systems need to be replaced to overcome the problem of wiring and the single point of arbitration. This is introduced as the Network-on-Chip (NoC) paradigm. Most of the communication architectures classified as a NoC are a network of routers on-chip, but the paradigm embodies a broader scope. The paradigm enables the sharing of on-chip wiring resources for multiple communication streams to reduce the total wiring required. Furthermore, it enables concurrent communication of concurrently handled data packets. The latter is in contrast to the central arbitration and single communication channel in bus-based systems.\ud In this thesis we explore the paradigm by implementation and characterization of multiple NoC router architectures. The scope of the communication architecture is the embedding in a heterogeneous multi-core System-on-Chip (SoC) for streaming applications. Six streaming applications, which are used in mobile devices, are analysed. Their common communication characteristics and specific bandwidth requirements are presented. One of the major constraints of these applications is the requirement of Quality of Service (QoS) for the interprocess communication.\ud \ud Based on application analysis we propose a circuit switched router architecture as opposed to a more flexible packet switched router architecture. The reason for this architecture is the observation that communication patterns in the applications are static. The circuit switched network is integrated in an ARM based heterogeneous reconfigurable multi-core SoC realized in a 0.13 μm CMOS technology.\ud \ud Besides this architecture, an existing packet switched router architecture, that also offers QoS, is improved and compared with the circuit switched router. Next to the exploration of those two router designs, two other packet switched routers, designed at the University of Cambridge, are included in the in-depth comparison. The four routers are placed and routed in 90 nm CMOS technology. The required buffering dominates the resource usage of all packet switched routers, which is significantly reduced in a circuit switched architecture. However, the latter pays a penalty by a larger required crossbar and reduced flexibility.\ud \ud The four routers are also compared for their latency performance and energy consumption. For latency the packet switched networks are simulated with popular synthetic traffic scenarios. The circuit switched router has a deterministic latency, due to the congestion free routes.\ud \ud The latency analysis shows the higher network utilization for NoCs using virtual channel flow control over wormhole flow control. Furthermore, the allocation mechanisms used in the improved packet switched router, cause a higher latency for randomly distributed packets compared to the router with speculation logic that is tailored for this type of traffic. Despite its higher latency for random traffic, the packet switched network is able to give end-to-end latency guarantees for specific connections, due to deterministic arbitration, as is shown in this thesis.\ud For the power analysis we compared the four routers using various traffic scenarios. One of the first observations is a high power consumption in idle mode, where no data is transported. The clock-tree and the connected synchronous elements consume the majority of the power. A minor part is the static power, which is directly related to the router's required chip area. Automatic insertion of fine-grain clock gating tremendously reduces this idle dynamic power consumption. With clock-gating, both the static and dynamic component have an equal share in the idle power at a clock frequency of 200 MHz.\ud \ud The increase in dynamic power consumption is directly related to the number of packets that are transported over the network and the amount of bit flips, i.e. activity, in the payload. Transportation of random payload, i.e. 25% activity, requires almost a factor three more in comparison with a payload of constant values, i.e. all bits inactive. Random activity is observed in the analysed streaming applications for most of the intermediate data. The buffer size has no influence on the packet's dynamic energy consumption, due to the fine-grain clock gating, which makes the packet switched routers as energy efficient as the circuit switched router. Most of the difference in energy consumption between the routers, is caused by the different crossbar dimensions and the extra bits in a packet which are required for routing and allocation. The larger crossbar is required for the circuit switched router to add flexibility, and for the improved packet switched router to enable QoS. A marginal increase in energy consumption is caused by the network congestion.\ud \ud During the design of the heterogeneous SoC architecture as well as the evaluation of the packet switched routers, we were hampered by the prohibitive simulation times of the architecture's bit and cycle accurate models. Motivated by simulation speed-ups of an FPGA in a Hardware-in-the-Loop (HIL) simulation, we developed a framework to simulate large many-core architectures on a single FPGA. Instead of the instantiation of the whole architecture in parallel in the FPGA, the individual cores are evaluated sequentially. Each core is modified such that the core's internal state and combinational functionality are separated.\ud \ud As all cores in a homogeneous many-core architecture are identical, we can construct a single hyper core, that embodies all combinational functionality of a single core. The state of the whole architecture, stored in the FPGA's memory blocks, is updated sequentially by offering a core's old state to the hyper core and store its new state. Using the sequential simulation approach in an FPGA, we are able to simulate two to three orders of magnitude faster compared to cycle and bit-accurate simulations in software.\u

    Implementation and Evaluation of an NoC Architecture for FPGAs

    Get PDF
    The Networks-on-Chip (NoC) approach for designing Systems-on-Chip (SoC) is currently emerging as an advanced concept for overcoming the scalability and efficiency problems of traditional bus-based systems. A great deal of theoretical research has been done in this area that provides good insight and shows promising results. There is a great need for research in hardware implementation of NoC-based systems to determine the feasibility of implementing various topologies and protocols, and also to accurately determine what design tradeoffs are involved in NoC implementation. This thesis addresses the challenges of implementing an NoC-based system on FPGAs for running real benchmark applications. The NoC used a mesh topology and circuit-switched communication protocol. An experimental framework was developed that allowed implementation of NoC-based system from a high level specification, using the Celoxica Handel-C hardware description language. Two test applications: charged couple device (CCD) and JPEG were developed in Handel-C to be used as our benchmark applications. Both benchmarks are computational expensive and require large quantities of data transfer that will test the NoC system. Implementation results show that the NoC-based system gives superior area utilization and speed performance compared to the bus-based system, running the same benchmarks
    corecore