309 research outputs found

    Cycle-accurate evaluation of reconfigurable photonic networks-on-chip

    Get PDF
    There is little doubt that the most important limiting factors of the performance of next-generation Chip Multiprocessors (CMPs) will be the power efficiency and the available communication speed between cores. Photonic Networks-on-Chip (NoCs) have been suggested as a viable route to relieve the off- and on-chip interconnection bottleneck. Low-loss integrated optical waveguides can transport very high-speed data signals over longer distances as compared to on-chip electrical signaling. In addition, with the development of silicon microrings, photonic switches can be integrated to route signals in a data-transparent way. Although several photonic NoC proposals exist, their use is often limited to the communication of large data messages due to a relatively long set-up time of the photonic channels. In this work, we evaluate a reconfigurable photonic NoC in which the topology is adapted automatically (on a microsecond scale) to the evolving traffic situation by use of silicon microrings. To evaluate this system's performance, the proposed architecture has been implemented in a detailed full-system cycle-accurate simulator which is capable of generating realistic workloads and traffic patterns. In addition, a model was developed to estimate the power consumption of the full interconnection network which was compared with other photonic and electrical NoC solutions. We find that our proposed network architecture significantly lowers the average memory access latency (35% reduction) while only generating a modest increase in power consumption (20%), compared to a conventional concentrated mesh electrical signaling approach. When comparing our solution to high-speed circuit-switched photonic NoCs, long photonic channel set-up times can be tolerated which makes our approach directly applicable to current shared-memory CMPs

    Clock Synchronisation Assisted Clock and Data Recovery for Sub-Nanosecond Data Centre Optical Switching

    Get PDF
    In current `Cloud' data centres, switching of data between servers is performed using deep hierarchies of interconnected electronic packet switches. Demand for network bandwidth from emerging data centre workloads, combined with the slowing of silicon transistor scaling, is leading to a widening gap between data centre traffic demand and electronically-switched data centre network capacity. All-optical switches could offer a future-proof alternative, with potentially under a third of the power consumption and cost of electronically-switched networks. However, the effective bandwidth of optical switches depends on their overall switching time. This is dominated by the clock and data recovery (CDR) locking time, which takes hundreds of nanoseconds in commercial receivers. Current data centre traffic is dominated by small packets that transmit in tens of nanoseconds, leading to low effective bandwidth, as a high proportion of receiver time is spent performing CDR locking instead of receiving data, removing the benefits of optical switching. High-performance optical switching requires sub-nanosecond CDR locking time to overcome this limitation. This thesis proposes, models, and demonstrates clock synchronisation assisted CDR, which can achieve this. This approach uses clock synchronisation to simplify the complexity of CDR versus previous asynchronous approaches. An analytical model of the technique is first derived that establishes its potential viability. Following this, two approaches to clock synchronisation assisted CDR are investigated: 1. Clock phase caching, which uses clock phase storage and regular updates in a 2km intra-building scale data centre network interconnected by single-mode optical fibre. 2. Single calibration clock synchronisation assisted CDR}, which leverages the 20 times lower thermal sensitivity of hollow core optical fibre versus single-mode fibre to synchronise a 100m cluster scale data centre network, with a single initial phase calibration step. Using a real-time FPGA-based optical switch testbed, sub-nanosecond CDR locking time was demonstrated for both approaches

    Synchronous subnanosecond clock and data recovery for optically switched data centres using clock phase caching

    Get PDF
    The rapid growth in the amount of data being transferred within data centres, combined with the slowdown in Moore’s Law, creates challenges for the future scalability of electronically switched data-centre networks. Optical switches could offer a future-proof alternative, and photonic integration platforms have been demonstrated with nanosecond-scale optical switching times. End-to-end switching time is, however, currently limited by the clock and data recovery time, which typically takes microseconds, removing the benefits of nanosecond optical switching. Here we show that a clock phase caching technique can provide clock and data recovery times of under 625 ps (16 symbols at 25.6 Gb s−1). Our approach uses the measurement and storage of clock phase values in a synchronized network to simplify clock and data recovery versus conventional asynchronous approaches. We demonstrate the capabilities of our technique using a real-time prototype with commercial transceivers and validate its resilience against temperature variation and clock jitter

    Towards zero latency photonic switching in shared memory networks

    Get PDF
    Photonic networks-on-chip based on silicon photonics have been proposed to reduce latency and power consumption in future chip multi-core processors (CMP). However, high performance CMPs use a shared memory model which generates large numbers of short messages, creating high arbitration latency overhead for photonic switching networks. In this paper we explore techniques which intelligently use information from the memory hierarchy to predict communication in order to setup photonic circuits with reduced or eliminated arbitration latency. Firstly, we present a switch scheduling algorithm which arbitrates on a per memory transaction basis and holds open photonic circuits to exploit temporal locality. We show that this can reduce the average arbitration latency overhead by 60% and eliminate arbitration latency altogether for a signi cant proportion of memory transactions. We then show how this technique can be applied to multiple-socket shared memory systems with low latency and energy consumption penalties. Finally, we present ideas and initial results to demonstrate that cache miss prediction could be used to set up photonic circuits for more complex memory transactions and main memory accesses

    Low-power reconfigurable network architecture for on-chip photonic interconnects

    Get PDF
    Photonic Networks-On-Chip have emerged as a viable solution for interconnecting multicore computer architectures in a power-efficient manner. Current architectures focus on large messages, however, which are not compatible with the coherence traffic found on chip multiprocessor networks. In this paper, we introduce a reconfigurable optical interconnect in which the topology is adapted automatically to the evolving traffic situation. This allows a large fraction of the (short) coherence messages to use the optical links, making our technique a better match for CMP networks when compared to existing solutions. We also evaluate the performance and power efficiency of our architecture using an assumed physical implementation based on ultra-low power optical switching devices and under realistic traffic load conditions

    Equalizer State Caching for Fast Data Recovery in Optically-Switched Data Center Networks

    Get PDF
    Optical switching offers the potential to significantly scale the capacity of data center networks (DCN) with a simultaneous reduction in switching time and power consumption. Previous research has shown that end-to-end switching time, which is the sum of the switch configuration time and the clock and data recovery (CDR) locking time, should be kept within a few nanoseconds for high network throughput. This challenge of low switching time has motivated research into fast optical switches, ultra-fast clock and amplitude recovery techniques. Concurrently, the data rate between server-to-server and server-to-switch interconnect is increasing drastically from the current 100 Gb/s (4×25 Gb/s) to 400 Gb/s and beyond, motivating the use of high order formats such as 50-GBaud four-level pulse-amplitude modulation (PAM-4) for signalling. Since PAM-4 is more sensitive to noise and distortion, digital equalizers are generally needed to compensate for impairments such as transceiver frequency rolloff, dispersion and optical filtering, adding additional time for equalizer adaptation and power consumption that are undesired for fast optical switching systems. Here we propose and investigate an equalizer state caching technique that reduces equalizer adaptation time and computation power consumption for fast optical switching systems, underpinning optically-switched DCNs using high baud rate and impairment-sensitive formats. Through a proof-of-concept experiment, we study the performance of the proposed equalizer state caching scheme in a three-node optical switching system using 56 GBaud PAM-4. Our experimental results show that the proposed scheme can tolerate up to 0.8-nm (100-GHz) instantaneous wavelength change with an adaptation delay of only 0.36 ns. Practical considerations such as clock phase misalignment, temperature-induced wavelength drift, and equalizer precision are also studied

    Bringing communication networks on a chip: test and verification implications

    Full text link

    Tightly-Coupled and Fault-Tolerant Communication in Parallel Systems

    Full text link
    The demand for processing power is increasing steadily. In the past, single processor architectures clearly dominated the markets. As instruction level parallelism is limited in most applications, significant performance can only be achieved in the future by exploiting parallelism at the higher levels of thread or process parallelism. As a consequence, modern “processors” incorporate multiple processor cores that form a single shared memory multiprocessor. In such systems, high performance devices like network interface controllers are connected to processors and memory like every other input/output device over a hierarchy of peripheral interconnects. Thus, one target must be to couple coprocessors physically closer to main memory and to the processors of a computing node. This removes the overhead of today’s peripheral interconnect structures. Such a step is the direct connection of HyperTransport (HT) devices to Opteron processors, which is presented in this thesis. Also, this work analyzes how communication from a device to processors can be optimized on the protocol level. As today’s computing nodes are shared memory systems, the cache coherence protocol is the central protocol for data exchange between processors and devices. Consequently, the analysis extends to classes of devices that are cache coherence protocol aware. Also, the concept of a transfer cache is proposed in this thesis, which reduces latency significantly even for non-coherent devices. The trend to the exploitation of process and thread level parallelism leads to a steady increase of system sizes. Networks that are used in such large systems are very susceptible to both hard and transient faults. Most transient fault rates are constant per bit that is stored or transmitted. With increasing system sizes and higher clock frequencies, the number of faults in time increases drastically. In the end, the error rate may rise at a level where high level error recovery becomes too costly if lower layers do not perform error correction that is transparent to the layers above. The second part of this thesis describes a direct interconnection network that provides a reliable transport service even without the use of end-to-end protocols. Also, a novel hardware based solution for intermediate routing is developed in this thesis, which allows an efficient, deadlock free routing around faulty links

    The AURORA Gigabit Testbed

    Get PDF
    AURORA is one of five U.S. networking testbeds charged with exploring applications of, and technologies necessary for, networks operating at gigabit per second or higher bandwidths. The emphasis of the AURORA testbed, distinct from the other four testbeds, BLANCA, CASA, NECTAR, and VISTANET, is research into the supporting technologies for gigabit networking. Like the other testbeds, AURORA itself is an experiment in collaboration, where government initiative (in the form of the Corporation for National Research Initiatives, which is funded by DARPA and the National Science Foundation) has spurred interaction among pre-existing centers of excellence in industry, academia, and government. AURORA has been charged with research into networking technologies that will underpin future high-speed networks. This paper provides an overview of the goals and methodologies employed in AURORA, and points to some preliminary results from our first year of research, ranging from analytic results to experimental prototype hardware. This paper enunciates our targets, which include new software architectures, network abstractions, and hardware technologies, as well as applications for our work

    Automated application-specific optimisation of interconnects in multi-core systems

    Get PDF
    In embedded computer systems there are often tasks, implemented as stand-alone devices, that are both application-specific and compute intensive. A recurring problem in this area is to design these application-specific embedded systems as close to the power and efficiency envelope as possible. Work has been done on optimizing singlecore systems and memory organisation, but current methods for achieving system design goals are proving limited as the system capabilities and system size increase in the multi- and many-core era. To address this problem, this thesis investigates machine learning approaches to managing the design space presented in the interconnect design of embedded multi-core systems. The design space presented is large due to the system scale and level of interconnectivity, and also feature inter-dependant parameters, further complicating analysis. The results presented in this thesis demonstrate that machine learning approaches, particularly wkNN and random forest, work well in handling the complexity of the design space. The benefits of this approach are in automation, saving time and effort in the system design phase as well as energy and execution time in the finished system
    corecore