65 research outputs found
Accelerating Fully Connected Neural Network on Optical Network-on-Chip (ONoC)
Fully Connected Neural Network (FCNN) is a class of Artificial Neural
Networks widely used in computer science and engineering, whereas the training
process can take a long time with large datasets in existing many-core systems.
Optical Network-on-Chip (ONoC), an emerging chip-scale optical interconnection
technology, has great potential to accelerate the training of FCNN with low
transmission delay, low power consumption, and high throughput. However,
existing methods based on Electrical Network-on-Chip (ENoC) cannot fit in ONoC
because of the unique properties of ONoC. In this paper, we propose a
fine-grained parallel computing model for accelerating FCNN training on ONoC
and derive the optimal number of cores for each execution stage with the
objective of minimizing the total amount of time to complete one epoch of FCNN
training. To allocate the optimal number of cores for each execution stage, we
present three mapping strategies and compare their advantages and disadvantages
in terms of hotspot level, memory requirement, and state transitions.
Simulation results show that the average prediction error for the optimal
number of cores in NN benchmarks is within 2.3%. We further carry out extensive
simulations which demonstrate that FCNN training time can be reduced by 22.28%
and 4.91% on average using our proposed scheme, compared with traditional
parallel computing methods that either allocate a fixed number of cores or
allocate as many cores as possible, respectively. Compared with ENoC,
simulation results show that under batch sizes of 64 and 128, on average ONoC
can achieve 21.02% and 12.95% on reducing training time with 47.85% and 39.27%
on saving energy, respectively.Comment: 14 pages, 10 figures. This paper is under the second review of IEEE
Transactions of Computer
Pulsar: Design and Simulation Methodology for Dynamic Bandwidth Allocation in Photonic Network-on-Chip Architectures in Heterogeneous Multicore Systems
As the computing industry moved toward faster and more energy-efficient solutions, multicore computers proved to be dependable. Soon after, the Network-on-Chip (NoC) paradigm made headway as an effective method of connecting multiple cores on a single chip. These on-chip networks have been used to relay communication between homogeneous and heterogeneous sets of cores and core clusters. However, the variation in bandwidth requirements of heterogeneous systems is often neglected. Therefore, at a given moment, bandwidth may be in excess at one node while it is insufficient at another leading to lower performance and higher energy costs. This work proposes and examines dynamic schemes for the allocation of photonic channels in a Photonic Network-on-Chip (PNoC) as an alternative to their static-provision counterparts and proposes a method of simulating and selecting the characteristics of a dynamic system at the time of design as to achieve maximum system performance in a Photonic Network-on-Chip for a given application type
Min/max time limits and energy penalty of communication scheduling in ring-based ONoC
International audienceRecent advances in the photonics devices integration bring ONoC as a bridge future for communication media in the MPSoC domain. As ONoC can support Wavelength Division Multiplexing (WDM) technique, communications between cores can be improved through allocation of one or several wavelengths for each communication. However, WDM introduces wavelength crosstalk, requiring to increase the laser power to provide accurate communication between cores. Thus, for the designer, exploring this design space (execution time vs power consumption) is not an easy task due to a large number of wavelength allocation combinations. The contribution presented in this paper proposes to evaluate the two extreme bounds of this design space considering the different communication scenario. To address this problem, we model the wavelength allocation by two different objective functions to compute the bounds in terms of execution times. Furthermore, from an accurate model of crosstalk between the wavelengths, we compute the energy penalty for each communication scenario. The results presented in this paper highlight the execution time and energy consumption tradeoff, and the opportunity for communication optimisation thanks to an efficient use of WDM technique
Recommended from our members
Next Generation Silicon Photonic Transceiver: From Device Innovation to System Analysis
Silicon photonics is recognized as a disruptive technology that has the potential to reshape many application areas, for example, data center communication, telecommunications, high-performance computing, and sensing. The key capability that silicon photonics offers is to leverage CMOS-style design, fabrication, and test infrastructure to build compact, energy-efficient, and high-performance integrated photonic systems-on- chip at low cost. As the need to squeeze more data into a given bandwidth and a given footprint increases, silicon photonics becomes more and more promising. This work develops and demonstrates novel devices, methodologies, and architectures to resolve the challenges facing the next-generation silicon photonic transceivers. The first part of this thesis focuses on the topology optimization of passive silicon photonic devices. Specifically, a novel device optimization methodology - particle swarm optimization in conjunction with 3D finite-difference time-domain (FDTD), has been proposed and proven to be an effective way to design a wide range of passive silicon photonic devices. We demonstrate a polarization rotator and a 90â—¦ optical hybrid for polarization-diversity and phase-diversity communications - two important schemes to increase the communication capacity by increasing the spectral efficiency. The second part of this thesis focuses on the design and characterization of the next- generation silicon photonic transceivers. We demonstrate a polarization-insensitive WDM receiver with an aggregate data rate of 160 Gb/s. This receiver adopts a novel architecture which effectively reduces the polarization-dependent loss. In addition, we demonstrate a III-V/silicon hybrid external cavity laser with a tuning range larger than 60 nm in the C-band on a silicon-on-insulator platform. A III-V semiconductor gain chip is hybridized into the silicon chip by edge-coupling to the silicon chip. The demonstrated packaging method requires only passive alignment and is thus suitable for high-volume production. We also demonstrate all silicon-photonics-based transmission of 34 Gbaud (272 Gb/s) dual-polarization 16-QAM using our integrated laser and silicon photonic coherent transceiver. The results show no additional penalty compared to commercially available narrow linewidth tunable lasers. The last part of this thesis focuses on the chip-scale optical interconnect and presents two different types of reconfigurable memory interconnects for multi-core many-memory computing systems. These reconfigurable interconnects can effectively alleviate the memory access issues, such as non-uniform memory access, and Network-on-Chip (NoC) hot-spots that plague the many-memory computing systems by dynamically directing the available memory bandwidth to the required memory interface
High-Performance and Wavelength-Reused Optical Network on Chip (ONoC) Architectures and Communication Schemes for Manycore Processor
Optical Network on Chip (ONoC) is an emerging chip-scale optical interconnection technology to realize the high-performance and power-efficient inter-core communication for many-core processors. By utilizing the silicon photonic interconnects to transmit data packets with optical signals, it can achieve ultra low communication delay, high bandwidth capacity, and low power dissipation. With the benefits of Wavelength Division Multiplexing (WDM), multiple optical signals can simultaneously be transmitted in the same optical interconnect through different wavelengths. Thus, the WDM-based ONoC is becoming a hot research topic recently. However, the maximal number of available wavelengths is restricted for the reliable and power-efficient optical communication in ONoC. Hence, with a limited number of wavelengths, the design of high-performance and power-efficient ONoC architecture is an important and challenging problem.
In this thesis, the design methodology of wavelength-reused ONoC architecture is explored. With the wavelength reuse scheme in optical routing paths, high-performance and power-efficient communication is realized for many-core processors only using a small number of available wavelengths. Three wavelength-reused ONoC architectures and communication schemes are proposed to fulfil different communication requirements, i.e., network scalability, multicast communication, and dark silicon.
Firstly, WRH-ONoC, a wavelength-reused hierarchical Optical Network on Chip architecture, is proposed to achieve high network scalability, namely obtaining low communication delay and high throughput capacity for hundreds of thousands of cores by reusing the limited number of available wavelengths with the modest hardware cost and energy overhead. WRH-ONoC combines the advantages of non-blocking communication in each lambda-router and wavelength reuse in all lambda-routers through the hierarchical networking. Both theoretical analysis and simulation results indicate that WRH-ONoC can achieve prominent improvement on the communication performance and scalability (e.g., 46.0% of reduction on the zero-load packet delay and 72.7% of improvement on the network throughput for 400 cores with small hardware cost and energy overhead) in comparison with existing schemes.
Secondly, DWRMR, a dynamical wavelength-reused multicast scheme based on the optical multicast ring, is proposed for widely existing multicast communications in many-core processors. In DWRMR, an optical multicast ring is dynamically constructed for each multicast group and the multicast packets are transmitted in a single-send-multi-receive manner requiring only one wavelength. All the cores in the same multicast group can reuse the established multicast ring through an optical token arbitration scheme for the interactive multicast communications, thereby avoiding the frequent construction of multicast routing paths dedicatedly for each core. Simulation results indicate that DWRMR can reduce more than 50% of end-to-end packet delay with slight hardware cost, or require only half number of wavelengths to achieve the same performance compared with existing schemes.
Thirdly, Dark-ONoC, a dynamically configurable ONoC architecture, is proposed for the many-core processor with dark silicon. Dark silicon is an inevitable phenomenon that only a small number of cores can be activated simultaneously while the other cores must stay in dark state (power-gated) due to the restricted power budget. Dark-ONoC periodically allocates non-blocking optical routing paths only between the active cores with as less wavelengths as possible. Thus, it can obtain high-performance communication and low power consumption at the same time. Extensive simulations are conducted with the dark silicon patterns from both synthetic distribution and real data traces. The simulation results indicate that the number of wavelengths is reduced by around 15% and the overall power consumption is reduced by 23.4% compared to existing schemes.
Finally, this thesis concludes several important principles on the design of wavelength-reused ONoC architecture, and summarizes some perspective issues for the future research
Resource and thermal management in 3D-stacked multi-/many-core systems
Continuous semiconductor technology scaling and the rapid increase in computational needs have stimulated the emergence of multi-/many-core processors. While up to hundreds of cores can be placed on a single chip, the performance capacity of the cores cannot be fully exploited due to high latencies of interconnects and memory, high power consumption, and low manufacturing yield in traditional (2D) chips. 3D stacking is an emerging technology that aims to overcome these limitations of 2D designs by stacking processor dies over each other and using through-silicon-vias (TSVs) for on-chip communication, and thus, provides a large amount of on-chip resources and shortens communication latency. These benefits, however, are limited by challenges in high power densities and temperatures.
3D stacking also enables integrating heterogeneous technologies into a single chip. One example of heterogeneous integration is building many-core systems with silicon-photonic network-on-chip (PNoC), which reduces on-chip communication latency significantly and provides higher bandwidth compared to electrical links. However, silicon-photonic links are vulnerable to on-chip thermal and process variations. These variations can be countered by actively tuning the temperatures of optical devices through micro-heaters, but at the cost of substantial power overhead.
This thesis claims that unearthing the energy efficiency potential of 3D-stacked systems requires intelligent and application-aware resource management. Specifically, the thesis improves energy efficiency of 3D-stacked systems via three major components of computing systems: cache, memory, and on-chip communication. We analyze characteristics of workloads in computation, memory usage, and communication, and present techniques that leverage these characteristics for energy-efficient computing.
This thesis introduces 3D cache resource pooling, a cache design that allows for flexible heterogeneity in cache configuration across a 3D-stacked system and improves cache utilization and system energy efficiency. We also demonstrate the impact of resource pooling on a real prototype 3D system with scratchpad memory.
At the main memory level, we claim that utilizing heterogeneous memory modules and memory object level management significantly helps with energy efficiency. This thesis proposes a memory management scheme at a finer granularity: memory object level, and a page allocation policy to leverage the heterogeneity of available memory modules and cater to the diverse memory requirements of workloads.
On the on-chip communication side, we introduce an approach to limit the power overhead of PNoC in (3D) many-core systems through cross-layer thermal management. Our proposed thermally-aware workload allocation policies coupled with an adaptive thermal tuning policy minimize the required thermal tuning power for PNoC, and in this way, help broader integration of PNoC. The thesis also introduces techniques in placement and floorplanning of optical devices to reduce optical loss and, thus, laser source power consumption.2018-03-09T00:00:00
Recommended from our members
Photonic Interconnects Beyond High Bandwidth
The extraordinary growth of parallelism in high-performance computing requires efficient data communication for scaling compute performance. High-performance computing systems have been using photonic links for communication of large bandwidth-distance product during the last decade. Photonic interconnection networks, however, should not be a wire-for-wire replacement based on conventional electrical counterparts. Features of photonics beyond high bandwidth, such as transparent bandwidth steering, can implement important functionalities needed by applications. In another aspect, application characteristics can be exploited to design better photonic interconnects. Therefore, this thesis explores codesign opportunities at the intersection between photonic interconnect architectures and high-performance computing applications. The key accomplishments of this thesis, ranging from system level to node level, are as follows.
Chapter 2 presents a system-level architecture that leverages photonic switching to enable a reconfigurable interconnect. The architecture, called Flexfly, reconfigures the inter-group level of the widely-used Dragonfly topology using information about the application’s communication pattern. It can steal additional direct bandwidth for communication-intensive group pairs. Simulations with applications such as GTC, Nekbone and LULESH show up to 1.8x speedup over Dragonfly paired with UGAL routing, along with halved hop count and latency for cross-group messages. To demonstrate the effectiveness of our approach, we built a 32-node Flexfly prototype using a silicon photonic switch connecting four groups and demonstrated 820 ns interconnect reconfiguration time. This is the first demonstration of silicon photonic switching and bandwidth steering in a high-performance computing cluster.
Chapter 3 extends photonic switching to the node level and presents a reconfigurable silicon photonic memory interconnect for many-core architectures. The interconnect targets at important memory access issues, such as network-on-chip hot-spots and non-uniform memory access. Integrated with the processor through 2.5D/3D stacking, a fast-tunable silicon photonic memory tunnel can transparently direct traffic from any off-chip memory to any on-chip interface – thus alleviating the hot-spot and non-uniform access effects. We demonstrated the operation of our proposed architecture using a tunable laser, a 4-port silicon photonic switch (four wavelength-routed memory channels) and a 4x4 mesh network-on-chip synthesized by FPGA. The emulated system achieves a 15-ns channel switching time. Simulations based on a 12-core 4-memory model show that for such switching speeds the interconnect system can realize a 2x speedup for the STREAM benchmark in the hot-spot scenario and a reduction of execution time for data-intensive applications such as 3D stencil and K-means clustering by 23% and 17%, respectively.
Chapters 4 explores application-level characteristics that can be exploited to hide photonic path setup delays. In view of the frequent reuse of optical circuits by many applications, we proposed a circuit-cached scheme that amortizes the setup overhead by maximizing circuit reuses. In order to improve circuit “hit” rates, we developed a reuse-distance based replacement policy called “Farthest Next Use”. We further investigated the tradeoffs between the realized hit rate and energy consumption. Finally, we experimentally demonstrated the feasibility of the proposed concept using silicon photonic devices in an FPGA-controlled network testbed.
Chapter 5 proceeds to develop an application-guided circuit-prefetch scheme. By learning temporal locality and communication patterns from upper-layer applications, the scheme not only caches a set of circuits for reuses, but also proactively prefetches circuits based on predictions. We applied this technique to communication patterns from a spectrum of science and engineering applications. The results show that setup delays via circuit misses are significantly reduced, showing how the proposed technique can improve circuit switching in photonic interconnects
Recommended from our members
Variation-Aware Modeling and Design of Nanophotonic Interconnects
Optical interconnects have started to replace electrical interconnects in the communications between racks and circuit boards with potential benefits in bandwidth, delay, power efficiency, and crosstalk. Silicon photonics has emerged to be a highly promising enabling technology for the short-reach nanophotonic interconnects because it offers favorable CMOS compatibility and high integration level. The fast-growing complexity of photonic integrated circuit (PIC) and close electro-optical integration call for computer-aided design (CAD) for integrated photonics, and electronic-photonic design automation (EPDA) including accurate behavior models and efficient simulation methodologies for integrated electro-optical systems. Also, the nanophotonic devices are highly sensitive to fabrication process variation and thermal variation effects, which requires proper modeling, optimization, and management schemes. To address these problems, this thesis is dedicated to the following two tasks: (1) compact modeling and circuit-level simulation of nanophotonic interconnects, and (2) power-efficient management of the variation effects in nanophotonic interconnects.The first part of the thesis develops compact models for key components in nanophotonic interconnects including silicon microring modulators, diode lasers, electro-absorption modulators (EAM), photodetectors, etc. These compact models are developed based on their electrical and optical properties, and are then extensively validated by measurement data. The model parameters are extracted from common electrical and optical tests. Implemented in Verilog-A, the models are used in SPICE simulations of optical links, whose results again agree well with measurement data. The compact model library and the simulation methodology enable electro-optical co-simulations and optical device design explorations in the circuit-level.In the second part of the thesis, we propose modeling methods and power-efficient management schemes for the process and thermal variations in optical interconnects. The proposed adaptive tuning technique performs on-chip self-tests and adaptively allocates just enough power for link operations. The technique saves significant amount of power compared to worst-case based conservative designs, and scales well w.r.t. variations and network size. We also design power-efficient pairing algorithms for microring-based optical interconnects. Our algorithms optimally mix-and-match microring-based devices to minimize the power consumption for tuning. The algorithms are tested on both measured and synthetic data sets, demonstrating promising results of power reduction and scalability for handling a large number of devices. Lastly, we decompose and analyze wafer-scale spatial patterns of process variations in microring modulators. We further investigate the correlations between the spatial patterns and fabrication process steps, which is valuable for understanding process variation sources and improving fabrication processes for uniformity
- …