78 research outputs found

    ์ฐจ์„ธ๋Œ€ HBM ์šฉ ๊ณ ์ง‘์ , ์ €์ „๋ ฅ ์†ก์ˆ˜์‹ ๊ธฐ ์„ค๊ณ„

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2020. 8. ์ •๋•๊ท .This thesis presents design techniques for high-density power-efficient transceiver for the next-generation high bandwidth memory (HBM). Unlike the other memory interfaces, HBM uses a 3D-stacked package using through-silicon via (TSV) and a silicon interposer. The transceiver for HBM should be able to solve the problems caused by the 3D-stacked package and TSV. At first, a data (DQ) receiver for HBM with a self-tracking loop that tracks a phase skew between DQ and data strobe (DQS) due to a voltage or thermal drift is proposed. The self-tracking loop achieves low power and small area by uti-lizing an analog-assisted baud-rate phase detector. The proposed pulse-to-charge (PC) phase detector (PD) converts the phase skew to a voltage differ-ence and detects the phase skew from the voltage difference. An offset calibra-tion scheme that can compensates for a mismatch of the PD is also proposed. The proposed calibration scheme operates without any additional sensing cir-cuits by taking advantage of the write training of HBM. Fabricated in 65 nm CMOS, the DQ receiver shows a power efficiency of 370 fJ/b at 4.8 Gb/s and occupies 0.0056 mm2. The experimental results show that the DQ receiver op-erates without any performance degradation under a ยฑ 10% supply variation. In a second prototype IC, a high-density transceiver for HBM with a feed-forward-equalizer (FFE)-combined crosstalk (XT) cancellation scheme is pre-sented. To compensate for the XT, the transmitter pre-distorts the amplitude of the FFE output according to the XT. Since the proposed XT cancellation (XTC) scheme reuses the FFE implemented to equalize the channel loss, additional circuits for the XTC is minimized. Thanks to the XTC scheme, a channel pitch can be significantly reduced, allowing for the high channel density. Moreover, the 3D-staggered channel structure removes the ground layer between the verti-cally adjacent channels, which further reduces a cross-sectional area of the channel per lane. The test chip including 6 data lanes is fabricated in 65 nm CMOS technology. The 6-mm channels are implemented on chip to emulate the silicon interposer between the HBM and the processor. The operation of the XTC scheme is verified by simultaneously transmitting 4-Gb/s data to the 6 consecutive channels with 0.5-um pitch and the XTC scheme reduces the XT-induced jitter up to 78 %. The measurement result shows that the transceiver achieves the throughput of 8 Gb/s/um. The transceiver occupies 0.05 mm2 for 6 lanes and consumes 36.6 mW at 6 x 4 Gb/s.๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ฐจ์„ธ๋Œ€ HBM์„ ์œ„ํ•œ ๊ณ ์ง‘์  ์ €์ „๋ ฅ ์†ก์ˆ˜์‹ ๊ธฐ ์„ค๊ณ„ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ๋กœ, ์ „์•• ๋ฐ ์˜จ๋„ ๋ณ€ํ™”์— ์˜ํ•œ ๋ฐ์ดํ„ฐ์™€ ํด๋Ÿญ ๊ฐ„ ์œ„์ƒ ์ฐจ์ด๋ฅผ ๋ณด์ƒํ•  ์ˆ˜ ์žˆ๋Š” ์ž์ฒด ์ถ”์  ๋ฃจํ”„๋ฅผ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ ์ˆ˜์‹ ๊ธฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ์ž์ฒด ์ถ”์  ๋ฃจํ”„๋Š” ๋ฐ์ดํ„ฐ ์ „์†ก ์†๋„์™€ ๊ฐ™์€ ์†๋„๋กœ ๋™์ž‘ํ•˜๋Š” ์œ„์ƒ ๊ฒ€์ถœ๊ธฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ „๋ ฅ ์†Œ๋ชจ์™€ ๋ฉด์ ์„ ์ค„์˜€๋‹ค. ๋˜ํ•œ ๋ฉ”๋ชจ๋ฆฌ์˜ ์“ฐ๊ธฐ ํ›ˆ๋ จ (write training) ๊ณผ์ •์„ ์ด์šฉํ•˜์—ฌ ํšจ๊ณผ์ ์œผ๋กœ ์œ„์ƒ ๊ฒ€์ถœ๊ธฐ์˜ ์˜คํ”„์…‹์„ ๋ณด์ƒํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ๋ฐ์ดํ„ฐ ์ˆ˜์‹ ๊ธฐ๋Š” 65 nm ๊ณต์ •์œผ๋กœ ์ œ์ž‘๋˜์–ด 4.8 Gb/s์—์„œ 370 fJ/b์„ ์†Œ๋ชจํ•˜์˜€๋‹ค. ๋˜ํ•œ 10 % ์˜ ์ „์•• ๋ณ€ํ™”์— ๋Œ€ํ•˜์—ฌ ์•ˆ์ •์ ์œผ๋กœ ๋™์ž‘ํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ, ํ”ผ๋“œ ํฌ์›Œ๋“œ ์ดํ€„๋ผ์ด์ €์™€ ๊ฒฐํ•ฉ๋œ ํฌ๋กœ์Šค ํ† ํฌ ๋ณด์ƒ ๋ฐฉ์‹์„ ํ™œ์šฉํ•œ ๊ณ ์ง‘์  ์†ก์ˆ˜์‹ ๊ธฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ์†ก์‹ ๊ธฐ๋Š” ํฌ๋กœ์Šค ํ† ํฌ ํฌ๊ธฐ์— ํ•ด๋‹นํ•˜๋Š” ๋งŒํผ ์†ก์‹ ๊ธฐ ์ถœ๋ ฅ์„ ์™œ๊ณกํ•˜์—ฌ ํฌ๋กœ์Šค ํ† ํฌ๋ฅผ ๋ณด์ƒํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ํฌ๋กœ์Šค ํ† ํฌ ๋ณด์ƒ ๋ฐฉ์‹์€ ์ฑ„๋„ ์†์‹ค์„ ๋ณด์ƒํ•˜๊ธฐ ์œ„ํ•ด ๊ตฌํ˜„๋œ ํ”ผ๋“œ ํฌ์›Œ๋“œ ์ดํ€„๋ผ์ด์ €๋ฅผ ์žฌํ™œ์šฉํ•จ์œผ๋กœ์จ ์ถ”๊ฐ€์ ์ธ ํšŒ๋กœ๋ฅผ ์ตœ์†Œํ™”ํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ์†ก์ˆ˜์‹ ๊ธฐ๋Š” ํฌ๋กœ์Šค ํ† ํฌ๊ฐ€ ๋ณด์ƒ ๊ฐ€๋Šฅํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์ฑ„๋„ ๊ฐ„๊ฒฉ์„ ํฌ๊ฒŒ ์ค„์—ฌ ๊ณ ์ง‘์  ํ†ต์‹ ์„ ๊ตฌํ˜„ํ•˜์˜€๋‹ค. ๋˜ํ•œ ์ง‘์ ๋„๋ฅผ ๋” ์ฆ๊ฐ€์‹œํ‚ค๊ธฐ ์œ„ํ•ด ์„ธ๋กœ๋กœ ์ธ์ ‘ํ•œ ์ฑ„๋„ ์‚ฌ์ด์˜ ์ฐจํ ์ธต์„ ์ œ๊ฑฐํ•œ ์ ์ธต ์ฑ„๋„ ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. 6๊ฐœ์˜ ์†ก์ˆ˜์‹ ๊ธฐ๋ฅผ ํฌํ•จํ•œ ํ”„๋กœํ† ํƒ€์ž… ์นฉ์€ 65 nm ๊ณต์ •์œผ๋กœ ์ œ์ž‘๋˜์—ˆ๋‹ค. HBM๊ณผ ํ”„๋กœ์„ธ์„œ ์‚ฌ์ด์˜ silicon interposer channel ์„ ๋ชจ์‚ฌํ•˜๊ธฐ ์œ„ํ•œ 6 mm ์˜ ์ฑ„๋„์ด ์นฉ ์œ„์— ๊ตฌํ˜„๋˜์—ˆ๋‹ค. ์ œ์•ˆํ•˜๋Š” ํฌ๋กœ์Šค ํ† ํฌ ๋ณด์ƒ ๋ฐฉ์‹์€ 0.5 um ๊ฐ„๊ฒฉ์˜ 6๊ฐœ์˜ ์ธ์ ‘ํ•œ ์ฑ„๋„์— ๋™์‹œ์— ๋ฐ์ดํ„ฐ๋ฅผ ์ „์†กํ•˜์—ฌ ๊ฒ€์ฆ๋˜์—ˆ์œผ๋ฉฐ, ํฌ๋กœ์Šค ํ† ํฌ๋กœ ์ธํ•œ ์ง€ํ„ฐ๋ฅผ ์ตœ๋Œ€ 78 % ๊ฐ์†Œ์‹œ์ผฐ๋‹ค. ์ œ์•ˆํ•˜๋Š” ์†ก์ˆ˜์‹ ๊ธฐ๋Š” 8 Gb/s/um ์˜ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ๊ฐ€์ง€๋ฉฐ 6 ๊ฐœ์˜ ์†ก์ˆ˜์‹ ๊ธฐ๊ฐ€ ์ด 36.6 mW์˜ ์ „๋ ฅ์„ ์†Œ๋ชจํ•˜์˜€๋‹ค.CHAPTER 1 INTRODUCTION 1 1.1 MOTIVATION 1 1.2 THESIS ORGANIZATION 4 CHAPTER 2 BACKGROUND ON HIGH-BANDWIDTH MEMORY 6 2.1 OVERVIEW 6 2.2 TRANSCEIVER ARCHITECTURE 10 2.3 READ/WRITE OPERATION 15 2.3.1 READ OPERATION 15 2.3.2 WRITE OPERATION 19 CHAPTER 3 BACKGROUNDS ON COUPLED WIRES 21 3.1 GENERALIZED MODEL 21 3.2 EFFECT OF CROSSTALK 26 CHAPTER 4 DQ RECEIVER WITH BAUD-RATE SELF-TRACKING LOOP 29 4.1 OVERVIEW 29 4.2 FEATURES OF DQ RECEIVER FOR HBM 33 4.3 PROPOSED PULSE-TO-CHARGE PHASE DETECTOR 35 4.3.1 OPERATION OF PULSE-TO-CHARGE PHASE DETECTOR 35 4.3.2 OFFSET CALIBRATION 37 4.3.3 OPERATION SEQUENCE 39 4.4 CIRCUIT IMPLEMENTATION 42 4.5 MEASUREMENT RESULT 46 CHAPTER 5 HIGH-DENSITY TRANSCEIVER FOR HBM WITH 3D-STAGGERED CHANNEL AND CROSSTALK CANCELLATION SCHEME 57 5.1 OVERVIEW 57 5.2 PROPOSED 3D-STAGGERED CHANNEL 61 5.2.1 IMPLEMENTATION OF 3D-STAGGERED CHANNEL 61 5.2.2 CHANNEL CHARACTERISTICS AND MODELING 66 5.3 PROPOSED FEED-FORWARD-EQUALIZER-COMBINED CROSSTALK CANCELLATION SCHEME 72 5.4 CIRCUIT IMPLEMENTATION 77 5.4.1 OVERALL ARCHITECTURE 77 5.4.2 TRANSMITTER WITH FFE-COMBINED XTC 79 5.4.3 RECEIVER 81 5.5 MEASUREMENT RESULT 82 CHAPTER 6 CONCLUSION 93 BIBLIOGRAPHY 95 ์ดˆ ๋ก 102Docto

    Experimental Evaluation and Comparison of Time-Multiplexed Multi-FPGA Routing Architectures

    Get PDF
    Emulating large complex designs require multi-FPGA systems (MFS). However, inter-FPGA communication is confronted by the challenge of lack of interconnect capacity due to limited number of FPGA input/output (I/O) pins. Serializing parallel signals onto a single trace effectively addresses the limited I/O pin obstacle. Besides the multiplexing scheme and multiplexing ratio (number of inter-FPGA signals per trace), the choice of the MFS routing architecture also affect the critical path latency. The routing architecture of an MFS is the interconnection pattern of FPGAs, fixed wires and/or programmable interconnect chips. Performance of existing MFS routing architectures is also limited by off-chip interface selection. In this dissertation we proposed novel 2D and 3D latency-optimized time-multiplexed MFS routing architectures. We used rigorous experimental approach and real sequential benchmark circuits to evaluate and compare the proposed and existing MFS routing architectures. This research provides a new insight into the encouraging effects of using off-chip optical interface and three dimensional MFS routing architectures. The vertical stacking results in shorter off-chip links improving the overall system frequency with the additional advantage of smaller footprint area. The proposed 3D architectures employed serialized interconnect between intra-plane and inter-plane FPGAs to address the pin limitation problem. Additionally, all off-chip links are replaced by optical fibers that exhibited latency improvement and resulted in faster MFS. Results indicated that exploiting third dimension provided latency and area improvements as compared to 2D MFS. We also proposed latency-optimized planar 2D MFS architectures in which electrical interconnections are replaced by optical interface in same spatial distribution. Performance evaluation and comparison showed that the proposed architectures have reduced critical path delay and system frequency improvement as compared to conventional MFS. We also experimentally evaluated and compared the system performance of three inter-FPGA communication schemes i.e. Logic Multiplexing, SERDES and MGT in conjunction with two routing architectures i.e. Completely Connected Graph (CCG) and TORUS. Experimental results showed that SERDES attained maximum frequency than the other two schemes. However, for very high multiplexing ratios, the performance of SERDES & MGT became comparable

    Source-synchronous I/O Links using Adaptive Interface Training for High Bandwidth Applications

    Get PDF
    Mobility is the key to the global business which requires people to be always connected to a central server. With the exponential increase in smart phones, tablets, laptops, mobile traffic will soon reach in the range of Exabytes per month by 2018. Applications like video streaming, on-demand-video, online gaming, social media applications will further increase the traffic load. Future application scenarios, such as Smart Cities, Industry 4.0, Machine-to-Machine (M2M) communications bring the concepts of Internet of Things (IoT) which requires high-speed low power communication infrastructures. Scientific applications, such as space exploration, oil exploration also require computing speed in the range of Exaflops/s by 2018 which means TB/s bandwidth at each memory node. To achieve such bandwidth, Input/Output (I/O) link speed between two devices needs to be increased to GB/s. The data at high speed between devices can be transferred serially using complex Clock-Data-Recovery (CDR) I/O links or parallely using simple source-synchronous I/O links. Even though CDR is more efficient than the source-synchronous method for single I/O link, but to achieve TB/s bandwidth from a single device, additional I/O links will be required and the source-synchronous method will be more advantageous in terms of area and power requirements as additional I/O links do not require extra hardware resources. At high speed, there are several non-idealities (Supply noise, crosstalk, Inter- Symbol-Interference (ISI), etc.) which create unwanted skew problem among parallel source-synchronous I/O links. To solve these problems, adaptive trainings are used in time domain to synchronize parallel source-synchronous I/O links irrespective of these non-idealities. In this thesis, two novel adaptive training architectures for source-synchronous I/O links are discussed which require significantly less silicon area and power in comparison to state-of-the-art architectures. First novel adaptive architecture is based on the unit delay concept to synchronize two parallel clocks by adjusting the phase of one clock in only one direction. Second novel adaptive architecture concept consists of Phase Interpolator (PI)-based Phase Locked Loop (PLL) which can adjust the phase in both direction and achieve faster synchronization at the expense of added complexity. With an increase in parallel I/O links, clock skew which is generated by the improper clock tree, also affects the timing margin. Incorrect duty cycle further reduces the timing margin mainly in Double Data Rate (DDR) systems which are generally used to increase the bandwidth of a high-speed communication system. To solve clock skew and duty cycle problems, a novel clock tree buffering algorithm and a novel duty cycle corrector are described which further reduce the power consumption of a source-synchronous system

    Hybrid NRZ/Multi-Tone Signaling for High-Speed Low-Power Wireline Transceivers

    Get PDF
    Over the past few decades, incessant growth of Internet networking traffic and High-Performance Computing (HPC) has led to a tremendous demand for data bandwidth. Digital communication technologies combined with advanced integrated circuit scaling trends have enabled the semiconductor and microelectronic industry to dramatically scale the bandwidth of high-loss interfaces such as Ethernet, backplane, and Digital Subscriber Line (DSL). The key to achieving higher bandwidth is to employ equalization technique to compensate the channel impairments such as Inter-Symbol Interference (ISI), crosstalk, and environmental noise. Therefore, todayรขs advanced input/outputs (I/Os) has been equipped with sophisticated equalization techniques to push beyond the uncompensated bandwidth of the system. To this end, process scaling has continually increased the data processing capability and improved the I/O performance over the last 15 years. However, since the channel bandwidth has not scaled with the same pace, the required signal processing and equalization circuitry becomes more and more complicated. Thereby, the energy efficiency improvements are largely offset by the energy needed to compensate channel impairments. In this design paradigm, re-thinking about the design strategies in order to not only satisfy the bandwidth performance, but also to improve power-performance becomes an important necessity. It is well known in communication theory that coding and signaling schemes have the potential to provide superior performance over band-limited channels. However, the choice of the optimum data communication algorithm should be considered by accounting for the circuit level power-performance trade-offs. In this thesis we have investigated the application of new algorithm and signaling schemes in wireline communications, especially for communication between microprocessors, memories, and peripherals. A new hybrid NRZ/Multi-Tone (NRZ/MT) signaling method has been developed during the course of this research. The system-level and circuit-level analysis, design, and implementation of the proposed signaling method has been performed in the frame of this work, and the silicon measurement results have proved the efficiency and the robustness of the proposed signaling methodology for wireline interfaces. In the first part of this work, a 7.5 Gb/s hybrid NRZ/MT transceiver (TRX) for multi-drop bus (MDB) memory interfaces is designed and fabricated in 40 nm CMOS technology. Reducing the complexity of the equalization circuitry on the receiver (RX) side, the proposed architecture achieves 1 pJ/bit link efficiency for a MDB channel bearing 45 dB loss at 2.5 GHz. The measurement results of the first prototype confirm that NRZ/MT serial data TRX can offer an energy-efficient solution for MDB memory interfaces. Motivated by the satisfying results of the first prototype, in the second phase of this research we have exploited the properties of multi-tone signaling, especially orthogonality among different sub-bands, to reduce the effect of crosstalk in high-dense wireline interconnects. A four-channel transceiver has been implemented in a standard CMOS 40 nm technology in order to demonstrate the performance of NRZ/MT signaling in presence of high channel loss and strong crosstalk noise. The proposed system achieves 1 pJ/bit power efficiency, while communicating over a MDB memory channel at 36 Gb/s aggregate data rate

    Verkkoliikenteen hajauttaminen rinnakkaisprosessoitavaksi ohjelmoitavan piirin avulla

    Get PDF
    The expanding diversity and amount of traffic in the Internet requires increasingly higher performing devices for protecting our networks against malicious activities. The computational load of these devices may be divided over multiple processing nodes operating in parallel to reduce the computation load of a single node. However, this requires a dedicated controller that can distribute the traffic to and from the nodes at wire-speed. This thesis concentrates on the system topologies and on the implementation aspects of the controller. A field-programmable gate array (FPGA) device, based on a reconfigurable logic array, is used for implementation because of its integrated circuit like performance and high-grain programmability. Two hardware implementations were developed; a straightforward design for 1-gigabit Ethernet, and a modular, highly parameterizable design for 10-gigabit Ethernet. The designs were verified by simulations and synthesizable testbenches. The designs were synthesized on different FPGA devices while varying parameters to analyze the achieved performance. High-end FPGA devices, such as Altera Stratix family, met the target processing speed of 10-gigabit Ethernet. The measurements show that the controller's latency is comparable to a typical switch. The results confirm that reconfigurable hardware is the proper platform for low-level network processing where the performance is prioritized over other features. The designed architecture is versatile and adaptable to applications expecting similar characteristics.Internetin edelleen lisรครคntyvรค ja monipuolistuva liikenne vaatii entistรค tehokkaampia laitteita suojaamaan tietoliikenneverkkoja tunkeutumisia vastaan. Tietoliikennelaitteiden kuormaa voidaan jakaa rinnakkaisille yksikรถille, jolloin yksittรคisen laitteen kuorma pienenee. Tรคmรค kuitenkin vaatii erityisen kontrolloijan, joka kykenee hajauttamaan liikennettรค yksikรถille linjanopeudella. Tรคmรค tutkimus keskittyy em. kontrolloijan jรคrjestelmรคtopologioiden tutkimiseen sekรค kontrolloijan toteuttamiseen ohjelmoitavalla piirillรค, kuten kenttรคohjelmoitava jรคrjestelmรคpiiri (eng. field programmable gate-array, FPGA). Kontrolloijasta tehtiin yksinkertainen toteutus 1-gigabitin Ethernet-verkkoihin sekรค modulaarinen ja parametrisoitu toteutus 10-gigabitin Ethernet-verkkoihin. Toteutukset verifioitiin simuloimalla sekรค kรคyttรคmรคllรค syntetisoituvia testirakenteita. Toteutukset syntetisoitiin eri FPGA-piireille vaihtelemalla samalla myรถs toteutuksen parametrejรค. Tehokkaimmat FPGA-piirit, kuten Altera Stratix -piirit, saavuttivat 10-gigabitin prosessointivaatimukset. Mittaustulokset osoittavat, ettรค kontrollerin vasteaika ei poikkea tavallisesta verkkokytkimestรค. Tyรถn tulokset vahvistavat kรคsitystรค, ettรค ohjelmoitavat piirit soveltuvat hyvin verkkoliikenteen matalantason prosessointiin, missรค vaaditaan ensisijaisesti suorituskykyรค. Suunniteltu arkkitehtuuri on monipuolinen ja soveltuu joustavuutensa ansiosta muihin samantyyppiseen sovelluksiin

    A Single-Ended Parallel Transceiver With Four-Bit Four-Wire Four-Level Balanced Coding for the Point-to-Point DRAM Interface

    No full text
    A four-bit four-wire four-level (4B4W4L) single-ended parallel transceiver for the point-to-point DRAM interface achieved a peak reduction of similar to 10 dB in the electromagnetic interference (EMI) H-field power, compared to a conventional 4-bit parallel binary transceiver with the same output driver power of transmitter (TX) and the same input voltage margin of receiver (RX). A four-level balanced coding is used in this work to minimize the simultaneous switching noise at TX, to utilize a differential sensing without a reference voltage at RX, to maintain the pin efficiency of 100%, and also to reduce EMI by setting the sum of currents through the four wires to be zero. A capacitive pre-emphasis scheme modified for four-level signaling is also used at TX to compensate for inter-symbol interference. The transmitted four-level signals are recovered by six differential comparators with an offset compensation and a decoder at RX. The proposed transceiver chip fabricated in a 65 nm CMOS process consumes 2.39 pJ/bit with a 1.2 V supply and a 2 inch FR4 channel at 8 Gb/s.1122sciescopu

    High performance communication on reconfigurable clusters

    Get PDF
    High Performance Computing (HPC) has matured to where it is an essential third pillar, along with theory and experiment, in most domains of science and engineering. Communication latency is a key factor that is limiting the performance of HPC, but can be addressed by integrating communication into accelerators. This integration allows accelerators to communicate with each other without CPU interactions, and even bypassing the network stack. Field Programmable Gate Arrays (FPGAs) are the accelerators that currently best integrate communication with computation. The large number of Multi-gigabit Transceivers (MGTs) on most high-end FPGAs can provide high-bandwidth and low-latency inter-FPGA connections. Additionally, the reconfigurable FPGA fabric enables tight coupling between computation kernel and network interface. Our thesis is that an application-aware communication infrastructure for a multi-FPGA system makes substantial progress in solving the HPC communication bottleneck. This dissertation aims to provide an application-aware solution for communication infrastructure for FPGA-centric clusters. Specifically, our solution demonstrates application-awareness across multiple levels in the network stack, including low-level link protocols, router microarchitectures, routing algorithms, and applications. We start by investigating the low-level link protocol and the impact of its latency variance on performance. Our results demonstrate that, although some link jitter is always present, we can still assume near-synchronous communication on an FPGA-cluster. This provides the necessary condition for statically-scheduled routing. We then propose two novel router microarchitectures for two different kinds of workloads: a wormhole Virtual Channel (VC)-based router for workloads with dynamic communication, and a statically-scheduled Virtual Output Queueing (VOQ)-based router for workloads with static communication. For the first (VC-based) router, we propose a framework that generates application-aware router configurations. Our results show that, by adding application-awareness into router configuration, the network performance of FPGA clusters can be substantially improved. For the second (VOQ-based) router, we propose a novel offline collective routing algorithm. This shows a significant advantage over a state-of-the-art collective routing algorithm. We apply our communication infrastructure to a critical strong-scaling HPC kernel, the 3D FFT. The experimental results demonstrate that the performance of our design is faster than that on CPUs and GPUs by at least one order of magnitude (achieving strong scaling for the target applications). Surprisingly, the FPGA cluster performance is similar to that of an ASIC-cluster. We also implement the 3D FFT on another multi-FPGA platform: the Microsoft Catapult II cloud. Its performance is also comparable or superior to CPU and GPU HPC clusters. The second application we investigate is Molecular Dynamics Simulation (MD). We model MD on both FPGA clouds and clusters. We find that combining processing and general communication in the same device leads to extremely promising performance and the prospect of MD simulations well into the us/day range with a commodity cloud

    Electronic Photonic Integrated Circuits and Control Systems

    Get PDF
    Photonic systems can operate at frequencies several orders of magnitude higher than electronics, whereas electronics offers extremely high density and easily built memories. Integrated photonic-electronic systems promise to combine advantage of both, leading to advantages in accuracy, reconfigurability and energy efficiency. This work concerns of hybrid and monolithic electronic-photonic system design. First, a high resolution voltage supply to control the thermooptic photonic chip for time-bin entanglement is described, in which the electronics system controller can be scaled with more number of power channels and the ability to daisy-chain the devices. Second, a system identification technique embedded with feedback control for wavelength stabilization and control model in silicon nitride photonic integrated circuits is proposed. Using the system, the wavelength in thermooptic device can be stabilized in dynamic environment. Third, the generation of more deterministic photon sources with temporal multiplexing established using field programmable gate arrays (FPGAs) as controller photonic device is demonstrated for the first time. The result shows an enhancement to the single photon output probability without introducing additional multi-photon noise. Fourth, multiple-input and multiple-output (MIMO) control of a silicon nitride thermooptic photonic circuits incorporating Mach Zehnder interferometers (MZIs) is demonstrated for the first time using a dual proportional integral reference tracking technique. The system exhibits improved performance in term of control accuracy by reducing wavelength peak drift due to internal and external disturbances. Finally, a monolithically integrated complementary metal oxide semiconductor (CMOS) nanophotonic segmented transmitter is characterized. With segmented design, the monolithic Mach Zehnder modulator (MZM) shows a low link sensitivity and low insertion loss with driver flexibility
    • โ€ฆ
    corecore