5 research outputs found
Recommended from our members
Analysis and design on low-power multi-Gb/s serial links
High speed serial links are critical components for addressing the growing demand for I/O bandwidth in next-generation computing applications, such as many-core systems, backplane and optical data communications. Due to continued process scaling and circuit innovations, today's CMOS serial link transceivers can achieve tens of Gb/s per pin. However, most of their reported power efficiency improves much slower than the rise of data rate. Therefore, aggregate I/O power is increasing and will exceed the power budget if the trend for more off-chip bandwidth is sustained.
In this work, a system level statistical analysis of serial links is first described, and compares the link performance of Non-Return-to-Zero (2-PAM) with higher-order modulation (duobinary) signaling schemes. This method enables fast and accurate BER distribution simulation of serial link transceivers that include channel and circuit imperfections, such as finite pulse rise/fall time, duty cycle variation, and both receiver and transmitter forwarded-clock jitter.
Second, in order to address link power efficiency, two test chips have been implemented. The first one describes a quad-lane, 6.4-7.2 Gb/s serial link receiver prototype using a forwarded clock architecture. A novel phase deskew scheme using injection-locked ring oscillators (ILRO) is proposed that achieves greater than one UI of phase shift for multiple clock phases, eliminating phase rotation and interpolation required in conventional architectures. Each receiver, optimized for power efficiency, consists of a low-power linear equalizer, four offset-cancelled quantizers for 1:4 demultiplexing, and an injection-locked ring oscillator coupled to a low-voltage swing, global clock distribution. Measurement results show a 6.4-7.2Gb/s data rate with BER < 10⁻¹² across 14 cm of PCB, and an 8Gb/s data rate through 4cm of PCB. Designed in a 1.2V, 90nm CMOS process, the ILRO achieves a wide tuning range from 1.6-2.6GHz. The total area of each receiver is 0.0174mm², resulting in a measured power efficiency of 0.6mW/Gb/s.
Improving upon the first test chip, a second test chip for 8Gb/s forwarded clock serial link receivers exploits a low-power super-harmonic injection-locked ring oscillator for symmetric multi-phase local clock generation and deskewing. Further power reduction is achieved by designing most of the receiver circuits in the near-threshold region (0.6V supply), with the exception of only the global clock buffer, test buffers and synthesized digital test circuits at nominal 1V supply. At the architectural level, a 1:10 direct demultiplexing rate is chosen to achieve low supply operation by exploiting high-parallelism. Fabricated in 65nm CMOS technology, two receiver prototypes are integrated in this test chip, one without and the other with front-end boot-strapped S/Hs. Including the amortized power of global clock distribution, the proposed serial link receivers consume 1.3mW and 2mW respectively at 8Gb/s input data rate, achieving a power efficiency of 0.163mW/Gb/s and 0.25mW/Gb/s. Measurement results show both receivers achieve BER < 10⁻¹² across a 20-cm FR4 PCB channel
Design Techniques for Energy Efficient Multi-GB/S Serial I/O Transceivers
Total I/O bandwidth demand is growing in high-performance systems due to the emergence of many-core microprocessors and in mobile devices to support the next generation of multi-media features. High-speed serial I/O energy efficiency must improve in order to enable continued scaling of these parallel computing platforms in applications ranging from data centers to smart mobile devices.
The first work, a low-power forwarded-clock I/O transceiver architecture is presented that employs a high degree of output/input multiplexing, supply-voltage scaling with data rate, and low-voltage circuit techniques to enable low-power operation. The transmitter utilizes a 4:1 output multiplexing voltage-mode driver along with 4-phase clocking that is efficiently generated from a passive poly-phase filter. The output driver voltage swing is accurately controlled from 100-200 mV_(ppd) using a low-voltage pseudo-differential regulator that employs a partial negative-resistance load for improved low frequency gain. 1:8 input de-multiplexing is performed at the receiver equalizer output with 8 parallel input samplers clocked from an 8-phase injection-locked oscillator that provides more than 1UI de-skew range.
Low-power high-speed serial I/O transmitters which include equalization to compensate for channel frequency dependent loss are required to meet the aggressive link energy efficiency targets of future systems. The second work presents a low power serial link transmitter design that utilizes an output stage which combines a voltage-mode driver, which offers low static-power dissipation, and current-mode equalization, which offers low complexity and dynamic-power dissipation. The utilization of current-mode equalization decouples the equalization settings and termination impedance, allowing for a significant reduction in pre-driver complexity relative to segmented voltage-mode drivers. Proper transmitter series termination is set with an impedance control loop which adjusts the on-resistance of the output transistors in the driver voltage-mode portion. Further reductions in dynamic power dissipation are achieved through scaling the serializer and local clock distribution supply with data rate.
Finally, it presents that a scalable quarter-rate transmitter employs an analog-controlled impedance-modulated 2-tap voltage-mode equalizer and achieves fast power-state transitioning with a replica-biased regulator and ILO clock generation. Capacitively-driven 2 mm global clock distribution and automatic phase calibration allows for aggressive supply scaling
Toward realizing power scalable and energy proportional high-speed wireline links
Growing computational demand and proliferation of cloud computing has placed high-speed
serial links at the center stage. Due to saturating energy efficiency improvements over the
last five years, increasing the data throughput comes at the cost of power consumption. Conventionally, serial link power can be reduced by optimizing individual building blocks such as
output drivers, receiver, or clock generation and distribution. However, this approach yields
very limited efficiency improvement. This dissertation takes an alternative approach toward
reducing the serial link power. Instead of optimizing the power of individual building blocks,
power of the entire serial link is reduced by exploiting serial link usage by the applications.
It has been demonstrated that serial links in servers are underutilized. On average, they
are used only 15% of the time, i.e. these links are idle for approximately 85% of the time.
Conventional links consume power during idle periods to maintain synchronization between
the transmitter and the receiver. However, by powering-off the link when idle and powering
it back when needed, power consumption of the serial link can be scaled proportionally to
its utilization. This approach of rapid power state transitioning is known as the rapid-on/off
approach. For the rapid-on/off to be effective, ideally the power-on time, off-state power,
and power state transition energy must all be close to zero. However, in practice, it is very
difficult to achieve these ideal conditions. Work presented in this dissertation addresses these
challenges.
When this research work was started (2011-12), there were only a couple of research papers
available in the area of rapid-on/off links. Systematic study or design of a rapid power state
transitioning in serial links was not available in the literature. Since rapid-on/off with
nanoseconds granularity is not a standard in any wireline communication, even the popular
test equipment does not support testing any such feature, neither any formal measurement methodology was available. All these circumstances made the beginning difficult. However,
these challenges provided a unique opportunity to explore new architectural techniques and
identify trade-offs. The key contributions of this dissertation are as follows.
The first and foremost contribution is understanding the underlying limitations of saturating energy efficiency improvements in serial links and why there is a compelling need to
find alternative ways to reduce the serial link power.
The second contribution is to identify potential power saving techniques and evaluate the
challenges they pose and the opportunities they present.
The third contribution is the design of a 5Gb/s transmitter with a rapid-on/off feature.
The transmitter achieves rapid-on/off capability in voltage mode output driver by using
a fast-digital regulator, and in the clock multiplier by accurate frequency pre-setting and
periodic reference insertion. To ease timing requirements, an improved edge replacement
logic circuit for the clock multiplier is proposed. Mathematical modeling of power-on time
as a function of various circuit parameters is also discussed. The proposed transmitter
demonstrates energy proportional operation over wide variations of link utilization, and is,
therefore, suitable for energy efficient links. Fabricated in 90nm CMOS technology, the
voltage mode driver, and the clock multiplier achieve power-on-time of only 2ns and 10ns,
respectively. This dissertation highlights key trade-off in the clock multiplier architecture,
to achieve fast power-on-lock capability at the cost of jitter performance.
The fourth contribution is the design of a 7GHz rapid-on/off LC-PLL based clock multi-
plier. The phase locked loop (PLL) based multiplier was developed to overcome the limita-
tions of the MDLL based approach. Proposed temperature compensated LC-PLL achieves
power-on-lock in 1ns.
The fifth and biggest contribution of this dissertation is the design of a 7Gb/s embedded
clock transceiver, which achieves rapid-on/off capability in LC-PLL, current-mode transmit-
ter and receiver. It was the first reported design of a complete transceiver, with an embedded
clock architecture, having rapid-on/off capability. Background phase calibration technique in
PLL and CDR phase calibration logic in the receiver enable instantaneous lock on power-on.
The proposed transceiver demonstrates power scalability with a wide range of link utiliza-
tion and, therefore, helps in improving overall system efficiency. Fabricated in 65nm CMOS technology, the 7Gb/s transceiver achieves power-on-lock in less than 20ns. The transceiver
achieves power scaling by 44x (63.7mW-to-1.43mW) and energy efficiency degradation by
only 2.2x (9.1pJ/bit-to-20.5pJ/bit), when the effective data rate (link utilization) changes
by 100x (7Gb/s-to-70Mb/s).
The sixth and final contribution is the design of a temperature sensor to compensate
the frequency drifts due to temperature variations, during long power-off periods, in the
fast power-on-lock LC-PLL. The proposed self-referenced VCO-based temperature sensor
is designed with all digital logic gates and achieves low supply sensitivity. This sensor is
suitable for integration in processor and DRAM environments. The proposed sensor works
on the principle of directly converting temperature information to frequency and finally
to digital bits. A novel sensing technique is proposed in which temperature information
is acquired by creating a threshold voltage difference between the transistors used in the
oscillators. Reduced supply sensitivity is achieved by employing junction capacitance, and
the overhead of voltage regulators and an external ideal reference frequency is avoided. The
effect of VCO phase noise on the sensor resolution is mathematically evaluated. Fabricated
in the 65nm CMOS process, the prototype can operate with a supply ranging from 0.85V
to 1.1V, and it achieves a supply sensitivity of 0.034oC/mV and an inaccuracy of ±0.9oC
and ±2.3oC from 0-100oC after 2-point calibration, with and without static nonlinearity
correction, respectively. It achieves a resolution of 0.3oC, resolution FoM of 0.3(nJ/conv)res2 ,
and measurement (conversion) time of 6.5μs
Energy-efficient systems for information transfer and processing
Machine learning (ML) systems are finding excellent utility in tackling the data deluge of the big data era thanks to the exponential increase in computing power. Current ML systems adopt either centralized cloud computing or distributed edge computing. In both, the challenge of energy efficiency has been drawing increased attention. In cloud computing, data transfer due to inter-chip, inter-board, inter-shelf and inter-rack communications (I/O interface) within data centers is one of the dominant energy costs. This will intensify with the growing demand for increased I/O bandwidth of high-performance computing in data centers. On the other hand, in edge computing, energy efficiency is the primary design challenge, as mobile devices have limited energy, computation and storage resources. This challenge is being exacerbated by the need to embed ML algorithms such as convolutional neural networks (CNNs) for enabling local on-device inference capabilities. In this dissertation, we investigate techniques to address these challenges.
To address the energy efficiency challenge in data centers, this dissertation focuses on reducing the energy consumption of the I/O interface. Specifically, in the emerging analog-to-digital converter (ADC)-based multi-Gb/s serial link receivers, the power dissipation is dominated by the ADC. ADCs in serial links employ signal-to-noise-and-distortion-ratio (SNDR) and effective-number-of-bits (ENOB) as performance metrics because these are the standard for generic ADC design. This dissertation presents the use of information-based metrics such as bit-error-rate (BER) to design a BER-optimal ADC (BOA) for serial links. First, theoretical analysis is developed to show when the benefits of BOA over a conventional uniform ADC (CUA) in a serial link receiver are substantial. Second, a \unit[4]{GS/s}, 4-\mbox{\textrm{bit}} on-chip ADC in a \unit[90]{nm} CMOS process is designed and integrated into a 4 Gb/s serial link receiver to verify the aforementioned analysis. Specifically, measured results demonstrate that a 3-\mathrm{bit} BOA receiver outperforms a 4-\mathrm{bit} CUA receiver at a BER <10^{-12} and provides \unit[50]{\%} power savings in the ADC. In the process, it is demonstrated conclusively that BER as opposed to ENOB is a better metric when designing ADCs for serial links.
For the problem of resource-constrained computing at the edge, this dissertation tackles the issue of energy-efficient implementation of ML algorithms, particularly CNNs which have recently gained considerable interest due to their record-breaking performance in many recognition tasks. However, their implementation complexity hinders their deployment on power-constrained embedded platforms. This dissertation develops two techniques for energy-efficient CNN design.
The first technique is a predictive CNN (PredictiveNet), which makes use of high sparsity in well-trained CNNs to bypass a large fraction of power-dominant convolutions at runtime without modifying the CNN structure. Analysis supported by simulations is provided to justify PredictiveNet's effectiveness. When applied to both the MNIST and CIFAR-10 datasets, simulation results show that PredictiveNet achieves 7.2\times and 4.4\times reduction in the computational and representational costs, respectively, compared with a conventional CNN. It is further shown that PredictiveNet enables computational and representational cost reductions of 2.5\times and 1.7\times, respectively, compared to a state-of-the-art CNN, while incurring only 0.02 classification accuracy loss.
The second technique is a variation-tolerant architecture for CNN capable of operating in near threshold voltage (NTV) regime for aggressive energy efficiency. It is well-known that NTV computing can achieve up to 10\times energy savings but is sensitive to process, temperature, and voltage (PVT) variations which can lead to timing errors. To leverage the great potential of NTV for energy efficiency, this dissertation develops a new statistical error compensation (SEC) technique referred to as rank decomposed SEC (RD-SEC). RD-SEC makes use of inherent redundancy in CNNs to handle timing errors due to NTV computing. When evaluated in CNNs for both the MNIST and CIFAR-10 datasets, simulation results in \unit[45]{nm} CMOS show that RD-SEC enables robust CNNs operating in the NTV regime. Specifically, the proposed RD-SEC can achieve up to 11\times improvement in variation tolerance and enable up to 113\times reduction in the standard deviation of classification accuracy while incurring marginal degradation in the median classification accuracy
Memory Systems and Interconnects for Scale-Out Servers
The information revolution of the last decade has been fueled by the digitization of almost all human activities through a wide range of Internet services. The backbone of this information age are scale-out datacenters that need to collect, store, and process massive amounts of data. These datacenters distribute vast datasets across a large number of servers, typically into memory-resident shards so as to maintain strict quality-of-service guarantees. While data is driving the skyrocketing demands for scale-out servers, processor and memory manufacturers have reached fundamental efficiency limits, no longer able to increase server energy efficiency at a sufficient pace. As a result, energy has emerged as the main obstacle to the scalability of information technology (IT) with huge economic implications. Delivering sustainable IT calls for a paradigm shift in computer system design. As memory has taken a central role in IT infrastructure, memory-centric architectures are required to fully utilize the IT's costly memory investment. In response, processor architects are resorting to manycore architectures to leverage the abundant request-level parallelism found in data-centric applications. Manycore processors fully utilize available memory resources, thereby increasing IT efficiency by almost an order of magnitude. Because manycore server chips execute a large number of concurrent requests, they exhibit high incidence of accesses to the last-level-cache for fetching instructions (due to large instruction footprints), and off-chip memory (due to lack of temporal reuse in on-chip caches) for accessing dataset objects. As a result, on-chip interconnects and the memory system are emerging as major performance and energy-efficiency bottlenecks in servers. This thesis seeks to architect on-chip interconnects and memory systems that are tuned for the requirements of memory-centric scale-out servers. By studying a wide range of data-centric applications, we uncover application phenomena common in data-centric applications, and examine their implications on on-chip network and off-chip memory traffic. Finally, we propose specialized on-chip interconnects and memory systems that leverage common traffic characteristics, thereby improving server throughput and energy efficiency