149 research outputs found
Performance and power optimization in VLSI physical design
As VLSI technology enters the nanoscale regime, a great amount of efforts have
been made to reduce interconnect delay. Among them, buffer insertion stands out
as an effective technique for timing optimization. A dramatic rise in on-chip buffer
density has been witnessed. For example, in two recent IBM ASIC designs, 25% gates
are buffers.
In this thesis, three buffer insertion algorithms are presented for the procedure
of performance and power optimization. The second chapter focuses on improving circuit performance under inductance effect. The new algorithm works under
the dynamic programming framework and runs in provably linear time for multiple
buffer types due to two novel techniques: restrictive cost bucketing and efficient delay
update. The experimental results demonstrate that our linear time algorithm consistently outperforms all known RLC buffering algorithms in terms of both solution
quality and runtime. That is, the new algorithm uses fewer buffers, runs in shorter
time and the buffered tree has better timing.
The third chapter presents a method to guarantee a high fidelity signal transmission in global bus. It proposes a new redundant via insertion technique to reduce
via variation and signal distortion in twisted differential line. In addition, a new
buffer insertion technique is proposed to synchronize the transmitted signals, thus
further improving the effectiveness of the twisted differential line. Experimental results demonstrate a 6GHz signal can be transmitted with high fidelity using the new
approaches. In contrast, only a 100MHz signal can be reliably transmitted using a
single-end bus with power/ground shielding. Compared to conventional twisted differential line structure, our new techniques can reduce the magnitude of noise by 45%
as witnessed in our simulation.
The fourth chapter proposes a buffer insertion and gate sizing algorithm for
million plus gates. The algorithm takes a combinational circuit as input instead of
individual nets and greatly reduces the buffer and gate cost of the entire circuit.
The algorithm has two main features: 1) A circuit partition technique based on the
criticality of the primary inputs, which provides the scalability for the algorithm, and
2) A linear programming formulation of non-linear delay versus cost tradeoff, which
formulates the simultaneous buffer insertion and gate sizing into linear programming
problem. Experimental results on ISCAS85 circuits show that even without the circuit
partition technique, the new algorithm achieves 17X speedup compared with path
based algorithm. In the meantime, the new algorithm saves 16.0% buffer cost, 4.9%
gate cost, 5.8% total cost and results in less circuit delay
Empirical timing analysis of CPUs and delay fault tolerant design using partial redundancy
The operating clock frequency is determined by the longest signal propagation
delay, setup/hold time, and timing margin. These are becoming less predictable with
the increasing design complexity and process miniaturization. The difficult challenge
is then to ensure that a device operating at its clock frequency is error-free with
quantifiable assurance. Effort at device-level engineering will not suffice for these
circuits exhibiting wide process variation and heightened sensitivities to operating
condition stress. Logic-level redress of this issue is a necessity and we propose a
design-level remedy for this timing-uncertainty problem.
The aim of the design and analysis approaches presented in this dissertation is to
provide framework, SABRE, wherein an increased operating clock frequency can be
achieved. The approach is a combination of analytical modeling, experimental analy-
sis, hardware /time-redundancy design, exception handling and recovery techniques.
Our proposed design replicates only a necessary part of the original circuit to avoid
high hardware overhead as in triple-modular-redundancy (TMR). The timing-critical
combinational circuit is path-wise partitioned into two sections. The combinational
circuits associated with long paths are laid out without any intrusion except for the
fan-out connections from the first section of the circuit to a replicated second section
of the combinational circuit. Thus only the second section of the circuit is replicated.
The signals fanning out from the first section are latches, and thus are far shorter than the paths spanning the entire combinational circuit. The replicated circuit is timed
at a subsequent clock cycle to ascertain relaxed timing paths. This insures that the
likelihood of mistiming due to stress or process variation is eliminated. During the
subsequent clock cycle, the outcome of the two logically identical, yet time-interleaved,
circuit outputs are compared to detect faults. When a fault is detected, the retry sig-
nal is triggered and the dynamic frequency-step-down takes place before a pipe flush,
and retry is issued. The significant timing overhead associated with the retry is offset
by the rarity of the timing violation events. Simulation results on ISCAS Benchmark
circuits show that 10% of clock frequency gain is possible with 10 to 20 % of hardware
overhead of replicated timing-critical circuit
Disseny microelectrnic de circuits discriminadors de polsos pel detector LHCb
The aim of this thesis is to present a solution for implementing the front end system of the Scintillator Pad Detector (SPD) of the calorimeter system of the LHCb experiment that will start in 2008 at the Large Hadron Collider (LHC) at CERN. The requirements of this specific system are discussed and an integrated solution is presented, both at system and circuit level. We also report some methodological achievements. In first place, a method to study the PSRR (and any transfer function) in fully differential circuits taking into account the effect of parameter mismatch is proposed. Concerning noise analysis, a method to study time variant circuits in the frequency domain is presented and justified. This would open the possibility to study the effect of 1/f noise in time variants circuits. In addition, it will be shown that the architecture developed for this system is a general solution for front ends in high luminosity experiments that must be operated with no dead time and must be robust against ballistic deficit
Statistical static timing analysis considering process variations and crosstalk
Increasing relative semiconductor process variations are making the prediction of
realistic worst-case integrated circuit delay or sign-off yield more difficult. As process
geometries shrink, intra-die variations have become dominant and it is imperative to
model them to obtain accurate timing analysis results. In addition, intra-die process
variations are spatially correlated due to pattern dependencies in the manufacturing
process. Any statistical static timing analysis (SSTA) tool is incomplete without a model
for signal crosstalk, as critical path delays can increase or decrease depending on the
switching of capacitively coupled nets. The coupled signal timing in turn depends on the
process variations. This work describes an SSTA tool that models signal crosstalk and
spatial correlation in intra-die process variations, along with gradients and inter-die
variations
Performance and power optimization in VLSI physical design
As VLSI technology enters the nanoscale regime, a great amount of efforts have
been made to reduce interconnect delay. Among them, buffer insertion stands out
as an effective technique for timing optimization. A dramatic rise in on-chip buffer
density has been witnessed. For example, in two recent IBM ASIC designs, 25% gates
are buffers.
In this thesis, three buffer insertion algorithms are presented for the procedure
of performance and power optimization. The second chapter focuses on improving circuit performance under inductance effect. The new algorithm works under
the dynamic programming framework and runs in provably linear time for multiple
buffer types due to two novel techniques: restrictive cost bucketing and efficient delay
update. The experimental results demonstrate that our linear time algorithm consistently outperforms all known RLC buffering algorithms in terms of both solution
quality and runtime. That is, the new algorithm uses fewer buffers, runs in shorter
time and the buffered tree has better timing.
The third chapter presents a method to guarantee a high fidelity signal transmission in global bus. It proposes a new redundant via insertion technique to reduce
via variation and signal distortion in twisted differential line. In addition, a new
buffer insertion technique is proposed to synchronize the transmitted signals, thus
further improving the effectiveness of the twisted differential line. Experimental results demonstrate a 6GHz signal can be transmitted with high fidelity using the new
approaches. In contrast, only a 100MHz signal can be reliably transmitted using a
single-end bus with power/ground shielding. Compared to conventional twisted differential line structure, our new techniques can reduce the magnitude of noise by 45%
as witnessed in our simulation.
The fourth chapter proposes a buffer insertion and gate sizing algorithm for
million plus gates. The algorithm takes a combinational circuit as input instead of
individual nets and greatly reduces the buffer and gate cost of the entire circuit.
The algorithm has two main features: 1) A circuit partition technique based on the
criticality of the primary inputs, which provides the scalability for the algorithm, and
2) A linear programming formulation of non-linear delay versus cost tradeoff, which
formulates the simultaneous buffer insertion and gate sizing into linear programming
problem. Experimental results on ISCAS85 circuits show that even without the circuit
partition technique, the new algorithm achieves 17X speedup compared with path
based algorithm. In the meantime, the new algorithm saves 16.0% buffer cost, 4.9%
gate cost, 5.8% total cost and results in less circuit delay
Improving the tolerance of stochastic LDPC decoders to overclocking-induced timing errors: a tutorial and design example
Channel codes such as Low-Density Parity-Check (LDPC) codes may be employed in wireless communication schemes for correcting transmission errors. This tolerance to channel-induced transmission errors allows the communication schemes to achieve higher transmission throughputs, at the cost of requiring additional processing for performing LDPC decoding. However, this LDPC decoding operation is associated with a potentially inadequate processing throughput, which may constrain the attainable transmission throughput. In order to increase the processing throughput, the clock period may be reduced, albeit this is at the cost of potentially introducing timing errors. Previous research efforts have considered a paucity of solutions for mitigating the occurrence of timing errors in channel decoders, by employing additional circuitry for detecting and correcting these overclocking-induced timing errors. Against this background, in this paper we demonstrate that stochastic LDPC decoders (LDPC-SDs) are capable of exploiting their inherent error correction capability for correcting not only transmission errors, but also timing errors, even without the requirement for additional circuitry. Motivated by this, we provide the first comprehensive tutorial on LDPC-SDs. We also propose a novel design flow for timing-error-tolerant LDPC decoders. We use this to develop a timing error model for LDPC-SDs and investigate how their overall error correction performance is affected by overclocking. Drawing upon our findings, we propose a modified LDPC-SD, having an improved timing error tolerance. In a particular practical scenario, this modification eliminates the approximately 1 dB performance degradation that is suffered by an overclocked LDPC-SD without our modification, enabling the processing throughput to be increased by up to 69.4%, which is achieved without compromising the error correction capability or processing energy consumption of the LDPC-SD
Learning to Decode the Surface Code with a Recurrent, Transformer-Based Neural Network
Quantum error-correction is a prerequisite for reliable quantum computation.
Towards this goal, we present a recurrent, transformer-based neural network
which learns to decode the surface code, the leading quantum error-correction
code. Our decoder outperforms state-of-the-art algorithmic decoders on
real-world data from Google's Sycamore quantum processor for distance 3 and 5
surface codes. On distances up to 11, the decoder maintains its advantage on
simulated data with realistic noise including cross-talk, leakage, and analog
readout signals, and sustains its accuracy far beyond the 25 cycles it was
trained on. Our work illustrates the ability of machine learning to go beyond
human-designed algorithms by learning from data directly, highlighting machine
learning as a strong contender for decoding in quantum computers
Special Topics in Information Technology
This open access book presents thirteen outstanding doctoral dissertations in Information Technology from the Department of Electronics, Information and Bioengineering, Politecnico di Milano, Italy. Information Technology has always been highly interdisciplinary, as many aspects have to be considered in IT systems. The doctoral studies program in IT at Politecnico di Milano emphasizes this interdisciplinary nature, which is becoming more and more important in recent technological advances, in collaborative projects, and in the education of young researchers. Accordingly, the focus of advanced research is on pursuing a rigorous approach to specific research topics starting from a broad background in various areas of Information Technology, especially Computer Science and Engineering, Electronics, Systems and Control, and Telecommunications. Each year, more than 50 PhDs graduate from the program. This book gathers the outcomes of the thirteen best theses defended in 2020-21 and selected for the IT PhD Award. Each of the authors provides a chapter summarizing his/her findings, including an introduction, description of methods, main achievements and future work on the topic. Hence, the book provides a cutting-edge overview of the latest research trends in Information Technology at Politecnico di Milano, presented in an easy-to-read format that will also appeal to non-specialists
Empirical timing analysis of CPUs and delay fault tolerant design using partial redundancy
The operating clock frequency is determined by the longest signal propagation
delay, setup/hold time, and timing margin. These are becoming less predictable with
the increasing design complexity and process miniaturization. The difficult challenge
is then to ensure that a device operating at its clock frequency is error-free with
quantifiable assurance. Effort at device-level engineering will not suffice for these
circuits exhibiting wide process variation and heightened sensitivities to operating
condition stress. Logic-level redress of this issue is a necessity and we propose a
design-level remedy for this timing-uncertainty problem.
The aim of the design and analysis approaches presented in this dissertation is to
provide framework, SABRE, wherein an increased operating clock frequency can be
achieved. The approach is a combination of analytical modeling, experimental analy-
sis, hardware /time-redundancy design, exception handling and recovery techniques.
Our proposed design replicates only a necessary part of the original circuit to avoid
high hardware overhead as in triple-modular-redundancy (TMR). The timing-critical
combinational circuit is path-wise partitioned into two sections. The combinational
circuits associated with long paths are laid out without any intrusion except for the
fan-out connections from the first section of the circuit to a replicated second section
of the combinational circuit. Thus only the second section of the circuit is replicated.
The signals fanning out from the first section are latches, and thus are far shorter than the paths spanning the entire combinational circuit. The replicated circuit is timed
at a subsequent clock cycle to ascertain relaxed timing paths. This insures that the
likelihood of mistiming due to stress or process variation is eliminated. During the
subsequent clock cycle, the outcome of the two logically identical, yet time-interleaved,
circuit outputs are compared to detect faults. When a fault is detected, the retry sig-
nal is triggered and the dynamic frequency-step-down takes place before a pipe flush,
and retry is issued. The significant timing overhead associated with the retry is offset
by the rarity of the timing violation events. Simulation results on ISCAS Benchmark
circuits show that 10% of clock frequency gain is possible with 10 to 20 % of hardware
overhead of replicated timing-critical circuit
- …