387 research outputs found

    A novel deep submicron bulk planar sizing strategy for low energy subthreshold standard cell libraries

    Get PDF
    Engineering andPhysical Science ResearchCouncil (EPSRC) and Arm Ltd for providing funding in the form of grants and studentshipsThis work investigates bulk planar deep submicron semiconductor physics in an attempt to improve standard cell libraries aimed at operation in the subthreshold regime and in Ultra Wide Dynamic Voltage Scaling schemes. The current state of research in the field is examined, with particular emphasis on how subthreshold physical effects degrade robustness, variability and performance. How prevalent these physical effects are in a commercial 65nm library is then investigated by extensive modeling of a BSIM4.5 compact model. Three distinct sizing strategies emerge, cells of each strategy are laid out and post-layout parasitically extracted models simulated to determine the advantages/disadvantages of each. Full custom ring oscillators are designed and manufactured. Measured results reveal a close correlation with the simulated results, with frequency improvements of up to 2.75X/2.43X obs erved for RVT/LVT devices respectively. The experiment provides the first silicon evidence of the improvement capability of the Inverse Narrow Width Effect over a wide supply voltage range, as well as a mechanism of additional temperature stability in the subthreshold regime. A novel sizing strategy is proposed and pursued to determine whether it is able to produce a superior complex circuit design using a commercial digital synthesis flow. Two 128 bit AES cores are synthesized from the novel sizing strategy and compared against a third AES core synthesized from a state-of-the-art subthreshold standard cell library used by ARM. Results show improvements in energy-per-cycle of up to 27.3% and frequency improvements of up to 10.25X. The novel subthreshold sizing strategy proves superior over a temperature range of 0 °C to 85 °C with a nominal (20 °C) improvement in energy-per-cycle of 24% and frequency improvement of 8.65X. A comparison to prior art is then performed. Valid cases are presented where the proposed sizing strategy would be a candidate to produce superior subthreshold circuits

    Timing Closure in Chip Design

    Get PDF
    Achieving timing closure is a major challenge to the physical design of a computer chip. Its task is to find a physical realization fulfilling the speed specifications. In this thesis, we propose new algorithms for the key tasks of performance optimization, namely repeater tree construction; circuit sizing; clock skew scheduling; threshold voltage optimization and plane assignment. Furthermore, a new program flow for timing closure is developed that integrates these algorithms with placement and clocktree construction. For repeater tree construction a new algorithm for computing topologies, which are later filled with repeaters, is presented. To this end, we propose a new delay model for topologies that not only accounts for the path lengths, as existing approaches do, but also for the number of bifurcations on a path, which introduce extra capacitance and thereby delay. In the extreme cases of pure power optimization and pure delay optimization the optimum topologies regarding our delay model are minimum Steiner trees and alphabetic code trees with the shortest possible path lengths. We presented a new, extremely fast algorithm that scales seamlessly between the two opposite objectives. For special cases, we prove the optimality of our algorithm. The efficiency and effectiveness in practice is demonstrated by comprehensive experimental results. The task of circuit sizing is to assign millions of small elementary logic circuits to elements from a discrete set of logically equivalent, predefined physical layouts such that power consumption is minimized and all signal paths are sufficiently fast. In this thesis we develop a fast heuristic approach for global circuit sizing, followed by a local search into a local optimum. Our algorithms use, in contrast to existing approaches, the available discrete layout choices and accurate delay models with slew propagation. The global approach iteratively assigns slew targets to all source pins of the chip and chooses a discrete layout of minimum size preserving the slew targets. In comprehensive experiments on real instances, we demonstrate that the worst path delay is within 7% of its lower bound on average after a few iterations. The subsequent local search reduces this gap to 2% on average. Combining global and local sizing we are able to size more than 5.7 million circuits within 3 hours. For the clock skew scheduling problem we develop the first algorithm with a strongly polynomial running time for the cycle time minimization in the presence of different cycle times and multi-cycle paths. In practice, an iterative local search method is much more efficient. We prove that this iterative method maximizes the worst slack, even when restricting the feasible schedule to certain time intervals. Furthermore, we enhance the iterative local approach to determine a lexicographically optimum slack distribution. The clock skew scheduling problem is then generalized to allow for simultaneous data path optimization. In fact, this is a time-cost tradeoff problem. We developed the first combinatorial algorithm for computing time-cost tradeoff curves in graphs that may contain cycles. Starting from the lowest-cost solution, the algorithm iteratively computes a descent direction by a minimum cost flow computation. The maximum feasible step length is then determined by a minimum ratio cycle computation. This approach can be used in chip design for several optimization tasks, e.g. threshold voltage optimization or plane assignment. Finally, the optimization routines are combined into a timing closure flow. Here, the global placement is alternated with global performance optimization. Netweights are used to penalize the length of critical nets during placement. After the global phase, the performance is improved further by applying more comprehensive optimization routines on the most critical paths. In the end, the clock schedule is optimized and clocktrees are inserted. Computational results of the design flow are obtained on real-world computer chips

    Standard cell library design for sub-threshold operation

    Get PDF

    Clock Tree and Flip-flop Co-optimization for Reducing Power Consumption and Power/Ground Noise of Integrated Circuits and Systems

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 공과대학 전기·컴퓨터공학부, 2017. 8. 김태환.For very-large-scale integration (VLSI) circuits, the activation of all flip-flops that are used to store data is synchronized by clock signals delivered through clock networks. Due to very high frequency of clock signal switches, the dynamic power consumed on clock networks takes a considerable portion of the total power consumption of the circuits. In addition, the largest amount of power consumption in the clock networks comes from the flip-flops and the buffers that drive the flip-flops at the clock network boundary. In addition, the requirement of simultaneously activating all flip-flops for synchronous circuits induces a high peak power/ground noise (i.e., voltage drop) at the clock boundary. In this regards, this thesis addresses two new problems: the problem of reducing the clock power consumption at the clock network boundary, and the problem of reducing the peak current at the clock network boundary. Unlike the prior works which have considered the optimization of flip-flops and clock buffers separately, our approach takes into account the co-optimization of flip-flops and clock buffers. Precisely, we propose four different types of hardware component that can implement a set of flip-flops and their driving buffer as a single unit. The key idea for the derivation of the four types of clock boundary component is that one of the inverters in the driving buffer and one of the inverters in each flip-flop can be combined and removed without changing the functionality of the flip-flops. Consequently, we have a more freedom to select (i.e., allocate) clock boundary components that is able to reduce the power consumption or peak current under timing constraint. We have implemented our approach of clock boundary optimization under bounded clock skew constraint and tested it with ISCAS 89 benchmark circuits. The experimental results confirm that our approach is able to reduce the clock power consumption by 7.9∼10.2% and power/ground noise by 27.7%∼30.9% on average.Chapter 1 Introduction 1 1.1 Clock Signal 1 1.2 Metrics of Clock Design 2 1.3 Clock Network Topologies 4 1.4 Multibit Flip-flop 5 1.5 Simultaneous Switching Noise 6 1.6 Contributions of This Dissertation 6 Chapter 2 Clock Tree and Flip-flop Co-optimization for Reducing Power Consumption 8 2.1 Introduction 8 2.2 Types of Boundary Optimization 9 2.3 Analysis of Four Types of Flip-flop 12 2.3.1 Internal Power Comparison 12 2.3.2 Characterization of Power Consumption 14 2.4 Problem Formulation 15 2.5 The Proposed Algorithm 17 2.5.1 Independence Assumption 17 2.5.2 BoundaryMin Algorithm 17 2.6 Experimental Results 29 2.6.1 Experimental Setup 29 2.6.2 Clock Tree Boundary Optimization Results 33 2.6.3 Capacitance Analysis on Flip-flops 38 2.6.4 Slew and Skew Analysis 39 2.6.5 Window Width Analysis 39 2.7 Conclusions 41 Chapter 3 Clock Tree and Flip-flop Co-optimization for Reducing Power/Ground Noise 42 3.1 Introduction 42 3.2 Current Characteristic of Four Types of Flip-flop 45 3.3 Motivational Example 47 3.4 Problem Formulation 52 3.5 Proposed Algorithm 54 3.5.1 An Overview 54 3.5.2 Superposition of Current Flows 55 3.5.3 Formulation to Instance of MOSP Problem 57 3.5.4 Selecting Target Power Grid Points 59 3.5.5 Consideration of Reducing Power Consumption 62 3.6 Experimental Results 62 3.7 Summary 65 Chapter 4 Conclusion 68 4.1 Clock Buffer and Flip-flop Co-optimization for Reducing Power Consumption 68 4.2 Clock Buffer and Flip-flop Co-optimization for Reducing Power/Ground Noise 69 초록 78Docto

    Algorithms for Circuit Sizing in VLSI Design

    Get PDF
    One of the key problems in the physical design of computer chips, also known as integrated circuits, consists of choosing a  physical layout  for the logic gates and memory circuits (registers) on the chip. The layouts have a high influence on the power consumption and area of the chip and the delay of signal paths.  A discrete set of predefined layouts  for each logic function and register type with different physical properties is given by a library. One of the most influential characteristics of a circuit defined by the layout is its size. In this thesis we present new algorithms for the problem of choosing sizes for the circuits and its continuous relaxation,  and  evaluate these in theory and practice. A popular approach is based on Lagrangian relaxation and projected subgradient methods. We show that seemingly heuristic modifications that have been proposed for this approach can be theoretically justified by applying the well-known multiplicative weights algorithm. Subsequently, we propose a new model for the sizing problem as a min-max resource sharing problem. In our context, power consumption and signal delays are represented by resources that are distributed to customers. Under certain assumptions we obtain a polynomial time approximation for the continuous relaxation of the sizing problem that improves over the Lagrangian relaxation based approach. The new resource sharing algorithm has been implemented as part of the BonnTools software package which is developed at the Research Institute for Discrete Mathematics at the University of Bonn in cooperation with IBM. Our experiments on the ISPD 2013 benchmarks and state-of-the-art microprocessor designs provided by IBM illustrate that the new algorithm exhibits more stable convergence behavior compared to a Lagrangian relaxation based algorithm. Additionally, better timing and reduced power consumption was achieved on almost all instances. A subproblem of the new algorithm consists of finding sizes minimizing a weighted sum of power consumption and signal delays. We describe a method that approximates the continuous relaxation of this problem in polynomial time under certain assumptions. For the discrete problem we provide a fully polynomial approximation scheme under certain assumptions on the topology of the chip. Finally, we present a new algorithm for timing-driven optimization of registers. Their sizes and locations on a chip are usually determined during the clock network design phase, and remain mostly unchanged afterwards although the timing criticalities on which they were based can change. Our algorithm permutes register positions and sizes within so-called  clusters  without impairing the clock network such that it can be applied late in a design flow. Under mild assumptions, our algorithm finds an optimal solution which maximizes the worst cluster slack. It is implemented as part of the BonnTools and improves timing of registers on state-of-the-art microprocessor designs by up to 7.8% of design cycle time. </div

    메쉬 기반의 클락 네트워크 설계 방법론

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2015. 2. 김태환.The clock distribution network in a synchronous digital circuit delivers a clock signal to every storage element i.e., clock sink in the circuit. However, since the continued technology scaling increases PVT (process-voltage-temperature) variation, the increase of clock skew variation is highly likely to cause performance degradation or system failure at run time. Recently, to mitigate the clock skew variation, many researchers have taken a profound interest in the clock mesh network. However, though the structure of clock mesh network is excellent in tolerating timing variation, it demands significantly high power consumption due to the use of excessive mesh wire and buffer resources. Thus, optimizing the resources required in the mesh clock synthesis while maintaining the variation tolerance is crucially important. The three major tasks that greatly affect the cost of resulting clock mesh are (1) mesh segment allocation, (2) mesh buffer allocation and sizing, and (3) clock sink binding to mesh segments. Previous clock mesh optimization approaches solve the three tasks sequentially, one by one at a time, to manage the run time complexity of the tasks at the expense of losing the quality of results. However, since the three tasks are tightly inter-related, simultaneously optimizing all three tasks is essential, if the run time is ever permitted, to synthesize an economical clock mesh network. In this dissertation, we propose an approach which is able to tackle the problem in an integrated fashion by combining the three tasks into an iterative framework of incremental updates and solving them simultaneously to find a globally optimal allocation of mesh resources while taking into account the clock skew tolerance constraints. The core parts of this dissertation are a precise analysis on the relation among the resource optimization tasks and an establishment of mechanism for effective and efficient integration of the tasks. In particular, to handle the run time problem, we propose a set of speed-up techniques i.e., modeling RC circuit for eliminating redundant matrix multiplications, exploiting sliding window scheme, and fast buffer sizing effect estimation, which are fitted into our context of fast clock skew estimation in mesh resource optimization as well as an invention of early decision policies. In summary, this dissertation presents the efficient design methodology for clock mesh synthesis with consideration on integration of three tasks and reduction of runtime complexity.Abstract i Contents iii List of Figures vi List of Tables x 1 Introduction 1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contributions of This Dissertation . . . . . . . . . . . . . . . . . . . 3 2 Background 5 2.1 Clock Distribution Network . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Clock Network Topologies . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Design Metrics of Clock Network . . . . . . . . . . . . . . . . . . . 7 2.4 The Effects of Variations on Clock Skew . . . . . . . . . . . . . . . . 9 3 Clock Mesh Synthesis Flow 12 3.1 Elements of Clock Mesh . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Conventional Clock Mesh Synthesis Overview . . . . . . . . . . . . . 13 3.3 Initial Grid Generation . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.4 Mesh Buffer Placement and Sizing . . . . . . . . . . . . . . . . . . . 14 3.5 Clock Mesh Optimization . . . . . . . . . . . . . . . . . . . . . . . . 17 4 Integrated Resource Allocation and Binding in Clock Mesh Synthesis 19 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.3 Framework of Clock Mesh Optimization . . . . . . . . . . . . . . . . 26 4.3.1 Incremental Resource Updates . . . . . . . . . . . . . . . . . 29 4.3.2 Constraints for Variation Tolerance . . . . . . . . . . . . . . 34 4.3.3 Early Decision Policies . . . . . . . . . . . . . . . . . . . . . 38 4.3.4 Time Complexity Analysis . . . . . . . . . . . . . . . . . . . 39 4.4 Fast Clock Skew Estimation Techniques . . . . . . . . . . . . . . . . 40 4.4.1 Partially Reusing Matrix Multiplication for Incremental Updates 41 4.4.2 Adopting Sliding Window Scheme . . . . . . . . . . . . . . . 43 4.4.3 Adjusting Delay Caused by Buffer Resizing . . . . . . . . . . 44 4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.5.1 Experimental Environments . . . . . . . . . . . . . . . . . . 46 4.5.2 Resource Requirement and Variation Tolerance Comparison . 48 4.5.3 Comparison with Clock Mesh Optimization using Worst Case Timing Analysis of Commercial Tool . . . . . . . . . . . . . 56 4.5.4 Analysis of the Effect of Proposed Techniques . . . . . . . . 58 4.5.5 Run Time Analysis . . . . . . . . . . . . . . . . . . . . . . . 61 4.5.6 Accuracy and Run Time of Fast Clock Skew Estimation . . . 63 4.5.7 Electromigration Analysis . . . . . . . . . . . . . . . . . . . 68 4.5.8 Run-time Analysis in Multi-thread Computing Environment . 70 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5 Conclusion 74 Abstract in Korean 84Docto

    Clock Distribution Network Optimization by Sequential Quadratic Programing

    Get PDF
    Clock mesh is widely used in microprocessor designs for achieving low clock skew and high process variation tolerance. Clock mesh optimization is a very diffcult problem to solve because it has a highly connected structure and requires accurate delay models which are computationally expensive. Existing methods on clock network optimization are either restricted to clock trees, which are easy to be separated into smaller problems, or naive heuristics based on crude delay models. A clock mesh sizing algorithm, which is aimed to minimize total mesh wire area with consideration of clock skew constraints, has been proposed in this research work. This algorithm is a systematic solution search through rigorous Sequential Quadratic Programming (SQP). The SQP is guided by an efficient adjoint sensitivity analysis which has near-SPICE(Simulation Program for Integrated Circuits Emphasis)-level accuracy and faster-than-SPICE speed. Experimental results on various benchmark circuits indicate that this algorithm leads to substantial wire area reduction while maintaining low clock skew in the clock mesh. The reduction in mesh area achieved is about 33%

    CAD methodologies for low power and reliable 3D ICs

    Get PDF
    The main objective of this dissertation is to explore and develop computer-aided-design (CAD) methodologies and optimization techniques for reliability, timing performance, and power consumption of through-silicon-via(TSV)-based and monolithic 3D IC designs. The 3D IC technology is a promising answer to the device scaling and interconnect problems that industry faces today. Yet, since multiple dies are stacked vertically in 3D ICs, new problems arise such as thermal, power delivery, and so on. New physical design methodologies and optimization techniques should be developed to address the problems and exploit the design freedom in 3D ICs. Towards the objective, this dissertation includes four research projects. The first project is on the co-optimization of traditional design metrics and reliability metrics for 3D ICs. It is well known that heat removal and power delivery are two major reliability concerns in 3D ICs. To alleviate thermal problem, two possible solutions have been proposed: thermal-through-silicon-vias (T-TSVs) and micro-fluidic-channel (MFC) based cooling. For power delivery, a complex power distribution network is required to deliver currents reliably to all parts of the 3D IC while suppressing the power supply noise to an acceptable level. However, these thermal and power networks pose major challenges in signal routability and congestion. In this project, a co-optimization methodology for signal, power, and thermal interconnects in 3D ICs is presented. The goal of the proposed approach is to improve signal, thermal, and power noise metrics and to provide fast and accurate design space explorations for early design stages. The second project is a study on 3D IC partition. For a 3D IC, the target circuit needs to be partitioned into multiple parts then mapped onto the dies. The partition style impacts design quality such as footprint, wirelength, timing, and so on. In this project, the design methodologies of 3D ICs with different partition styles are demonstrated. For the LEON3 multi-core microprocessor, three partitioning styles are compared: core-level, block-level, and gate-level. The design methodologies for such partitioning styles and their implications on the physical layout are discussed. Then, to perform timing optimizations for 3D ICs, two timing constraint generation methods are demonstrated that lead to different design quality. The third project is on the buffer insertion for timing optimization of 3D ICs. For high performance 3D ICs, it is crucial to perform thorough timing optimizations. Among timing optimization techniques, buffer insertion is known to be the most effective way. The TSVs have a large parasitic capacitance that increases the signal slew and the delay on the downstream. In this project, a slew-aware buffer insertion algorithm is developed that handles full 3D nets and considers TSV parasitics and slew effects on delay. Compared with the well-known van Ginneken algorithm and a commercial tool, the proposed algorithm finds buffering solutions with lower delay values and acceptable runtime overhead. The last project is on the ultra-high-density logic designs for monolithic 3D ICs. The nano-scale 3D interconnects available in monolithic 3D IC technology enable ultra-high-density device integration at the individual transistor-level. The benefits and challenges of monolithic 3D integration technology for logic designs are investigated. First, a 3D standard cell library for transistor-level monolithic 3D ICs is built and their timing and power behavior are characterized. Then, various interconnect options for monolithic 3D ICs that improve design quality are explored. Next, timing-closed, full-chip GDSII layouts are built and iso-performance power comparisons with 2D IC designs are performed. Important design metrics such as area, wirelength, timing, and power consumption are compared among transistor-level monolithic 3D, gate-level monolithic 3D, TSV-based 3D, and traditional 2D designs.PhDCommittee Chair: Lim, Sung Kyu; Committee Member: Bakir, Muhannad; Committee Member: Kim, Hyesoon; Committee Member: Lee, Hsien-Hsin; Committee Member: Mukhopadhyay, Saiba

    Analysis and Design Methodologies for Switched-Capacitor Filter Circuits in Advanced CMOS Technologies

    Get PDF
    Analog filters are an extremely important block in several electronic systems, such as RF transceivers, data acquisition channels, or sigma-delta modulators. They allow the suppression of unwanted frequencies bands in a signal, improving the system’s performance. These blocks are typically implemented using active RC filters, gm-C filters, or switched-capacitor (SC) filters. In modern deep-submicron CMOS technologies, the transistors intrinsic gain is small and has a large variability, making the design of moderate and high-gain amplifiers, used in the implementation of filter blocks, extremely difficult. To avoid this difficulty, in the case of SC filters, the opamp can be replaced with a voltage buffer or a low-gain amplifier (< 2), simplifying the amplifier’s design and making it easier to achieve higher bandwidths, for the same power. However, due to the loss of the virtual ground node, the circuit becomes sensitive to the effects of parasitic capacitances, which effect needs to be compensated during the design process. This thesis addresses the task of optimizing SC filters (mainly focused on implementations using low-gain amplifiers), helping designers with the complex task of designing high performance SC filters in advanced CMOS technologies. An efficient optimization methodology is introduced, based on hybrid cost functions (equation-based/simulation-based) and using genetic algorithms. The optimization software starts by using equations in the cost function to estimate the filter’s frequency response reducing computation time, when compared with the electrical simulation of the circuit’s impulse response. Using equations, the frequency response can be quickly computed (< 1 s), allowing the use of larger populations in the genetic algorithm (GA) to cover the entire design space. Once the specifications are met, the population size is reduced and the equation-based design is fine-tuned using the more computationally intensive, but more accurate, simulation-based cost function, allowing to accurately compensate the parasitic capacitances, which are harder to estimate using equations. With this hybrid approach, it is possible to obtain the final optimized design within a reasonable amount of computation time. Two methods are described for the estimation of the filter’s frequency response. The first method is hierarchical in nature where, in the first step, the frequency response is optimized using the circuit’s ideal transfer function. The following steps are used to optimize circuits, at transistor level, to replace the ideal blocks (amplifier and switches) used in the first step, while compensating the effects of the circuit’s parasitic capacitances in the ideal design. The second method uses a novel efficient numerical methodology to obtain the frequency response of SC filters, based on the circuit’s first-order differential equations. The methodology uses a non-hierarchical approach, where the non-ideal effects of the transistors (in the amplifier and in the switches) are taken into consideration, allowing the accurate computation of the frequency response, even in the case of incomplete settling in the SC branches. Several design and optimization examples are given to demonstrate the performance of the proposed methods. The prototypes of a second order programmable bandpass SC filter and a 50 Hz notch SC filter have been designed in UMC 130 nm CMOS technology and optimized using the proposed optimization software with a supply voltage of 0.9 V. The bandpass SC filter has a total power consumption of 249 uW. The filter’s central frequency can be tuned between 3.9 kHz and 7.1 kHz, the gain between -6.4 dB and 12.6 dB, and the quality factor between 0.9 and 6.9. Depending on the bit configuration, the circuit’s THD is between -54.7 dB and -61.7 dB. The 50 Hz notch SC filter has a total power consumption of 273 uW. The transient simulation of the circuit’s extracted view (C+CC) shows an attenuation of 52.3 dB in the 50 Hz interference and that the desired 5 kHz signal has a THD of -92.3 dB
    corecore