54 research outputs found

    Fast Repeater Tree Construction

    Get PDF
    Repeaters are used during physical design of chips to improve the electrical and timing properties of interconnections. They are added along Steiner trees that connect root gates to sinks, creating repeater trees. Their construction became a crucial part of chip design. We present a new algorithm to solve the repeater tree construction problem. We first present an extensive version of the Repeater Tree Problem. Our problem formulation encapsulates most of the constraints that have been studied so far. We also consider several aspects for the first time, for example, slew dependent required arrival times at repeater tree sinks. The employed technology, the properties of available repeaters and metal wires, the shape of the chip, the temperature, the voltages, and many other factors highly influence the results of repeater tree construction. To take all this into account, we extensively preprocess the environment to extract parameters for our algorithms. We first present an algorithm for Steiner tree creation and prove that our algorithm is able to create timing-efficient as well as cost-efficient trees. Our algorithm is based on a delay model that accurately describes the timing that one can achieve after repeater insertion upfront. Next, we deal with the problem of adding repeaters to a given Steiner tree. The predominantly used algorithms to solve this problem use dynamic programming. However, they have several drawbacks. Firstly, potential repeater positions along the Steiner tree have to be chosen upfront. Secondly, the algorithms strictly follow the given Steiner tree and miss optimization opportunities. Finally, dynamic programming causes high running times. We present our new buffer insertion algorithm, Fast Buffering, that overcomes these limitations. It is able to produce results with similar quality to a dynamic programming approach but a much better running time. In addition, we also present improvements to the dynamic programming approach that allows us to push the quality at the expense of a high running time. We have implemented our algorithms as part of the BonnTools physical design optimization suite developed at the Research Institute for Discrete Mathematics in cooperation with IBM. Our implementation deals with all tedious details of a grown real-world chip optimization environment. We have created extensive experimental results on challenging real-world test cases provided by our cooperation partner. Our algorithm can solve about 5.7 million instances per hour

    Timing-Constrained Global Routing with Buffered Steiner Trees

    Get PDF
    This dissertation deals with the combination of two key problems that arise in the physical design of computer chips: global routing and buffering. The task of buffering is the insertion of buffers and inverters into the chip's netlist to speed-up signal delays and to improve electrical properties of the chip. Insertion of buffers and inverters goes alongside with construction of Steiner trees that connect logical sources with possibly many logical sinks and have buffers and inverters as parts of these connections. Classical global routing focuses on packing Steiner trees within the limited routing space. Buffering and global routing have been solved separately in the past. In this thesis we overcome the limitations of the classical approaches by considering the buffering problem as a global, multi-objective problem. We study its theoretical aspects and propose algorithms which we implement in the tool BonnRouteBuffer for timing-constrained global routing with buffered Steiner trees. At its core, we propose a new theoretically founded framework to model timing constraints inherently within global routing. As most important sub-task we have to compute a buffered Steiner tree for a single net minimizing the sum of prices for delays, routing congestion, placement congestion, power consumption, and net length. For this sub-task we present a fully polynomial time approximation scheme to compute an almost-cheapest Steiner tree with a given routing topology and prove that an exact algorithm cannot exist unless P=NP. For topology computation we present a bicriteria approximation algorithm that bounds both the geometric length and the worst slack of the topology. To improve the practical results we present many heuristic modifications, speed-up- and post-optimization techniques for buffered Steiner trees. We conduct experiments on challenging real-world test cases provided by our cooperation partner IBM to demonstrate the quality of our tool. Our new algorithm could produce better solutions with respect to both timing and routability. After post-processing with gate sizing and Vt-assignment, we can even reduce the power consumption on most instances. Overall, our results show that our tool BonnRouteBuffer for timing-constrained global routing is superior to industrial state-of-the-art tools

    Timing Closure in Chip Design

    Get PDF
    Achieving timing closure is a major challenge to the physical design of a computer chip. Its task is to find a physical realization fulfilling the speed specifications. In this thesis, we propose new algorithms for the key tasks of performance optimization, namely repeater tree construction; circuit sizing; clock skew scheduling; threshold voltage optimization and plane assignment. Furthermore, a new program flow for timing closure is developed that integrates these algorithms with placement and clocktree construction. For repeater tree construction a new algorithm for computing topologies, which are later filled with repeaters, is presented. To this end, we propose a new delay model for topologies that not only accounts for the path lengths, as existing approaches do, but also for the number of bifurcations on a path, which introduce extra capacitance and thereby delay. In the extreme cases of pure power optimization and pure delay optimization the optimum topologies regarding our delay model are minimum Steiner trees and alphabetic code trees with the shortest possible path lengths. We presented a new, extremely fast algorithm that scales seamlessly between the two opposite objectives. For special cases, we prove the optimality of our algorithm. The efficiency and effectiveness in practice is demonstrated by comprehensive experimental results. The task of circuit sizing is to assign millions of small elementary logic circuits to elements from a discrete set of logically equivalent, predefined physical layouts such that power consumption is minimized and all signal paths are sufficiently fast. In this thesis we develop a fast heuristic approach for global circuit sizing, followed by a local search into a local optimum. Our algorithms use, in contrast to existing approaches, the available discrete layout choices and accurate delay models with slew propagation. The global approach iteratively assigns slew targets to all source pins of the chip and chooses a discrete layout of minimum size preserving the slew targets. In comprehensive experiments on real instances, we demonstrate that the worst path delay is within 7% of its lower bound on average after a few iterations. The subsequent local search reduces this gap to 2% on average. Combining global and local sizing we are able to size more than 5.7 million circuits within 3 hours. For the clock skew scheduling problem we develop the first algorithm with a strongly polynomial running time for the cycle time minimization in the presence of different cycle times and multi-cycle paths. In practice, an iterative local search method is much more efficient. We prove that this iterative method maximizes the worst slack, even when restricting the feasible schedule to certain time intervals. Furthermore, we enhance the iterative local approach to determine a lexicographically optimum slack distribution. The clock skew scheduling problem is then generalized to allow for simultaneous data path optimization. In fact, this is a time-cost tradeoff problem. We developed the first combinatorial algorithm for computing time-cost tradeoff curves in graphs that may contain cycles. Starting from the lowest-cost solution, the algorithm iteratively computes a descent direction by a minimum cost flow computation. The maximum feasible step length is then determined by a minimum ratio cycle computation. This approach can be used in chip design for several optimization tasks, e.g. threshold voltage optimization or plane assignment. Finally, the optimization routines are combined into a timing closure flow. Here, the global placement is alternated with global performance optimization. Netweights are used to penalize the length of critical nets during placement. After the global phase, the performance is improved further by applying more comprehensive optimization routines on the most critical paths. In the end, the clock schedule is optimized and clocktrees are inserted. Computational results of the design flow are obtained on real-world computer chips

    Facility Location and Clock Tree Synthesis

    Get PDF
    The construction of clock trees and repeater trees are major challenges in chip design. Such trees distribute an electrical clock signal from a source to a set of sinks on a chip. On recent designs there can be millions of repeater trees with only a few up to some hundred sinks and several clock trees with up to some hundred thousand of sinks. In repeater trees the signal has to arrive at each sink not later than an individual required arrival time, while in clock trees it has to arrive at each sink within an individual required arrival time window. In this thesis, we present new theory and algorithms for the construction of clock trees and repeater trees and an essential sub-problem, the Sink Clustering Problem. We also describe our clock tree construction tool BonnClock, which has been used by IBM Microelectronics for the design of hundreds of most complex chips. First, we introduce the Sink Clustering Problem, the main sub-problem of clock tree design. Given a metric space (V,c), a finite set D of terminals with positions p(v) ∈ V and demands d(v) ∈ R ≥ 0 for all v ∈ D, a facility opening cost f ∈ R>0 and a load limit u ∈ R>0 , the task is to find a partition D=D1 ∪ ... ∪ Dk of D and, for all 1 ≤ i ≤ k, a Steiner tree Si for {p(v)| v ∈ Di }. Each cluster (Di ,Si ), 1 ≤ i ≤ k, has to keep the load limit, that means ∑e ∈ E(Si) c(e) +∑s ∈ Di d(s) ≤ u. The goal is to minimize the weighted sum of the length of all Steiner trees plus the number of clusters, i.e. minimize ∑i=1,...,k (∑e ∈ E(Si ) c(e)) +kf. We present the first constant-factor approximation algorithm for the Sink Clustering Problem. It is based on decomposing a minimum spanning tree on the sinks and has an approximation guarantee of 1+2α, where α is the Steiner ratio of the underlying metric. Moreover, we introduce two variants of the algorithm that rely on decomposing an approximate minimum Steiner tree and an approximate minimum traveling salesman tour. These algorithms have approximation guarantees of 3β and 3γ, respectively, where β and γ are the approximation guarantees of the Steiner tree and TSP approximation algorithms, respectively. We also propose two post-optimization algorithms that can further improve an existing clustering. We analyze the structure of the Sink Clustering Problem and exhibit its connections to matroid theory. In particular, we use the property of matroids that for any two bases B1 , B2 there is a bijection p : B1 → B2 so that (B1 \ {b}) ∪ {p(b)} is again a basis for each b ∈ B1. We replace each Steiner tree of an optimum solution by a minimum spanning tree and connect all trees to a new artificial vertex s and get a tree S. In a modified metric the total length of S is a good lower bound for the cost of an optimum solution. Due to the matroid property we can compare a minimum spanning tree T on D ∪ {s} with S; the length of any edge of T is bounded by the length of an edge of S. We introduce the concept of K-dominated functions that helps us to increase the `cost' of certain edges of T while still having the property that the total length of all edges of T ending in a vertex of K ⊆ D is bounded by the total length of all edges of S ending in a vertex of K. Applying this procedure to the sets of a laminar family on D yields an improved lower bound. The bound can be further improved by combining it with a lower bound for the length of a minimum Steiner tree on D. For this bound we prove the following lemma: For any family of trees T = {T1 ,..., Tk } with V(Ti ) ⊂ D, 1 ≤ i ≤ k, with the property that for any subset T' ⊆ T the trees in T' cover at least | T' |+1 vertices, there exists an edge ei ∈ E(Ti ) for i=1,..., k such that these edges E={ei | 1 ≤ i ≤ k} form a forest, i.e. the set does not contain an edge twice and it does not contain a circuit. Our experimental results on real-world instances from clock tree design show that the cost of the solutions computed by our algorithms is in average only 10% over the best lower bound. Moreover, we compare our algorithm to another clustering algorithm used in industry. The results show that the total cost of our solutions is 10% less than the cost of the solutions computed by the competitive tool. Clock trees have to satisfy several timing constraints. More precisely, the signal has to reach each sink within an individual required arrival time window. Sinks can only be clustered together if their required arrival time windows have a point of time in common. Typically, all required arrival time windows are the same. In this case we have the Sink Clustering Problem defined above. However, there are clock trees where the sinks have different required arrival time windows. This motivates a generalization of the Sink Clustering Problem where each sink additionally has an individual time window. As further constraint the time windows of the sinks of a cluster must have at least one point of time in common. We study the Sink Clustering Problem with Time Windows and present a polynomial O(log s)-approximation algorithm for this problem, where s is the size of a minimum clique partition in the interval graph induced by the time windows. Our algorithm is based on a divide and conquer approach and uses the approximation algorithms for the Sink Clustering Problem on sub-sets of the instance. We show that the approximation guarantee of the algorithm is tight. For the practical construction of clock trees we present our algorithm BonnClock. BonnClock builds a clock tree combining a bottom-up clustering and a top-down partitioning strategy. In the bottom-up phase BonnClock is using the Sink Clustering Algorithm in order to determine the drivers of unconnected sinks or inverters. The `global' topology of the tree is determined by the top-down partitioning considering big blockages and timing restrictions. BonnClock uses a dynamic program in order to determine the sizes of the inverters that are inserted. All components of the algorithm are discussed in detail. As part of this thesis, we have also implemented this algorithm. BonnClock has become the standard tool to construct clock trees within IBM. We show experimental results with comparisons to another industrial clock tree construction tool and to lower bounds for the power consumption. It turns out that - mainly due to the Sink Clustering Algorithm - our power consumption is much smaller than with the other tool and only one third over the lower bound. Finally, we consider the repeater tree construction problem. In contrast to clock trees, each sink has a latest required arrival time instead of a time window. We describe a simple algorithm to build such trees where we insert the sinks one by one into an existing tree. Depending on the optimization goal we show a variant of the algorithm computing trees of almost optimal length or trees with guaranteed best possible performance. Moreover, we analyze the topology of trees with best or almost best performance more closely. Such trees are equivalent to minimax and almost minimax trees: Let a1 , ... , an ∈ N ≥ 0 be a set of numbers. The weight of a tree with n leaves is the maximum over all leaves i of the depth of leaf i plus ai. For a non-negative integral constant c the goal is to build a binary tree with weight at most the optimum weight plus c. This problem can be solved optimally by a greedy algorithm. However, we are interested in the online version of this problem where we have to insert the leaf i with weight ai into the tree without knowing n and the following weights aj, j> i. We give necessary and sufficient conditions for an online algorithm to compute trees of weight at most the optimum weight plus c. Moreover, we show how these conditions can be verified efficiently. We obtain an online algorithm that computes an optimum tree in O(nlog n) time. Finally, we study a further mathematical model of repeater trees that considers that additional delay caused by a bifurcation of a tree can be distributed partially to the two branches. For c∈ R>0 and a set L ⊆ {(l1 ,l2 ) ∈ R2 ≥ 0 | l1 +l2 = c} of two-element sets of non-negative real numbers we consider rooted binary trees with the property that the two edges emanating from every non-leaf are assigned lengths l1 and l2 with { l1 ,l2 } ? L. We study the asymptotic growth of the maximum number of leaves of bounded depths in such trees and the existence of such trees with leaves at individually specified maximum depths. Our results yield better lower bounds for repeater trees

    Driving the Network-on-Chip Revolution to Remove the Interconnect Bottleneck in Nanoscale Multi-Processor Systems-on-Chip

    Get PDF
    The sustained demand for faster, more powerful chips has been met by the availability of chip manufacturing processes allowing for the integration of increasing numbers of computation units onto a single die. The resulting outcome, especially in the embedded domain, has often been called SYSTEM-ON-CHIP (SoC) or MULTI-PROCESSOR SYSTEM-ON-CHIP (MP-SoC). MPSoC design brings to the foreground a large number of challenges, one of the most prominent of which is the design of the chip interconnection. With a number of on-chip blocks presently ranging in the tens, and quickly approaching the hundreds, the novel issue of how to best provide on-chip communication resources is clearly felt. NETWORKS-ON-CHIPS (NoCs) are the most comprehensive and scalable answer to this design concern. By bringing large-scale networking concepts to the on-chip domain, they guarantee a structured answer to present and future communication requirements. The point-to-point connection and packet switching paradigms they involve are also of great help in minimizing wiring overhead and physical routing issues. However, as with any technology of recent inception, NoC design is still an evolving discipline. Several main areas of interest require deep investigation for NoCs to become viable solutions: • The design of the NoC architecture needs to strike the best tradeoff among performance, features and the tight area and power constraints of the onchip domain. • Simulation and verification infrastructure must be put in place to explore, validate and optimize the NoC performance. • NoCs offer a huge design space, thanks to their extreme customizability in terms of topology and architectural parameters. Design tools are needed to prune this space and pick the best solutions. • Even more so given their global, distributed nature, it is essential to evaluate the physical implementation of NoCs to evaluate their suitability for next-generation designs and their area and power costs. This dissertation performs a design space exploration of network-on-chip architectures, in order to point-out the trade-offs associated with the design of each individual network building blocks and with the design of network topology overall. The design space exploration is preceded by a comparative analysis of state-of-the-art interconnect fabrics with themselves and with early networkon- chip prototypes. The ultimate objective is to point out the key advantages that NoC realizations provide with respect to state-of-the-art communication infrastructures and to point out the challenges that lie ahead in order to make this new interconnect technology come true. Among these latter, technologyrelated challenges are emerging that call for dedicated design techniques at all levels of the design hierarchy. In particular, leakage power dissipation, containment of process variations and of their effects. The achievement of the above objectives was enabled by means of a NoC simulation environment for cycleaccurate modelling and simulation and by means of a back-end facility for the study of NoC physical implementation effects. Overall, all the results provided by this work have been validated on actual silicon layout

    Broadening the Scope of Multi-Objective Optimizations in Physical Synthesis of Integrated Circuits.

    Full text link
    In modern VLSI design, physical synthesis tools are primarily responsible for satisfying chip-performance constraints by invoking a broad range of circuit optimizations, such as buffer insertion, logic restructuring, gate sizing and relocation. This process is known as timing closure. Our research seeks more powerful and efficient optimizations to improve the state of the art in modern chip design. In particular, we integrate timing-driven relocation, retiming, logic cloning, buffer insertion and gate sizing in novel ways to create powerful circuit transformations that help satisfy setup-time constraints. State-of-the-art physical synthesis optimizations are typically applied at two scales: i) global algorithms that affect the entire netlist and ii) local transformations that focus on a handful of gates or interconnections. The scale of modern chip designs dictates that only near-linear-time optimization algorithms can be applied at the global scope — typically limited to wirelength-driven placement and legalization. Localized transformations can rely on more time-consuming optimizations with accurate delay models. Few techniques bridge the gap between fully-global and localized optimizations. This dissertation broadens the scope of physical synthesis optimization to include accurate transformations operating between the global and local scales. In particular, we integrate groups of related transformations to break circular dependencies and increase the number of circuit elements that can be jointly optimized to escape local minima. Integrated transformations in this dissertation are developed by identifying and removing obstacles to successful optimizations. Integration is achieved through mapping multiple operations to rigorous mathematical optimization problems that can be solved simultaneously. We achieve computational scalability in our techniques by leveraging analytical delay models and focusing optimization efforts on carefully selected regions of the chip. In this regard, we make extensive use of a linear interconnect-delay model that accounts for the impact of subsequent repeated insertion. Our integrated transformations are evaluated on high-performance circuits with over 100,000 gates. Integrated optimization techniques described in this dissertation ensure graceful timing-closure process and impact nearly every aspect of a typical physical synthesis flow. They have been validated in EDA tools used at IBM for physical synthesis of high-performance CPU and ASIC designs, where they significantly improved chip performance.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/78744/1/iamyou_1.pd

    Power Management for Deep Submicron Microprocessors

    Get PDF
    As VLSI technology scales, the enhanced performance of smaller transistors comes at the expense of increased power consumption. In addition to the dynamic power consumed by the circuits there is a tremendous increase in the leakage power consumption which is further exacerbated by the increasing operating temperatures. The total power consumption of modern processors is distributed between the processor core, memory and interconnects. In this research two novel power management techniques are presented targeting the functional units and the global interconnects. First, since most leakage control schemes for processor functional units are based on circuit level techniques, such schemes inherently lack information about the operational profile of higher-level components of the system. This is a barrier to the pivotal task of predicting standby time. Without this prediction, it is extremely difficult to assess the value of any leakage control scheme. Consequently, a methodology that can predict the standby time is highly beneficial in bridging the gap between the information available at the application level and the circuit implementations. In this work, a novel Dynamic Sleep Signal Generator (DSSG) is presented. It utilizes the usage traces extracted from cycle accurate simulations of benchmark programs to predict the long standby periods associated with the various functional units. The DSSG bases its decisions on the current and previous standby state of the functional units to accurately predict the length of the next standby period. The DSSG presents an alternative to Static Sleep Signal Generation (SSSG) based on static counters that trigger the generation of the sleep signal when the functional units idle for a prespecified number of cycles. The test results of the DSSG are obtained by the use of a modified RISC superscalar processor, implemented by SimpleScalar, the most widely accepted open source vehicle for architectural analysis. In addition, the results are further verified by a Simultaneous Multithreading simulator implemented by SMTSIM. Leakage saving results shows an increase of up to 146% in leakage savings using the DSSG versus the SSSG, with an accuracy of 60-80% for predicting long standby periods. Second, chip designers in their effort to achieve timing closure, have focused on achieving the lowest possible interconnect delay through buffer insertion and routing techniques. This approach, though, taxes the power budget of modern ICs, especially those intended for wireless applications. Also, in order to achieve more functionality, die sizes are constantly increasing. This trend is leading to an increase in the average global interconnect length which, in turn, requires more buffers to achieve timing closure. Unconstrained buffering is bound to adversely affect the overall chip performance, if the power consumption is added as a major performance metric. In fact, the number of global interconnect buffers is expected to reach hundreds of thousands to achieve an appropriate timing closure. To mitigate the impact of the power consumed by the interconnect buffers, a power-efficient multi-pin routing technique is proposed in this research. The problem is based on a graph representation of the routing possibilities, including buffer insertion and identifying the least power path between the interconnect source and set of sinks. The novel multi-pin routing technique is tested by applying it to the ISPD and IBM benchmarks to verify the accuracy, complexity, and solution quality. Results obtained indicate that an average power savings as high as 32% for the 130-nm technology is achieved with no impact on the maximum chip frequency

    Design methodology and productivity improvement in high speed VLSI circuits

    Get PDF
    2017 Spring.Includes bibliographical references.To view the abstract, please see the full text of the document
    • …
    corecore