6 research outputs found

    Macro-Driven Circuit Design Methodology for High-Performance Datapaths

    Get PDF
    Datapath design is one of the most critical elements in the design of a high performance microprocessor. However datapath design is typically done manually, and is often custom style. This adversely impacts the overall productivity of the design team, as well as the quality of the design. In spite of this, very little automation has been available to the designers of high performance datapaths. In this paper we present a new "macrodriven " approach to the design of datapath circuits. Our approach, referred to as SMART (Smart Macro Design Advisor), is based on automatic generation of regular datapath components such as muxes, comparators, adders etc., which we refer to as datapath macros. The generated solution is based on designer provided constraints such as delay, load and slope, and is optimized for a designer provided cost metric such as power, area. Results on datapath circuits of a high-performance microprocessor show that this approach is very effective for both designer productivity as well as design quality

    Power-efficient design of 16-bit mixed-operand multipliers

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004.Includes bibliographical references (p. 53).Multiplication is an expensive and slow arithmetic operation, which plays an important role in many DSP algorithms. It usually lies in the critical-delay paths, having an effect on performance of the system as well as consuming large power. Consequently, significant improvements in both power and performance can be achieved in the overall DSP system by carefully designing and optimizing power and performance of the multiplier. This thesis explores several circuit-level techniques for power-efficiently designing multipliers, including supply voltage reduction, efficient multiplication algorithms, low power circuit logic styles, and transistor sizing using dynamic and static tuners. Based on these techniques, several 16-bit multipliers have been successfully designed and implemented in 0.13[micro]m CMOS technology at the supply voltage of 1.5V and 0.9V. The multipliers are modified to handle multiplications of two 16-bit operands in which each can be either signed magnitude or two's complement formats. Examining power-performance characteristics of these multipliers reveals that both array and tree structures are feasible solutions for designing 16-bit multipliers, and complementary CMOS and single-ended CPL-TG logics are promising candidates for power-efficient design. The appropriate choices of structures and logic styles depend on power and performance constraints of the particular design.by Sataporn Pornpromlikit.M.Eng

    Layout optimization in ultra deep submicron VLSI design

    Get PDF
    As fabrication technology keeps advancing, many deep submicron (DSM) effects have become increasingly evident and can no longer be ignored in Very Large Scale Integration (VLSI) design. In this dissertation, we study several deep submicron problems (eg. coupling capacitance, antenna effect and delay variation) and propose optimization techniques to mitigate these DSM effects in the place-and-route stage of VLSI physical design. The place-and-route stage of physical design can be further divided into several steps: (1) Placement, (2) Global routing, (3) Layer assignment, (4) Track assignment, and (5) Detailed routing. Among them, layer/track assignment assigns major trunks of wire segments to specific layers/tracks in order to guide the underlying detailed router. In this dissertation, we have proposed techniques to handle coupling capacitance at the layer/track assignment stage, antenna effect at the layer assignment, and delay variation at the ECO (Engineering Change Order) placement stage, respectively. More specifically, at layer assignment, we have proposed an improved probabilistic model to quickly estimate the amount of coupling capacitance for timing optimization. Antenna effects are also handled at layer assignment through a linear-time tree partitioning algorithm. At the track assignment stage, timing is further optimized using a graph based technique. In addition, we have proposed a novel gate splitting methodology to reduce delay variation in the ECO placement considering spatial correlations. Experimental results on benchmark circuits showed the effectiveness of our approaches

    Leading the Blind:Automated Transistor-Level Modeling for FPGA Architects

    Get PDF
    The design and development of innovative FPGA architectures hinge on the flexibility of its toolchain. Retargetable toolchains, like the Verilog-to-Routing (VTR) flow, have been developed to enable the testing of new FPGAs by mapping circuits onto easily-described and possibly theoretical architectures. However, in reality, the difficulty extends beyond having CAD tools that support the architectural changes: it is equally important for FPGA architects to be able to produce reliable delay and area models for these tools. In addition to having acute architectural intuitions, designing and optimizing the circuit at the transistor-level requires architects to have, as well, a particular set of electrical engineering skills and expertise. The process is also painstaking and time-consuming, rendering the comparison of a variety of architectures or the exploration of a wide design space quite complicated and even impossible in practice. In this work, we present a novel approach to model the delay and area of FPGA architectures with various structures and characteristics, quickly and with acceptable accuracy. Abstracting from the user the transistor-level design and optimization that normally accompany the model- ing process, this approach, called FPRESSO, can be used by any architect without prerequisites. We take inspiration from the way a standard-cell flow performs large-scale transistor-size optimization and apply the same concepts to FPGAs, only at a coarser granularity. Skilled designers prepare for FPRESSO a set of locally optimized libraries of basic parameterizable components with a variety of drive strengths. Then, inexperienced users specify arbitrary FPGA architectures as interconnects of these basic components. The architecture is globally optimized, within minutes, through a standard logic synthesis tool, by choosing the most fitting version of each cell and adding buffers wherever appropriate. The resulting delay and area characteristics are automatically returned, in a format suitable for the VTR flow. A correct modeling of any architecture requires not only an optimization of the logic components, but also a proper modeling of the wires connecting these components. This does not only include measuring the length of the wires to determine their respective resistance and capacitance, but also, minimizing their length to reduce the wireload effect on the overall performance. To that end, FPRESSO features an automatic and generic wire modeling approach based on a simulated annealing floorplanning algorithm, to estimate the wires between the different components of the FPGA architecture. To evaluate the results of FPRESSO and confirm the validity of its modeled architectures, we use it to explore a wide range of FPGA architectures. First, we repeat a known study that helped set the standards on the optimal Look-Up-Table (LUT) and cluster size for conventional FPGAs. We show, by comparing with the results of the study, that modeling in FPRESSO preserves the very same trends and conclusions, with significantly less effort. We then extend the search space to cover fracturable LUTs and sparse crossbars, and show how FPRESSO makes the exploration of a huge search space not only possible but easy, efficient, and affordable, for any class of VTR users

    Timing Closure in Chip Design

    Get PDF
    Achieving timing closure is a major challenge to the physical design of a computer chip. Its task is to find a physical realization fulfilling the speed specifications. In this thesis, we propose new algorithms for the key tasks of performance optimization, namely repeater tree construction; circuit sizing; clock skew scheduling; threshold voltage optimization and plane assignment. Furthermore, a new program flow for timing closure is developed that integrates these algorithms with placement and clocktree construction. For repeater tree construction a new algorithm for computing topologies, which are later filled with repeaters, is presented. To this end, we propose a new delay model for topologies that not only accounts for the path lengths, as existing approaches do, but also for the number of bifurcations on a path, which introduce extra capacitance and thereby delay. In the extreme cases of pure power optimization and pure delay optimization the optimum topologies regarding our delay model are minimum Steiner trees and alphabetic code trees with the shortest possible path lengths. We presented a new, extremely fast algorithm that scales seamlessly between the two opposite objectives. For special cases, we prove the optimality of our algorithm. The efficiency and effectiveness in practice is demonstrated by comprehensive experimental results. The task of circuit sizing is to assign millions of small elementary logic circuits to elements from a discrete set of logically equivalent, predefined physical layouts such that power consumption is minimized and all signal paths are sufficiently fast. In this thesis we develop a fast heuristic approach for global circuit sizing, followed by a local search into a local optimum. Our algorithms use, in contrast to existing approaches, the available discrete layout choices and accurate delay models with slew propagation. The global approach iteratively assigns slew targets to all source pins of the chip and chooses a discrete layout of minimum size preserving the slew targets. In comprehensive experiments on real instances, we demonstrate that the worst path delay is within 7% of its lower bound on average after a few iterations. The subsequent local search reduces this gap to 2% on average. Combining global and local sizing we are able to size more than 5.7 million circuits within 3 hours. For the clock skew scheduling problem we develop the first algorithm with a strongly polynomial running time for the cycle time minimization in the presence of different cycle times and multi-cycle paths. In practice, an iterative local search method is much more efficient. We prove that this iterative method maximizes the worst slack, even when restricting the feasible schedule to certain time intervals. Furthermore, we enhance the iterative local approach to determine a lexicographically optimum slack distribution. The clock skew scheduling problem is then generalized to allow for simultaneous data path optimization. In fact, this is a time-cost tradeoff problem. We developed the first combinatorial algorithm for computing time-cost tradeoff curves in graphs that may contain cycles. Starting from the lowest-cost solution, the algorithm iteratively computes a descent direction by a minimum cost flow computation. The maximum feasible step length is then determined by a minimum ratio cycle computation. This approach can be used in chip design for several optimization tasks, e.g. threshold voltage optimization or plane assignment. Finally, the optimization routines are combined into a timing closure flow. Here, the global placement is alternated with global performance optimization. Netweights are used to penalize the length of critical nets during placement. After the global phase, the performance is improved further by applying more comprehensive optimization routines on the most critical paths. In the end, the clock schedule is optimized and clocktrees are inserted. Computational results of the design flow are obtained on real-world computer chips
    corecore