269 research outputs found
AutoAccel: Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture
CPU-FPGA heterogeneous architectures are attracting ever-increasing attention
in an attempt to advance computational capabilities and energy efficiency in
today's datacenters. These architectures provide programmers with the ability
to reprogram the FPGAs for flexible acceleration of many workloads.
Nonetheless, this advantage is often overshadowed by the poor programmability
of FPGAs whose programming is conventionally a RTL design practice. Although
recent advances in high-level synthesis (HLS) significantly improve the FPGA
programmability, it still leaves programmers facing the challenge of
identifying the optimal design configuration in a tremendous design space.
This paper aims to address this challenge and pave the path from software
programs towards high-quality FPGA accelerators. Specifically, we first propose
the composable, parallel and pipeline (CPP) microarchitecture as a template of
accelerator designs. Such a well-defined template is able to support efficient
accelerator designs for a broad class of computation kernels, and more
importantly, drastically reduce the design space. Also, we introduce an
analytical model to capture the performance and resource trade-offs among
different design configurations of the CPP microarchitecture, which lays the
foundation for fast design space exploration. On top of the CPP
microarchitecture and its analytical model, we develop the AutoAccel framework
to make the entire accelerator generation automated. AutoAccel accepts a
software program as an input and performs a series of code transformations
based on the result of the analytical-model-based design space exploration to
construct the desired CPP microarchitecture. Our experiments show that the
AutoAccel-generated accelerators outperform their corresponding software
implementations by an average of 72x for a broad class of computation kernels
Physical design algorithms for asynchronous circuits
Asynchronous designs have been demonstrated to be able to achieve both higher performance and lower power compared with their synchronous counterparts. It provides a very promising solution to the emerging challenges in advanced technology. However, due to the lack of proper EDA tool support, the design cycle for asynchronous circuits is much longer compared with the one for synchronous circuits. Thus, even with many advantages, asynchronous circuits are still not the mainstream in the industry. In this thesis, we provides several algorithms to resolve the emerging issues for the physical design of asynchronous circuits. Our proposed algorithms optimize asynchronous circuits using placement, gate sizing, repeater insertion and pipeline buffer insertion techniques. An incremental maximum cycle ratio algorithm is also proposed to speed up the timing analysis of asynchronous circuits
Timing optimization during the physical synthesis of cell-based VLSI circuits
Tese (doutorado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Engenharia de Automação e Sistemas, Florianópolis, 2016.Abstract : The evolution of CMOS technology made possible integrated circuits with billions of transistors assembled into a single silicon chip, giving rise to the jargon Very-Large-Scale Integration (VLSI). The required clock frequency affects the performance of a VLSI circuit and induces timing constraints that must be properly handled by synthesis tools. During the physical synthesis of VLSI circuits, several optimization techniques are used to iteratively reduce the number of timing violations until the target clock frequency is met. The dramatic increase of interconnect delay under technology scaling represents one of the major challenges for the timing closure of modern VLSI circuits. In this scenario, effective interconnect synthesis techniques play a major role. That is why this thesis targets two timing optimization problems for effective interconnect synthesis: Incremental Timing-Driven Placement (ITDP) and Incremental Timing-Driven Layer Assignment (ITLA). For solving the ITDP problem, this thesis proposes a new Lagrangian Relaxation formulation that minimizes timing violations for both setup and hold timing constraints. This work also proposes a netbased technique that uses Lagrange multipliers as net-weights, which are dynamically updated using an accurate timing analyzer. The netbased technique makes use of a novel discrete search to relocate cells by employing the Euclidean distance to define a proper neighborhood. For solving the ITLA problem, this thesis proposes a network flow approach that handles simultaneously critical and non-critical segments, and exploits a few flow conservation conditions to extract timing information for each net segment individually, thereby enabling the use of an external timing engine. The experimental validation using benchmark suites derived from industrial circuits demonstrates the effectiveness of the proposed techniques when compared with state-of-the-art works.A evolução da tecnologia CMOS viabilizou a fabricação de circuitos integrados contendo bilhões de transistores em uma única pastilha de silício, dando origem ao jargão Very-Large-Scale Integration (VLSI). A frequência-alvo de operação de um circuito VLSI afeta o seu desempenho e induz restrições de timing que devem ser manipuladas pelas ferramentas de síntese. Durante a síntese física de circuitos VLSI, diversas técnicas de otimização são usadas para iterativamente reduzir o número de violações de timing até que a frequência-alvo de operação seja atingida. O aumento dramático do atraso das interconexões devido à evolução tecnológica representa um dos maiores desafios para o fluxo de timing closure de circuitos VLSI contemporâneos. Nesse cenário, técnicas de síntese de interconexão eficientes têm um papel fundamental. Por este motivo, esta tese aborda dois problemas de otimização de timing para uma síntese eficiente das interconexões de um circuito VLSI: Incremental Timing-Driven Placement (ITDP) e Incremental Timing-Driven Layer Assignment (ITLA). Para resolver o problema de ITDP, esta tese propõe uma nova formulação utilizando Relaxação Lagrangeana que tem por objetivo a minimização simultânea das violações de timing para restrições do tipo setup e hold. Este trabalho também propõe uma técnica que utiliza multiplicadores de Lagrange como pesos para as interconexões, os quais são atualizados dinamicamente através dos resultados de uma ferramenta de análise de timing. Tal técnica realoca as células do circuito por meio de uma nova busca discreta que adota a distância Euclidiana como vizinhança.Para resolver o problema de ITLA, esta tese propõe uma abordagem em fluxo em redes que otimiza simultaneamente segmentos críticos e não-críticos, e explora algumas condições de fluxo para extrair as informações de timing para cada segmento individualmente, permitindo assim o uso de uma ferramenta de timing externa. A validação experimental, utilizando benchmarks derivados de circuitos industriais, demonstra a eficiência das técnicas propostas quando comparadas com trabalhos estado da arte
Aceleração da legalização incremental mediante o uso de árvores espaciais
Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Ciência da Computação, Florianópolis, 2017.Na síntese física de circuitos integrados, a etapa de legalização é responsável por remover sobreposições de células e alinhá-las com as linhas e colunas do circuito, enquanto minimiza o deslocamento das células. Esta etapa é aplicada não somente após o posicionamento global, mas também após etapas de otimização incremental tais como posicionamento incremental guiado por atraso, gate sizing e inserção de buffers. Quando utilizada em técnicas de otimização incremental, a legalização pode ser aplicada como um passo final, após cada iteração da otimização,ou de maneira incremental, após cada transformação no posicionamento. Infelizmente, técnicas recentes de legalização incremental utilizam estruturas de dados que não são adequadas para o armazenamento de informações sobre geometrias. Além disso, apesar de diferentes estratégias de legalização serem utilizadas por diferentes trabalhos de otimização incremental, estes trabalhos não apresentam resultados quantitativos do impacto destas estratégias no tempo de execução e qualidade da solução final. Este trabalho propõe uma técnica de legalização incremental utilizando uma estrutura de dados chamada R-tree, projetada para o armazenamento de informações sobre geometrias, permitindo buscas espaciais rápidas. A técnica proposta foi comparada atécnicas do estado da arte em legalização incremental, assim como às estratégias de legalização final e iterativa. Os resultados experimentais mostram que a técnica proposta é pelo menos 6 vezes mais rápida e realiza o mesmo número de legalizações quando comparado a outras técnicas de legalização incremental do estado da arte. Além disso, o algoritmo proposto é mais rápido que as estratégias de legalização final e iterativa, enquanto resulta em uma solução com perfil de densidade e comprimento das interconexões semelhante.Abstract : In the physical synthesis of digital circuits, circuit legalization removes overlaps and keeps cell alignment with circuit rows and sites while minimizing total cell displacement. Legalization is applied not only after global placement, but also after incremental optimization steps like incremental timing-driven placement, gate sizing, and buffer insertion. In the case of incremental optimization techniques, the legalization stepcan be applied as a final step, after each optimization iteration or incrementally, after each cell movement. Unfortunately, recent incremental legalization techniques employ data structures that are not suitable for handling geometry information. In addition, despite different legalization strategies are used by different works on incremental optimization, those works do not present quantitative results on how those strategies impact on the runtime and quality of the final solution. This work proposes a new legalization technique that relies on an R-tree, a data structure tailored to efficient geometry information storage, which allows for fast spatial search. The proposed technique was compared to state-of-the-art incremental legalization techniques, as well as to the final and iterative legalization strategies. Experimental results show that the proposed technique is at least 6 times faster and performs as many successful legalizations when compared to the related work on incremental legalization. In addition, it is faster than both the other two legalization strategies, while resulting in a solution with similar density profile and circuit wirelength
Broadening the Scope of Multi-Objective Optimizations in Physical Synthesis of Integrated Circuits.
In modern VLSI design, physical synthesis tools are primarily responsible for satisfying chip-performance constraints by invoking a broad range of circuit optimizations, such as buffer insertion, logic restructuring, gate sizing and relocation. This process is known as timing closure. Our research seeks more powerful and efficient optimizations to improve the state of the art in modern chip design. In particular, we integrate timing-driven relocation, retiming, logic cloning, buffer insertion and gate sizing in novel ways to create powerful circuit transformations that help satisfy setup-time constraints.
State-of-the-art physical synthesis optimizations are typically applied at two scales: i) global algorithms that affect the entire netlist and ii) local transformations that focus on a handful of gates or interconnections. The scale of modern chip designs dictates that only near-linear-time optimization algorithms can be applied at the global scope — typically limited to wirelength-driven placement and legalization. Localized transformations can rely on more time-consuming optimizations with accurate delay models. Few techniques bridge the gap between fully-global and localized optimizations. This dissertation broadens the scope of physical synthesis optimization to include accurate transformations operating between the global and local scales. In particular, we integrate groups of related transformations to break circular dependencies and increase the number of circuit elements that can be jointly optimized to escape local minima.
Integrated transformations in this dissertation are developed by identifying and removing obstacles to successful optimizations. Integration is achieved through mapping multiple operations to rigorous mathematical optimization problems that can be solved simultaneously. We achieve computational scalability in our techniques by leveraging analytical delay models and focusing optimization efforts on carefully selected regions of the chip. In this regard, we make extensive use of a linear interconnect-delay model that accounts for the impact of subsequent repeated insertion. Our integrated transformations are evaluated on high-performance circuits with over 100,000 gates.
Integrated optimization techniques described in this dissertation ensure graceful timing-closure process and impact nearly every aspect of a typical physical synthesis flow. They have been validated in EDA tools used at IBM for physical synthesis of high-performance CPU and ASIC designs, where they significantly improved chip performance.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/78744/1/iamyou_1.pd
On the Use of Directed Moves for Placement in VLSI CAD
Search-based placement methods have long been used for placing integrated circuits targeting the field programmable gate array (FPGA) and standard cell design styles. Such methods offer the potential for high-quality solutions but often come at the cost of long run-times compared to alternative methods.
This dissertation examines strategies for enhancing local search heuristics---and in particular, simulated annealing---through the application of directed moves. These moves help to guide a search-based optimizer by focusing efforts on states which are most likely to yield productive improvement, effectively pruning the size of the search space.
The engineering theory and implementation details of directed moves are discussed in the context of both field programmable gate array and standard cell designs. This work explores the ways in which such moves can be used to improve the quality of FPGA placements, improve the robustness of floorplan repair and legalization methods for mixed-size standard cell designs, and enhance the quality of detailed placement for standard cell circuits. The analysis presented herein confirms the validity and efficacy of directed moves, and supports the use of such heuristics within various optimization frameworks
High-Performance Placement and Routing for the Nanometer Scale.
Modern semiconductor manufacturing facilitates single-chip electronic systems that only five years ago required ten to twenty chips. Naturally, design complexity has grown within this period. In contrast to this growth, it is becoming common in the industry to limit design team size which places a heavier burden on design automation tools.
Our work identifies new objectives, constraints and concerns in the physical design of systems-on-chip, and develops new computational techniques to address them. In addition to faster and more relevant design optimizations, we demonstrate that traditional design flows based on ``separation of concerns'' produce unnecessarily suboptimal layouts. We develop new integrated optimizations that streamline traditional chains of loosely-linked design tools. In particular, we bridge the gap between mixed-size placement and routing by updating the objective of global and detail placement to a more accurate estimate of routed wirelength. To this we add sophisticated whitespace allocation, and the combination provides increased routability, faster routing,
shorter routed wirelength, and the best via counts of published techniques. To further improve post-routing design metrics, we present new global routing techniques based on Discrete Lagrange Multipliers (DLM) which produce the best routed wirelength results on recent benchmarks. Our work culminates in the integration of our routing techniques within an incremental placement flow to
improve detailed routing solutions, shrink die sizes and reduce total chip cost.
Not only do our techniques improve the quality and cost of designs, but also simplify design automation software implementation in many cases. Ultimately, we reduce the time needed for design closure through improved tool fidelity and the use of our incremental techniques for placement and routing.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/64639/1/royj_1.pd
AI/ML Algorithms and Applications in VLSI Design and Technology
An evident challenge ahead for the integrated circuit (IC) industry in the
nanometer regime is the investigation and development of methods that can
reduce the design complexity ensuing from growing process variations and
curtail the turnaround time of chip manufacturing. Conventional methodologies
employed for such tasks are largely manual; thus, time-consuming and
resource-intensive. In contrast, the unique learning strategies of artificial
intelligence (AI) provide numerous exciting automated approaches for handling
complex and data-intensive tasks in very-large-scale integration (VLSI) design
and testing. Employing AI and machine learning (ML) algorithms in VLSI design
and manufacturing reduces the time and effort for understanding and processing
the data within and across different abstraction levels via automated learning
algorithms. It, in turn, improves the IC yield and reduces the manufacturing
turnaround time. This paper thoroughly reviews the AI/ML automated approaches
introduced in the past towards VLSI design and manufacturing. Moreover, we
discuss the scope of AI/ML applications in the future at various abstraction
levels to revolutionize the field of VLSI design, aiming for high-speed, highly
intelligent, and efficient implementations
On The Engineering of a Stable Force-Directed Placer
Analytic and force-directed placement methods that simultaneously minimize wire length and spread cells are receiving renewed attention from both academia and industry. However, these methods are by no means trivial to implement---to date, published works have failed to provide sufficient engineering details to replicate results.
This dissertation addresses the implementation of a generic force-directed placer entitled FDP. Specifically, this thesis provides (1) a description of efficient force computation for spreading cells, (2) an illustration of numerical instability in this method and a means to avoid the instability, (3) metrics for measuring cell distribution throughout the placement area, and (4) a complementary technique that aids in minimizing wire length. FDP is compared to Kraftwerk and other leading academic tools including Capo, Dragon, and mPG for both standard cell and mixed-size circuits. Wire lengths produced by FDP are found to be, on average, up to 9% and 3% better than Kraftwerk and Capo, respectively. All told, this thesis confirms the validity and applicability of the approach, and provides clarifying details of the intricacies surrounding the implementation of a force-directed global placer
Recommended from our members
Interconnect optimizations for nanometer VLSI design
textAs the semiconductor technology scales into deeper sub-micron domain, billions of transistors can be used on a single system-on-chip (SOC) makes interconnection optimization more important roughly for two reasons. First, congestion, power, timing in routing and buffering requirements make inter- connection optimization more and more challenging. Second, gate delay get- ting shorter while the RC delay gets longer due to scaling. Study of interconnection construction and optimization algorithms in real industry flows and designs ends up with interesting findings. One used to be overlooked but very important and practical problem is how to utilize over- the-block routing resources intelligently. Routing over large IP blocks needs special attention as there is almost no way to insert buffers inside hard IP blocks, which can lead to unsolvable slew/timing violations. In current design flows we have seen, the routing resources over the IP blocks were either dealt as routing blockages leading to a significant waste, or simply treated in the same way as outside-the-block routing resources, which would violate the slew constraints and thus fail buffering. To handle that, this work proposes a novel buffering-aware over-the- block rectilinear Steiner minimum tree (BOB-RSMT) algorithm which helps reclaim the “wasted” over-the-block routing resources while meeting user-specified slew constraints. Proposed algorithm incrementally and efficiently migrates initial tree structures with buffering-awareness to meet slew constraints while minimizing wire-length. Moreover, due to the fact that timing optimization is important for the VLSI design, in this work, timing-driven over-the-block rectilinear Steiner tree (TOB-RST) is also studied to optimize critical paths. This proposed TOB-RST algorithm can be used in routing or post-routing stage to provide high-quality topologies to help close timing. Then a follow-up problem emerges: how to accomplish the whole routing with over-the-block routing resources used properly. Utilizing over-the- block routing resources could dramatically improve the routing solution, yet require special attention, since the slew, affected by different RC on different metal layers, must be constrained by buffering and is easily violated. Moreover, even of all nets are slew-legalized, the routing solution could still suffer from heavy congestion problem. A new global router, BOB-Router, is to solve the over-the-block global routing problem through minimizing overflows, wire-length and via count simultaneously without violating slew constraints. Based on my completed works, BOB-RSMT and BOB-Router tremendously improve the overall routing and buffering quality. Experimental results show that proposed over-the-block rectilinear Steiner tree construction and routing completely satisfies the slew constraints and significantly outperforms the obstacle-avoiding rectilinear Steiner tree construction and routing in terms of wire-length, via count and overflows.Electrical and Computer Engineerin
- …