Search CORE

289 research outputs found

Timing optimization during the physical synthesis of cell-based VLSI circuits

Author: Livramento Vinícius dos Santos
Publication venue
Publication date: 01/01/2016
Field of study

Tese (doutorado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Engenharia de Automação e Sistemas, Florianópolis, 2016.Abstract : The evolution of CMOS technology made possible integrated circuits with billions of transistors assembled into a single silicon chip, giving rise to the jargon Very-Large-Scale Integration (VLSI). The required clock frequency affects the performance of a VLSI circuit and induces timing constraints that must be properly handled by synthesis tools. During the physical synthesis of VLSI circuits, several optimization techniques are used to iteratively reduce the number of timing violations until the target clock frequency is met. The dramatic increase of interconnect delay under technology scaling represents one of the major challenges for the timing closure of modern VLSI circuits. In this scenario, effective interconnect synthesis techniques play a major role. That is why this thesis targets two timing optimization problems for effective interconnect synthesis: Incremental Timing-Driven Placement (ITDP) and Incremental Timing-Driven Layer Assignment (ITLA). For solving the ITDP problem, this thesis proposes a new Lagrangian Relaxation formulation that minimizes timing violations for both setup and hold timing constraints. This work also proposes a netbased technique that uses Lagrange multipliers as net-weights, which are dynamically updated using an accurate timing analyzer. The netbased technique makes use of a novel discrete search to relocate cells by employing the Euclidean distance to define a proper neighborhood. For solving the ITLA problem, this thesis proposes a network flow approach that handles simultaneously critical and non-critical segments, and exploits a few flow conservation conditions to extract timing information for each net segment individually, thereby enabling the use of an external timing engine. The experimental validation using benchmark suites derived from industrial circuits demonstrates the effectiveness of the proposed techniques when compared with state-of-the-art works.A evolução da tecnologia CMOS viabilizou a fabricação de circuitos integrados contendo bilhões de transistores em uma única pastilha de silício, dando origem ao jargão Very-Large-Scale Integration (VLSI). A frequência-alvo de operação de um circuito VLSI afeta o seu desempenho e induz restrições de timing que devem ser manipuladas pelas ferramentas de síntese. Durante a síntese física de circuitos VLSI, diversas técnicas de otimização são usadas para iterativamente reduzir o número de violações de timing até que a frequência-alvo de operação seja atingida. O aumento dramático do atraso das interconexões devido à evolução tecnológica representa um dos maiores desafios para o fluxo de timing closure de circuitos VLSI contemporâneos. Nesse cenário, técnicas de síntese de interconexão eficientes têm um papel fundamental. Por este motivo, esta tese aborda dois problemas de otimização de timing para uma síntese eficiente das interconexões de um circuito VLSI: Incremental Timing-Driven Placement (ITDP) e Incremental Timing-Driven Layer Assignment (ITLA). Para resolver o problema de ITDP, esta tese propõe uma nova formulação utilizando Relaxação Lagrangeana que tem por objetivo a minimização simultânea das violações de timing para restrições do tipo setup e hold. Este trabalho também propõe uma técnica que utiliza multiplicadores de Lagrange como pesos para as interconexões, os quais são atualizados dinamicamente através dos resultados de uma ferramenta de análise de timing. Tal técnica realoca as células do circuito por meio de uma nova busca discreta que adota a distância Euclidiana como vizinhança.Para resolver o problema de ITLA, esta tese propõe uma abordagem em fluxo em redes que otimiza simultaneamente segmentos críticos e não-críticos, e explora algumas condições de fluxo para extrair as informações de timing para cada segmento individualmente, permitindo assim o uso de uma ferramenta de timing externa. A validação experimental, utilizando benchmarks derivados de circuitos industriais, demonstra a eficiência das técnicas propostas quando comparadas com trabalhos estado da arte

Repositório Institucional da UFSC

Otimização de atraso pós-posicionamento explorando ramos não-críticos de árvores de Steiner

Author: Guth Chrystian de Sousa
Publication venue
Publication date: 01/01/2016
Field of study

Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Ciência da Computação, Florianópolis, 2016.O crescente impacto das interconexões no desempenho dos circuitos aumentou a importância do projeto físico na última década. No contexto das tecnologias contemporâneas, é imprescindível se considerar informações de interconexões nas estimativas de atraso, para que otimizações no projeto físico não invalidem otimizações de desempenho realizadas durante a síntese lógica. Uma das técnicas de otimização utilizadas durante o projeto físico é o posicionamento guiado por atraso (TDP: timing-driven placement). Dado um posicionamento inicial do circuito, TDP move um número limitado de células com o objetivo de reduzir (ou mesmo corrigir, se possível) as violações de atraso crítico do circuito. O TDP pode ser realizado de maneira global ou incremental. Este trabalho propõe e avalia uma técnica de TDP incremental que reposiciona um subconjunto de células a fim de otimizar o atraso referente às interconexões mais críticas do circuito tentando, ao mesmo tempo, preservar a qualidade do posicionamento inicial. A técnica modela explicitamente as interconexões com árvores de Steiner, as quais são capazes de capturar informações sobre a topologia do roteamento final. Aplicada em circuitos industriais previamente otimizados, a técnica proposta proporcionou reduções médias de violações de atraso de 34% a 62%, considerando as restrições de deslocamento short e long, respectivamente.Abstract : The growing impact of interconnections on circuit performance has increased the importance of physical design in the last decade. In the context of the contemporary technologies, it is essential that circuit delay estimates consider interconnect information to avoid that physical synthesis optimizations invalidate upstream optimizations. Timing-driven placement (TDP) is one of the optimization techniques used during physical synthesis. Given an initial circuit placement, TDP moves a limited number of cells targeting at reducing (or even correcting, if possible) the circuit timing violations. TDP can be performed in a global fashion or incrementally. This work proposes and evaluates an incremental TDP technique that moves a subset of cells to optimize the delay of the most critical interconnections in the circuit, while trying to preserve the initial placement quality. The technique explicitly models the interconnections as Steiner trees, which are able to capture information on the interconnection topologies in the final routing. The proposed technique was applied on previously optimized industrial circuits having produced average reductions of 34% and 62% in timing violations, concerning short and long maximum displacement restrictions, respectively

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositório Institucional da UFSC

Design Techniques for Energy-Quality Scalable Digital Systems

Author: JAHIER PAGLIARI Daniele
Publication venue: Politecnico di Torino
Publication date: 18/05/2018
Field of study

Energy efficiency is one of the key design goals in modern computing. Increasingly complex tasks are being executed in mobile devices and Internet of Things end-nodes, which are expected to operate for long time intervals, in the orders of months or years, with the limited energy budgets provided by small form-factor batteries. Fortunately, many of such tasks are error resilient, meaning that they can toler- ate some relaxation in the accuracy, precision or reliability of internal operations, without a significant impact on the overall output quality. The error resilience of an application may derive from a number of factors. The processing of analog sensor inputs measuring quantities from the physical world may not always require maximum precision, as the amount of information that can be extracted is limited by the presence of external noise. Outputs destined for human consumption may also contain small or occasional errors, thanks to the limited capabilities of our vision and hearing systems. Finally, some computational patterns commonly found in domains such as statistics, machine learning and operational research, naturally tend to reduce or eliminate errors. Energy-Quality (EQ) scalable digital systems systematically trade off the quality of computations with energy efficiency, by relaxing the precision, the accuracy, or the reliability of internal software and hardware components in exchange for energy reductions. This design paradigm is believed to offer one of the most promising solutions to the impelling need for low-energy computing. Despite these high expectations, the current state-of-the-art in EQ scalable design suffers from important shortcomings. First, the great majority of techniques proposed in literature focus only on processing hardware and software components. Nonetheless, for many real devices, processing contributes only to a small portion of the total energy consumption, which is dominated by other components (e.g. I/O, memory or data transfers). Second, in order to fulfill its promises and become diffused in commercial devices, EQ scalable design needs to achieve industrial level maturity. This involves moving from purely academic research based on high-level models and theoretical assumptions to engineered flows compatible with existing industry standards. Third, the time-varying nature of error tolerance, both among different applications and within a single task, should become more central in the proposed design methods. This involves designing “dynamic” systems in which the precision or reliability of operations (and consequently their energy consumption) can be dynamically tuned at runtime, rather than “static” solutions, in which the output quality is fixed at design-time. This thesis introduces several new EQ scalable design techniques for digital systems that take the previous observations into account. Besides processing, the proposed methods apply the principles of EQ scalable design also to interconnects and peripherals, which are often relevant contributors to the total energy in sensor nodes and mobile systems respectively. Regardless of the target component, the presented techniques pay special attention to the accurate evaluation of benefits and overheads deriving from EQ scalability, using industrial-level models, and on the integration with existing standard tools and protocols. Moreover, all the works presented in this thesis allow the dynamic reconfiguration of output quality and energy consumption. More specifically, the contribution of this thesis is divided in three parts. In a first body of work, the design of EQ scalable modules for processing hardware data paths is considered. Three design flows are presented, targeting different technologies and exploiting different ways to achieve EQ scalability, i.e. timing-induced errors and precision reduction. These works are inspired by previous approaches from the literature, namely Reduced-Precision Redundancy and Dynamic Accuracy Scaling, which are re-thought to make them compatible with standard Electronic Design Automation (EDA) tools and flows, providing solutions to overcome their main limitations. The second part of the thesis investigates the application of EQ scalable design to serial interconnects, which are the de facto standard for data exchanges between processing hardware and sensors. In this context, two novel bus encodings are proposed, called Approximate Differential Encoding and Serial-T0, that exploit the statistical characteristics of data produced by sensors to reduce the energy consumption on the bus at the cost of controlled data approximations. The two techniques achieve different results for data of different origins, but share the common features of allowing runtime reconfiguration of the allowed error and being compatible with standard serial bus protocols. Finally, the last part of the manuscript is devoted to the application of EQ scalable design principles to displays, which are often among the most energy- hungry components in mobile systems. The two proposals in this context leverage the emissive nature of Organic Light-Emitting Diode (OLED) displays to save energy by altering the displayed image, thus inducing an output quality reduction that depends on the amount of such alteration. The first technique implements an image-adaptive form of brightness scaling, whose outputs are optimized in terms of balance between power consumption and similarity with the input. The second approach achieves concurrent power reduction and image enhancement, by means of an adaptive polynomial transformation. Both solutions focus on minimizing the overheads associated with a real-time implementation of the transformations in software or hardware, so that these do not offset the savings in the display. For each of these three topics, results show that the aforementioned goal of building EQ scalable systems compatible with existing best practices and mature for being integrated in commercial devices can be effectively achieved. Moreover, they also show that very simple and similar principles can be applied to design EQ scalable versions of different system components (processing, peripherals and I/O), and to equip these components with knobs for the runtime reconfiguration of the energy versus quality tradeoff

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Otimização de atraso pós-posicionamento explorando ramos não-críticos de árvores de Steiner

Author: Guth Chrystian de Sousa
Publication venue
Publication date: 01/01/2016
Field of study

Almae Matris Studiorum Campus

Repositório Institucional da UFSC

Algorithmic techniques for nanometer VLSI design and manufacturing closure

Author: Hu Shiyan
Publication venue: Texas A&M University
Publication date: 10/10/2008
Field of study

As Very Large Scale Integration (VLSI) technology moves to the nanoscale regime, design and manufacturing closure becomes very difficult to achieve due to increasing chip and power density. Imperfections due to process, voltage and temperature variations aggravate the problem. Uncertainty in electrical characteristic of individual device and wire may cause significant performance deviations or even functional failures. These impose tremendous challenges to the continuation of Moore's law as well as the growth of semiconductor industry. Efforts are needed in both deterministic design stage and variation-aware design stage. This research proposes various innovative algorithms to address both stages for obtaining a design with high frequency, low power and high robustness. For deterministic optimizations, new buffer insertion and gate sizing techniques are proposed. For variation-aware optimizations, new lithography-driven and post-silicon tuning-driven design techniques are proposed. For buffer insertion, a new slew buffering formulation is presented and is proved to be NP-hard. Despite this, a highly efficient algorithm which runs > 90x faster than the best alternatives is proposed. The algorithm is also extended to handle continuous buffer locations and blockages. For gate sizing, a new algorithm is proposed to handle discrete gate library in contrast to unrealistic continuous gate library assumed by most existing algorithms. Our approach is a continuous solution guided dynamic programming approach, which integrates the high solution quality of dynamic programming with the short runtime of rounding continuous solution. For lithography-driven optimization, the problem of cell placement considering manufacturability is studied. Three algorithms are proposed to handle cell flipping and relocation. They are based on dynamic programming and graph theoretic approaches, and can provide different tradeoff between variation reduction and wire- length increase. For post-silicon tuning-driven optimization, the problem of unified adaptivity optimization on logical and clock signal tuning is studied, which enables us to significantly save resources. The new algorithm is based on a novel linear programming formulation which is solved by an advanced robust linear programming technique. The continuous solution is then discretized using binary search accelerated dynamic programming, batch based optimization, and Latin Hypercube sampling based fast simulation

Texas A&M Repository

Recommended from our members

Modern FPGA placement techniques with hardware acceleration

Author: Dhar Shounak
Publication venue
Publication date: 24/04/2021
Field of study

In deep sub-micron technology nodes, Application-Specific Integrated Circuits (ASICs) are becoming expensive to design and manufacture. For this reason, Field Programmable Gate Arrays (FPGAs), which are general purpose and flexible programmable hardware, are gaining more design wins in low volume and fast evolving applications. Modern FPGAs are becoming popular in high performance data analytics, search engines, autonomous cars, communication and networking applications. FPGAs are also accompanied with a complete Computer-Aided Design (CAD) toolchain, that is used to optimally map and fit the design applications or workloads onto the underlying target FPGA device. These design applications mapped onto the FPGA demand high maximum achievable clock frequency (Fmax) and low power consumption while maintaining a low compilation time, which is a major hindrance in widespread adoption of FPGAs. The focus of this Ph.D. dissertation is the placement problem for FPGAs, which takes a major portion of the FPGA CAD tool runtime. A new algorithm for spreading cells during FPGA global placement is proposed, which achieves better wirelength and routing congestion and takes less runtime than the algorithm used in the state-of-the-art academic FPGA placer. We also propose FPGA acceleration of various subsystems of an analytic global placement algorithm, including wirelength gradient computation and spreading, which achieves significant speedup over the multi-threaded CPU version. A new detailed placement algorithm is proposed, which offers better tradeoff between quality and runtime compared to existing methods. This algorithm is also accelerated on a GPU and an FPGA, achieving significant speedup over multi-threaded CPU implementation. Another detailed placement algorithm is also proposed which physically re-aligns timing critical paths and improves Fmax with minimal runtime overhead. Both of these algorithms for detailed placement have shown good results on industrial benchmarks and have been integrated into an industrial FPGA CAD tool flowElectrical and Computer Engineerin

Texas ScholarWorks

Dynamic Resource Management of Network-on-Chip Platforms for Multi-stream Video Processing

Author: Mendis Hashan Roshantha
Publication venue: University of York
Publication date: 01/03/2017
Field of study

This thesis considers resource management in the context of parallel multiple video stream decoding, on multicore/many-core platforms. Such platforms have tens or hundreds of on-chip processing elements which are connected via a Network-on-Chip (NoC). Inefficient task allocation configurations can negatively affect the communication cost and resource contention in the platform, leading to predictability and performance issues. Efficient resource management for large-scale complex workloads is considered a challenging research problem; especially when applications such as video streaming and decoding have dynamic and unpredictable workload characteristics. For these type of applications, runtime heuristic-based task mapping techniques are required. As the application and platform size increase, decentralised resource management techniques are more desirable to overcome the reliability and performance bottlenecks in centralised management. In this work, several heuristic-based runtime resource management techniques, targeting real-time video decoding workloads are proposed. Firstly, two admission control approaches are proposed; one fully deterministic and highly predictable; the other is heuristic-based, which balances predictability and performance. Secondly, a pair of runtime task mapping schemes are presented, which make use of limited known application properties, communication cost and blocking-aware heuristics. Combined with the proposed deterministic admission controller, these techniques can provide strict timing guarantees for hard real-time streams whilst improving resource usage. The third contribution in this thesis is a distributed, bio-inspired, low-overhead, task re-allocation technique, which is used to further improve the timeliness and workload distribution of admitted soft real-time streams. Finally, this thesis explores parallelisation and resource management issues, surrounding soft real-time video streams that have been encoded using complex encoding tools and modern codecs such as High Efficiency Video Coding (HEVC). Properties of real streams and decoding trace data are analysed, to statistically model and generate synthetic HEVC video decoding workloads. These workloads are shown to have complex and varying task dependency structures and resource requirements. To address these challenges, two novel runtime task clustering and mapping techniques for Tile-parallel HEVC decoding are proposed. These strategies consider the workload communication to computation ratio and stream-specific characteristics to balance predictability improvement and communication energy reduction. Lastly, several task to memory controller port assignment schemes are explored to alleviate performance bottlenecks, resulting from memory traffic contention

White Rose E-theses Online

Energy efficient enabling technologies for semantic video processing on mobile devices

Author: Larkin Daniel
Publication venue: Dublin City University. Centre for Digital Video Processing (CDVP)
Publication date: 01/11/2008
Field of study

Semantic object-based processing will play an increasingly important role in future multimedia systems due to the ubiquity of digital multimedia capture/playback technologies and increasing storage capacity. Although the object based paradigm has many undeniable benefits, numerous technical challenges remain before the applications becomes pervasive, particularly on computational constrained mobile devices. A fundamental issue is the ill-posed problem of semantic object segmentation. Furthermore, on battery powered mobile computing devices, the additional algorithmic complexity of semantic object based processing compared to conventional video processing is highly undesirable both from a real-time operation and battery life perspective. This thesis attempts to tackle these issues by firstly constraining the solution space and focusing on the human face as a primary semantic concept of use to users of mobile devices. A novel face detection algorithm is proposed, which from the outset was designed to be amenable to be offloaded from the host microprocessor to dedicated hardware, thereby providing real-time performance and reducing power consumption. The algorithm uses an Artificial Neural Network (ANN), whose topology and weights are evolved via a genetic algorithm (GA). The computational burden of the ANN evaluation is offloaded to a dedicated hardware accelerator, which is capable of processing any evolved network topology. Efficient arithmetic circuitry, which leverages modified Booth recoding, column compressors and carry save adders, is adopted throughout the design. To tackle the increased computational costs associated with object tracking or object based shape encoding, a novel energy efficient binary motion estimation architecture is proposed. Energy is reduced in the proposed motion estimation architecture by minimising the redundant operations inherent in the binary data. Both architectures are shown to compare favourable with the relevant prior art

DCU Online Research Access Service

José Luís Almada Güntzel

Author
Publication venue
Publication date: 18/10/2018
Field of study

Repositório Institucional da UFSC

Advanced multilateration theory, software development, and data processing: The MICRODOT system

Author: Escobal P. R.
Gallagher J. F.
Vonroos O. H.
Publication venue
Publication date
Field of study

The process of geometric parameter estimation to accuracies of one centimeter, i.e., multilateration, is defined and applications are listed. A brief functional explanation of the theory is presented. Next, various multilateration systems are described in order of increasing system complexity. Expected systems accuracy is discussed from a general point of view and a summary of the errors is listed. An outline of the design of a software processing system for multilateration, called MICRODOT, is presented next. The links of this software, which can be used for multilateration data simulations or operational data reduction, are examined on an individual basis. Functional flow diagrams are presented to aid in understanding the software capability. MICRODOT capability is described with respect to vehicle configurations, interstation coordinate reduction, geophysical parameter estimation, and orbit determination. Numerical results obtained from MICRODOT via data simulations are displayed both for hypothetical and real world vehicle/station configurations such as used in the GEOS-3 Project. These simulations show the inherent power of the multilateration procedure

NASA Technical Reports Server