33 research outputs found
An ICT image processing chip based on fast computation algorithm and self-timed circuit technique.
by Johnson, Tin-Chak Pang.Thesis (M.Phil.)--Chinese University of Hong Kong, 1997.Includes bibliographical references.AcknowledgmentsAbstractList of figuresList of tablesChapter 1. --- Introduction --- p.1-1Chapter 1.1 --- Introduction --- p.1-1Chapter 1.2 --- Introduction to asynchronous system --- p.1-5Chapter 1.2.1 --- Motivation --- p.1-5Chapter 1.2.2 --- Hazards --- p.1-7Chapter 1.2.3 --- Classes of Asynchronous circuits --- p.1-8Chapter 1.3 --- Introduction to Transform Coding --- p.1-9Chapter 1.4 --- Organization of the Thesis --- p.1-16Chapter 2. --- Asynchronous Design Methodologies --- p.2-1Chapter 2.1 --- Introduction --- p.2-1Chapter 2.2 --- Self-timed system --- p.2-2Chapter 2.3 --- DCVSL Methodology --- p.2-4Chapter 2.3.1 --- DCVSL gate --- p.2-5Chapter 2.3.2 --- Handshake Control --- p.2-7Chapter 2.4 --- Micropipeline Methodology --- p.2-11Chapter 2.4.1 --- Summary of previous design --- p.2-12Chapter 2.4.2 --- New Micropipeline structure and improvements --- p.2-17Chapter 2.4.2.1 --- Asymmetrical delay --- p.2-20Chapter 2.4.2.2 --- Variable Delay and Delay Value Selection --- p.2-22Chapter 2.5 --- Comparison between DCVSL and Micropipeline --- p.2-25Chapter 3. --- Self-timed Multipliers --- p.3-1Chapter 3.1 --- Introduction --- p.3-1Chapter 3.2 --- Design Example 1 : Bit-serial matrix multiplier --- p.3-3Chapter 3.2.1 --- DCVSL design --- p.3-4Chapter 3.2.2 --- Micropipeline design --- p.3-4Chapter 3.2.3 --- The first test chip --- p.3-5Chapter 3.2.4 --- Second test chip --- p.3-7Chapter 3.3 --- Design Example 2 - Modified Booth's Multiplier --- p.3-9Chapter 3.3.1 --- Circuit Design --- p.3-10Chapter 3.3.2 --- Simulation result --- p.3-12Chapter 3.3.3 --- The third test chip --- p.3-14Chapter 4. --- Current-Sensing Completion Detection --- p.4-1Chapter 4.1 --- Introduction --- p.4-1Chapter 4.2 --- Current-sensor --- p.4-2Chapter 4.2.1 --- Constant current source --- p.4-2Chapter 4.2.2 --- Current mirror --- p.4-4Chapter 4.2.3 --- Current comparator --- p.4-5Chapter 4.3 --- Self-timed logic using CSCD --- p.4-9Chapter 4.4 --- CSCD test chips and testing results --- p.4-10Chapter 4.4.1 --- Test result --- p.4-11Chapter 5. --- Self-timed ICT processor architecture --- p.5-1Chapter 5.1 --- Introduction --- p.5-1Chapter 5.2 --- Comparison of different architecture --- p.5-3Chapter 5.2.1 --- General purpose Digital Signal Processor --- p.5-5Chapter 5.2.1.1 --- Hardware and speed estimation : --- p.5-6Chapter 5.2.2 --- Micropipeline without fast algorithm --- p.5-7Chapter 5.2.2.1 --- Hardware and speed estimation : --- p.5-8Chapter 5.2.3 --- Micropipeline with fast algorithm (I) --- p.5-8Chapter 5.2.3.1 --- Hardware and speed estimation : --- p.5-9Chapter 5.2.4 --- Micropipeline with fast algorithm (II) --- p.5-10Chapter 5.2.4.1 --- Hardware and speed estimation : --- p.5-11Chapter 6. --- Implementation of self-timed ICT processor --- p.6-1Chapter 6.1 --- Introduction --- p.6-1Chapter 6.2 --- Implementation of Self-timed 2-D ICT processor (First version) --- p.6-3Chapter 6.2.1 --- 1-D ICT module --- p.6-4Chapter 6.2.2 --- Self-timed Transpose memory --- p.6-5Chapter 6.2.3 --- Layout Design --- p.6-8Chapter 6.3 --- Implementation of Self-timed 1-D ICT processor with fast algorithm (final version) --- p.6-9Chapter 6.3.1 --- I/O buffers and control units --- p.6-10Chapter 6.3.1.1 --- Input control --- p.6-11Chapter 6.3.1.2 --- Output control --- p.6-12Chapter 6.3.1.2.1 --- Self-timed Computational Block --- p.6-13Chapter 6.3.1.3 --- Handshake Control Unit --- p.6-14Chapter 6.3.1.4 --- Integer Execution Unit (IEU) --- p.6-18Chapter 6.3.1.5 --- Program memory and Instruction decoder --- p.6-20Chapter 6.3.2 --- Layout Design --- p.6-21Chapter 6.4 --- Specifications of the final version self-timed ICT chip --- p.6-22Chapter 7. --- Testing of Self-timed ICT processor --- p.7-1Chapter 7.1 --- Introduction --- p.7-1Chapter 7.2 --- Pin assignment of Self-timed 1 -D ICT chip --- p.7-2Chapter 7.3 --- Simulation --- p.7-3Chapter 7.4 --- Testing of Self-timed 1-D ICT processor --- p.7-5Chapter 7.4.1 --- Functional test --- p.7-5Chapter 7.4.1.1 --- Testing environment and results --- p.7-5Chapter 7.4.2 --- Transient Characteristics --- p.7-7Chapter 7.4.3 --- Comments on speed and power --- p.7-10Chapter 7.4.4 --- Determination of optimum delay control voltage --- p.7-12Chapter 7.5 --- Testing of delay element and other logic cells --- p.7-13Chapter 8. --- Conclusions --- p.8-1BibliographyAppendice
Design of application-specific instruction set processors with asynchronous methodology for embedded digital signal processing applications.
Kwok Yan-lun Andy.Thesis submitted in: November 2004.Thesis (M.Phil.)--Chinese University of Hong Kong, 2005.Includes bibliographical references (leaves 133-137).Abstracts in English and Chinese.Abstract --- p.iæèŠ --- p.iiAcknowledgements --- p.iiiList of Figures --- p.viiList of Tables and Examples --- p.xChapter 1. --- Introduction --- p.1Chapter 1.1. --- Motivation --- p.1Chapter 1.2. --- Objective and Approach --- p.4Chapter 1.3. --- Thesis Organization --- p.5Chapter 2. --- Related Work --- p.7Chapter 2.1. --- Coverage --- p.7Chapter 2.2. --- ASIP Design Methodologies --- p.8Chapter 2.3. --- Asynchronous Technology on Processors --- p.12Chapter 2.4. --- Summary --- p.14Chapter 3. --- Asynchronous Design Methodology --- p.15Chapter 3.1. --- Overview --- p.15Chapter 3.2. --- Asynchronous Design Style --- p.17Chapter 3.2.1. --- Micropipelines --- p.17Chapter 3.2.2. --- Fine-grain Pipelining --- p.20Chapter 3.2.3. --- Globally-Asynchronous Locally-Synchronous (GALS) Design --- p.22Chapter 3.3. --- Advantages of GALS in ASIP Design --- p.27Chapter 3.3.1. --- Reuse of Synchronous and Asynchronous IP --- p.27Chapter 3.3.2. --- Fine Tuning of Performance and Power Consumption --- p.27Chapter 3.3.3. --- Synthesis-based Design Flow --- p.28Chapter 3.4. --- Design of GALS Asynchronous Wrapper --- p.28Chapter 3.4.1. --- Handshake Protocol --- p.28Chapter 3.4.2. --- Pausible Clock Generator --- p.29Chapter 3.4.3. --- Port Controllers --- p.30Chapter 3.4.4. --- Performance of the Asynchronous Wrapper --- p.33Chapter 3.5. --- Summary --- p.35Chapter 4. --- Platform Based ASIP Design Methodology --- p.36Chapter 4.1. --- Platform Based Approach --- p.36Chapter 4.1.1. --- The Definition of Our Platform --- p.37Chapter 4.1.2. --- The Definition of the Platform Based Design --- p.37Chapter 4.2. --- Platform Architecture --- p.38Chapter 4.2.1. --- The Nature of DSP Algorithms --- p.38Chapter 4.2.2. --- Design Space of Datapath Optimization --- p.46Chapter 4.2.3. --- Proposed Architecture --- p.49Chapter 4.2.4. --- The Strategy of Realizing an Optimized Datapath --- p.51Chapter 4.2.5. --- Pipeline Organization --- p.59Chapter 4.2.6. --- GALS Partitioning --- p.61Chapter 4.2.7. --- Operation Mechanism --- p.63Chapter 4.3. --- Overall Design Flow --- p.67Chapter 4.4. --- Summary --- p.70Chapter 5. --- Design of the ASIP Platform --- p.72Chapter 5.1. --- Design Goal --- p.72Chapter 5.2. --- Instruction Fetch --- p.74Chapter 5.2.1. --- Instruction fetch unit --- p.74Chapter 5.2.2. --- Zero-overhead loops and Subroutines --- p.75Chapter 5.3. --- Instruction Decode --- p.77Chapter 5.3.1. --- Instruction decoder --- p.77Chapter 5.3.2. --- The Encoding of Parallel and Complex Instructions --- p.80Chapter 5.4. --- Datapath --- p.81Chapter 5.4.1. --- Base Functional Units --- p.81Chapter 5.4.2. --- Functional Unit Wrapper Interface --- p.83Chapter 5.5. --- Register File Systems --- p.84Chapter 5.5.1. --- Memory Hierarchy --- p.84Chapter 5.5.2. --- Register File Organization --- p.85Chapter 5.5.3. --- Address Generation --- p.93Chapter 5.5.4. --- Load and Store --- p.98Chapter 5.6. --- Design Verification --- p.100Chapter 5.7. --- Summary --- p.104Chapter 6. --- Case Studies --- p.105Chapter 6.1. --- Objective --- p.105Chapter 6.2. --- Approach --- p.105Chapter 6.3. --- Based versus Optimized --- p.106Chapter 6.3.1. --- Matrix Manipulation --- p.106Chapter 6.3.2. --- Autocorrelation --- p.109Chapter 6.3.3. --- CORDIC --- p.110Chapter 6.4. --- Optimized versus Advanced Commercial DSPs --- p.113Chapter 6.4.1. --- Introduction to TMS320C62x and SC140 --- p.113Chapter 6.4.2. --- Results --- p.115Chapter 6.5. --- Summary --- p.116Chapter 7. --- Conclusion --- p.118Chapter 7.1. --- When ASIPs encounter asynchronous --- p.118Chapter 7.2. --- Contributions --- p.120Chapter 7.3. --- Future Directions --- p.121Chapter A --- Synthesis of Extended Burst-Mode Asynchronous Finite State Machine --- p.122Chapter B --- Base Instruction Set --- p.124Chapter C --- Special Registers --- p.127Chapter D --- Synthesizable Model of GALS Wrapper --- p.130Reference --- p.13
Energy autonomous systems : future trends in devices, technology, and systems
The rapid evolution of electronic devices since the beginning of the nanoelectronics era has brought about exceptional computational power in an ever shrinking system footprint. This has enabled among others the wealth of nomadic battery powered wireless systems (smart phones, mp3 players, GPS, âŠ) that society currently enjoys. Emerging integration technologies enabling even smaller volumes and the associated increased functional density may bring about a new revolution in systems targeting wearable healthcare, wellness, lifestyle and industrial monitoring applications
NONLINEAR OPERATORS FOR IMAGE PROCESSING: DESIGN, IMPLEMENTATION AND MODELING TECHNIQUES FOR POWER ESTIMATION
1998/1999Negli ultimi anni passati le applicazioni multimediali hanno visto uno sviluppo notevole, trovando applicazione in un gran numero di campi. Applicazioni come video conferenze, diagnostica medica, telefonia mobile e applicazioni militari necessitano il trattamento di una gran mole di dati ad alta velocitĂ . Pertanto, l'elaborazione di immagini e di dati vocali Ăš molto importante ed Ăš stata oggetto di numerosi sforzi, nel tentativo di trovare algoritmi sempre piĂč veloci ed efficaci. Tra gli algoritmi proposti, noi crediamo che gli operatori razionali svolgano un ruolo molto importante, grazie alla loro versatilitĂ ed efficacia nell'elaborazione di dati. Negli ultimi anni sono stati proposti diversi algoritmi, dimostrando che questi operatori possono essere molto vantaggiosi in diverse applicazioni, producendo buoni risultati. Lo scopo di questo lavoro Ăš di realizzare alcuni di questi algoritmi e, quindi, dimostrare che i filtri razionali, in particolare, possono essere realizzati senza ricorrere a sistemi di grandi dimensioni e possono raggiungere frequenze operative molto alte. Una volta che il blocco fondamentale di un sistema basato su operatori razionali sia stato realizzato, esso pu6 essere riusato con successo in molte altre applicazioni. Dal punto di vista del progettista, Ăš importante avere uno schema generale di studio, che lo renda capace di studiare le varie configurazioni del sistema da realizzare e di analizzare i compromessi tra le variabili di progetto. In particolare, per soddisfare l'esigenza di metodi versatili per la stima della potenza, abbiamo sviluppato una tecnica di macro modellizazione che permette al progettista di stimare velocemente ed accuratamente la potenza dissipata da un circuito. La tesi Ăš organizzata come segue: Nel Capitolo 1 alcuni sono presentati alcuni algoritmi studiati per la realizzazione. Ne viene data solo una veloce descrizione, lasciando comunque al lettore interessato dei riferimenti bibliografici. Nel Capitolo 2 vengono discusse le architetture fondamentali usate per la realizzazione. Principalmente sono state usate architetture a pipeline, ma viene data anche una descrizione degli approcci oggigiorno disponibili per l'ottimizzazione delle temporizzazioni. Nel Capitolo 3 sono presentate le realizzazioni di due sistemi studiati per questa tesi. Gli approcci seguiti si basano su ASIC e FPGA. Richiedono tecniche e soluzioni diverse per il progetto del sistema, per cui Ă© interessante vedere cosa pu6 essere fatto nei due casi. Infine, nel Capitolo 4, descriviamo la nostra tecnica di macro modellizazione per la stima di potenza, dando una breve visione delle tecniche finora proposte e facendo vedere quali sono i vantaggi che il nostro metodo comporta per il progetto.In the past few years, multimedia application have been growing very fast, being applied to a large variety of fields. Applications like video conference, medical diagnostic, mobile phones, military applications require to handle large amount of data at high rate. Images as well as voice data processing are therefore very important and they have been subjected to a lot of efforts in order to find always faster and effective algorithms. Among image processing algorithms, we believe that rational operators assume an important role, due to their versatility and effectiveness in data processing. In the last years, several algorithms have been proposed, demonstrating that these operators can be very suitable in different applications with very good results. The aim of this work is to implement some of these algorithm and, therefore, demonstrate that rational filters, in particular, can be implemented without requiring large sized systems and they can operate at very high frequencies. Once the basic building block of a rational based system has been implemented, it can be successfully reused in many other applications. From the designer point of view, it is important to have a general framework, which makes it able to study various configurations of the system to be implemented and analyse the trade-off among the design variables. In particular, to meet the need far versatile tools far power estimation, we developed a new macro modelling technique, which allows the designer to estimate the power dissipated by a circuit quickly and accurately. The thesis is organized as follows: In chapter 1 we present some of the algorithms which have been studied for implementation. Only a brief overview is given, leaving to the interested reader some references in literature. In chapter 2 we discuss the basic architectures used for the implementations. Pipelined structures have been mainly used for this thesis, but an overview of the nowaday available approaches for timing optimization is presented. In chapter 3 we present two of the implementation designed for this thesis. The approaches followed are ASIC driven and FPGA drive. They require different techniques and different solution for the design of the system, therefore it is interesting to see what can be done in both the cases. Finally, in chapter 4, we describe our macro modelling techniques for power estimation, giving a brief overview of the up to now proposed techniques and showing the advantages our method brings to the design.XII Ciclo1969Versione digitalizzata della tesi di dottorato cartacea
Architectural Exploration of KeyRing Self-Timed Processors
RĂSUMĂ
Les derniĂšres dĂ©cennies ont vu lâaugmentation des performances des processeurs contraintes
par les limites imposĂ©es par la consommation dâĂ©nergie des systĂšmes Ă©lectroniques : des trĂšs
basses consommations requises pour les objets connectés, aux budgets de dépenses électriques
des serveurs, en passant par les limitations thermiques et la durée de vie des batteries des
appareils mobiles. Cette forte demande en processeurs efficients en énergie, couplée avec
les limitations de la rĂ©duction dâĂ©chelle des transistorsâqui ne permet plus dâamĂ©liorer les
performances Ă densitĂ© de puissance constanteâ, conduit les concepteurs de circuits intĂ©grĂ©s
Ă explorer de nouvelles microarchitectures permettant dâobtenir de meilleures performances
pour un budget Ă©nergĂ©tique donnĂ©. Cette thĂšse sâinscrit dans cette tendance en proposant
une nouvelle microarchitecture de processeur, appelĂ©e KeyRing, conçue avec lâintention de
rĂ©duire la consommation dâĂ©nergie des processeurs.
La frĂ©quence dâopĂ©ration des transistors dans les circuits intĂ©grĂ©s est proportionnelle Ă leur
consommation dynamique dâĂ©nergie. Par consĂ©quent, les techniques de conception permettant
de réduire dynamiquement le nombre de transistors en opération sont trÚs largement
adoptĂ©es pour amĂ©liorer lâefficience Ă©nergĂ©tique des processeurs. La technique de clock-gating
est particuliĂšrement usitĂ©e dans les circuits synchrones, car elle rĂ©duit lâimpact de lâhorloge
globale, qui est la principale source dâactivitĂ©. La microarchitecture KeyRing prĂ©sentĂ©e dans
cette thÚse utilise une méthode de synchronisation décentralisée et asynchrone pour réduire
lâactivitĂ© des circuits. Elle est dĂ©rivĂ©e du processeur AnARM, un processeur dĂ©veloppĂ© par
Octasic sur la base dâune microarchitecture asynchrone ad hoc. Bien quâil soit plus efficient
en Ă©nergie que des alternatives synchrones, le AnARM est essentiellement incompatible avec
les mĂ©thodes de synthĂšse et dâanalyse temporelle statique standards. De plus, sa technique
de conception ad hoc ne sâinscrit que partiellement dans les paradigmes de conceptions asynchrones.
Cette thÚse propose une approche rigoureuse pour définir les principes généraux
de cette technique de conception ad hoc, en faisant levier sur la littérature asynchrone. La
microarchitecture KeyRing qui en résulte est développée en association avec une méthode
de conception automatisĂ©e, qui permet de sâaffranchir des incompatibilitĂ©s natives existant
entre les outils de conception et les systÚmes asynchrones. La méthode proposée permet de
pleinement mettre Ă profit les flots de conception standards de lâindustrie microĂ©lectronique
pour réaliser la synthÚse et la vérification des circuits KeyRing. Cette thÚse propose également
des protocoles expérimentaux, dont le but est de renforcer la relation de causalité
entre la microarchitecture KeyRing et une réduction de la consommation énergétique des
processeurs, comparativement Ă des alternatives synchrones Ă©quivalentes.----------ABSTRACT
Over the last years, microprocessors have had to increase their performances while keeping
their power envelope within tight bounds, as dictated by the needs of various markets: from
the ultra-low power requirements of the IoT, to the electrical power consumption budget
in enterprise servers, by way of passive cooling and day-long battery life in mobile devices.
This high demand for power-efficient processors, coupled with the limitations of technology
scalingâwhich no longer provides improved performances at constant power densitiesâ, is
leading designers to explore new microarchitectures with the goal of pulling more performances
out of a fixed power budget. This work enters into this trend by proposing a new
processor microarchitecture, called KeyRing, having a low-power design intent.
The switching activity of integrated circuitsâi.e. transistors switching on and offâdirectly
affects their dynamic power consumption. Circuit-level design techniques such as clock-gating
are widely adopted as they dramatically reduce the impact of the global clock in synchronous
circuits, which constitutes the main source of switching activity. The KeyRing microarchitecture
presented in this work uses an asynchronous clocking scheme that relies on decentralized
synchronization mechanisms to reduce the switching activity of circuits. It is derived from
the AnARM, a power-efficient ARM processor developed by Octasic using an ad hoc asynchronous
microarchitecture. Although it delivers better power-efficiency than synchronous
alternatives, it is for the most part incompatible with standard timing-driven synthesis and
Static Timing Analysis (STA). In addition, its design style does not fit well within the existing
asynchronous design paradigms. This work lays the foundations for a more rigorous
definition of this rather unorthodox design style, using circuits and methods coming from the
asynchronous literature. The resulting KeyRing microarchitecture is developed in combination
with Electronic Design Automation (EDA) methods that alleviate incompatibility issues
related to ad hoc clocking, enabling timing-driven optimizations and verifications of KeyRing
circuits using industry-standard design flows. In addition to bridging the gap with standard
design practices, this work also proposes comprehensive experimental protocols that aims to
strengthen the causal relation between the reported asynchronous microarchitecture and a
reduced power consumption compared with synchronous alternatives.
The main achievement of this work is a framework that enables the architectural exploration
of circuits using the KeyRing microarchitecture