30 research outputs found
Doctor of Philosophy
dissertationPortable electronic devices will be limited to available energy of existing battery chemistries for the foreseeable future. However, system-on-chips (SoCs) used in these devices are under a demand to offer more functionality and increased battery life. A difficult problem in SoC design is providing energy-efficient communication between its components while maintaining the required performance. This dissertation introduces a novel energy-efficient network-on-chip (NoC) communication architecture. A NoC is used within complex SoCs due it its superior performance, energy usage, modularity, and scalability over traditional bus and point-to-point methods of connecting SoC components. This is the first academic research that combines asynchronous NoC circuits, a focus on energy-efficient design, and a software framework to customize a NoC for a particular SoC. Its key contribution is demonstrating that a simple, asynchronous NoC concept is a good match for low-power devices, and is a fruitful area for additional investigation. The proposed NoC is energy-efficient in several ways: simple switch and arbitration logic, low port radix, latch-based router buffering, a topology with the minimum number of 3-port routers, and the asynchronous advantages of zero dynamic power consumption while idle and the lack of a clock tree. The tool framework developed for this work uses novel methods to optimize the topology and router oorplan based on simulated annealing and force-directed movement. It studies link pipelining techniques that yield improved throughput in an energy-efficient manner. A simulator is automatically generated for each customized NoC, and its traffic generators use a self-similar message distribution, as opposed to Poisson, to better match application behavior. Compared to a conventional synchronous NoC, this design is superior by achieving comparable message latency with half the energy
Elastic bundles :modelling and architecting asynchronous circuits with granular rigidity
PhD ThesisIntegrated Circuit (IC) designs these days are predominantly System-on-Chips (SoCs).
The complexity of designing a SoC has increased rapidly over the years due to growing
process and environmental variations coupled with global clock distribution di culty.
Moreover, traditional synchronous design is not apt to handle the heterogeneous timing
nature of modern SoCs. As a countermeasure, the semiconductor industry witnessed
a strong revival of asynchronous design principles. A new paradigm of digital circuits
emerged, as a result, namely mixed synchronous-asynchronous circuits. With a wave
of recent innovations in synchronous-asynchronous CAD integration, this paradigm is
showing signs of commercial adoption in future SoCs mainly due to the scope for reuse
of synchronous functional blocks and IP cores, and the co-existence of synchronous and
asynchronous design styles in a common EDA framework.
However, there is a lack of formal methods and tools to facilitate mixed synchronousasynchronous
design. In this thesis, we propose a formal model based on Petri nets with
step semantics to describe these circuits behaviourally. Implication of this model in the
veri cation and synthesis of mixed synchronous-asynchronous circuits is studied. Till
date, this paradigm has been mainly explored on the basis of Globally Asynchronous
Locally Synchronous (GALS) systems. Despite decades of research, GALS design has
failed to gain traction commercially. To understand its drawbacks, a simulation framework
characterising the physical and functional aspects of GALS SoCs is presented.
A novel method for synthesising mixed synchronous-asynchronous circuits with varying
levels of rigidity is proposed. Starting with a high-level data ow model of a system which
is intrinsically asynchronous, the key idea is to introduce rigidity of chosen granularity
levels in the model without changing functional behaviour. The system is then partitioned
into functional blocks of synchronous and asynchronous elements before being transformed
into an equivalent circuit which can be synthesised using standard EDA tools
A High-Throughput, Low-Power Asynchronous Mesh-of-Trees Interconnection Network for the Explicit Multi-Threading (XMT) Parallel Architecture
This thesis presents an asynchronous (clockless) Mesh-of-Trees network that
consumes less power and area than the synchronous Mesh-of-Trees network,
while maintaining high throughput and low latency.
Two new asynchronous designs are proposed for the fundamental
pipelined components of the network (routing and arbitration),
which are optimized for power, area, latency and throughput.
Mixed-timing interfaces are added to create a mixed-timing network
which provides communication between synchronous and asynchronous
domains.
Two issues top the agenda of CPU design in the emerging many-core era:
programmers' productivity and power consumption.
Through its reliance on the richest available theory of parallel algorithms,
the eXplicit Multi-Threading (XMT) parallel architecture addresses
programmers' productivity.
The motivation for this work is to provide an effective
interconnection network for the XMT architecture
in terms of both performance and power consumption.
Performance of the XMT processor with the mixed-timing network is
measured for several applications
Design of asynchronous microprocessor for power proportionality
PhD ThesisMicroprocessors continue to get exponentially cheaper for end users following Moore’s
law, while the costs involved in their design keep growing, also at an exponential rate.
The reason is the ever increasing complexity of processors, which modern EDA tools
struggle to keep up with. This makes further scaling for performance subject to a high
risk in the reliability of the system. To keep this risk low, yet improve the performance,
CPU designers try to optimise various parts of the processor. Instruction Set Architecture
(ISA) is a significant part of the whole processor design flow, whose optimal design
for a particular combination of available hardware resources and software requirements
is crucial for building processors with high performance and efficient energy utilisation.
This is a challenging task involving a lot of heuristics and high-level design decisions.
Another issue impacting CPU reliability is continuous scaling for power consumption. For
the last decades CPU designers have been mainly focused on improving performance, but
“keeping energy and power consumption in mind”. The consequence of this was a development
of energy-efficient systems, where energy was considered as a resource whose
consumption should be optimised. As CMOS technology was progressing, with feature
size decreasing and power delivered to circuit components becoming less stable, the
energy resource turned from an optimisation criterion into a constraint, sometimes a critical
one. At this point power proportionality becomes one of the most important aspects
in system design. Developing methods and techniques which will address the problem
of designing a power-proportional microprocessor, capable to adapt to varying operating
conditions (such as low or even unstable voltage levels) and application requirements in
the runtime, is one of today’s grand challenges. In this thesis this challenge is addressed
by proposing a new design flow for the development of an ISA for microprocessors, which
can be altered to suit a particular hardware platform or a specific operating mode. This
flow uses an expressive and powerful formalism for the specification of processor instruction
sets called the Conditional Partial Order Graph (CPOG). The CPOG model captures
large sets of behavioural scenarios for a microarchitectural level in a computationally
efficient form amenable to formal transformations for synthesis, verification and automated
derivation of asynchronous hardware for the CPU microcontrol. The feasibility of
the methodology, novel design flow and a number of optimisation techniques was proven
in a full size asynchronous Intel 8051 microprocessor and its demonstrator silicon. The
chip showed the ability to work in a wide range of operating voltage and environmental
conditions. Depending on application requirements and power budget our ASIC supports
several operating modes: one optimised for energy consumption and the other one for
performance. This was achieved by extending a traditional datapath structure with an
auxiliary control layer for adaptable and fault tolerant operation. These and other optimisations
resulted in a reconfigurable and adaptable implementation, which was proven
by measurements, analysis and evaluation of the chip.EPSR
Custom Cell Placement Automation for Asynchronous VLSI
Asynchronous Very-Large-Scale-Integration (VLSI) integrated circuits have demonstrated many advantages over their synchronous counterparts, including low power consumption, elastic pipelining, robustness against manufacturing and temperature variations, etc. However, the lack of dedicated electronic design automation (EDA) tools, especially physical layout automation tools, largely limits the adoption of asynchronous circuits. Existing commercial placement tools are optimized for synchronous circuits, and require a standard cell library provided by semiconductor foundries to complete the physical design. The physical layouts of cells in this library have the same height to simplify the placement problem and the power distribution network. Although the standard cell methodology also works for asynchronous designs, the performance is inferior compared with counterparts designed using the full-custom design methodology. To tackle this challenge, we propose a gridded cell layout methodology for asynchronous circuits, in which the cell height and cell width can be any integer multiple of two grid values. The gridded cell approach combines the shape regularity of standard cells with the size flexibility of full-custom layouts. Therefore, this approach can achieve a better space utilization ratio and lower wire length for asynchronous designs. Experiments have shown that the gridded cell placement approach reduces area without impacting the routability. We have also used this placer to tape out a chip in a 65nm process technology, demonstrating that our placer generates design-rule clean results
Recommended from our members
On Multicast in Asynchronous Networks-on-Chip: Techniques, Architectures, and FPGA Implementation
In this era of exascale computing, conventional synchronous design techniques are facing unprecedented challenges. The consumer electronics market is replete with many-core systems in the range of 16 cores to thousands of cores on chip, integrating multi-billion transistors. However, with this ever increasing complexity, the traditional design approaches are facing key issues such as increasing chip power, process variability, aging, thermal problems, and scalability. An alternative paradigm that has gained significant interest in the last decade is asynchronous design. Asynchronous designs have several potential advantages: they are naturally energy proportional, burning power only when active, do not require complex clock distribution, are robust to different forms of variability, and provide ease of composability for heterogeneous platforms. Networks-on-chip (NoCs) is an interconnect paradigm that has been introduced to deal with the ever-increasing system complexity. NoCs provide a distributed, scalable, and efficient interconnect solution for today’s many-core systems. Moreover, NoCs are a natural match with asynchronous design techniques, as they separate communication infrastructure and timing from the computational elements. To this end, globally-asynchronous locally-synchronous (GALS) systems that interconnect multiple processing cores, operating at different clock speeds, using an asynchronous NoC, have gained significant interest. While asynchronous NoCs have several advantages, they also face a key challenge of supporting new types of traffic patterns. Once such pattern is multicast communication, where a source sends packets to arbitrary number of destinations. Multicast is not only common in parallel computing, such as for cache coherency, but also for emerging areas such as neuromorphic computing. This important capability has been largely missing from asynchronous NoCs. This thesis introduces several efficient multicast solutions for these interconnects. In particular, techniques, and network architectures are introduced to support high-performance and low-power multicast. Two leading network topologies are the focus: a variant mesh-of-trees (MoT) and a 2D mesh. In addition, for a more realistic implementation and analysis, as well as significantly advancing the field of asynchronous NoCs, this thesis also targets synthesis of these NoCs on commercial FPGAs. While there has been significant advances in FPGA technologies, there has been only limited research on implementing asynchronous NoCs on FPGAs. To this end, a systematic computeraided design (CAD) methodology has been introduced to efficiently and safely map asynchronous NoCs on FPGAs. Overall, this thesis makes the following three contributions. The first contribution is a multicast solution for a variant MoT network topology. This topology consists of simple low-radix switches, and has been used in high-performance computing platforms. A novel local speculation technique is introduced, where a subset of the network’s switches are speculative that always broadcast every packet. These switches are very simple and have high performance. Speculative switches are surrounded by non-speculative ones that route packets based on their destinations and also throttle any redundant copies created by the former. This hybrid network architecture achieved significant performance and power benefits over other multicast approaches. The second contribution is a multicast solution for a 2D-mesh topology, which is more complex with higher-radix switches and also is more commonly used. A novel continuous-time replication strategy is introduced to optimize the critical multi-way forking operation of a multicast transmission. In this technique, a multicast packet is first stored in an input port of a switch, from where it is sent through distinct output ports towards different destinations concurrently, at each output’s own rate and in continuous time. This strategy is shown to have significant latency and energy benefits over an approach that performs multicast using multiple distinct serial unicasts to each destination. Finally, a systematic CAD methodology is introduced to synthesize asynchronous NoCs on commercial FPGAs. A two-fold goal is targeted: correctness and high performance. For ease of implementation, only existing FPGA synthesis tools are used. Moreover, since asynchronous NoCs involve special asynchronous components, a comprehensive guide is introduced to map these elements correctly and efficiently. Two asynchronous NoC switches are synthesized using the proposed approach on a leading Xilinx FPGA in 28 nm: one that only handles unicast, and the other that also supports multicast. Both showed significant energy benefits with some performance gains over a state-of-the-art synchronous switch
Architectural Exploration of KeyRing Self-Timed Processors
RÉSUMÉ
Les dernières décennies ont vu l’augmentation des performances des processeurs contraintes
par les limites imposées par la consommation d’énergie des systèmes électroniques : des très
basses consommations requises pour les objets connectés, aux budgets de dépenses électriques
des serveurs, en passant par les limitations thermiques et la durée de vie des batteries des
appareils mobiles. Cette forte demande en processeurs efficients en énergie, couplée avec
les limitations de la réduction d’échelle des transistors—qui ne permet plus d’améliorer les
performances à densité de puissance constante—, conduit les concepteurs de circuits intégrés
à explorer de nouvelles microarchitectures permettant d’obtenir de meilleures performances
pour un budget énergétique donné. Cette thèse s’inscrit dans cette tendance en proposant
une nouvelle microarchitecture de processeur, appelée KeyRing, conçue avec l’intention de
réduire la consommation d’énergie des processeurs.
La fréquence d’opération des transistors dans les circuits intégrés est proportionnelle à leur
consommation dynamique d’énergie. Par conséquent, les techniques de conception permettant
de réduire dynamiquement le nombre de transistors en opération sont très largement
adoptées pour améliorer l’efficience énergétique des processeurs. La technique de clock-gating
est particulièrement usitée dans les circuits synchrones, car elle réduit l’impact de l’horloge
globale, qui est la principale source d’activité. La microarchitecture KeyRing présentée dans
cette thèse utilise une méthode de synchronisation décentralisée et asynchrone pour réduire
l’activité des circuits. Elle est dérivée du processeur AnARM, un processeur développé par
Octasic sur la base d’une microarchitecture asynchrone ad hoc. Bien qu’il soit plus efficient
en Ă©nergie que des alternatives synchrones, le AnARM est essentiellement incompatible avec
les méthodes de synthèse et d’analyse temporelle statique standards. De plus, sa technique
de conception ad hoc ne s’inscrit que partiellement dans les paradigmes de conceptions asynchrones.
Cette thèse propose une approche rigoureuse pour définir les principes généraux
de cette technique de conception ad hoc, en faisant levier sur la littérature asynchrone. La
microarchitecture KeyRing qui en résulte est développée en association avec une méthode
de conception automatisée, qui permet de s’affranchir des incompatibilités natives existant
entre les outils de conception et les systèmes asynchrones. La méthode proposée permet de
pleinement mettre à profit les flots de conception standards de l’industrie microélectronique
pour réaliser la synthèse et la vérification des circuits KeyRing. Cette thèse propose également
des protocoles expérimentaux, dont le but est de renforcer la relation de causalité
entre la microarchitecture KeyRing et une réduction de la consommation énergétique des
processeurs, comparativement Ă des alternatives synchrones Ă©quivalentes.----------ABSTRACT
Over the last years, microprocessors have had to increase their performances while keeping
their power envelope within tight bounds, as dictated by the needs of various markets: from
the ultra-low power requirements of the IoT, to the electrical power consumption budget
in enterprise servers, by way of passive cooling and day-long battery life in mobile devices.
This high demand for power-efficient processors, coupled with the limitations of technology
scaling—which no longer provides improved performances at constant power densities—, is
leading designers to explore new microarchitectures with the goal of pulling more performances
out of a fixed power budget. This work enters into this trend by proposing a new
processor microarchitecture, called KeyRing, having a low-power design intent.
The switching activity of integrated circuits—i.e. transistors switching on and off—directly
affects their dynamic power consumption. Circuit-level design techniques such as clock-gating
are widely adopted as they dramatically reduce the impact of the global clock in synchronous
circuits, which constitutes the main source of switching activity. The KeyRing microarchitecture
presented in this work uses an asynchronous clocking scheme that relies on decentralized
synchronization mechanisms to reduce the switching activity of circuits. It is derived from
the AnARM, a power-efficient ARM processor developed by Octasic using an ad hoc asynchronous
microarchitecture. Although it delivers better power-efficiency than synchronous
alternatives, it is for the most part incompatible with standard timing-driven synthesis and
Static Timing Analysis (STA). In addition, its design style does not fit well within the existing
asynchronous design paradigms. This work lays the foundations for a more rigorous
definition of this rather unorthodox design style, using circuits and methods coming from the
asynchronous literature. The resulting KeyRing microarchitecture is developed in combination
with Electronic Design Automation (EDA) methods that alleviate incompatibility issues
related to ad hoc clocking, enabling timing-driven optimizations and verifications of KeyRing
circuits using industry-standard design flows. In addition to bridging the gap with standard
design practices, this work also proposes comprehensive experimental protocols that aims to
strengthen the causal relation between the reported asynchronous microarchitecture and a
reduced power consumption compared with synchronous alternatives.
The main achievement of this work is a framework that enables the architectural exploration
of circuits using the KeyRing microarchitecture
Design methodology and productivity improvement in high speed VLSI circuits
2017 Spring.Includes bibliographical references.To view the abstract, please see the full text of the document