467 research outputs found
Recommended from our members
Design and performance optimization of asynchronous networks-on-chip
As digital systems continue to grow in complexity, the design of conventional synchronous systems is facing unprecedented challenges. The number of transistors on individual chips is already in the multi-billion range, and a greatly increasing number of components are being integrated onto a single chip. As a consequence, modern digital designs are under strong time-to-market pressure, and there is a critical need for composable design approaches for large complex systems.
In the past two decades, networks-on-chip (NoCâs) have been a highly active research area. In a NoC-based system, functional blocks are first designed individually and may run at different clock rates. These modules are then connected through a structured network for on-chip global communication. However, due to the rigidity of centrally-clocked NoCâs, there have been bottlenecks of system scalability, energy and performance, which cannot be easily solved with synchronous approaches. As a result, there has been significant recent interest in combing the notion of asynchrony with NoC designs. Since the NoC approach inherently separates the communication infrastructure, and its timing, from computational elements, it is a natural match for an asynchronous paradigm. Asynchronous NoCâs, therefore, enable a modular and extensible system composition for an âobject-orientâ design style.
The thesis aims to significantly advance the state-of-art and viability of asynchronous and globally-asynchronous locally-synchronous (GALS) networks-on-chip, to enable high-performance and low-energy systems. The proposed asynchronous NoCâs are nearly entirely based on standard cells, which eases their integration into industrial design flows. The contributions are instantiated in three different directions.
First, practical acceleration techniques are proposed for optimizing the system latency, in order to break through the latency bottleneck in the memory interfaces of many on-chip parallel processors. Novel asynchronous network protocols are proposed, along with concrete NoC designs. A new concept, called âmonitoring networkâ, is introduced. Monitoring networks are lightweight shadow networks used for fast-forwarding anticipated traffic information, ahead of the actual packet traffic. The routers are therefore allowed to initiate and perform arbitration and channel allocation in advance. The technique is successfully applied to two topologies which belong to two different categories â a variant mesh-of-trees (MoT) structure and a 2D-mesh topology. Considerable and stable latency improvements are observed across a wide range of traffic patterns, along with moderate throughput gains.
Second, for the first time, a high-performance and low-power asynchronous NoC router is compared directly to a leading commercial synchronous counterpart in an advanced industrial technology. The asynchronous router design shows significant performance improvements, as well as area and power savings. The proposed asynchronous router integrates several advanced techniques, including a low-latency circular FIFO for buffer design, and a novel end-to-end credit-based virtual channel (VC) flow control. In addition, a semi-automated design flow is created, which uses portions of a standard synchronous tool flow.
Finally, a high-performance multi-resource asynchronous arbiter design is developed. This small but important component can be directly used in existing asynchronous NoCâs for performance optimization. In addition, this standalone design promises use in opening up new NoC directions, as well as for general use in parallel systems. In the proposed arbiter design, the allocation of a resource to a client is divided into several steps. Multiple successive client-resource pairs can be selected rapidly in pipelined sequence, and the completion of the assignments can overlap in parallel.
In sum, the thesis provides a set of advanced design solutions for performance optimization of asynchronous and GALS networks-on-chip. These solutions are at different levels, from network protocols, down to router- and component-level optimizations, which can be directly applied to existing basic asynchronous NoC designs to provide a leap in performance improvement
Dataflow computers: a tutorial and survey
Journal ArticleThe demand for very high performance computer has encouraged some researchers in the computer science field to consider alternatives to the conventional notions of program and computer organization. The dataflow computer is one attempt to form a new collection of consistent systems ideas to improve both computer performance and to alleviate the software design problems induced by the construction of highly concurrent programs
Exploration and Design of Power-Efficient Networked Many-Core Systems
Multiprocessing is a promising solution to meet the requirements of near future applications. To get full benefit from parallel processing, a manycore system needs efficient, on-chip communication architecture. Networkon- Chip (NoC) is a general purpose communication concept that offers highthroughput, reduced power consumption, and keeps complexity in check by a regular composition of basic building blocks. This thesis presents power efficient communication approaches for networked many-core systems. We address a range of issues being important for designing power-efficient manycore systems at two different levels: the network-level and the router-level.
From the network-level point of view, exploiting state-of-the-art concepts such as Globally Asynchronous Locally Synchronous (GALS), Voltage/ Frequency Island (VFI), and 3D Networks-on-Chip approaches may be a solution to the excessive power consumption demanded by todayâs and future many-core systems. To this end, a low-cost 3D NoC architecture, based on high-speed GALS-based vertical channels, is proposed to mitigate high peak temperatures, power densities, and area footprints of vertical interconnects in 3D ICs. To further exploit the beneficial feature of a negligible inter-layer distance of 3D ICs, we propose a novel hybridization scheme for inter-layer communication. In addition, an efficient adaptive routing algorithm is presented which enables congestion-aware and reliable communication for the hybridized NoC architecture. An integrated monitoring and management platform on top of this architecture is also developed in order to implement more scalable power optimization techniques.
From the router-level perspective, four design styles for implementing power-efficient reconfigurable interfaces in VFI-based NoC systems are proposed. To enhance the utilization of virtual channel buffers and to manage their power consumption, a partial virtual channel sharing method for NoC routers is devised and implemented.
Extensive experiments with synthetic and real benchmarks show significant power savings and mitigated hotspots with similar performance compared to latest NoC architectures. The thesis concludes that careful codesigned elements from different network levels enable considerable power savings for many-core systems.Siirretty Doriast
Recommended from our members
Contributions to the Design of Asynchronous Macromodular Systems
In this thesis, I advocate the use of macromodules to design and build robust and performance-competitive asynchronous systems. The contributions of the work relate to different aspects of the design of asynchronous macromodular systems. First, an architectural optimization for 4-phase systems is introduced. The goal of the optimization is to increase the performance of a system by increasing the level of concurrent activity in the sequencing of data processing stages. In particular, three new asynchronous sequencers are designed, which increase the throughput of the system. Existing asynchronous data paths do not operate correctly at this increased level of concurrency: data hazards may result. Interlock mechanisms are introduced to insure correct operation. The technique can also be regarded as a low-power optimization: The increased throughput can be traded for a significant reduction in the power consumption of the entire system. SPICE simulation results show that the new sequencers allow roughly twice the throughput of non-concurrent sequencers. The simulations also show that, after voltage scaling, energy dissipation is reduced by a factor of 2.5. Second, the use of pulses for efficient inter-module synchronization is introduced. The idea is complemented with the definition of a pulse-mode handshake protocol and the characterization of Pulse-Burst Operation (PBO), an important extension to traditional pulse-mode operation. Also, a basic set of macromodules, that efficiently implement control operations such as sequencing, selection, iteration, concurrency control, resource sharing, and arbitration is presented. Modules for interfacing pulse-mode circuits with traditional 2-phase and 4-phase circuits are also included in the set. Finally, the design of a packet switch is used to demonstrate the viability of pulse-mode macromodules to implement complex, high performance systems. The switch organization, its asynchronous operation, and the low control overhead introduced by pulse-mode macromodules result in a design that can handle 2.4 times the target throughput of 155 Mbits/Sec. Also, the switch is characterized by very low input-to-output latency. These results suggest that pulse-mode macromodules can keep control overhead low without introducing complex, unsafe timing considerations, two necessary conditions to achieve robust, performance-competitive systems
Spatial parallelism in the routers of asynchronous on-chip networks
State-of-the-art multi-processor systems-on-chip use on-chip networks as their communication fabric. Although most on-chip networks are implemented synchronously, asynchronous on-chip networks have several advantages over their synchronous counterparts. Timing division multiplexing (TDM) flow control methods have been utilized in asynchronous on-chip networks extensively. The synchronization required by TDM leads to significant speed penalties. Compared with using TDM methods, spatial parallelism methods, such as the spatial division multiplexing (SDM) flow control method, achieve better network throughput with less area overhead.This thesis proposes several techniques to increase spatial parallelism in the routers of asynchronous on-chip networks.Channel slicing is a new pipeline structure that alleviates the speed penalty by removing the synchronization among bit-level data pipelines. It is also found out that the lookahead pipeline using early evaluated acknowledgement can be used in routers to further improve speed.SDM is a new flow control method proposed for asynchronous on-chip networks. It improves network throughput without introducing synchronization among buffers of different frames, which is required by TDM methods. It is also found that the area overhead of SDM is smaller than the virtual channel (VC) flow control method -- the most used TDM method. The major design problem of SDM is the area consuming crossbars. A novel 2-stage Clos switch structure is proposed to replace the crossbar in SDM routers, which significantly reduces the area overhead. This Clos switch is dynamically reconfigured by a new asynchronous Clos scheduler.Several asynchronous SDM routers are implemented using these new techniques. An asynchronous VC router is also reproduced for comparison. Performance analyses show that the SDM routers outperform the VC router in throughput, area overhead and energy efficiency.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
Spatial parallelism in the routers of asynchronous on-chip networks
State-of-the-art multi-processor systems-on-chip use on-chip networks as their communication fabric. Although most on-chip networks are implemented synchronously, asynchronous on-chip networks have several advantages over their synchronous counterparts. Timing division multiplexing (TDM) flow control methods have been utilized in asynchronous on-chip networks extensively. The synchronization required by TDM leads to significant speed penalties. Compared with using TDM methods, spatial parallelism methods, such as the spatial division multiplexing (SDM) flow control method, achieve better network throughput with less area overhead.This thesis proposes several techniques to increase spatial parallelism in the routers of asynchronous on-chip networks.Channel slicing is a new pipeline structure that alleviates the speed penalty by removing the synchronization among bit-level data pipelines. It is also found out that the lookahead pipeline using early evaluated acknowledgement can be used in routers to further improve speed.SDM is a new flow control method proposed for asynchronous on-chip networks. It improves network throughput without introducing synchronization among buffers of different frames, which is required by TDM methods. It is also found that the area overhead of SDM is smaller than the virtual channel (VC) flow control method -- the most used TDM method. The major design problem of SDM is the area consuming crossbars. A novel 2-stage Clos switch structure is proposed to replace the crossbar in SDM routers, which significantly reduces the area overhead. This Clos switch is dynamically reconfigured by a new asynchronous Clos scheduler.Several asynchronous SDM routers are implemented using these new techniques. An asynchronous VC router is also reproduced for comparison. Performance analyses show that the SDM routers outperform the VC router in throughput, area overhead and energy efficiency.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
Multi-resource approach to asynchronous SoC : design and tool support
As silicon cost reduces, the demands for higher performance and lower power consumption are ever increasing. The ability to dynamically control the number of resources employed can help balance and optimise a system in terms of its throughput, power consumption, and resilience to errors. The management of multiple resources requires building more advanced resource allocation logic than traditional 1-of-N arbiters posing the need for the efficient design flow supporting both the design and verification of such systems. Networks-on-Chip provide a good application example of distributed arbitration, in which the processor cores needing to transmit data are the clients; and the point-to-point links are the resources managed by routers. Building fast and smart arbiters can greatly benefit such systems in providing efficient and reliable communication service. In this thesis, a multi-resource arbiter was developed based on the Signal Transition Graph (STG) development flow. The arbiter distributes multiple active interchangeable resources that initiate requests when they are ready to be used. It supports concurrent resource utilization, which benefits creating asynchronous Multiple-Input-Multiple- Output (MIMO) queues. In order to deal with designs of higher complexity, an arbiter-oriented design flow is proposed. The flow is based on digital circuit components that are represented internally as STGs. This allows designing circuits without directly working with STGs but allowing their use for synthesis and formal verification. The interfaces for modelling, simulation, and visual model representation of the flow were implemented based on the existing modelling framework. As a result, the verification phase of the flow has helped to find hazards in existing Priority arbiter implementations. Finally, based on the logic-gate flow, the structure of a low-latency general purpose arbiter was developed. This design supports a wide variety of arbitration problems including the multi-resource management, which can benefit building NoCs employing complex and adaptive routing techniques.EThOS - Electronic Theses Online ServiceEPSRC grant GR/E044662/1 (STEP)GBUnited Kingdo
Design of variation-tolerant synchronizers for multiple clock and voltage domains
PhD ThesisParametric variability increasingly affects the performance of electronic circuits as
the fabrication technology has reached the level of 32nm and beyond. These
parameters may include transistor Process parameters (such as threshold
voltage), supply Voltage and Temperature (PVT), all of which could have a
significant impact on the speed and power consumption of the circuit, particularly
if the variations exceed the design margins. As systems are designed with more
asynchronous protocols, there is a need for highly robust synchronizers and
arbiters. These components are often used as interfaces between communication
links of different timing domains as well as sampling devices for asynchronous
inputs coming from external components. These applications have created a need
for new robust designs of synchronizers and arbiters that can tolerate process,
voltage and temperature variations.
The aim of this study was to investigate how synchronizers and arbiters should be
designed to tolerate parametric variations. All investigations focused mainly on
circuit-level and transistor level designs and were modeled and simulated in the
UMC90nm CMOS technology process. Analog simulations were used to measure
timing parameters and power consumption along with a âMonte Carloâ statistical
analysis to account for process variations.
Two main components of synchronizers and arbiters were primarily investigated:
flip-flop and mutual-exclusion element (MUTEX). Both components can violate the
input timing conditions, setup and hold window times, which could cause
metastability inside their bistable elements and possibly end in failures. The
mean-time between failures is an important reliability feature of any synchronizer
delay through the synchronizer.
The MUTEX study focused on the classical circuit, in addition to a number of
tolerance, based on increasing internal gain by adding current sources, reducing
the capacitive loading, boosting the transconductance of the latch, compensating
the existing Miller capacitance, and adding asymmetry to maneuver the metastable
point. The results showed that some circuits had little or almost no improvements,
while five techniques showed significant improvements by reducing Ď and
maintaining high tolerance.
Three design approaches are proposed to provide variation-tolerant
synchronizers. wagging synchronizer proposed to First, the is significantly
increase reliability over that of the conventional two flip-flop synchronizer. The
robustness of the wagging technique can be enhanced by using robust Ď latches or
adding one more cycle of synchronization. The second approach is the
Metastability Auto-Detection and Correction (MADAC) latch which relies on swiftly
detecting a metastable event and correcting it by enforcing the previously stored
logic value. This technique significantly reduces the resolution time down from
uncertain
synchronization technique is proposed to transfer signals between Multiple-
Voltage Multiple-Clock Domains (MVD/MCD) that do not require conventional
level-shifters between the domains or multiple power supplies within each
domain. This interface circuit uses a synchronous set and feedback reset protocol
which provides level-shifting and synchronization of all signals between the
domains, from a wide range of voltage-supplies and clock frequencies.
Overall, synchronizer circuits can tolerate variations to a greater extent by
employing the wagging technique or using a MADAC latch, while MUTEX tolerance
can suffice with small circuit modifications. Communication between MVD/MCD
can be achieved by an asynchronous handshake
without a need for adding level-shifters.The Saudi Arabian Embassy in London,
Umm Al-Qura University, Saudi Arabi
Driving the Network-on-Chip Revolution to Remove the Interconnect Bottleneck in Nanoscale Multi-Processor Systems-on-Chip
The sustained demand for faster, more powerful chips has been met by the
availability of chip manufacturing processes allowing for the integration of increasing
numbers of computation units onto a single die. The resulting outcome,
especially in the embedded domain, has often been called SYSTEM-ON-CHIP
(SoC) or MULTI-PROCESSOR SYSTEM-ON-CHIP (MP-SoC).
MPSoC design brings to the foreground a large number of challenges, one of
the most prominent of which is the design of the chip interconnection. With a
number of on-chip blocks presently ranging in the tens, and quickly approaching
the hundreds, the novel issue of how to best provide on-chip communication
resources is clearly felt.
NETWORKS-ON-CHIPS (NoCs) are the most comprehensive and scalable
answer to this design concern. By bringing large-scale networking concepts to
the on-chip domain, they guarantee a structured answer to present and future
communication requirements. The point-to-point connection and packet switching
paradigms they involve are also of great help in minimizing wiring overhead
and physical routing issues. However, as with any technology of recent inception,
NoC design is still an evolving discipline. Several main areas of interest
require deep investigation for NoCs to become viable solutions:
⢠The design of the NoC architecture needs to strike the best tradeoff among
performance, features and the tight area and power constraints of the onchip
domain.
⢠Simulation and verification infrastructure must be put in place to explore,
validate and optimize the NoC performance.
⢠NoCs offer a huge design space, thanks to their extreme customizability in
terms of topology and architectural parameters. Design tools are needed
to prune this space and pick the best solutions.
⢠Even more so given their global, distributed nature, it is essential to evaluate
the physical implementation of NoCs to evaluate their suitability for
next-generation designs and their area and power costs.
This dissertation performs a design space exploration of network-on-chip architectures,
in order to point-out the trade-offs associated with the design of
each individual network building blocks and with the design of network topology
overall. The design space exploration is preceded by a comparative analysis
of state-of-the-art interconnect fabrics with themselves and with early networkon-
chip prototypes. The ultimate objective is to point out the key advantages
that NoC realizations provide with respect to state-of-the-art communication
infrastructures and to point out the challenges that lie ahead in order to make
this new interconnect technology come true. Among these latter, technologyrelated
challenges are emerging that call for dedicated design techniques at all
levels of the design hierarchy. In particular, leakage power dissipation, containment
of process variations and of their effects. The achievement of the above
objectives was enabled by means of a NoC simulation environment for cycleaccurate
modelling and simulation and by means of a back-end facility for the
study of NoC physical implementation effects. Overall, all the results provided
by this work have been validated on actual silicon layout
Middleware services for distributed virtual environments
PhD ThesisDistributed Virtual Environments (DVEs) are virtual environments which allow
dispersed users to interact with each other and the virtual world through the
underlying network.
Scalability is a major challenge in building a successful DVE, which is directly
affected by the volume of message exchange. Different techniques have been
deployed to reduce the volume of message exchange in order to support large
numbers of simultaneous participants in a DVE. Interest management is a
popular technique for filtering unnecessary message exchange between users.
The rationale behind interest management is to resolve the "interests" of users
and decide whether messages should be exchanged between them. There are
three basic interest management approaches: region-based, aura-based and
hybrid approaches. However, if the time taken for an interest management
approach to determine interests is greater than the duration of the interaction, it
is not possible to guarantee interactions will occur correctly or at all. This is
termed the Missed Interaction Problem, which all existing interest management
approaches are susceptible to.
This thesis provides a new aura-based interest management approach, termed
Predictive Interest management (PIM), to alleviate the missed interaction
problem. PIM uses an enlarged aura to detect potential aura-intersections and
iii
initiate message exchange. It utilises variable message exchange frequencies,
proportional to the intersection degree of the objects' expanded auras, to restrict
bandwidth usage. This thesis provides an experimental system, the PIM system,
which couples predictive interest management with the de-centralised server
communication model. It utilises the Common Object Request Broker
Architecture (CORBA) middleware standard to provide an interoperable
middleware for DVEs. Experimental results are provided to demonstrate that
PIM provides a scalable interest management approach which alleviates the
missed interaction problem
- âŚ