5 research outputs found
Some unusual micropipeline circuits
Journal ArticleWe present a few unusual Micropipelines (Sutherland, CACM, September 1989) that employ the Muller C-ELEMENT or an extension of the C-ELEMENT called LOCKC (Liebchen and Gopalakrishnan, ICCD, 1992). We first describe two variations of the two-dimensional Micropipeline structure realized using ordinary C-ELEMENTs. These micropipelines can be used to control wavefront arrays (S.-Y.Kung et.al, IEEE Computer, 1987). Next, we present a ring style arbiter realized using a LocKC-based one-dimensional micropipeline. Finally, we present a solution to the symmetric crossbar arbitration problem posed by Tamir and Chi (IEEE Trans. Parallel and Dist Systems, Jan '93) using a circuit that employs the two-dimensional micropipeline as well as the LOCKC. We present various circuits to solve the symmetric crossbar arbitration problem, including ones that consume very little power when idling
Recommended from our members
Design and performance optimization of asynchronous networks-on-chip
As digital systems continue to grow in complexity, the design of conventional synchronous systems is facing unprecedented challenges. The number of transistors on individual chips is already in the multi-billion range, and a greatly increasing number of components are being integrated onto a single chip. As a consequence, modern digital designs are under strong time-to-market pressure, and there is a critical need for composable design approaches for large complex systems.
In the past two decades, networks-on-chip (NoC’s) have been a highly active research area. In a NoC-based system, functional blocks are first designed individually and may run at different clock rates. These modules are then connected through a structured network for on-chip global communication. However, due to the rigidity of centrally-clocked NoC’s, there have been bottlenecks of system scalability, energy and performance, which cannot be easily solved with synchronous approaches. As a result, there has been significant recent interest in combing the notion of asynchrony with NoC designs. Since the NoC approach inherently separates the communication infrastructure, and its timing, from computational elements, it is a natural match for an asynchronous paradigm. Asynchronous NoC’s, therefore, enable a modular and extensible system composition for an ‘object-orient’ design style.
The thesis aims to significantly advance the state-of-art and viability of asynchronous and globally-asynchronous locally-synchronous (GALS) networks-on-chip, to enable high-performance and low-energy systems. The proposed asynchronous NoC’s are nearly entirely based on standard cells, which eases their integration into industrial design flows. The contributions are instantiated in three different directions.
First, practical acceleration techniques are proposed for optimizing the system latency, in order to break through the latency bottleneck in the memory interfaces of many on-chip parallel processors. Novel asynchronous network protocols are proposed, along with concrete NoC designs. A new concept, called ‘monitoring network’, is introduced. Monitoring networks are lightweight shadow networks used for fast-forwarding anticipated traffic information, ahead of the actual packet traffic. The routers are therefore allowed to initiate and perform arbitration and channel allocation in advance. The technique is successfully applied to two topologies which belong to two different categories – a variant mesh-of-trees (MoT) structure and a 2D-mesh topology. Considerable and stable latency improvements are observed across a wide range of traffic patterns, along with moderate throughput gains.
Second, for the first time, a high-performance and low-power asynchronous NoC router is compared directly to a leading commercial synchronous counterpart in an advanced industrial technology. The asynchronous router design shows significant performance improvements, as well as area and power savings. The proposed asynchronous router integrates several advanced techniques, including a low-latency circular FIFO for buffer design, and a novel end-to-end credit-based virtual channel (VC) flow control. In addition, a semi-automated design flow is created, which uses portions of a standard synchronous tool flow.
Finally, a high-performance multi-resource asynchronous arbiter design is developed. This small but important component can be directly used in existing asynchronous NoC’s for performance optimization. In addition, this standalone design promises use in opening up new NoC directions, as well as for general use in parallel systems. In the proposed arbiter design, the allocation of a resource to a client is divided into several steps. Multiple successive client-resource pairs can be selected rapidly in pipelined sequence, and the completion of the assignments can overlap in parallel.
In sum, the thesis provides a set of advanced design solutions for performance optimization of asynchronous and GALS networks-on-chip. These solutions are at different levels, from network protocols, down to router- and component-level optimizations, which can be directly applied to existing basic asynchronous NoC designs to provide a leap in performance improvement
Doctor of Philosophy
dissertationSparse matrix codes are found in numerous applications ranging from iterative numerical solvers to graph analytics. Achieving high performance on these codes has however been a significant challenge, mainly due to array access indirection, for example, of the form A[B[i]]. Indirect accesses make precise dependence analysis impossible at compile-time, and hence prevent many parallelizing and locality optimizing transformations from being applied. The expert user relies on manually written libraries to tailor the sparse code and data representations best suited to the target architecture from a general sparse matrix representation. However libraries have limited composability, address very specific optimization strategies, and have to be rewritten as new architectures emerge. In this dissertation, we explore the use of the inspector/executor methodology to accomplish the code and data transformations to tailor high performance sparse matrix representations. We devise and embed abstractions for such inspector/executor transformations within a compiler framework so that they can be composed with a rich set of existing polyhedral compiler transformations to derive complex transformation sequences for high performance. We demonstrate the automatic generation of inspector/executor code, which orchestrates code and data transformations to derive high performance representations for the Sparse Matrix Vector Multiply kernel in particular. We also show how the same transformations may be integrated into sparse matrix and graph applications such as Sparse Matrix Matrix Multiply and Stochastic Gradient Descent, respectively. The specific constraints of these applications, such as problem size and dependence structure, necessitate unique sparse matrix representations that can be realized using our transformations. Computations such as Gauss Seidel, with loop carried dependences at the outer most loop necessitate different strategies for high performance. Specifically, we organize the computation into level sets or wavefronts of irregular size, such that iterations of a wavefront may be scheduled in parallel but different wavefronts have to be synchronized. We demonstrate automatic code generation of high performance inspectors that do explicit dependence testing and level set construction at runtime, as well as high performance executors, which are the actual parallelized computations. For the above sparse matrix applications, we automatically generate inspector/executor code comparable in performance to manually tuned libraries
Multi-resource approach to asynchronous SoC : design and tool support
As silicon cost reduces, the demands for higher performance and lower power consumption are ever increasing. The ability to dynamically control the number of resources employed can help balance and optimise a system in terms of its throughput, power consumption, and resilience to errors. The management of multiple resources requires building more advanced resource allocation logic than traditional 1-of-N arbiters posing the need for the efficient design flow supporting both the design and verification of such systems. Networks-on-Chip provide a good application example of distributed arbitration, in which the processor cores needing to transmit data are the clients; and the point-to-point links are the resources managed by routers. Building fast and smart arbiters can greatly benefit such systems in providing efficient and reliable communication service. In this thesis, a multi-resource arbiter was developed based on the Signal Transition Graph (STG) development flow. The arbiter distributes multiple active interchangeable resources that initiate requests when they are ready to be used. It supports concurrent resource utilization, which benefits creating asynchronous Multiple-Input-Multiple- Output (MIMO) queues. In order to deal with designs of higher complexity, an arbiter-oriented design flow is proposed. The flow is based on digital circuit components that are represented internally as STGs. This allows designing circuits without directly working with STGs but allowing their use for synthesis and formal verification. The interfaces for modelling, simulation, and visual model representation of the flow were implemented based on the existing modelling framework. As a result, the verification phase of the flow has helped to find hazards in existing Priority arbiter implementations. Finally, based on the logic-gate flow, the structure of a low-latency general purpose arbiter was developed. This design supports a wide variety of arbitration problems including the multi-resource management, which can benefit building NoCs employing complex and adaptive routing techniques.EThOS - Electronic Theses Online ServiceEPSRC grant GR/E044662/1 (STEP)GBUnited Kingdo