855 research outputs found
Recommended from our members
On Multicast in Asynchronous Networks-on-Chip: Techniques, Architectures, and FPGA Implementation
In this era of exascale computing, conventional synchronous design techniques are facing unprecedented challenges. The consumer electronics market is replete with many-core systems in the range of 16 cores to thousands of cores on chip, integrating multi-billion transistors. However, with this ever increasing complexity, the traditional design approaches are facing key issues such as increasing chip power, process variability, aging, thermal problems, and scalability. An alternative paradigm that has gained significant interest in the last decade is asynchronous design. Asynchronous designs have several potential advantages: they are naturally energy proportional, burning power only when active, do not require complex clock distribution, are robust to different forms of variability, and provide ease of composability for heterogeneous platforms. Networks-on-chip (NoCs) is an interconnect paradigm that has been introduced to deal with the ever-increasing system complexity. NoCs provide a distributed, scalable, and efficient interconnect solution for today’s many-core systems. Moreover, NoCs are a natural match with asynchronous design techniques, as they separate communication infrastructure and timing from the computational elements. To this end, globally-asynchronous locally-synchronous (GALS) systems that interconnect multiple processing cores, operating at different clock speeds, using an asynchronous NoC, have gained significant interest. While asynchronous NoCs have several advantages, they also face a key challenge of supporting new types of traffic patterns. Once such pattern is multicast communication, where a source sends packets to arbitrary number of destinations. Multicast is not only common in parallel computing, such as for cache coherency, but also for emerging areas such as neuromorphic computing. This important capability has been largely missing from asynchronous NoCs. This thesis introduces several efficient multicast solutions for these interconnects. In particular, techniques, and network architectures are introduced to support high-performance and low-power multicast. Two leading network topologies are the focus: a variant mesh-of-trees (MoT) and a 2D mesh. In addition, for a more realistic implementation and analysis, as well as significantly advancing the field of asynchronous NoCs, this thesis also targets synthesis of these NoCs on commercial FPGAs. While there has been significant advances in FPGA technologies, there has been only limited research on implementing asynchronous NoCs on FPGAs. To this end, a systematic computeraided design (CAD) methodology has been introduced to efficiently and safely map asynchronous NoCs on FPGAs. Overall, this thesis makes the following three contributions. The first contribution is a multicast solution for a variant MoT network topology. This topology consists of simple low-radix switches, and has been used in high-performance computing platforms. A novel local speculation technique is introduced, where a subset of the network’s switches are speculative that always broadcast every packet. These switches are very simple and have high performance. Speculative switches are surrounded by non-speculative ones that route packets based on their destinations and also throttle any redundant copies created by the former. This hybrid network architecture achieved significant performance and power benefits over other multicast approaches. The second contribution is a multicast solution for a 2D-mesh topology, which is more complex with higher-radix switches and also is more commonly used. A novel continuous-time replication strategy is introduced to optimize the critical multi-way forking operation of a multicast transmission. In this technique, a multicast packet is first stored in an input port of a switch, from where it is sent through distinct output ports towards different destinations concurrently, at each output’s own rate and in continuous time. This strategy is shown to have significant latency and energy benefits over an approach that performs multicast using multiple distinct serial unicasts to each destination. Finally, a systematic CAD methodology is introduced to synthesize asynchronous NoCs on commercial FPGAs. A two-fold goal is targeted: correctness and high performance. For ease of implementation, only existing FPGA synthesis tools are used. Moreover, since asynchronous NoCs involve special asynchronous components, a comprehensive guide is introduced to map these elements correctly and efficiently. Two asynchronous NoC switches are synthesized using the proposed approach on a leading Xilinx FPGA in 28 nm: one that only handles unicast, and the other that also supports multicast. Both showed significant energy benefits with some performance gains over a state-of-the-art synchronous switch
TFI-FTS: An efficient transient fault injection and fault-tolerant system for asynchronous circuits on FPGA platform
Designing VLSI digital circuits is challenging tasks because of testing the circuits concerning design time. The reliability and productivity of digital integrated circuits are primarily affected by the defects in the manufacturing process or systems. If the defects are more in the systems, which leads the fault in the systems. The fault tolerant systems are necessary to overcome the faults in the VLSI digital circuits. In this research article, an asynchronous circuits based an effective transient fault injection (TFI) and fault tolerant system (FTS) are modelled. The TFI system generates the faults based on BMA based LFSR with faulty logic insertion and one hot encoded register. The BMA based LFSR reduces the hardware complexity with less power consumption on-chip than standard LFSR method. The FTS uses triple mode redundancy (TMR) based majority voter logic (MVL) to tolerant the faults for asynchronous circuits. The benchmarked 74X-series circuits are considered as an asynchronous circuit for TMR logic. The TFI-FTS module is modeled using Verilog-HDL on Xilinx-ISE and synthesized on hardware platform. The Performance parameters are tabulated for TFI-FTS based asynchronous circuits. The performance of TFI-FTS Module is analyzed with 100% fault coverage. The fault coverage is validated using functional simulation of each asynchronous circuit with fault injection in TFI-FTS Module
Arquiteturas de Pipeline Assíncronas Register Less NULL Convention Logic (RL-NCL) Usando Portas Básicas
Asynchronous circuits is an alternative to design digital systems that is becoming the interest of many researchers in the digital design area mainly due to it’s low-power consumption and robustness. One of the most compelling design paradigms of asynchronous circuits is the NULL Convention Logic (NCL). The pipeline is a very common technique used in digital circuits to achieve high throughput. Although one can implement a pipeline using NCL gates, recent works have shown that register-less pipelines are possible using modified NCL gates. In this paper we propose two new Register-Less NCL (RL-NCL) pipeline architectures and two new methods to design NCL gates, which can be implemented even in Field Programmable Gate Arrays (FPGAs) or using the standard cells method. The new design of the proposed architecture was able to achieve an average area reduction of 27,32%, an average latency reduction of 14,1% and an average throughput increase of 5,54% comparing with the conventional NCL pipeline architecture.Los circuitos asíncronos son una alternativa para el diseño de sistemas digitales que se está convirtiendo en el interés de muchos investigadores en el área del diseño digital debido principalmente a su bajo consumo y robustez. Uno de los paradigmas de diseño más convincentes de los circuitos asíncronos es la NULL Convention Logic (NCL). La pipeline es una técnica muy común utilizada en circuitos digitales para lograr un alto rendimiento. Aunque se puede implementar una pipeline utilizando puertas NCL, trabajos recientes han demostrado que las pipelines sin registro son posibles utilizando puertas NCL modificadas. En este artículo, propusimos dos nuevas arquitecturas de pipeline Register-Less NCL (RL-NCL) y un paradigma de diseño, que pueden implementarse incluso en Field Programmable Gate Arrays (FPGA) o utilizando el método de celdas estándar. El nuevo diseño de la arquitectura propuesta logró una reducción media del área del 27,32%, una reducción media de la latencia del 14,1% y un aumento medio del rendimiento del 5,54% en comparación con la arquitectura de pipeline NCL convencional.Circuitos assíncronos é uma alternativa para projetar sistemas digitais que vem despertando o interesse de muitos pesquisadores na área de projeto digital principalmente devido ao seu baixo consumo de energia e robustez. Um dos paradigmas de projeto mais atraentes de circuitos assíncronos é o NULL Convention Logic (NCL). O pipeline é uma técnica muito comum usada em circuitos digitais para obter alto rendimento. Embora seja possível implementar um pipeline usando portas NCL, trabalhos recentes mostraram que pipelines sem registro são possíveis usando portas NCL modificadas. Neste artigo propomos duas novas arquiteturas de pipeline NCL Register-Less (RL-NCL) e dois novos métodos para projetar portas NCL, que podem ser implementadas até mesmo em Field Programmable Gate Arrays (FPGAs) ou usando o método de células padrão. O novo design da arquitetura proposta foi capaz de alcançar uma redução média de área de 27,32%, uma redução média de latência de 14,1% e um aumento médio de throughput de 5,54% em comparação com a arquitetura de pipeline NCL convencional
Track Extrapolation and Distribution for the CDF-II Trigger System
The CDF-II experiment is a multipurpose detector designed to study a wide
range of processes observed in the high energy proton-antiproton collisions
produced by the Fermilab Tevatron. With event rates greater than 1MHz, the
CDF-II trigger system is crucial for selecting interesting events for
subsequent analysis. This document provides an overview of the Track
Extrapolation System (XTRP), a component of the CDF-II trigger system. The XTRP
is a fully digital system that is utilized in the track-based selection of high
momentum lepton and heavy flavor signatures. The design of the XTRP system
includes five different custom boards utilizing discrete and FPGA technology
residing in a single VME crate. We describe the design, construction,
commissioning and operation of this system.Comment: 34 pages, 9 figures, submitted to Nucl.Inst.Meth.
Null convention logic circuits for asynchronous computer architecture
For most of its history, computer architecture has been able to benefit from a rapid scaling in semiconductor technology, resulting in continuous improvements to CPU design. During that period, synchronous logic has dominated because of its inherent ease of design and abundant tools. However, with the scaling of semiconductor processes into deep sub-micron and then to nano-scale dimensions, computer architecture is hitting a number of roadblocks such as high power and increased process variability. Asynchronous techniques can potentially offer many advantages compared to conventional synchronous design, including average case vs. worse case performance, robustness in the face of process and operating point variability and the ready availability of high performance, fine grained pipeline architectures. Of the many alternative approaches to asynchronous design, Null Convention Logic (NCL) has the advantage that its quasi delay-insensitive behavior makes it relatively easy to set up complex circuits without the need for exhaustive timing analysis. This thesis examines the characteristics of an NCL based asynchronous RISC-V CPU and analyses the problems with applying NCL to CPU design. While a number of university and industry groups have previously developed small 8-bit microprocessor architectures using NCL techniques, it is still unclear whether these offer any real advantages over conventional synchronous design. A key objective of this work has been to analyse the impact of larger word widths and more complex architectures on NCL CPU implementations. The research commenced by re-evaluating existing techniques for implementing NCL on programmable devices such as FPGAs. The little work that has been undertaken previously on FPGA implementations of asynchronous logic has been inconclusive and seems to indicate that asynchronous systems cannot be easily implemented in these devices. However, most of this work related to an alternative technique called bundled data, which is not well suited to FPGA implementation because of the difficulty in controlling and matching delays in a 'bundle' of signals. On the other hand, this thesis clearly shows that such applications are not only possible with NCL, but there are some distinct advantages in being able to prototype complex asynchronous systems in a field-programmable technology such as the FPGA. A large part of the value of NCL derives from its architectural level behavior, inherent pipelining, and optimization opportunities such as the merging of register and combina- tional logic functions. In this work, a number of NCL multiplier architectures have been analyzed to reveal the performance trade-offs between various non-pipelined, 1D and 2D organizations. Two-dimensional pipelining can easily be applied to regular architectures such as array multipliers in a way that is both high performance and area-efficient. It was found that the performance of 2D pipelining for small networks such as multipliers is around 260% faster than the equivalent non-pipelined design. However, the design uses 265% more transistors so the methodology is mainly of benefit where performance is strongly favored over area. A pipelined 32bit x 32bit signed Baugh-Wooley multiplier with Wallace-Tree Carry Save Adders (CSA), which is representative of a real design used for CPUs and DSPs, was used to further explore this concept as it is faster and has fewer pipeline stages compared to the normal array multiplier using Ripple-Carry adders (RCA). It was found that 1D pipelining with ripple-carry chains is an efficient implementation option but becomes less so for larger multipliers, due to the completion logic for which the delay time depends largely on the number of bits involved in the completion network. The average-case performance of ripple-carry adders was explored using random input vectors and it was observed that it offers little advantage on the smaller multiplier blocks, but this particular timing characteristic of asynchronous design styles be- comes increasingly more important as word size grows. Finally, this research has resulted in the development of the first 32-Bit asynchronous RISC-V CPU core. Called the Redback RISC, the architecture is a structure of pipeline rings composed of computational oscillations linked with flow completeness relationships. It has been written using NELL, a commercial description/synthesis tool that outputs standard Verilog. The Redback has been analysed and compared to two approximately equivalent industry standard 32-Bit synchronous RISC-V cores (PicoRV32 and Rocket) that are already fabricated and used in industry. While the NCL implementation is larger than both commercial cores it has similar performance and lower power compared to the PicoRV32. The implementation results were also compared against an existing NCL design tool flow (UNCLE), which showed how much the results of these implementation strategies differ. The Redback RISC has achieved similar level of throughput and 43% better power and 34% better energy compared to one of the synchronous cores with the same benchmark test and test condition such as input sup- ply voltage. However, it was shown that area is the biggest drawback for NCL CPU design. The core is roughly 2.5× larger than synchronous designs. On the other hand its area is still 2.9× smaller than previous designs using UNCLE tools. The area penalty is largely due to the unavoidable translation into a dual-rail topology when using the standard NCL cell library
- …