38 research outputs found

    Implementation of multi-CLB designs using quantum-dot cellular automata

    Get PDF
    CMOS scaling is currently facing a technological barrier. Novel technologies are being proposed to keep up with the need for computation power and speed. One of the proposed ideas is the quantum-dot cellular automata (QCA) technology. QCA uses quantum mechanical effects in the device at the molecular scale. QCA systems have the potential for low power, high density, and regularity. This thesis studies QCA devices and uses those devices to build a simple field programmable gate array (FPGA). The FPGA is a combination of multiple configure logical blocks (CLBs) tiled together. Most previous work on this area has focused on fixed logic and programmable interconnect. In contrast, the work at the Rochester Institute of Technology (RIT) has designed and simulated a configurable logic block (CLB) based on look-up tables (LUTs). This thesis presents a simple FPGA that consists of multiple copies of the CLB created by the RIT group. The FPGA is configured to emulate a ripple-carry adder and a bit-serial multiplier. The latency and throughput of both functions are analyzed. We employ a multilevel approach to design specification and simulation. QCADesigner software is used for layout and simulation of an individual CLB. For the FPGA, the high-level HDLQ Verilog library is used. This hybrid approach provides a high degree of confidence in reasonable simulation time

    Fault and Defect Tolerant Computer Architectures: Reliable Computing With Unreliable Devices

    Get PDF
    This research addresses design of a reliable computer from unreliable device technologies. A system architecture is developed for a fault and defect tolerant (FDT) computer. Trade-offs between different techniques are studied and yield and hardware cost models are developed. Fault and defect tolerant designs are created for the processor and the cache memory. Simulation results for the content-addressable memory (CAM)-based cache show 90% yield with device failure probabilities of 3 x 10(-6), three orders of magnitude better than non fault tolerant caches of the same size. The entire processor achieves 70% yield with device failure probabilities exceeding 10(-6). The required hardware redundancy is approximately 15 times that of a non-fault tolerant design. While larger than current FT designs, this architecture allows the use of devices much more likely to fail than silicon CMOS. As part of model development, an improved model is derived for NAND Multiplexing. The model is the first accurate model for small and medium amounts of redundancy. Previous models are extended to account for dependence between the inputs and produce more accurate results

    Matrix multiplication using quantum-dot cellular automata to implement conventional microelectronics

    Full text link
    Quantum-dot cellular automata (QCA) shows promise as a post silicon CMOS, low power computational technology. Nevertheless, to generalize QCA for next-generation digital devices, the ability to implement conventional programmable circuits based on NOR, AND, and OR gates is necessary. To this end, we devise a new QCA structure, the QCA matrix multiplier (MM), employing the standard Coulomb blocked, five quantum dot (QD) QCA cell and quasi-adiabatic switching for sequential data latching in the QCA cells. Our structure can multiply two N x M matrices, using one input and one bidirectional input/output data line. The calculation is highly parallelizable, and it is possible to achieve reduced calculation time in exchange for increasing numbers of parallel matrix multiplier units. We show convergent, ab initio simulation results using the Intercellular Hartree Approximation for one, three, and nine matrix multiplier units. The structure can generally implement any programmable logic array (PLA) or any matrix multiplication based operation.Comment: 14 pages, 9 figures, supplemental informatio

    Quantum-dot Cellular Automata: Review Paper

    Get PDF
    Quantum-dot Cellular Automata (QCA) is one of the most important discoveries that will be the successful alternative for CMOS technology in the near future. An important feature of this technique, which has attracted the attention of many researchers, is that it is characterized by its low energy consumption, high speed and small size compared with CMOS.  Inverter and majority gate are the basic building blocks for QCA circuits where it can design the most logical circuit using these gates with help of QCA wire. Due to the lack of availability of review papers, this paper will be a destination for many people who are interested in the QCA field and to know how it works and why it had taken lots of attention recentl

    MFPA: Mixed-Signal Field Programmable Array for Energy-Aware Compressive Signal Processing

    Get PDF
    Compressive Sensing (CS) is a signal processing technique which reduces the number of samples taken per frame to decrease energy, storage, and data transmission overheads, as well as reducing time taken for data acquisition in time-critical applications. The tradeoff in such an approach is increased complexity of signal reconstruction. While several algorithms have been developed for CS signal reconstruction, hardware implementation of these algorithms is still an area of active research. Prior work has sought to utilize parallelism available in reconstruction algorithms to minimize hardware overheads; however, such approaches are limited by the underlying limitations in CMOS technology. Herein, the MFPA (Mixed-signal Field Programmable Array) approach is presented as a hybrid spin-CMOS reconfigurable fabric specifically designed for implementation of CS data sampling and signal reconstruction. The resulting fabric consists of 1) slice-organized analog blocks providing amplifiers, transistors, capacitors, and Magnetic Tunnel Junctions (MTJs) which are configurable to achieving square/square root operations required for calculating vector norms, 2) digital functional blocks which feature 6-input clockless lookup tables for computation of matrix inverse, and 3) an MRAM-based nonvolatile crossbar array for carrying out low-energy matrix-vector multiplication operations. The various functional blocks are connected via a global interconnect and spin-based analog-to-digital converters. Simulation results demonstrate significant energy and area benefits compared to equivalent CMOS digital implementations for each of the functional blocks used: this includes an 80% reduction in energy and 97% reduction in transistor count for the nonvolatile crossbar array, 80% standby power reduction and 25% reduced area footprint for the clockless lookup tables, and roughly 97% reduction in transistor count for a multiplier built using components from the analog blocks. Moreover, the proposed fabric yields 77% energy reduction compared to CMOS when used to implement CS reconstruction, in addition to latency improvements

    Towards FPGA hardware in the loop for QCA simulation

    Get PDF
    As transistors begin to hit raw physical limits and performance barriers, other technologies are being researched to potentially replace conventional integrated circuit technology. Quantum-dot Cellular Automata (QCA) is one such technology which executes computations using coulomb interactions and quantum-mechanical effects. Part of this research is pursuant to the design of circuits which exploit QCA technology and take advantage of what it has to offer. These circuits must be simulated to ensure their functionality and help prove the viability of QCA. These simulations, like many scientific computing applications, can take a long time to complete; hours or days, depending on their size and complexity. Many scientific applications have benefitted from research into Field Programmable Gate Array (FPGA) application development, which has been used to accelerate the speed at which such simulations execute. This thesis investigates the possibility of using FPGAs to accelerate the simulation of QCA circuits. The hardware developed is a streaming type architecture using floating point arithmetic and hardware/software techniques. Hardware implementation shows the system to run slower than the existing software code, but demonstrates the ability to simulate a small QCA circuit. Analysis of the design reveals good potential for achieving speedup, and an alternate design is proposed to improve the execution time. In the course of this work, improvements to the existing software are also developed and contributed to the community

    An Energy-Efficient Reconfigurable DTLS Cryptographic Engine for Securing Internet-of-Things Applications

    Full text link
    This paper presents the first hardware implementation of the Datagram Transport Layer Security (DTLS) protocol to enable end-to-end security for the Internet of Things (IoT). A key component of this design is a reconfigurable prime field elliptic curve cryptography (ECC) accelerator, which is 238x and 9x more energy-efficient compared to software and state-of-the-art hardware respectively. Our full hardware implementation of the DTLS 1.3 protocol provides 438x improvement in energy-efficiency over software, along with code size and data memory usage as low as 8 KB and 3 KB respectively. The cryptographic accelerators are coupled with an on-chip low-power RISC-V processor to benchmark applications beyond DTLS with up to two orders of magnitude energy savings. The test chip, fabricated in 65 nm CMOS, demonstrates hardware-accelerated DTLS sessions while consuming 44.08 uJ per handshake, and 0.89 nJ per byte of encrypted data at 16 MHz and 0.8 V.Comment: Published in IEEE Journal of Solid-State Circuits (JSSC

    Architectural Solutions for NanoMagnet Logic

    Get PDF
    The successful era of CMOS technology is coming to an end. The limit on minimum fabrication dimensions of transistors and the increasing leakage power hinder the technological scaling that has characterized the last decades. In several different ways, this problem has been addressed changing the architectures implemented in CMOS, adopting parallel processors and thus increasing the throughput at the same operating frequency. However, architectural alternatives cannot be the definitive answer to a continuous increase in performance dictated by Moore’s law. This problem must be addressed from a technological point of view. Several alternative technologies that could substitute CMOS in next years are currently under study. Among them, magnetic technologies such as NanoMagnet Logic (NML) are interesting because they do not dissipate any leakage power. More- over, magnets have memory capability, so it is possible to merge logic and memory in the same device. However, magnetic circuits, and NML in this specific research, have also some important drawbacks that need to be addressed: first, the circuit clock frequency is limited to 100 MHz, to avoid errors in data propagation; second, there is a connection between circuit layout and timing, and in particular, longer wires will have longer latency. These drawbacks are intrinsic to the technology and for this reason they cannot be avoided. The only chance is to limit their impact from an architectural point of view. The first step followed in the research path of this thesis is indeed the choice and optimization of architectures able to deal with the problems of NML. Systolic Ar- rays are identified as an ideal solution for this technology, because they are regular structures with local interconnections that limit the long latency of wires; more- over they are composed of several Processing Elements that work in parallel, thus exploit parallelization to increase throughput (limiting the impact of the low clock frequency). Through the analysis of Systolic Arrays for NML, several possible im- provements have been identified and addressed: 1) it has been defined a rigorous way to increase throughput with interleaving, providing equations that allow to esti- mate the number of operations to be interleaved and the rules to provide inputs; 2) a latency insensitive circuit has been designed, that exploits a data communication protocol between processing elements to avoid data synchronization problems. This feature has been exploited to design a latency insensitive Systolic Array that is able to execute the Floyd-Steinberg dithering algorithm. All the improvements presented in this framework apply to Systolic Arrays implemented in any technology. So, they can also be exploited to increase performance of today’s CMOS parallel circuits. This research path is presented in Chapter 3. While Systolic Arrays are an interesting solution for NML, their usage could be quite limited because they are normally application-specific. The second re- search path addresses this problem. A Reconfigurable Systolic Array is presented, that can be programmed to execute several algorithms. This architecture has been tested implementing many algorithms, including FIR and IIR filters, Discrete Cosine Transform and Matrix Multiplication. This research path is presented in Chapter 4. In common Von Neumann architectures, the logic part of the circuit and the memory one are separated. Today bus communication between logic and memory represents the bottleneck of the system. This problem is addressed presenting Logic- In-Memory (LIM), an architecture where memory elements are merged in logic ones. This research path aims at defining a real LIM architectures. This has been done in two steps. The first step is represented by an architecture composed of three layers: memory, routing and logic. In the second step instead the routing plane is no more present, and its features are inherited by the memory plane. In this solution, a pyramidal memory model is used, where memories near logic elements contain the most probably used data, and other memory layers contain the remaining data and instruction set. This circuit has been tested with odd-even sort algorithms and it has been benchmarked against GPUs and ASIC. This research path is presented in Chapter 5. MagnetoElastic NML (ME-NML) is a technological improvement of the NML principle, proposed by researchers of Politecnico di Torino, where the clock system is based on the induced stretch of a piezoelectric substrate when a voltage is ap- plied to its boundaries. The main advantage of this solution is that it consumes much less power than the classic clock implementation. This technology has not yet been investigated from an architectural point of view and considering complex circuits. In this research field, a standard methodology for the design of ME-NML circuits has been proposed. It is based on a Standard Cell Library and an enhanced VHDL model. The effectiveness of this methodology has been proved designing a Galois Field Multiplier. Moreover the serial-parallel trade-off in ME-NML has been investigated, designing three different solutions for the Multiply and Accumulate structure. This research path is presented in Chapter 6. While ME-NML is an extremely interesting technology, it needs to be combined with other faster technologies to have a real competitive system. Signal interfaces between NML and other technologies (mainly CMOS) have been rarely presented in literature. A mixed-technology multiplexer is designed and presented as the basis for a CMOS to NML interface. The reverse interface (from ME-NML to CMOS) is instead based on a sensing circuit for the Faraday effect: a change in the polarization of a magnet induces an electric field that can be used to generate an input signal for a CMOS circuit. This research path is presented in Chapter 7. The research work presented in this thesis represents a fundamental milestone in the path towards nanotechnologies. The most important achievement is the de- sign and simulation of complex circuits with NML, benchmarking this technology with real application examples. The characterization of a technology considering complex functions is a major step to be performed and that has not yet been ad- dressed in literature for NML. Indeed, only in this way it is possible to intercept in advance any weakness of NanoMagnet Logic that cannot be discovered consid- ering only small circuits. Moreover, the architectural improvements introduced in this thesis, although technology-driven, can be actually applied to any technology. We have demonstrated the advantages that can derive applying them to CMOS cir- cuits. This thesis represents therefore a major step in two directions: the first is the enhancement of NML technology; the second is a general improvement of parallel architectures and the development of the new Logic-In-Memory paradigm
    corecore