28 research outputs found

    Strategies for Optimising DRAM Repair

    Get PDF
    Dynamic Random Access Memories (DRAM) are large complex devices, prone to defects during manufacture. Yield is improved by the provision of redundant structures used to repair these defects. This redundancy is often implemented by the provision of excess memory capacity and programmable address logic allowing the replacement of faulty cells within the memory array. As the memory capacity of DRAM devices has increased, so has the complexity of their redundant structures, introducing increasingly complex restrictions and interdependencies upon the use of this redundant capacity. Currently redundancy analysis algorithms solving the problem of optimally allocating this redundant capacity must be manually customised for each new device. Compromises made to reduce the complexity, and human error, reduce the efficacy of these algorithms. This thesis develops a methodology for automating the customisation of these redundancy analysis algorithms. Included are: a modelling language describing the redundant structures (including the restrictions and interdependencies placed upon their use), algorithms manipulating this model to generate redundancy analysis algorithms, and methods for translating those algorithms into executable code. Finally these concepts are used to develop a prototype software tool capable of generating redundancy analysis algorithms customised for a specified device

    Design and application of convergent cellular automata

    Get PDF
    Systems made of many interacting elements may display unanticipated emergent properties. A system for which the desired properties are the same as those which emerge will be inherently robust. Currently available techniques for designing emergent properties are prohibitively costly for all but the simplest systems. The self-assembly of biological cells into tissues and ultimately organisms is an example of a natural dynamic distributed system of which the primary emergent behaviour is a fully operational being. The distributed process that co-ordinates this self-assembly is morphogenesis. By analysing morphogenesis with a cellular automata model we deduce a means by which this self-organisation might be achieved. This mechanism is then adapted to the design of self-organising patterns, reliable electronic systems and self-assembling systems. The limitations of the design algorithm are analysed, as is a means to overcome them. The cost of this algorithm is discussed and finally demonstrated with the design of a reliable arithmetic logic unit and a self-assembling, self-repairing and metamorphosising robot made of 12,000 cells

    Parallel implementation of fractal image compression

    Get PDF
    Thesis (M.Sc.Eng.)-University of Natal, Durban, 2000.Fractal image compression exploits the piecewise self-similarity present in real images as a form of information redundancy that can be eliminated to achieve compression. This theory based on Partitioned Iterated Function Systems is presented. As an alternative to the established JPEG, it provides a similar compression-ratio to fidelity trade-off. Fractal techniques promise faster decoding and potentially higher fidelity, but the computationally intensive compression process has prevented commercial acceptance. This thesis presents an algorithm mapping the problem onto a parallel processor architecture, with the goal of reducing the encoding time. The experimental work involved implementation of this approach on the Texas Instruments TMS320C80 parallel processor system. Results indicate that the fractal compression process is unusually well suited to parallelism with speed gains approximately linearly related to the number of processors used. Parallel processing issues such as coherency, management and interfacing are discussed. The code designed incorporates pipelining and parallelism on all conceptual and practical levels ensuring that all resources are fully utilised, achieving close to optimal efficiency. The computational intensity was reduced by several means, including conventional classification of image sub-blocks by content with comparisons across class boundaries prohibited. A faster approach adopted was to perform estimate comparisons between blocks based on pixel value variance, identifying candidates for more time-consuming, accurate RMS inter-block comparisons. These techniques, combined with the parallelism, allow compression of 512x512 pixel x 8 bit images in under 20 seconds, while maintaining a 30dB PSNR. This is up to an order of magnitude faster than reported for conventional sequential processor implementations. Fractal based compression of colour images and video sequences is also considered. The work confirms the potential of fractal compression techniques, and demonstrates that a parallel implementation is appropriate for addressing the compression time problem. The processor system used in these investigations is faster than currently available PC platforms, but the relevance lies in the anticipation that future generations of affordable processors will exceed its performance. The advantages of fractal image compression may then be accessible to the average computer user, leading to commercial acceptance

    Timing-Error Tolerance Techniques for Low-Power DSP: Filters and Transforms

    Get PDF
    Low-power Digital Signal Processing (DSP) circuits are critical to commercial System-on-Chip design for battery powered devices. Dynamic Voltage Scaling (DVS) of digital circuits can reclaim worst-case supply voltage margins for delay variation, reducing power consumption. However, removing static margins without compromising robustness is tremendously challenging, especially in an era of escalating reliability concerns due to continued process scaling. The Razor DVS scheme addresses these concerns, by ensuring robustness using explicit timing-error detection and correction circuits. Nonetheless, the design of low-complexity and low-power error correction is often challenging. In this thesis, the Razor framework is applied to fixed-precision DSP filters and transforms. The inherent error tolerance of many DSP algorithms is exploited to achieve very low-overhead error correction. Novel error correction schemes for DSP datapaths are proposed, with very low-overhead circuit realisations. Two new approximate error correction approaches are proposed. The first is based on an adapted sum-of-products form that prevents errors in intermediate results reaching the output, while the second approach forces errors to occur only in less significant bits of each result by shaping the critical path distribution. A third approach is described that achieves exact error correction using time borrowing techniques on critical paths. Unlike previously published approaches, all three proposed are suitable for high clock frequency implementations, as demonstrated with fully placed and routed FIR, FFT and DCT implementations in 90nm and 32nm CMOS. Design issues and theoretical modelling are presented for each approach, along with SPICE simulation results demonstrating power savings of 21 – 29%. Finally, the design of a baseband transmitter in 32nm CMOS for the Spectrally Efficient FDM (SEFDM) system is presented. SEFDM systems offer bandwidth savings compared to Orthogonal FDM (OFDM), at the cost of increased complexity and power consumption, which is quantified with the first VLSI architecture

    FPGA-based high-performance neural network acceleration

    Full text link
    In the last ten years, Artificial Intelligence through Deep Neural Networks (DNNs) has penetrated virtually every aspect of science, technology, and business. Advances are rapid with thousands of papers being published annually. Many types of DNNs have been and continue to be developed -- in this thesis, we address Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Graph Neural Networks (GNNs) -- each with a different set of target applications and implementation challenges. The overall problem for all of these Neural Networks (NNs) is that their target applications generally pose stringent constraints on latency and throughput, but also have strict accuracy requirements. Much research has therefore gone into all aspects of improving NN quality and performance: algorithms, code optimization, acceleration with GPUs, and acceleration with hardware, both dedicated ASICs and off-the-shelf FPGAs. In this thesis, we concentrate on the last of these approaches. There have been many previous efforts in creating hardware to accelerate NNs. The problem designers face is that optimal NN models typically have significant irregularities, making them hardware unfriendly. One commonly used approach is to train NN models to follow regular computation and data patterns. This approach, however, can hurt the models' accuracy or lead to models with non-negligible redundancies. This dissertation takes a different approach. Instead of regularizing the model, we create architectures friendly to irregular models. Our thesis is that high-accuracy and high-performance NN inference and training can be achieved by creating a series of novel irregularity-aware architectures for Field-Programmable Gate Arrays (FPGAs). In four different studies on four different NN types, we find that this approach results in speedups of 2.1x to 3255x compared with carefully selected prior art; for inference, there is no change in accuracy. The bulk of this dissertation revolves around these studies, the various workload balancing techniques, and the resulting NN acceleration architectures. In particular, we propose four different architectures to handle, respectively, data structure level, operation level, bit level, and model level irregularities. At the data structure level, we propose AWB-GCN, which uses runtime workload rebalancing to handle Sparse Matrices Multiplications (SpMM) on extremely sparse and unbalanced input. With GNN inference as a case study, AWB-GCN achieves over 90% system efficiency, guarantees efficient off-chip memory access, and provides considerable speedups over CPUs (3255x), GPUs (80x), and a prior ASIC accelerator (5.1x). At the operation level, we propose O3BNN-R, which can detect redundant operations and prune them at run time. This works even for those that are highly data-dependent and unpredictable. With Binarized NNs (BNNs) as a case study, O3BNN-R can prune over 30% of the operations, without any accuracy loss, yielding speedups over state-of-the-art implementations on CPUs (1122x), GPUs (2.3x), and FPGAs (2.1x). At the bit level, we propose CQNN. CQNN embeds a Coarse-Grained Reconfigurable Architecture (CGRA) which can be programmed at runtime to support NN functions with various data-width requirements. Results show that CQNN can deliver us-level Quantized NN (QNN) inference. At the model level, we propose FPDeep, especially for training. In order to address model-level irregularity, FPDeep uses a novel model partitioning schemes to balance workload and storage among nodes. By using a hybrid of model and layer parallelism to train DNNs, FPDeep avoids the large gap that commonly occurs between training and testing accuracy due to the improper convergence to sharp minimizers (caused by large training batches). Results show that FPDeep provides scalable, fast, and accurate training and leads to 6.6x higher energy efficiency than GPUs

    Efficient architectures for multidimensional discrete transforms in image and video processing applications

    Get PDF
    PhD ThesisThis thesis introduces new image compression algorithms, their related architectures and data transforms architectures. The proposed architectures consider the current hardware architectures concerns, such as power consumption, hardware usage, memory requirement, computation time and output accuracy. These concerns and problems are crucial in multidimensional image and video processing applications. This research is divided into three image and video processing related topics: low complexity non-transform-based image compression algorithms and their architectures, architectures for multidimensional Discrete Cosine Transform (DCT); and architectures for multidimensional Discrete Wavelet Transform (DWT). The proposed architectures are parameterised in terms of wordlength, pipelining and input data size. Taking such parameterisation into account, efficient non-transform based and low complexity image compression algorithms for better rate distortion performance are proposed. The proposed algorithms are based on the Adaptive Quantisation Coding (AQC) algorithm, and they achieve a controllable output bit rate and accuracy by considering the intensity variation of each image block. Their high speed, low hardware usage and low power consumption architectures are also introduced and implemented on Xilinx devices. Furthermore, efficient hardware architectures for multidimensional DCT based on the 1-D DCT Radix-2 and 3-D DCT Vector Radix (3-D DCT VR) fast algorithms have been proposed. These architectures attain fast and accurate 3-D DCT computation and provide high processing speed and power consumption reduction. In addition, this research also introduces two low hardware usage 3-D DCT VR architectures. Such architectures perform the computation of butterfly and post addition stages without using block memory for data transposition, which in turn reduces the hardware usage and improves the performance of the proposed architectures. Moreover, parallel and multiplierless lifting-based architectures for the 1-D, 2-D and 3-D Cohen-Daubechies-Feauveau 9/7 (CDF 9/7) DWT computation are also introduced. The presented architectures represent an efficient multiplierless and low memory requirement CDF 9/7 DWT computation scheme using the separable approach. Furthermore, the proposed architectures have been implemented and tested using Xilinx FPGA devices. The evaluation results have revealed that a speed of up to 315 MHz can be achieved in the proposed AQC-based architectures. Further, a speed of up to 330 MHz and low utilisation rate of 722 to 1235 can be achieved in the proposed 3-D DCT VR architectures. In addition, in the proposed 3-D DWT architecture, the computation time of 3-D DWT for data size of 144×176×8-pixel is less than 0.33 ms. Also, a power consumption of 102 mW at 50 MHz clock frequency using 256×256-pixel frame size is achieved. The accuracy tests for all architectures have revealed that a PSNR of infinite can be attained

    Ultra Reliable Computing Systems

    Get PDF
    For high security and safety applications as well as general purpose applications, it is necessary to have ultra reliable computing systems. This dissertation describes our system of self-testable and self-repairable digital devices, especially, EPLDs (Electrically Programmable Logic Devices). In addition to significantly improving the reliability of digital systems, our self-healing and re-configurable system design with added repair capability can also provide higher yields, lower testing costs, and faster time-to-market for the semiconductor industry. The digital system in our approach is composed of blocks, which realize combinational and sequential circuits using GALs (Generic Array Logic Devices). We describe three techniques for fault-locating and fault-repairing in these devices. The methodology we used for evaluation of these methods and a comparison with devices that have no self-repair capability was simulation of the self-repair algorithms. Our simulations show that the lifetime for a GAL-based EPLD that uses our multiple self-repairing methods is longer than the lifetime of a GAL-based EPLD that uses a single self-repair method or no self-repair method. Specifically, our work demonstrates that the lifetime of a GAL can be increased by adding extra columns in the AND array of a GAL and extra output ORs in a GAL. It also gives information on how many extra columns and extra ORs a GAL needs and which self-repairing method should be used to guarantee a given lifetime. Thus, we can estimate an ideal point, where the maximum reliability can be reached with the minimum cost

    Comunicações e armazenamento de massa em sistemas embebidos escaláveis

    Get PDF
    Mestrado em Engenharia Electrónica e TelecomunicaçõesInserido no projecto ECU2010, este documento visa determinar a melhor solução possível para implementação de armazenamento de informação e comunicações de elevado débito para aplicações no âmbito do desporto automóvel. O projecto ECU2010 tem como objectivo a pesquisa de uma nova arquitectura de unidades de controlo electrónico (ECU) para desporto automóvel especialmente centrado no controlo de motores de combustão interna. A nova arquitectura proposta deverá de ser capaz de fazer o controlo de um motor de combustão interna usando os mais modernos modelos de controlo, mas sendo baseada numa modelo de processamento distribuído, composta por módulos de processamento auto-suficientes ao nível de comunicações e armazenamento e de sensores/actuadores com inteligência capazes de processamento prévio de dados. A comunicação entre módulos não será abordada neste documento nem a comunicação com os elementos periféricos de actuação e/ou natureza sensorial, mas sim a comunicação entre os módulos de processamento e um dispositivo de controlo e monitorização, doravante chamado de Anfitrião, que tipicamente será um computador pessoal ou PDA. De igual forma este documento debruçar-se-á sobre uma solução para o armazenamento em massa de informação, principalmente focada no armazenamento de dados históricos resultantes de variáveis de monitorização, processamento intermédio e de actuação. O objectivo deste documento será produzir um conjunto de blocos de electrónica digital reconfiguráveis implementando as funcionalidades atrás mencionadas numa FPGA da Xilinx modelo Spartan 3E, que em conjunto com hardware desenvolvido para o efeito fazem a interface com os dispositivos de suporte e comunicação definidos no documento.This dissertation is written in the scope of ECU2010 project, and aims to determine the best possible solution for information storage and high speed communications for automotive motorsports applications. The ECU2010 is centred on the research of a new architecture of electronic control units (ECU) for motor sport, focussing on control of internal combustion engines. The proposed new architecture should be capable of controlling an internal combustion engine using the state-of-the art control models, but based on a distributed processing model consisting on self-sufficient processing modules in terms of communications, storage and intelligent enabled sensors/actuators, which of which is able to produce low-level data processing. Communication between modules is not discussed herein, neither communication with the peripheral sensors/actuators. Instead, focus will be given to the communication between the processing modules and a control and monitoring device, hereinafter called the Host, that will be typically a personal computer or PDA. This document will analyse and propose a solution for information mass storage and retrieval to a host system, mainly focused on historical data produced by variable monitoring and processing. The purpose of this document outcome is to produce a set of reconfigurable digital electronic IP cores, implementing features mentioned above in a Spartan 3E Xilinx FPGA

    SpiNNaker - A Spiking Neural Network Architecture

    Get PDF
    20 years in conception and 15 in construction, the SpiNNaker project has delivered the world’s largest neuromorphic computing platform incorporating over a million ARM mobile phone processors and capable of modelling spiking neural networks of the scale of a mouse brain in biological real time. This machine, hosted at the University of Manchester in the UK, is freely available under the auspices of the EU Flagship Human Brain Project. This book tells the story of the origins of the machine, its development and its deployment, and the immense software development effort that has gone into making it openly available and accessible to researchers and students the world over. It also presents exemplar applications from ‘Talk’, a SpiNNaker-controlled robotic exhibit at the Manchester Art Gallery as part of ‘The Imitation Game’, a set of works commissioned in 2016 in honour of Alan Turing, through to a way to solve hard computing problems using stochastic neural networks. The book concludes with a look to the future, and the SpiNNaker-2 machine which is yet to come

    Proceedings of the 5th International Workshop on Reconfigurable Communication-centric Systems on Chip 2010 - ReCoSoC\u2710 - May 17-19, 2010 Karlsruhe, Germany. (KIT Scientific Reports ; 7551)

    Get PDF
    ReCoSoC is intended to be a periodic annual meeting to expose and discuss gathered expertise as well as state of the art research around SoC related topics through plenary invited papers and posters. The workshop aims to provide a prospective view of tomorrow\u27s challenges in the multibillion transistor era, taking into account the emerging techniques and architectures exploring the synergy between flexible on-chip communication and system reconfigurability