445 research outputs found

    Dynamic partial reconfiguration for pipelined digital systems— A Case study using a color space conversion engine

    Get PDF
    In digital hardware design, reconfigurable devices such as Field Programmable Gate Arrays (FPGAs) allow for a unique feature called partial reconfiguration PR). This refers to the reprogramming of a subset of the reconfigurable logic during active operation. PR allows multiple hardware blocks to be consolidated into a single partition, which can be reprogrammed at run-time as desired. This may reduce the logic circuit (and silicon area) requirements and greatly extend functionality. Furthermore, dynamic partial reconfiguration (DPR) refers to PR that does not halt the system during reprogramming. This allows for configuration to overlap with normal processing, potentially achieving better system performance than a static(halting) PR implementation. This work has investigated the advantages and trade-offs of DPR as applied to an existing color space conversion(CSC) engine provided by Hewlett-Packard (HP). Two versions were created: a single-pipeline engine, which can only overlap tasks in specific sequences; and a dual-pipeline engine, which can overlap any consecutive tasks. These were implemented in a Virtex-6 FPGA. Data communication occurs over the PCI Express (PCIe) interface. Test results show improvements in execution speed and resource utilization, though some are minor due to intrinsic characteristics of the CSC engine pipeline. The dual-pipeline version outperformed the single-pipeline in most test cases. Therefore, future work will focus on multiple-pipeline architectures

    Efficient reconfigurable architectures for 3D medical image compression

    Get PDF
    This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Recently, the more widespread use of three-dimensional (3-D) imaging modalities, such as magnetic resonance imaging (MRI), computed tomography (CT), positron emission tomography (PET), and ultrasound (US) have generated a massive amount of volumetric data. These have provided an impetus to the development of other applications, in particular telemedicine and teleradiology. In these fields, medical image compression is important since both efficient storage and transmission of data through high-bandwidth digital communication lines are of crucial importance. Despite their advantages, most 3-D medical imaging algorithms are computationally intensive with matrix transformation as the most fundamental operation involved in the transform-based methods. Therefore, there is a real need for high-performance systems, whilst keeping architectures exible to allow for quick upgradeability with real-time applications. Moreover, in order to obtain efficient solutions for large medical volumes data, an efficient implementation of these operations is of significant importance. Reconfigurable hardware, in the form of field programmable gate arrays (FPGAs) has been proposed as viable system building block in the construction of high-performance systems at an economical price. Consequently, FPGAs seem an ideal candidate to harness and exploit their inherent advantages such as massive parallelism capabilities, multimillion gate counts, and special low-power packages. The key achievements of the work presented in this thesis are summarised as follows. Two architectures for 3-D Haar wavelet transform (HWT) have been proposed based on transpose-based computation and partial reconfiguration suitable for 3-D medical imaging applications. These applications require continuous hardware servicing, and as a result dynamic partial reconfiguration (DPR) has been introduced. Comparative study for both non-partial and partial reconfiguration implementation has shown that DPR offers many advantages and leads to a compelling solution for implementing computationally intensive applications such as 3-D medical image compression. Using DPR, several large systems are mapped to small hardware resources, and the area, power consumption as well as maximum frequency are optimised and improved. Moreover, an FPGA-based architecture of the finite Radon transform (FRAT)with three design strategies has been proposed: direct implementation of pseudo-code with a sequential or pipelined description, and block random access memory (BRAM)- based method. An analysis with various medical imaging modalities has been carried out. Results obtained for image de-noising implementation using FRAT exhibits promising results in reducing Gaussian white noise in medical images. In terms of hardware implementation, promising trade-offs on maximum frequency, throughput and area are also achieved. Furthermore, a novel hardware implementation of 3-D medical image compression system with context-based adaptive variable length coding (CAVLC) has been proposed. An evaluation of the 3-D integer transform (IT) and the discrete wavelet transform (DWT) with lifting scheme (LS) for transform blocks reveal that 3-D IT demonstrates better computational complexity than the 3-D DWT, whilst the 3-D DWT with LS exhibits a lossless compression that is significantly useful for medical image compression. Additionally, an architecture of CAVLC that is capable of compressing high-definition (HD) images in real-time without any buffer between the quantiser and the entropy coder is proposed. Through a judicious parallelisation, promising results have been obtained with limited resources. In summary, this research is tackling the issues of massive 3-D medical volumes data that requires compression as well as hardware implementation to accelerate the slowest operations in the system. Results obtained also reveal a significant achievement in terms of the architecture efficiency and applications performance.Ministry of Higher Education Malaysia (MOHE), Universiti Tun Hussein Onn Malaysia (UTHM) and the British Counci

    Artificial neural networks acceleration on field-programmable gate arrays considering model redundancy

    Get PDF
    Artificial Neural Networks (ANNs) have dramatically developed over the last ten years, and have been successfully applied in many important areas. A natural follow-up topic is to deploy ANNs to a wider range of hardware platforms. However, modern ANN models may aim for millisecond- or even nanosecond-level latency for each input processing while it is common for them to require million-level operations and gigabyte-scale data access for computing each input. This intrinsic high computational complexity introduces hardware challenges to the system implementation. Meanwhile, the integration of computing resources on hardware platforms is hampered by the slowing down of Moore’s Law. Therefore, it is important to study new design methods for ANN hardware systems that produce high model accuracy with low resource usage. Field-Programmable Gate Array (FPGA) is a natural fit for this topic due to its reconfigurability and flexibility. These features of FPGA allow us to implement customised data paths and data representations on hardware, which makes it the primary platform in this research. The main topics discussed in this thesis include neural network redundancy and its impact on hardware systems. The main goal is to reduce hardware complexity by reducing neural network redundancy and maintaining accuracy at the same time. To achieve this, redundancy is firstly categorised into two types: model- and data-level. Then, each type is studied in isolation before both are combined in a single system design. First, to study model-level redundancy, an algorithm called dropout is implemented as a way to reduce model-level redundancy during training and used here to reduce hardware cost. Our proposed system achieves a 50% reduction in DSP usage and 33% to 47% fewer on-chip memory usage compared to conventional implementations. Second, in terms of data-level redundancy, we aim to study how data precision affects hardware cost and system throughput. Our experiments show that reduced-precision data present negligible or even no accuracy loss to full-precision data on the tested benchmarks. In particular, the 4-bit fixed point presents a good trade-off between model accuracy and hardware cost compared to other tested data representations. Third, we studied the interactive effect of reducing both model- and data-level redundancy and proposed a FPGA accelerator design for Redundancy-Reduced (RR-) MobileNet [Hea17]. Our proposed RR-MobileNet system achieves a state-of-the-art latency, 7.85 ms, for single image processing in ImageNet inference. Finally, a design guideline is proposed as a step-by-step guidance for redundancy-reduced neural network system design.Open Acces

    Reconfigurable Instruction Cell Architecture Reconfiguration and Interconnects

    Get PDF

    Efficient machine learning: models and accelerations

    Get PDF
    One of the key enablers of the recent unprecedented success of machine learning is the adoption of very large models. Modern machine learning models typically consist of multiple cascaded layers such as deep neural networks, and at least millions to hundreds of millions of parameters (i.e., weights) for the entire model. The larger-scale model tend to enable the extraction of more complex high-level features, and therefore, lead to a significant improvement of the overall accuracy. On the other side, the layered deep structure and large model sizes also demand to increase computational capability and memory requirements. In order to achieve higher scalability, performance, and energy efficiency for deep learning systems, two orthogonal research and development trends have attracted enormous interests. The first trend is the acceleration while the second is the model compression. The underlying goal of these two trends is the high quality of the models to provides accurate predictions. In this thesis, we address these two problems and utilize different computing paradigms to solve real-life deep learning problems. To explore in these two domains, this thesis first presents the cogent confabulation network for sentence completion problem. We use Chinese language as a case study to describe our exploration of the cogent confabulation based text recognition models. The exploration and optimization of the cogent confabulation based models have been conducted through various comparisons. The optimized network offered a better accuracy performance for the sentence completion. To accelerate the sentence completion problem in a multi-processing system, we propose a parallel framework for the confabulation recall algorithm. The parallel implementation reduce runtime, improve the recall accuracy by breaking the fixed evaluation order and introducing more generalization, and maintain a balanced progress in status update among all neurons. A lexicon scheduling algorithm is presented to further improve the model performance. As deep neural networks have been proven effective to solve many real-life applications, and they are deployed on low-power devices, we then investigated the acceleration for the neural network inference using a hardware-friendly computing paradigm, stochastic computing. It is an approximate computing paradigm which requires small hardware footprint and achieves high energy efficiency. Applying this stochastic computing to deep convolutional neural networks, we design the functional hardware blocks and optimize them jointly to minimize the accuracy loss due to the approximation. The synthesis results show that the proposed design achieves the remarkable low hardware cost and power/energy consumption. Modern neural networks usually imply a huge amount of parameters which cannot be fit into embedded devices. Compression of the deep learning models together with acceleration attracts our attention. We introduce the structured matrices based neural network to address this problem. Circulant matrix is one of the structured matrices, where a matrix can be represented using a single vector, so that the matrix is compressed. We further investigate a more flexible structure based on circulant matrix, called block-circulant matrix. It partitions a matrix into several smaller blocks and makes each submatrix is circulant. The compression ratio is controllable. With the help of Fourier Transform based equivalent computation, the inference of the deep neural network can be accelerated energy efficiently on the FPGAs. We also offer the optimization for the training algorithm for block circulant matrices based neural networks to obtain a high accuracy after compression

    Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration

    Full text link
    DNN accelerators are often developed and evaluated in isolation without considering the cross-stack, system-level effects in real-world environments. This makes it difficult to appreciate the impact of System-on-Chip (SoC) resource contention, OS overheads, and programming-stack inefficiencies on overall performance/energy-efficiency. To address this challenge, we present Gemmini, an open-source*, full-stack DNN accelerator generator. Gemmini generates a wide design-space of efficient ASIC accelerators from a flexible architectural template, together with flexible programming stacks and full SoCs with shared resources that capture system-level effects. Gemmini-generated accelerators have also been fabricated, delivering up to three orders-of-magnitude speedups over high-performance CPUs on various DNN benchmarks. * https://github.com/ucb-bar/gemminiComment: To appear at the 58th IEEE/ACM Design Automation Conference (DAC), December 2021, San Francisco, CA, US

    System-on-chip Computing and Interconnection Architectures for Telecommunications and Signal Processing

    Get PDF
    This dissertation proposes novel architectures and design techniques targeting SoC building blocks for telecommunications and signal processing applications. Hardware implementation of Low-Density Parity-Check decoders is approached at both the algorithmic and the architecture level. Low-Density Parity-Check codes are a promising coding scheme for future communication standards due to their outstanding error correction performance. This work proposes a methodology for analyzing effects of finite precision arithmetic on error correction performance and hardware complexity. The methodology is throughout employed for co-designing the decoder. First, a low-complexity check node based on the P-output decoding principle is designed and characterized on a CMOS standard-cells library. Results demonstrate implementation loss below 0.2 dB down to BER of 10^{-8} and a saving in complexity up to 59% with respect to other works in recent literature. High-throughput and low-latency issues are addressed with modified single-phase decoding schedules. A new "memory-aware" schedule is proposed requiring down to 20% of memory with respect to the traditional two-phase flooding decoding. Additionally, throughput is doubled and logic complexity reduced of 12%. These advantages are traded-off with error correction performance, thus making the solution attractive only for long codes, as those adopted in the DVB-S2 standard. The "layered decoding" principle is extended to those codes not specifically conceived for this technique. Proposed architectures exhibit complexity savings in the order of 40% for both area and power consumption figures, while implementation loss is smaller than 0.05 dB. Most modern communication standards employ Orthogonal Frequency Division Multiplexing as part of their physical layer. The core of OFDM is the Fast Fourier Transform and its inverse in charge of symbols (de)modulation. Requirements on throughput and energy efficiency call for FFT hardware implementation, while ubiquity of FFT suggests the design of parametric, re-configurable and re-usable IP hardware macrocells. In this context, this thesis describes an FFT/IFFT core compiler particularly suited for implementation of OFDM communication systems. The tool employs an accuracy-driven configuration engine which automatically profiles the internal arithmetic and generates a core with minimum operands bit-width and thus minimum circuit complexity. The engine performs a closed-loop optimization over three different internal arithmetic models (fixed-point, block floating-point and convergent block floating-point) using the numerical accuracy budget given by the user as a reference point. The flexibility and re-usability of the proposed macrocell are illustrated through several case studies which encompass all current state-of-the-art OFDM communications standards (WLAN, WMAN, xDSL, DVB-T/H, DAB and UWB). Implementations results are presented for two deep sub-micron standard-cells libraries (65 and 90 nm) and commercially available FPGA devices. Compared with other FFT core compilers, the proposed environment produces macrocells with lower circuit complexity and same system level performance (throughput, transform size and numerical accuracy). The final part of this dissertation focuses on the Network-on-Chip design paradigm whose goal is building scalable communication infrastructures connecting hundreds of core. A low-complexity link architecture for mesochronous on-chip communication is discussed. The link enables skew constraint looseness in the clock tree synthesis, frequency speed-up, power consumption reduction and faster back-end turnarounds. The proposed architecture reaches a maximum clock frequency of 1 GHz on 65 nm low-leakage CMOS standard-cells library. In a complex test case with a full-blown NoC infrastructure, the link overhead is only 3% of chip area and 0.5% of leakage power consumption. Finally, a new methodology, named metacoding, is proposed. Metacoding generates correct-by-construction technology independent RTL codebases for NoC building blocks. The RTL coding phase is abstracted and modeled with an Object Oriented framework, integrated within a commercial tool for IP packaging (Synopsys CoreTools suite). Compared with traditional coding styles based on pre-processor directives, metacoding produces 65% smaller codebases and reduces the configurations to verify up to three orders of magnitude

    Algorithms and Architectures for Secure Embedded Multimedia Systems

    Get PDF
    Embedded multimedia systems provide real-time video support for applications in entertainment (mobile phones, internet video websites), defense (video-surveillance and tracking) and public-domain (tele-medicine, remote and distant learning, traffic monitoring and management). With the widespread deployment of such real-time embedded systems, there has been an increasing concern over the security and authentication of concerned multimedia data. While several (software) algorithms and hardware architectures have been proposed in the research literature to support multimedia security, these fail to address embedded applications whose performance specifications have tighter constraints on computational power and available hardware resources. The goals of this dissertation research are two fold: 1. To develop novel algorithms for joint video compression and encryption. The proposed algorithms reduce the computational requirements of multimedia encryption algorithms. We propose an approach that uses the compression parameters instead of compressed bitstream for video encryption. 2. Hardware acceleration of proposed algorithms over reconfigurable computing platforms such as FPGA and over VLSI circuits. We use signal processing knowledge to make the algorithms suitable for hardware optimizations and try to reduce the critical path of circuits using hardware-specific optimizations. The proposed algorithms ensures a considerable level of security for low-power embedded systems such as portable video players and surveillance cameras. These schemes have zero or little compression losses and preserve the desired properties of compressed bitstream in encrypted bitstream to ensure secure and scalable transmission of videos over heterogeneous networks. They also support indexing, search and retrieval in secure multimedia digital libraries. This property is crucial not only for police and armed forces to retrieve information about a suspect from a large video database of surveillance feeds, but extremely helpful for data centers (such as those used by youtube, aol and metacafe) in reducing the computation cost in search and retrieval of desired videos
    • …
    corecore