66 research outputs found

    An Application-Specific Instruction Set for Accelerating Set-Oriented Database Primitives

    Get PDF
    The key task of database systems is to efficiently manage large amounts of data. A high query throughput and a low query latency are essential for the success of a database system. Lately, research focused on exploiting hardware features like superscalar execution units, SIMD, or multiple cores to speed up processing. Apart from these software optimizations for given hardware, even tailor-made processing circuits running on FPGAs are built to run mostly stateless query plans with incredibly high throughput. A similar idea, which was already considered three decades ago, is to build tailor-made hardware like a database processor. Despite their superior performance, such application-specific processors were not considered to be beneficial because general-purpose processors eventually always caught up so that the high development costs did not pay off. In this paper, we show that the development of a database processor is much more feasible nowadays through the availability of customizable processors. We illustrate exemplarily how to create an instruction set extension for set-oriented database rimitives. The resulting application-specific processor provides not only a high performance but it also enables very energy-efficient processing. Our processor requires in various configurations more than 960x less energy than a high-end x86 processor while providing the same performance

    Design of a GF(64)-LDPC Decoder Based on the EMS Algorithm

    No full text
    International audienceThis paper presents the architecture, performance and implementation results of a serial GF(64)-LDPC decoder based on a reduced-complexity version of the Extended Min-Sum algorithm. The main contributions of this work correspond to the variable node processing, the codeword decision and the elementary check node processing. Post-synthesis area results show that the decoder area is less than 20% of a Virtex 4 FPGA for a decoding throughput of 2.95 Mbps. The implemented decoder presents performance at less than 0.7 dB from the Belief Propagation algorithm for different code lengths and rates. Moreover, the proposed architecture can be easily adapted to decode very high Galois Field orders, such as GF(4096) or higher, by slightly modifying a marginal part of the design

    Hardware-efficient data compression in wireless intracortical brain-machine interfaces

    Get PDF
    Brain-Machine Interfaces (BMI) have emerged as a promising technology for restoring lost motor function in patients with neurological disorders and/or motor impairments, e.g. paraplegia, amputation, stroke, spinal cord injury, amyotrophic lateral sclerosis, etc. The past 2 decades have seen significant advances in BMI performance. This has largely been driven by the invention and uptake of intracortical microelectrode arrays that can isolate the activity of individual neurons. However, the current paradigm involves the use of percutaneous connections, i.e. wires. These wires carry the information from the intracortical array implanted in the brain to outside of the body, where the information is used for neural decoding. These wires carry significant long-term risks ranging from infection, to mechanical injury, to impaired mobility and quality of life for the individual. Therefore, there is a desire to make intracortical BMIs (iBMI) wireless, where the data is communicated out wirelessly, either with the use of electromagnetic or acoustic waves. Unfortunately, this consumes a significant amount of power, which is dissipated from the implant in the form of heat. Heating tissue can cause irreparable damage, and so there are strict limits on heat flux from implants to cortical tissue. Given the ever-increasing number of channels per implant, the required communication power is now exceeding the acceptable cortical heat transfer limits. This cortical heating issue is hampering widespread clinical use. As such, effective data compression would bring Wireless iBMIs (WI-BMI) into alignment with heat transfer limits, enabling large channel counts and small implant sizes without risking tissue damage via heating. This thesis addresses the aforementioned communication power problem from a signal processing and data compression perspective, and is composed of two parts. In the first part, we investigate hardware-efficient ways to compress the Multi-Unit Activity (MUA) signal, which is the most common signal in modern iBMIs. In the second and final part, we look at efficient ways to extract and compress the high-bandwidth Entire Spiking Activity signal, which, while underexplored as a signal, has been the subject of significant interest given its ability to outperform the MUA signal in neural decoding. Overall, this thesis introduces hardware-efficient methods of extracting high-performing neural features, and compressing them by an order of magnitude or more beyond the state-of-the-art in ultra-low power ways. This enables many more recording channels to be fit onto intracortical implants, while remaining within cortical heat transfer safety and channel capacity limits.Open Acces

    Acceleration of k-Nearest Neighbor and SRAD Algorithms Using Intel FPGA SDK for OpenCL

    Get PDF
    Field Programmable Gate Arrays (FPGAs) have been widely used for accelerating machine learning algorithms. However, the high design cost and time for implementing FPGA-based accelerators using traditional HDL-based design methodologies has discouraged users from designing FPGA-based accelerators. In recent years, a new CAD tool called Intel FPGA SDK for OpenCL (IFSO) allowed fast and efficient design of FPGA-based hardware accelerators from high level specification such as OpenCL. Even software engineers with basic hardware design knowledge could design FPGA-based accelerators. In this thesis, IFSO has been used to explore acceleration of k-Nearest-Neighbour (kNN) algorithm and Speckle Reducing Anisotropic Diffusion (SRAD) simulation using FPGAs. kNN is a popular algorithm used in machine learning. Bitonic sorting and radix sorting algorithms were used in the kNN algorithm to check if these provide any performance improvements. Acceleration of SRAD simulation was also explored. The experimental results obtained for these algorithms from FPGA-based acceleration were compared with the state of the art CPU implementation. The optimized algorithms were implemented on two different FPGAs (Intel Stratix A7 and Intel Arria 10 GX). Experimental results show that the FPGA-based accelerators provided similar or better execution time (up to 80X) and better power efficiency (75% reduction in power consumption) than traditional platforms such as a workstation based on two Intel Xeon processors E5-2620 Series (each with 6 cores and running at 2.4 GHz)

    Embedded Machine Learning: Emphasis on Hardware Accelerators and Approximate Computing for Tactile Data Processing

    Get PDF
    Machine Learning (ML) a subset of Artificial Intelligence (AI) is driving the industrial and technological revolution of the present and future. We envision a world with smart devices that are able to mimic human behavior (sense, process, and act) and perform tasks that at one time we thought could only be carried out by humans. The vision is to achieve such a level of intelligence with affordable, power-efficient, and fast hardware platforms. However, embedding machine learning algorithms in many application domains such as the internet of things (IoT), prostheses, robotics, and wearable devices is an ongoing challenge. A challenge that is controlled by the computational complexity of ML algorithms, the performance/availability of hardware platforms, and the application\u2019s budget (power constraint, real-time operation, etc.). In this dissertation, we focus on the design and implementation of efficient ML algorithms to handle the aforementioned challenges. First, we apply Approximate Computing Techniques (ACTs) to reduce the computational complexity of ML algorithms. Then, we design custom Hardware Accelerators to improve the performance of the implementation within a specified budget. Finally, a tactile data processing application is adopted for the validation of the proposed exact and approximate embedded machine learning accelerators. The dissertation starts with the introduction of the various ML algorithms used for tactile data processing. These algorithms are assessed in terms of their computational complexity and the available hardware platforms which could be used for implementation. Afterward, a survey on the existing approximate computing techniques and hardware accelerators design methodologies is presented. Based on the findings of the survey, an approach for applying algorithmic-level ACTs on machine learning algorithms is provided. Then three novel hardware accelerators are proposed: (1) k-Nearest Neighbor (kNN) based on a selection-based sorter, (2) Tensorial Support Vector Machine (TSVM) based on Shallow Neural Networks, and (3) Hybrid Precision Binary Convolution Neural Network (BCNN). The three accelerators offer a real-time classification with monumental reductions in the hardware resources and power consumption compared to existing implementations targeting the same tactile data processing application on FPGA. Moreover, the approximate accelerators maintain a high classification accuracy with a loss of at most 5%

    An Architecture for High-throughput and Improved-quality Stereo Vision Processor

    Get PDF
    This paper presents the VLSI architecture to achieve high-throughput and improved-quality stereo vision for real applications. The stereo vision processor generates gray-scale output images with depth information from input images taken by two CMOS Image Sensors (CIS). The depth estimator using the sum of absolute differences (SAD) algorithm as stereo matching technique is implemented on hardware by exploiting pipelining and parallelism. To produce depth maps with improved-quality at real-time, pre- and post-processing units are adopted, and to enhance the adaptability of the system to real environments, special function registers (SFRs) are assigned to vision parameters. The design using 0.18um standard CMOS technology can operate at 120MHz clock, achieving over 140 frames/sec depth maps with 320 by 240 image size and 64 disparity levels. Experimental results based on images taken in real world and the Middlebury data set will be presented. Comparison data with existing hardware systems and hardware specifications of the proposed processor will be given

    Hardware implementation aspects of polar decoders and ultra high-speed LDPC decoders

    Get PDF
    The goal of channel coding is to detect and correct errors that appear during the transmission of information. In the past few decades, channel coding has become an integral part of most communications standards as it improves the energy-efficiency of transceivers manyfold while only requiring a modest investment in terms of the required digital signal processing capabilities. The most commonly used channel codes in modern standards are low-density parity-check (LDPC) codes and Turbo codes, which were the first two types of codes to approach the capacity of several channels while still being practically implementable in hardware. The decoding algorithms for LDPC codes, in particular, are highly parallelizable and suitable for high-throughput applications. A new class of channel codes, called polar codes, was introduced recently. Polar codes have an explicit construction and low-complexity encoding and successive cancellation (SC) decoding algorithms. Moreover, polar codes are provably capacity achieving over a wide range of channels, making them very attractive from a theoretical perspective. Unfortunately, polar codes under standard SC decoding cannot compete with the LDPC and Turbo codes that are used in current standards in terms of their error-correcting performance. For this reason, several improved SC-based decoding algorithms have been introduced. The most prominent SC-based decoding algorithm is the successive cancellation list (SCL) decoding algorithm, which is powerful enough to approach the error-correcting performance of LDPC codes. The original SCL decoding algorithm was described in an arithmetic domain that is not well-suited for hardware implementations and is not clear how an efficient SCL decoder architecture can be implemented. To this end, in this thesis, we re-formulate the SCL decoding algorithm in two distinct arithmetic domains, we describe efficient hardware architectures to implement the resulting SCL decoders, and we compare the decoders with existing LDPC and Turbo decoders in terms of their error-correcting performance and their implementation efficiency. Due to the ongoing technology scaling, the feature sizes of integrated circuits keep shrinking at a remarkable pace. As transistors and memory cells keep shrinking, it becomes increasingly difficult and costly (in terms of both area and power) to ensure that the implemented digital circuits always operate correctly. Thus, manufactured digital signal processing circuits, including channel decoder circuits, may not always operate correctly. Instead of discarding these faulty dies or using costly circuit-level fault mitigation mechanisms, an alternative approach is to try to live with certain malfunctions, provided that the algorithm implemented by the circuit is sufficiently fault-tolerant. In this spirit, in this thesis we examine decoding of polar codes and LDPC codes under the assumption that the memories that are used within the decoders are not fully reliable. We show that, in both cases, there is inherent fault-tolerance and we also propose some methods to reduce the effect of memory faults on the error-correcting performance of the considered decoders

    Implementation of watershed based image segmentation algorithm in FPGA

    Get PDF
    The watershed algorithm is a commonly used method of solving the image segmentation problem. However, of the many variants of the watershed algorithm not all are equally well suited for hardware implementation. Different algorithms are studied and the watershed algorithm based on connected components is selected for the implementation, as it exhibits least computational complexity, good segmentation quality and can be implemented in the FPGA. It has simplified memory access compared to all other watershed based image segmentation algorithms. This thesis proposes a new hardware implementation of the selected watershed algorithm. The main aim of the thesis is to implement image segmentation algorithm in a FPGA which requires minimum hardware resources, low execution time and is suitable for use in real time applications. A pipelined architecture of algorithm is designed, implemented in VHDL and synthesized for Xilinx Virtex-4 FPGA. In the implementation, image is loaded to external memory and algorithm is repeatedly applied to the image. To overcome the problem of over-segmentation, pre-processing step is used before the segmentation and implemented in the pipelined architecture. The pipelined architecture of pre-processing stage can be operated at up to 228 MHz. The computation time for a 512 x 512 image is about 35 to 45 ms using one pipelined segmentation unit. A proposal of parallel architecture is discussed which uses multiple segmentation units and is fast enough for the real time applications. The implemented and proposed architectures are excellent candidates to use for different applications where high speed performance is needed

    Real-time neural signal processing and low-power hardware co-design for wireless implantable brain machine interfaces

    Get PDF
    Intracortical Brain-Machine Interfaces (iBMIs) have advanced significantly over the past two decades, demonstrating their utility in various aspects, including neuroprosthetic control and communication. To increase the information transfer rate and improve the devices’ robustness and longevity, iBMI technology aims to increase channel counts to access more neural data while reducing invasiveness through miniaturisation and avoiding percutaneous connectors (wired implants). However, as the number of channels increases, the raw data bandwidth required for wireless transmission also increases becoming prohibitive, requiring efficient on-implant processing to reduce the amount of data through data compression or feature extraction. The fundamental aim of this research is to develop methods for high-performance neural spike processing co-designed within low-power hardware that is scaleable for real-time wireless BMI applications. The specific original contributions include the following: Firstly, a new method has been developed for hardware-efficient spike detection, which achieves state-of-the-art spike detection performance and significantly reduces the hardware complexity. Secondly, a novel thresholding mechanism for spike detection has been introduced. By incorporating firing rate information as a key determinant in establishing the spike detection threshold, we have improved the adaptiveness of spike detection. This eventually allows the spike detection to overcome the signal degradation that arises due to scar tissue growth around the recording site, thereby ensuring enduringly stable spike detection results. The long-term decoding performance, as a consequence, has also been improved notably. Thirdly, the relationship between spike detection performance and neural decoding accuracy has been investigated to be nonlinear, offering new opportunities for further reducing transmission bandwidth by at least 30% with minor decoding performance degradation. In summary, this thesis presents a journey toward designing ultra-hardware-efficient spike detection algorithms and applying them to reduce the data bandwidth and improve neural decoding performance. The software-hardware co-design approach is essential for the next generation of wireless brain-machine interfaces with increased channel counts and a highly constrained hardware budget. The fundamental aim of this research is to develop methods for high-performance neural spike processing co-designed within low-power hardware that is scaleable for real-time wireless BMI applications. The specific original contributions include the following: Firstly, a new method has been developed for hardware-efficient spike detection, which achieves state-of-the-art spike detection performance and significantly reduces the hardware complexity. Secondly, a novel thresholding mechanism for spike detection has been introduced. By incorporating firing rate information as a key determinant in establishing the spike detection threshold, we have improved the adaptiveness of spike detection. This eventually allows the spike detection to overcome the signal degradation that arises due to scar tissue growth around the recording site, thereby ensuring enduringly stable spike detection results. The long-term decoding performance, as a consequence, has also been improved notably. Thirdly, the relationship between spike detection performance and neural decoding accuracy has been investigated to be nonlinear, offering new opportunities for further reducing transmission bandwidth by at least 30\% with only minor decoding performance degradation. In summary, this thesis presents a journey toward designing ultra-hardware-efficient spike detection algorithms and applying them to reduce the data bandwidth and improve neural decoding performance. The software-hardware co-design approach is essential for the next generation of wireless brain-machine interfaces with increased channel counts and a highly constrained hardware budget.Open Acces
    • …
    corecore