Search CORE

44 research outputs found

Reliable Hardware Architectures of CORDIC Algorithm with Fixed Angle of Rotations

Author: Ramadoss Rajkumar
Publication venue: RIT Scholar Works
Publication date: 01/08/2015
Field of study

Fixed-angle rotation operation of vectors is widely used in signal processing, graphics, and robotics. Various optimized coordinate rotation digital computer (CORDIC) designs have been proposed for uniform rotation of vectors through known and specified angles. Nevertheless, in the presence of faults, such hardware architectures are potentially vulnerable. In this thesis, we propose efficient error detection schemes for two fixed-angle rotation designs, i.e., the Interleaved Scaling and Cascaded Single-rotation CORDIC. To the best of our knowledge, this work is the first in providing reliable architectures for these variants of CORDIC. The former is suitable for low-area applications and, hence, we propose recomputing with encoded operands schemes which add negligible area overhead to the designs. Moreover, the proposed error detection schemes for the latter variant are optimized for efficient applications which hamper the performance of the architectures negligibly. We present three variants of recomputing with encoded operands to detect both transient and permanent faults, coupled with signature-based schemes. The overheads of the proposed designs are assessed through Xilinx FPGA implementations and their effectiveness is benchmarked through error simulations. The results give confidence for the proposed efficient architectures which can be tailored based on the reliability requirements and the overhead to be tolerated

RIT Scholar Works

Bit Serial Systolic Architectures for Multiplicative Inversion and Division over GF(2<sup>m</sup>)

Author: Daneshbeh Amir
Publication venue: 'University of Waterloo'
Publication date: 01/01/2005
Field of study

Systolic architectures are capable of achieving high throughput by maximizing pipelining and by eliminating global data interconnects. Recursive algorithms with regular data flows are suitable for systolization. The computation of multiplicative inversion using algorithms based on EEA (Extended Euclidean Algorithm) are particularly suitable for systolization. Implementations based on EEA present a high degree of parallelism and pipelinability at bit level which can be easily optimized to achieve local data flow and to eliminate the global interconnects which represent most important bottleneck in todays sub-micron design process. The net result is to have high clock rate and performance based on efficient systolic architectures. This thesis examines high performance but also scalable implementations of multiplicative inversion or field division over Galois fields GF(2m) in the specific case of cryptographic applications where field dimension m may be very large (greater than 400) and either m or defining irreducible polynomial may vary. For this purpose, many inversion schemes with different basis representation are studied and most importantly variants of EEA and binary (Stein's) GCD computation implementations are reviewed. A set of common as well as contrasting characteristics of these variants are discussed. As a result a generalized and optimized variant of EEA is proposed which can compute division, and multiplicative inversion as its subset, with divisor in either polynomial or triangular basis representation. Further results regarding Hankel matrix formation for double-basis inversion is provided. The validity of using the same architecture to compute field division with polynomial or triangular basis representation is proved. Next, a scalable unidirectional bit serial systolic array implementation of this proposed variant of EEA is implemented. Its complexity measures are defined and these are compared against the best known architectures. It is shown that assuming the requirements specified above, this proposed architecture may achieve a higher clock rate performance w. r. t. other designs while being more flexible, reliable and with minimum number of inter-cell interconnects. The main contribution at system level architecture is the substitution of all counter or adder/subtractor elements with a simpler distributed and free of carry propagation delays structure. Further a novel restoring mechanism for result sequences of EEA is proposed using a double delay element implementation. Finally, using this systolic architecture a CMD (Combined Multiplier Divider) datapath is designed which is used as the core of a novel systolic elliptic curve processor. This EC processor uses affine coordinates to compute scalar point multiplication which results in having a very small control unit and negligible with respect to the datapath for all practical values of m. The throughput of this EC based on this bit serial systolic architecture is comparable with designs many times larger than itself reported previously

University of Waterloo's Institutional Repository

Hardware-aware design, search, and optimization of deep neural networks

Author: Chitty-Venkata Sai Subra
Publication venue
Publication date: 25/08/2023
Field of study

Deep Learning has achieved remarkable progress in the last decade due to its powerful automatic representation capability for a variety of tasks, such as Image Recognition, Speech Recognition, and Machine Translation. This success is associated with network design, which is crucial to feature representation, leading to many innovative architectures such as Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Graph Neural Network (GNN) and Transformers. A wide range of hardware platforms is available to accelerate the performance of Deep Neural Networks (DNNs), ranging from general-purpose hardware such as CPUs to special-purpose devices such as Tensor Processing Unit (TPU). High-performance computing systems such as GPUs effectively reduce the computation time of DNNs. Due to the slowing down of Moore's law, the research in developing Domain-Specific Hardware, which excels in its assigned tasks, has gained significance. Therefore, it is not straightforward to choose a platform that works in all scenarios, as it depends on the application and environment. Neural Architecture Search (NAS), a subset of Automatic Machine Learning (AutoML), is a method to automate the design process of Neural Network architecture on a given task and dataset without significant human intervention. The NAS method is an intelligent algorithm to automatically search for an efficient neural architecture to save the researcher's manual effort and computation time. Hardware-aware Neural Architecture Search (HW-NAS) is a class of problems whose goal is to search for networks that are not only accurate on the given dataset but also hardware-efficient in terms of latency, energy, size, etc. The resulting searched models outperform manually designed networks in several aspects, such as model performance and inference latency on the actual hardware. NAS and HW-NAS have been very successful in searching for efficient models that achieve State-of-the-art performance on many tasks, such as Image Classification, Object Detection, Machine Translation, etc. Pruning and Quantization are two important techniques to design lightweight, memory-efficient, and hardware-friendly methods for inference on a variety of devices such as CPU, GPU, ASIC, and FPGA. These methods successfully compressed large networks into smaller models with negligible accuracy or task performance loss. Neural Network Pruning refers to removing redundant or unimportant weights/nodes/neurons/filters parameters which do not significantly hinder model performance, thereby reducing the size and computational complexity of a model. Network Quantization converts the high-precision model weights/parameters (Floating point 32) to low precision (Integer 8, Integer 4). Quantization methodology has attracted much attention in academia and industry as inference of a model can be performed at a low precision with a negligible drop in accuracy, as opposed to training where a model is trained at high precision. Weight Pruning or element-wise pruning method shrinks the DNN model significantly and introduces a considerable sparsity in the weight matrices. The uniform systolic arrays in TPU and Tensor Cores in Volta and Turing GPU architectures are not explicitly designed to accelerate such sparse matrices. Therefore, the speedup due to weight pruning is negligible despite removing 90\% of the parameters. Later, several node pruning methods have been developed to resolve the sparsity bottlenecks. However, these methods do not consider the underlying Hardware dimension (size of the array, number of CPUs) or Tensor Core precision, leading to suboptimal performance. We develop Hardware Dimension Aware Pruning (HDAP) method for array-based accelerators, multi-core CPUs, and Tensor Core-enabled GPUs by considering the underlying dimension of the system. The node-pruned networks using the HDAP method achieved an average speedup of 3.2x and 4.2x, whereas the baseline method attained an average speedup of only 1.5x and 1.6x on Turing Tensor Core GPU and Eyeriss architecture, respectively. Hardware systems are often prone to soft errors or permanent faults due to external conditions or internal scaling. A lot of work has been done on the systolic array implementation and its reliability concerns in the past. However, their fault tolerance perspective with respect to DNNs is not yet fully understood with a fault model. In our work, we first present a fault model i.e., different sequences in which faults can occur on the systolic array, and co-design a fault-based and array size based Pruning (FPAP) algorithm with the intent of bypassing the faults and removing the internal redundancy at the same time for efficient inference. Tensor Cores in Nvidia Ampere 100 (A100) GPU support (1) 2:4 fine-grained sparse pruning where 2 out of every 4 elements are pruned and (2) traditional dense multiplication to achieve a good accuracy and performance trade-off. The A100 Tensor Core also takes advantage of 1-bit, 4-bit, and 8-bit multiplication to speed up the inference of a model. Hence, finding the right matrix type (dense or 2:4 sparse) along with the precision for each layer becomes a combinatorial problem. Neural Architecture Search (NAS) can alleviate such problems by automating the architecture design process instead of a brute-force search. In this work, we propose \textbf{(i)} Mixed Sparse and Precision Search (MSPS), a NAS framework to search for efficient sparse and mixed-precision quantized models within the predefined search space and fixed backbone neural network (Eg. ResNet50), and \textbf{(ii)} Architecture, Sparse and Precision Search (ASPS) to jointly search for kernel size and the number of filters, and sparse-precision combination of each layer. We illustrate the effectiveness of our methods targeting A100 Tensor Core on Nvidia GPUs by searching efficient sparse-mixed precision networks on ResNet50 and achieving better accuracy-latency trade-off models compared to the manually designed Uniform Sparse Int8 networks

Digital Repository @ Iowa State University (ISU)

Effective network grid synthesis and optimization for high performance very large scale integration system design

Author: Yang Yun
Publication venue
Publication date: 01/02/2008
Field of study

制度:新 ; 文部省報告番号:甲2642号 ; 学位の種類:博士(工学) ; 授与年月日:2008/3/15 ; 早大学位記番号:新480

Waseda University Repository

Efficient and Low-complexity Hardware Architecture of Gaussian Normal Basis Multiplication over GF(2m) for Elliptic Curve Cryptosystems

Author: Bahram Rashidi
Reza Rezaeian Farashahi
Sayed Masoud Sayedi
Publication venue: International Association for Cryptologic Research (IACR)
Publication date: 18/11/2015
Field of study

In this paper an efficient high-speed architecture of Gaussian normal basis multiplier over binary finite field GF(2m) is presented. The structure is constructed by using regular modules for computation of exponentiation by powers of 2 and low-cost blocks for multiplication by normal elements of the binary field. Since the exponents are powers of 2, the modules are implemented by some simple cyclic shifts in the normal basis representation. As a result, the multiplier has a simple structure with a low critical path delay. The efficiency of the proposed structure is studied in terms of area and time complexity by using its implementation on Vertix-4 FPGA family and also its ASIC design in 180nm CMOS technology. Comparison results with other structures of the Gaussian normal basis multiplier verify that the proposed architecture has better performance in terms of speed and hardware utilization

CiteSeerX

Cryptology ePrint Archive

Survey of FPGA applications in the period 2000 – 2015 (Technical Report)

Author: Porrmann Mario
Romoth Johannes
Rückert Ulrich
Publication venue
Publication date: 01/01/2017
Field of study

Romoth J, Porrmann M, Rückert U. Survey of FPGA applications in the period 2000 – 2015 (Technical Report).; 2017.Since their introduction, FPGAs can be seen in more and more different fields of applications. The key advantage is the combination of software-like flexibility with the performance otherwise common to hardware. Nevertheless, every application field introduces special requirements to the used computational architecture. This paper provides an overview of the different topics FPGAs have been used for in the last 15 years of research and why they have been chosen over other processing units like e.g. CPUs

Publications at Bielefeld University

Programmable architectures for the automated design of digital FIR filters using evolvable hardware

Author: Hounsell Benjamin Iain
Publication venue: The University of Edinburgh
Publication date: 01/01/2001
Field of study

Edinburgh Research Archive

Multi-LSTM Acceleration and CNN Fault Tolerance

Author: Ribes Stefano
Publication venue
Publication date: 01/01/2021
Field of study

This thesis addresses the following two problems related to the field of Machine Learning: the acceleration of multiple Long Short Term Memory (LSTM) models on FPGAs and the fault tolerance of compressed Convolutional Neural Networks (CNN). LSTMs represent an effective solution to capture long-term dependencies in sequential data, like sentences in Natural Language Processing applications, video frames in Scene Labeling tasks or temporal series in Time Series Forecasting. In order to further boost their efficacy, especially in presence of long sequences, multiple LSTM models are utilized in a Hierarchical and Stacked fashion. However, because of their memory-bounded nature, efficient mapping of multiple LSTMs on a computing device becomes even more challenging. The first part of this thesis addresses the problem of mapping multiple LSTM models to a FPGA device by introducing a framework that modifies their memory requirements according to the target architecture. For the similar accuracy loss, the proposed framework maps multiple LSTMs with a performance improvement of 3x to 5x over state-of-the-art approaches. In the second part of this thesis, we investigate the fault tolerance of CNNs, another effective deep learning architecture. CNNs represent a dominating solution in image classification tasks, but suffer from a high performance cost, due to their computational structure. In fact, due to their large parameter space, fetching their data from main memory typically becomes a performance bottleneck. In order to tackle the problem, various techniques for their parameters compression have been developed, such as weight pruning, weight clustering and weight quantization. However, reducing the memory footprint of an application can lead to its data becoming more sensitive to faults. For this thesis work, we have conducted an analysis to verify the conditions for applying OddECC, a mechanism that supports variable strength and size ECCs for different memory regions. Our experiments reveal that compressed CNNs, which have their memory footprint reduced up to 86.3x by utilizing the aforementioned compression schemes, exhibit accuracy drops up to 13.56% in presence of random single bit faults

Chalmers Research

REAL-TIME ADAPTIVE PULSE COMPRESSION ON RECONFIGURABLE, SYSTEM-ON-CHIP (SOC) PLATFORMS

Author: Suarez Hernan
Publication venue
Publication date: 01/12/2015
Field of study

New radar applications need to perform complex algorithms and process a large quantity of data to generate useful information for the users. This situation has motivated the search for better processing solutions that include low-power high-performance processors, efficient algorithms, and high-speed interfaces. In this work, hardware implementation of adaptive pulse compression algorithms for real-time transceiver optimization is presented, and is based on a System-on-Chip architecture for reconfigurable hardware devices. This study also evaluates the performance of dedicated coprocessors as hardware accelerator units to speed up and improve the computation of computing-intensive tasks such matrix multiplication and matrix inversion, which are essential units to solve the covariance matrix. The tradeoffs between latency and hardware utilization are also presented. Moreover, the system architecture takes advantage of the embedded processor, which is interconnected with the logic resources through high-performance buses, to perform floating-point operations, control the processing blocks, and communicate with an external PC through a customized software interface. The overall system functionality is demonstrated and tested for real-time operations using a Ku-band testbed together with a low-cost channel emulator for different types of waveforms

SHAREOK repository