Search CORE

170 research outputs found

Efficient modular arithmetic units for low power cryptographic applications

Author: Modugu Rajashekhar Reddy
Publication venue: Scholars\u27 Mine
Publication date: 01/01/2010
Field of study

The demand for high security in energy constrained devices such as mobiles and PDAs is growing rapidly. This leads to the need for efficient design of cryptographic algorithms which offer data integrity, authentication, non-repudiation and confidentiality of the encrypted data and communication channels. The public key cryptography is an ideal choice for data integrity, authentication and non-repudiation whereas the private key cryptography ensures the confidentiality of the data transmitted. The latter has an extremely high encryption speed but it has certain limitations which make it unsuitable for use in certain applications. Numerous public key cryptographic algorithms are available in the literature which comprise modular arithmetic modules such as modular addition, multiplication, inversion and exponentiation. Recently, numerous cryptographic algorithms have been proposed based on modular arithmetic which are scalable, do word based operations and efficient in various aspects. The modular arithmetic modules play a crucial role in the overall performance of the cryptographic processor. Hence, better results can be obtained by designing efficient arithmetic modules such as modular addition, multiplication, exponentiation and squaring. This thesis is organized into three papers, describes the efficient implementation of modular arithmetic units, application of these modules in International Data Encryption Algorithm (IDEA). Second paper describes the IDEA algorithm implementation using the existing techniques and using the proposed efficient modular units. The third paper describes the fault tolerant design of a modular unit which has online self-checking capability --Abstract, page iv

Missouri University of Science and Technology (Missouri S&T): Scholars' Mine

Implementing Energy Parsimonious Circuits through Inexact Designs

Author: Lingamneni Avinash
Publication venue
Publication date: 01/01/2011
Field of study

Inexact Circuits or circuits in which accuracy of the output can be traded for cost (energy, delay and/or area) savings, have been receiving increasing attention of late due to invariable inaccuracies in nanometer-scale circuits and a concomitant growing desire for ultra low energy embedded systems. Most of the previous approaches to realize inexact circuits relied on scaling of circuit-level operational parameters (such as supply voltage) to achieve the cost and accuracy tradeoffs, and suffered from serious drawbacks of significant implementation overheads that drastically reduced the gains. In this thesis, two novel architecture-level approaches called Probabilisttc Pruning and Probabilistic Logic Minimization are proposed to realize inexact circuits with zero overhead. Extensive simulations on various architectures of datapath elements and a prototype chip fabrication demonstrate that normalized gains as large as 2X-9.5X in Energy-Delay-Area product can be obtained for relative error as low as 10 -6 % - 1% compared to corresponding conventional correct designs

DSpace at Rice University

Recommended from our members

Efficient analysis and storage of large-scale genomic data

Author: Klarqvist Marcus
Publication venue: University of Cambridge
Publication date: 01/09/2019
Field of study

The impending advent of population-scaled sequencing cohorts involving tens of millions of individuals with matched phenotypic measurements will produce unprecedented volumes of genetic data. Storing and analysing such gargantuan datasets places computational performance at a pivotal position in medical genomics. In this thesis, I explore the potential for accelerating and parallelizing standard genetics workflows, file formats, and algorithms using both hardware-accelerated vectorization, parallel and distributed algorithms, and heterogeneous computing. First, I describe a novel bit-counting operation termed the positional population-count, which can be used together with succinct representations and standard efficient operations to accelerate many genetic calculations. In order to enable the use of this new operator and the canonical population count on any target machine I developed a unified low-level library using CPU dispatching to select the optimal method contingent on the available instruction set architecture and the given input size at run-time. As a proof-of-principle application, I apply the positional population-count operator to computing quality control-related summary statistics for terabyte-scaled sequencing readsets with >3,800-fold speed improvements. As another application, I describe a framework for efficiently computing the cardinality of set intersection using these operators and applied this framework to efficiently compute genome-wide linkage-disequilibrium in datasets with up to 67 million samples resulting in up to >60-fold improvements in speed for dense genotypic vectors and up to >250,000-fold savings in memory and >100,000-fold improvement in speed for sparse genotypic vectors. I next describe a framework for handling the terabytes of compressed output data and describe graphical routines for visualizing long-range linkage-disequilibrium blocks as seen over many human centromeres. Finally, I describe efficient algorithms for storing and querying very large genetic datasets and specialized algorithms for the genotype component of such datasets with >10,000-fold savings in memory compared to the current interchange format.Wellcome Trus

Apollo (Cambridge)

High Performance Digital Circuit Techniques

Author: Sadrossadat Sayed Alireza
Publication venue: 'University of Waterloo'
Publication date: 01/01/2009
Field of study

Achieving high performance is one of the most difficult challenges in designing digital circuits. Flip-flops and adders are key blocks in most digital systems and must therefore be designed to yield highest performance. In this thesis, a new high performance serial adder is developed while power consumption is attained. Also, a statistical framework for the design of flip-flops is introduced that ensures that such sequential circuits meet timing yield under performance criteria. Firstly, a high performance serial adder is developed. The new adder is based on the idea of having a constant delay for the addition of two operands. While conventional adders exhibit logarithmic delay, the proposed adder works at a constant delay order. In addition, the new adder's hardware complexity is in a linear order with the word length, which consequently exhibits less area and power consumption as compared to conventional high performance adders. The thesis demonstrates the underlying algorithm used for the new adder and followed by simulation results. Secondly, this thesis presents a statistical framework for the design of flip-flops under process variations in order to maximize their timing yield. In nanometer CMOS technologies, process variations significantly impact the timing performance of sequential circuits which may eventually cause their malfunction. Therefore, developing a framework for designing such circuits is inevitable. Our framework generates the values of the nominal design parameters; i.e., the size of gates and transmission gates of flip-flop such that maximum timing yield is achieved for flip-flops. While previous works focused on improving the yield of flip-flops, less research was done to improve the timing yield in the presence of process variations

University of Waterloo's Institutional Repository

Neural Network Accelerator Design for System on Chip

Author: Ammari Samaneh
Publication venue
Publication date: 02/11/2022
Field of study

ML is vastly utilized in a variety of applications such as voice recognition, computer vision, image classification, object detection, and plenty of other use cases. Protecting data privacy and the importance of preventing latency in different applications and saving the network bandwidth to process data locally without the need to transfer it to the cloud. The approach is called edge computing. It is challenging to design a deep learning accelerator suitable for edge devices. Two main factors affect the chip design. On-chip memory is the first and the most power and area consuming unit. The second one is multipliers. In this thesis, we are focusing on the latter. Most machine learning algorithms use convolution, which is calculated by multiplying and accumulating input feature maps and weights. Most of the deep learning accelerators use the precision scalable Multiply and Accumulate (MAC) architecture and an array of MAC units. Most of the chip’s area is taken up by the array of MAC units, especially multipliers, which also use a lot of power. This master’s thesis consists of two parts. First, a new deep learning accelerator architecture is proposed. Second, different multiplier algorithms are explored. These algorithms were implemented in the SystemVerilog language and synthesized via Cadence tools. The aim was to find a smaller area and lower power consumption multiplier with higher performance. In this work, the Braun multiplier, the Booth multiplier, the Baugh-Wooley array multiplier, the Wallace multiplier, the Parallel prefix Vedic multiplier, and the Modified-Booth multiplier are implemented. The power consumption, chip area usage, and performance of the multipliers at different clock frequencies are measured and considered to select the optimal multiplier. Then the precision flexibility feature is added to the selected multiplier algorithms to perform one 8-bit*8-bit, two 4-bit*4-bit, or four 2-bit*2-bit multiplication. It is worth mentioning that both data (multiplicand) and weight (multiplier) can be in different bit width ranges, such as 1,2,4,8. In the proposed deep learning accelerator, the area and power of the systolic array are measured and reported. Among all other multipliers, the signed flexible Modified Booth multiplier which can calculate 2,4, and 8-bits is selected. It occupies 866.461 um2, and consumes 0.653 mW power at 1 GHz. The area and power of the systolic array with bit precision flexible are 283,564 mm2 and 223,156 mW power at 1 GHz, respectively

Trepo - Institutional Repository of Tampere University

Techniques for Efficient Implementation of FIR and Particle Filtering

Author
Publication venue: 'Linkoping University Electronic Press'
Publication date
Field of study

Crossref

Flexible Baseband Modulator Architecture for Multi-Waveform 5G Communications

Author: Ferreira João Canas
Ferreira Mário Lopes
Publication venue: 'IntechOpen'
Publication date: 28/02/2020
Field of study

The fifth-generation (5G) revolution represents more than a mere performance enhancement of previous generations: it will deeply transform the way humans and/or machines interact, enabling a heterogeneous expansion in the number of use cases and services. Crucial to the realization of this revolution is the design of hardware components characterized by high degrees of flexibility, versatility and resource/power efficiency. This chapter proposes a field-programmable gate array (FPGA)-oriented baseband processing architecture suitable for fast-changing communication environments such as 4G/5G waveform coexistence, noncontiguous carrier aggregation (CA) or centralized cloud radio access network (C-RAN) processing. The proposed architecture supports three 5G waveform candidates and is shown to be upgradable, resource-efficient and cost-effective. Through hardware virtualization, enabled by dynamic partial reconfiguration (DPR), the design space exploration of our architecture exceeds the hardware resources available on the Zynq xc7z020 device. Moreover, dynamic frequency scaling (DFS) enables the runtime adjustment of processing throughput and power reductions by up to 88%. The combined resource overhead for DPR and DFS is very low, and the reconfiguration latency stays two orders of magnitude below the control plane latency requirements proposed for 5G communications

IntechOpen