1,844 research outputs found
A versatile Montgomery multiplier architecture with characteristic three support
We present a novel unified core design which is extended to realize Montgomery multiplication in the fields GF(2n), GF(3m), and GF(p). Our unified design supports RSA and elliptic curve schemes, as well as the identity-based encryption which requires a pairing computation on an elliptic curve. The architecture is pipelined and is highly scalable. The unified core utilizes the redundant signed digit representation to reduce the critical path delay. While the carry-save representation used in classical unified architectures is only good for addition and multiplication operations, the redundant signed digit representation also facilitates efficient computation of comparison and subtraction operations besides addition and multiplication. Thus, there is no need for a transformation between the redundant and the non-redundant representations of field elements, which would be required in the classical unified architectures to realize the subtraction and comparison operations. We also quantify the benefits of the unified architectures in terms of area and critical path delay. We provide detailed implementation results. The metric shows that the new unified architecture provides an improvement over a hypothetical non-unified architecture of at least 24.88%, while the improvement over a classical unified architecture is at least 32.07%
A high-speed integrated circuit with applications to RSA Cryptography
Merged with duplicate record 10026.1/833 on 01.02.2017 by CS (TIS)The rapid growth in the use of computers and networks in government, commercial and
private communications systems has led to an increasing need for these systems to be
secure against unauthorised access and eavesdropping. To this end, modern computer
security systems employ public-key ciphers, of which probably the most well known is the
RSA ciphersystem, to provide both secrecy and authentication facilities.
The basic RSA cryptographic operation is a modular exponentiation where the modulus
and exponent are integers typically greater than 500 bits long. Therefore, to obtain reasonable
encryption rates using the RSA cipher requires that it be implemented in hardware.
This thesis presents the design of a high-performance VLSI device, called the WHiSpER
chip, that can perform the modular exponentiations required by the RSA cryptosystem
for moduli and exponents up to 506 bits long. The design has an expected throughput
in excess of 64kbit/s making it attractive for use both as a general RSA processor within
the security function provider of a security system, and for direct use on moderate-speed
public communication networks such as ISDN.
The thesis investigates the low-level techniques used for implementing high-speed arithmetic
hardware in general, and reviews the methods used by designers of existing modular
multiplication/exponentiation circuits with respect to circuit speed and efficiency.
A new modular multiplication algorithm, MMDDAMMM, based on Montgomery arithmetic,
together with an efficient multiplier architecture, are proposed that remove the
speed bottleneck of previous designs.
Finally, the implementation of the new algorithm and architecture within the WHiSpER
chip is detailed, along with a discussion of the application of the chip to ciphering and key
generation
Optimization of Supersingular Isogeny Cryptography for Deeply Embedded Systems
Public-key cryptography in use today can be broken by a quantum computer with sufficient resources. Microsoft Research has published an open-source library of quantum-secure supersingular isogeny (SI) algorithms including Diffie-Hellman key agreement and key encapsulation in portable C and optimized x86 and x64 implementations. For our research, we modified this library to target a deeply-embedded processor with instruction set extensions and a finite-field coprocessor originally designed to accelerate traditional elliptic curve cryptography (ECC). We observed a 6.3-7.5x improvement over a portable C implementation using instruction set extensions and a further 6.0-6.1x improvement with the addition of the coprocessor. Modification of the coprocessor to a wider datapath further increased performance 2.6-2.9x. Our results show that current traditional ECC implementations can be easily refactored to use supersingular elliptic curve arithmetic and achieve post-quantum security
BRAMAC: Compute-in-BRAM Architectures for Multiply-Accumulate on FPGAs
Deep neural network (DNN) inference using reduced integer precision has been
shown to achieve significant improvements in memory utilization and compute
throughput with little or no accuracy loss compared to full-precision
floating-point. Modern FPGA-based DNN inference relies heavily on the on-chip
block RAM (BRAM) for model storage and the digital signal processing (DSP) unit
for implementing the multiply-accumulate (MAC) operation, a fundamental DNN
primitive. In this paper, we enhance the existing BRAM to also compute MAC by
proposing BRAMAC (Compute-in-AM
rchitectures for
ultiply-cumulate). BRAMAC supports
2's complement 2- to 8-bit MAC in a small dummy BRAM array using a hybrid
bit-serial & bit-parallel data flow. Unlike previous compute-in-BRAM
architectures, BRAMAC allows read/write access to the main BRAM array while
computing in the dummy BRAM array, enabling both persistent and tiling-based
DNN inference. We explore two BRAMAC variants: BRAMAC-2SA (with 2 synchronous
dummy arrays) and BRAMAC-1DA (with 1 double-pumped dummy array).
BRAMAC-2SA/BRAMAC-1DA can boost the peak MAC throughput of a large Arria-10
FPGA by 2.6/2.1, 2.3/2.0, and
1.9/1.7 for 2-bit, 4-bit, and 8-bit precisions, respectively at
the cost of 6.8%/3.4% increase in the FPGA core area. By adding
BRAMAC-2SA/BRAMAC-1DA to a state-of-the-art tiling-based DNN accelerator, an
average speedup of 2.05/1.7 and 1.33/1.52 can
be achieved for AlexNet and ResNet-34, respectively across different model
precisions.Comment: 11 pages, 13 figures, 3 tables, FCCM conference 202
Introduction to Logic Circuits & Logic Design with VHDL
The overall goal of this book is to fill a void that has appeared in the instruction of digital circuits over
the past decade due to the rapid abstraction of system design. Up until the mid-1980s, digital circuits
were designed using classical techniques. Classical techniques relied heavily on manual design
practices for the synthesis, minimization, and interfacing of digital systems. Corresponding to this design
style, academic textbooks were developed that taught classical digital design techniques. Around 1990,
large-scale digital systems began being designed using hardware description languages (HDL) and
automated synthesis tools. Broad-scale adoption of this modern design approach spread through the
industry during this decade. Around 2000, hardware description languages and the modern digital
design approach began to be taught in universities, mainly at the senior and graduate level. There
were a variety of reasons that the modern digital design approach did not penetrate the lower levels of
academia during this time. First, the design and simulation tools were difficult to use and overwhelmed
freshman and sophomore students. Second, the ability to implement the designs in a laboratory setting
was infeasible. The modern design tools at the time were targeted at custom integrated circuits, which
are cost- and time-prohibitive to implement in a university setting. Between 2000 and 2005, rapid
advances in programmable logic and design tools allowed the modern digital design approach to be
implemented in a university setting, even in lower-level courses. This allowed students to learn the
modern design approach based on HDLs and prototype their designs in real hardware, mainly field
programmable gate arrays (FPGAs). This spurred an abundance of textbooks to be authored teaching
hardware description languages and higher levels of design abstraction. This trend has continued until
today. While abstraction is a critical tool for engineering design, the rapid movement toward teaching only
the modern digital design techniques has left a void for freshman- and sophomore-level courses in digital
circuitry. Legacy textbooks that teach the classical design approach are outdated and do not contain
sufficient coverage of HDLs to prepare the students for follow-on classes. Newer textbooks that teach
the modern digital design approach move immediately into high-level behavioral modeling with minimal
or no coverage of the underlying hardware used to implement the systems. As a result, students are not
being provided the resources to understand the fundamental hardware theory that lies beneath the
modern abstraction such as interfacing, gate-level implementation, and technology optimization.
Students moving too rapidly into high levels of abstraction have little understanding of what is going
on when they click the “compile and synthesize” button of their design tool. This leads to graduates who
can model a breadth of different systems in an HDL but have no depth into how the system is
implemented in hardware. This becomes problematic when an issue arises in a real design and there
is no foundational knowledge for the students to fall back on in order to debug the problem
Computer Science Principles with Java
This textbook is intended to be used for a first course in computer science, such as the College Board’s Advanced Placement course known as AP Computer Science Principles (CSP). This book includes all the topics on the CSP exam, plus some additional topics. It takes a breadth-first approach, with an emphasis on the principles which form the foundation for hardware and software. No prior experience with programming should be required to use this book. This version of the book uses the Java programming language.https://rdw.rowan.edu/oer/1018/thumbnail.jp
Recommended from our members
New methods for finite field arithmetic
We describe novel methods for obtaining fast software implementations of the arithmetic operations in the finite field GF(p) and GF(p[superscript k]). In GF(p) we realize an extensive speedup in modular addition and subtraction routines and some small speedup in the modular multiplication routine with an arbitrary prime modulus p which is of arbitrary length. The most important feature of the method is that it avoids bit-level operations which are slow on microprocessors and performs word-level operations which are significantly faster. The proposed method has applications in public-key cryptographic algorithms defined over the finite field GF(p), most notably the elliptic curve digital signature algorithm. The new method provides up to 13% speedup in the execution of the ECDSA algorithm over the field GF(p) for the length of p in the range 161≤k≤256. In the finite extension field GF(p[superscript k]) we describe two new methods for obtaining fast software implementations of the modular multiplication operation with an arbitrary prime modulus p, which has less bit-length than the word-length of a microprocessor and an arbitrary generator polynomial. The second algorithm is a significant improvement over the first algorithm by using the same concepts introduced in GF(p) arithmetic
- …