835 research outputs found
FPGA for High-Performance Bit-Serial Pipeline Datapath
Abstract | In this paper, we introduce our work on the chip design of a new FPGA chip for highperformance bit-serial pipeline datapath which is customized both in the logic architecture and routing architecture. The chip consists of 200k transistors on 3.5mm square substrate (excluding the IO pad area) using 0.5 2-metal process technology. The estimated clock frequency is 156MHz
BISMO: A Scalable Bit-Serial Matrix Multiplication Overlay for Reconfigurable Computing
Matrix-matrix multiplication is a key computational kernel for numerous
applications in science and engineering, with ample parallelism and data
locality that lends itself well to high-performance implementations. Many
matrix multiplication-dependent applications can use reduced-precision integer
or fixed-point representations to increase their performance and energy
efficiency while still offering adequate quality of results. However, precision
requirements may vary between different application phases or depend on input
data, rendering constant-precision solutions ineffective. We present BISMO, a
vectorized bit-serial matrix multiplication overlay for reconfigurable
computing. BISMO utilizes the excellent binary-operation performance of FPGAs
to offer a matrix multiplication performance that scales with required
precision and parallelism. We characterize the resource usage and performance
of BISMO across a range of parameters to build a hardware cost model, and
demonstrate a peak performance of 6.5 TOPS on the Xilinx PYNQ-Z1 board.Comment: To appear at FPL'1
A Standalone FPGA-based Miner for Lyra2REv2 Cryptocurrencies
Lyra2REv2 is a hashing algorithm that consists of a chain of individual
hashing algorithms, and it is used as a proof-of-work function in several
cryptocurrencies. The most crucial and exotic hashing algorithm in the
Lyra2REv2 chain is a specific instance of the general Lyra2 algorithm. This
work presents the first hardware implementation of the specific instance of
Lyra2 that is used in Lyra2REv2. Several properties of the aforementioned
algorithm are exploited in order to optimize the design. In addition, an
FPGA-based hardware implementation of a standalone miner for Lyra2REv2 on a
Xilinx Multi-Processor System on Chip is presented. The proposed Lyra2REv2
miner is shown to be significantly more energy efficient than both a GPU and a
commercially available FPGA-based miner. Finally, we also explain how the
simplified Lyra2 and Lyra2REv2 architectures can be modified with minimal
effort to also support the recent Lyra2REv3 chained hashing algorithm.Comment: 13 pages, accepted for publication in IEEE Trans. Circuits Syst. I.
arXiv admin note: substantial text overlap with arXiv:1807.0576
Realizing arbitrary-precision modular multiplication with a fixed-precision multiplier datapath
Within the context of cryptographic hardware, the term scalability refers to the ability to process operands of any size, regardless of the precision of the underlying data path or registers. In this paper we present a simple yet effective technique for increasing the scalability of a fixed-precision Montgomery multiplier. Our idea is to extend the datapath of a Montgomery multiplier in such a way that it can also perform an ordinary multiplication of two n-bit operands (without modular reduction), yielding a 2n-bit result. This
conventional (nxn->2n)-bit multiplication is then used as a âsub-routineâ to realize arbitrary-precision Montgomery multiplication according to standard software algorithms such as Coarsely Integrated Operand Scanning (CIOS). We
show that performing a 2n-bit modular multiplication on an n-bit multiplier can be done in 5n clock cycles, whereby we assume that the n-bit modular multiplication takes n cycles. Extending a Montgomery multiplier for this extra
functionality requires just some minor modifications of the datapath and entails a slight increase in silicon area
Digital implementation of the cellular sensor-computers
Two different kinds of cellular sensor-processor architectures are used nowadays in various
applications. The first is the traditional sensor-processor architecture, where the sensor and the
processor arrays are mapped into each other. The second is the foveal architecture, in which a
small active fovea is navigating in a large sensor array. This second architecture is introduced
and compared here. Both of these architectures can be implemented with analog and digital
processor arrays. The efficiency of the different implementation types, depending on the used
CMOS technology, is analyzed. It turned out, that the finer the technology is, the better to use
digital implementation rather than analog
A CAD Tool for Synthesizing Optimized Variants of Altera\u27s Nios II Soft-Core Processor
Soft-core processors offer embedded system designers the benefits of customization, flexibility and reusability. Altera\u27s NIOS II soft-core processor is a popular, commercially available soft-core processor that can be implemented on a variety of Altera FPGAs. In this thesis, the Nios II soft-core processor from Altera Corporation was studied and a VHDL implementation, called UW_Nios II, was developed. UW_Nios II was developed to enable us to perform design space exploration (DSE) for the Nios II processor. It was evaluated and compared with Altera Nios II and shown to be competitive. SCBuild is an existing CAD tool that was developed to enable DSE of soft-core processors. We modified SCBuild to automatically explore the design space of the UW_Nios II using a genetic algorithm. This tool can accurately estimate the area and critical path delay of different variants of the UW_Nios II on a field programmable gate array. Through experiments conducted using SCBuild, it was shown that employing a genetic algorithm to explore the design space of parameterized Nios II core, with a large design space, helps designers find optimized variants of UW_Nios II
An FPGA-based quantum computing emulation framework based on serial-parallel architecture
Hardware emulation of quantum systems can mimic more efficiently the parallel behaviour of quantum computations, thus allowing higher processing speed-up than software simulations. In this paper, an efficient hardware emulation method that employs a serial- parallel hardware architecture targeted for field programmable gate array (FPGA) is proposed. Quantum Fourier transform and Groverâs search are chosen as case studies in this work since they are the core of many useful quantum algorithms. Experimental work shows that, with the proposed emulation architecture, a linear reduction in resource utilization is attained against the pipeline implementations proposed in prior works. The proposed work contributes to the formulation of a proof-of-concept baseline FPGA emulation framework with optimization on datapath designs that can be extended to emulate practical large-scale quantum circuits
Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing
Matrix-matrix multiplication is a key computational kernel for numerous
applications in science and engineering, with ample parallelism and data
locality that lends itself well to high-performance implementations. Many
matrix multiplication-dependent applications can use reduced-precision integer
or fixed-point representations to increase their performance and energy
efficiency while still offering adequate quality of results. However, precision
requirements may vary between different application phases or depend on input
data, rendering constant-precision solutions ineffective. BISMO, a vectorized
bit-serial matrix multiplication overlay for reconfigurable computing,
previously utilized the excellent binary-operation performance of FPGAs to
offer a matrix multiplication performance that scales with required precision
and parallelism. We show how BISMO can be scaled up on Xilinx FPGAs using an
arithmetic architecture that better utilizes 6-LUTs. The improved BISMO
achieves a peak performance of 15.4 binary TOPS on the Ultra96 board with a
Xilinx UltraScale+ MPSoC.Comment: Invited paper at ACM TRETS as extension of FPL'18 paper
arXiv:1806.0886
Square-rich fixed point polynomial evaluation on FPGAs
Polynomial evaluation is important across a wide range of application domains, so significant work has been done on accelerating its computation. The conventional algorithm, referred to as Horner's rule, involves the least number of steps but can lead to increased latency due to serial computation. Parallel evaluation algorithms such as Estrin's method have shorter latency than Horner's rule, but achieve this at the expense of large hardware overhead. This paper presents an efficient polynomial evaluation algorithm, which reforms the evaluation process to include an increased number of squaring steps. By using a squarer design that is more efficient than general multiplication, this can result in polynomial evaluation with a 57.9% latency reduction over Horner's rule and 14.6% over Estrin's method, while consuming less area than Horner's rule, when implemented on a Xilinx Virtex 6 FPGA. When applied in fixed point function evaluation, where precision requirements limit the rounding of operands, it still achieves a 52.4% performance gain compared to Horner's rule with only a 4% area overhead in evaluating 5th degree polynomials
- âŠ