7,753 research outputs found
High Performance Biological Pairwise Sequence Alignment: FPGA versus GPU versus Cell BE versus GPP
This paper explores the pros and cons of reconfigurable computing in the form of FPGAs for high performance efficient computing. In particular, the paper presents the results of a comparative study between three different acceleration technologies, namely, Field Programmable Gate Arrays (FPGAs), Graphics Processor Units (GPUs), and IBMâs Cell Broadband Engine (Cell BE), in the design and implementation of the widely-used Smith-Waterman pairwise sequence alignment algorithm, with general purpose processors as a base reference implementation. Comparison criteria include speed, energy consumption, and purchase and development costs. The study shows that FPGAs largely outperform all other implementation platforms on performance per watt criterion and perform better than all other platforms on performance per dollar criterion, although by a much smaller margin. Cell BE and GPU come second and third, respectively, on both performance per watt and performance per dollar criteria. In general, in order to outperform other technologies on performance per dollar criterion (using currently available hardware and development tools), FPGAs need to achieve at least two orders of magnitude speed-up compared to general-purpose processors and one order of magnitude speed-up compared to domain-specific technologies such as GPUs
Recommended from our members
Two-dimensional DCT/IDCT architecture
A fully parallel architecture for the computation of a two-dimensional (2-D) discrete cosine transform (DCT), based on row-column decomposition is presented. It uses the same one dimensional (1-D) DCT unit for the row and column computations and (N2+N) registers to perform the transposition. It possesses features of regularity and modularity, and is thus well suited for VLSI implementation. It can be used for the computation of either the forward or the inverse 2-D DCT. Each 1-D DCT unit uses N fully parallel vector inner product (VIP) units. The design of the VIP units is based on a systematic design methodology using radix-2â arithmetic, which allows partitioning of the elements of each vector into small groups. Array multipliers without the final adder are used to produce the different partial product terms. This allows a more efficient use of 4:2 compressors for the accumulation of the products in the intermediate stages and reduces the number of accumulators from N to one. Using this procedure, the 2-D DCT architecture requires less than N2 multipliers (in terms of area occupied) and only 2N adders. It can compute a N x N-point DCT at a rate of one complete transform per N cycles after an appropriate initial delay
Pipelined Two-Operand Modular Adders
Pipelined two-operand modular adder (TOMA) is one of basic components used in digital signal processing (DSP) systems that use the residue number system (RNS). Such modular adders are used in binary/residue and residue/binary converters, residue multipliers and scalers as well as within residue processing channels. The design of pipelined TOMAs is usually obtained by inserting an appriopriate number of latch layers inside a nonpipelined TOMA structure. Hence their area is also determined by the number of latches and the delay by the number of latch layers. In this paper we propose a new pipelined TOMA that is based on a new TOMA, that has the smaller area and smaller delay than other known structures. Comparisons are made using data from the very large scale of integration (VLSI) standard cell library
What is the Computational Value of Finite Range Tunneling?
Quantum annealing (QA) has been proposed as a quantum enhanced optimization
heuristic exploiting tunneling. Here, we demonstrate how finite range tunneling
can provide considerable computational advantage. For a crafted problem
designed to have tall and narrow energy barriers separating local minima, the
D-Wave 2X quantum annealer achieves significant runtime advantages relative to
Simulated Annealing (SA). For instances with 945 variables, this results in a
time-to-99%-success-probability that is times faster than SA
running on a single processor core. We also compared physical QA with Quantum
Monte Carlo (QMC), an algorithm that emulates quantum tunneling on classical
processors. We observe a substantial constant overhead against physical QA:
D-Wave 2X again runs up to times faster than an optimized
implementation of QMC on a single core. We note that there exist heuristic
classical algorithms that can solve most instances of Chimera structured
problems in a timescale comparable to the D-Wave 2X. However, we believe that
such solvers will become ineffective for the next generation of annealers
currently being designed. To investigate whether finite range tunneling will
also confer an advantage for problems of practical interest, we conduct
numerical studies on binary optimization problems that cannot yet be
represented on quantum hardware. For random instances of the number
partitioning problem, we find numerically that QMC, as well as other algorithms
designed to simulate QA, scale better than SA. We discuss the implications of
these findings for the design of next generation quantum annealers.Comment: 17 pages, 13 figures. Edited for clarity, in part in response to
comments. Added link to benchmark instance
- âŠ