238 research outputs found

    Implementing the 2-D Wavelet Transform on SIMD-Enhanced General-Purpose Processors

    Full text link

    GRAAL: A Framework for Low-Power 3D Graphics Accelerators

    Full text link

    Reconfigurable Universal SADMultiplier Array

    No full text
    In this paper, we investigate the collapsing of some multioperand addition related operations into a single array. More specifically we consider multiplication and Sum of Absolute Differences (SAD) and propose an array capable of performing the aforementioned operations for unsigned, signed magnitude, and two’s complement notations. The array, called a universal array, is divided into common and controlled logic blocks intended to be reconfigured dynamically. The proposed unit was constructed around three main operational fields, which are feed with the necessary data products or SAD addition terms in order to compute the desired operation. It is estimated that a 66.6 % of the (3:2)counter array is shared by the operations providing an opportunity to reduce reconfiguration times. The synthesis result for a FPGA device, of the new structure, was compared against other multiplier organizations. The obtained results indicate that the proposed unit is capable of processing in 23.9 ns a 16 bit multiplication, and that an 8 input SAD can be computed in 29.8 ns using current FPGA technology. Even though the proposed structure incorporates more operations, the extra delay required over conventional structures is very small (in the order of 1 % compared to Baugh&Wooley multiplier)

    Serial Binary Multiplication with Feed-Forward Neural Networks

    No full text
    In this paper we propose no learning based neural networks for serial binary multiplication. We show that for "subarray-wise" generation of the partial product matrix and a data transmission rate of ffi bits per cycle the serial multiplication of two n-bit operands can be computed in \Sigma n \Upsilon serial cycles with an O(nffi) size neural network, and maximum fan-in and weight values both in the order of O(ffi log ffi). The minimum delay for this scheme is in the order of d n e + log n and it corresponds to a data transmission rate of d n e bits per cycle. For "column-wise" generation of the partial product matrix and a data transmission rate of 1 bit per cycle the serial multiplication can be achieved in 2n \Gamma 1 + (k + 1)dlog k ne delay with a (k + 1) size neural network, a maximum weight of 2 and a maximum fan-in of 3k + 1. If a data transmission rate of ffi bits per serial cycle is assumed we prove a delay of d e + (ffi + 1)dlog ne for a (ffi + 1)(n \Gamma 1) size neural network, a maximum weight of 2 and a maximum fan-in of 3ffi + 1

    Reconfigurable Fixed Point Dense and Sparse Matrix-Vector Multiply/Add Unit

    No full text
    In this paper, we propose a reconfigurable hardware accelerator for fixed-point-matrix-vector-multiply/add operations, capable to work on dense and sparse matrices formats. The prototyped hardware unit accommodates 4 dense or sparse matrix inputs and performs computations in a space parallel design achieving 4 multiplications and up to 12 additions at 120 MHz over an xc2vp100-6 FPGA device, reaching a throughput of 1.9 GOPS. A total of 11 units can be integrated in the same FPGA chip, achieving a performance of 21 GOPS. 1

    Reconfigurable Repetitive Padding Unit

    No full text
    This paper proposes a reconfigurable processing unit, which performs the MPEG-4 repetitive padding algorithm in real time. The padding unit has been implemented as a scalable systolic structure of processing elements. A generic array of PE has been described in VHDL, and the functionality of the unit has been validated by simulations. In order to determine the chip area and speed of the padding structure, we have synthesized the structure for two FPGA families - Xilinx and Altera. The simulation results indicate that the proposed padding unit can operate in a wide frequency range, depending on the implemented configuration. It is shown that it can process from tens up to hundreds of thousands MPEG-4 macroblocks per second. This allows the real-time requirements of all MPEG-4 profiles and levels to be met e#ciently at trivial hardware costs. Finally, the trade-o# between chip-area and operating speed is discussed and possible configuration alternatives are proposed

    Integrating Uni- and Multicast Scheduling in Buffered Crossbar Switches

    No full text
    Abstract — Internet traffic is a mixture of unicast and multicast flows. Integrated schedulers capable of dealing with both traffic types have been designed mainly for Input Queued (IQ) buffer-less crossbar switches. Combined Input and Crossbar Queued (CICQ) switches, on the other hand, are known to have better performance than their buffer-less predecessors due to their potential in simplifying the scheduling and improving the switching performance. The design of integrated schedulers in CICQ switches has thus far been neglected. In this paper, we propose a novel CICQ architecture that supports both unicast and multicast traffic along with its appropriate scheduling. In particular, we propose an integrated round robin based scheduler that efficiently services both unicast and multicast traffic simultaneously. Our scheme, named Multicast and Unicast Round robin Scheduling (MURS), has been shown to outperform all existing schemes while keeping simple hardware requirements. Simulation results suggested that we can trade the size of the internal buffers for the number of input multicast queues

    Reconfigurable Multiple Operation Array

    No full text
    Abstract. In this paper, we investigate the collapsing of eight multioperand addition related operations into a single and common (3:2)counter array. We consider for this unit multiplication in integer and fractional representations, the Sum of Absolute Differences (SAD) in unsigned, signed magnitude and two’s complement notation. Furthermore, the unit also incorporates a Multiply-Accumulation unit (MAC) for two’s complement notation. The proposed multiple operation unit was constructed around 10 element arrays that can be reduced using well known counter techniques, which are feed with the necessary data to perform the proposed eight operations. It is estimated that 6/8 of the basic (3:2)counter array is shared by the operations. The obtained results of the presented unit indicates that is capable of processing a 4x4 SAD macro-block in 36.35 ns and takes 30.43 ns to process the rest of the operations using a VIRTEX II PRO xc2vp100-7ff1696 FPGA device. 1 Introduction: Th
    • …
    corecore