63 research outputs found

    Accelerating Quadrature Methods for Option Valuation

    Full text link

    Performance Modeling of Virtualized Custom Logic Computations

    Get PDF
    Virtualization of custom logic computations (i.e., by sharing a fixed function across distinct data streams) provides a means of reusing hardware resources, particularly when resources are limited. This is common practice in traditional processors where more than one user can share processor resources. In this paper, we virtualize a custom logic block using C-slow techniques to support fine-grain context-switching. We then develop and present an analytic model for several performance measures (throughput, latency, input queue occupancy) for both fine-grained and coarse-grained context switching (to a secondary memory). Next, we calibrate the analytic performance model with empirical measurements. We then validate the model via discrete-event simulation and use the model to predict the performance and develop optimal schedules for virtualized logic computations. We present results for a Taylor series expansion of a cosine function with added feedback and an AES encryption cipher

    Pipelining Saturated Accumulation

    Get PDF
    Aggressive pipelining and spatial parallelism allow integrated circuits (e.g., custom VLSI, ASICs, and FPGAs) to achieve high throughput on many Digital Signal Processing applications. However, cyclic data dependencies in the computation can limit parallelism and reduce the efficiency and speed of an implementation. Saturated accumulation is an important example where such a cycle limits the throughput of signal processing applications. We show how to reformulate saturated addition as an associative operation so that we can use a parallel-prefix calculation to perform saturated accumulation at any data rate supported by the device. This allows us, for example, to design a 16-bit saturated accumulator which can operate at 280 MHz on a Xilinx Spartan-3(XC3S-5000-4) FPGA, the maximum frequency supported by the component's DCM

    A Multithreaded Soft Processor for SoPC Area Reduction

    Full text link
    The growth in size and performance of Field Programmable Gate Arrays (FPGAs) has compelled System-on-a-Programmable-Chip (SoPC) designers to use soft proces-sors for controlling systems with large numbers of intellec-tual property (IP) blocks. Soft processors control IP blocks, which are accessed by the processor either as peripheral de-vices or/and by using custom instructions (CIs). In large systems, chip multiprocessors (CMPs) are used to execute many programs concurrently. When these programs require the use of the same IP blocks which are accessed as periph-eral devices, they may have to stall waiting for their turn. In the case of CIs, the FPGA logic blocks that implement the CIs may have to be replicated for each processor. In both of these cases FPGA area is wasted, either by idle soft processors or the replication of CI logic blocks. This paper presents a multithreaded (MT) soft processor for area reduction in SoPC implementations. An MT proces-sor allows multiple programs to access the same IP without the need for the logic replication or the replication of whole processors. We first designed a single-threaded processor that is instruction-set compatible to Altera’s Nios II soft processor. Our processor is approximately the same size as the Nios II Economy version, with equivalent performance. We augmented our processor to have 4-way interleaved mul-tithreading capabilities. This paper compares the area us-age and performance of the MT processor versus two CMP systems, using Altera’s and our single-threaded processors, separately. Our results show that we can achieve an area savings of about 45 % for the processor itself, in addition to the area savings due to not replicating CI logic blocks. 1

    Pipelining saturated accumulation

    Full text link

    Semi-automated Design of High-performance Digital Circuits with Xilinx FPGAs

    Get PDF
    Tato diplomová práce se zabývá návrhem sekvenčních digitálních obvodů s ohledem na optimalizaci zpoždění. V práci je popsána problematika dvou technik, které jsou běžně používané při optimalizaci – stručně je popsána technika tzv. synchronizace registrů (angl. retiming), větší pozornost je však věnována technice tzv. zřetězení (angl. pipelining). V rámci praktické části byla vypracována forma abstrakce sekvenčních digitálních obvodů pomocí acyklických orientovaných grafů. Obvod je tak přenesen do roviny, ve které je jednodušší jej transformovat. Zároveň je představen nástroj pro polo-automatickou optimalizaci digitálních obvodů vyvíjených v prostředí Xilinx ISE Design Suite využitím techniky zřetězení.This master's thesis deals with sequential digital circuit design optimization concerning delay optimization. Two techniques commonly used for the optimization are described in the thesis – a brief description of the retiming technique and a more in-depth description of the pipelining technique. A form of abstraction of sequential digital circuits using Directed Acyclic Graphs (DAGs) was developed in the practical part of the thesis. This abstraction represents the circuit in a more manageable way for transformations. At the same time, a tool for semi-automatic digital circuit optimization using pipelining is introduced. This tool is compatible with Xilinx ISE Design Suite.

    Parallelization of dynamic programming recurrences in computational biology

    Get PDF
    The rapid growth of biosequence databases over the last decade has led to a performance bottleneck in the applications analyzing them. In particular, over the last five years DNA sequencing capacity of next-generation sequencers has been doubling every six months as costs have plummeted. The data produced by these sequencers is overwhelming traditional compute systems. We believe that in the future compute performance, not sequencing, will become the bottleneck in advancing genome science. In this work, we investigate novel computing platforms to accelerate dynamic programming algorithms, which are popular in bioinformatics workloads. We study algorithm-specific hardware architectures that exploit fine-grained parallelism in dynamic programming kernels using field-programmable gate arrays: FPGAs). We advocate a high-level synthesis approach, using the recurrence equation abstraction to represent dynamic programming and polyhedral analysis to exploit parallelism. We suggest a novel technique within the polyhedral model to optimize for throughput by pipelining independent computations on an array. This design technique improves on the state of the art, which builds latency-optimal arrays. We also suggest a method to dynamically switch between a family of designs using FPGA reconfiguration to achieve a significant performance boost. We have used polyhedral methods to parallelize the Nussinov RNA folding algorithm to build a family of accelerators that can trade resources for parallelism and are between 15-130x faster than a modern dual core CPU implementation. A Zuker RNA folding accelerator we built on a single workstation with four Xilinx Virtex 4 FPGAs outperforms 198 3 GHz Intel Core 2 Duo processors. Furthermore, our design running on a single FPGA is an order of magnitude faster than competing implementations on similar-generation FPGAs and graphics processors. Our work is a step toward the goal of automated synthesis of hardware accelerators for dynamic programming algorithms

    Design synthesis for dynamically reconfigurable logic systems

    Get PDF
    Dynamic reconfiguration of logic circuits has been a research problem for over four decades. While applications using logic reconfiguration in practical scenarios have been demonstrated, the design of these systems has proved to be a difficult process demanding the skills of an experienced reconfigurable logic design expert. This thesis proposes an automatic synthesis method which relieves designers of some of the difficulties associated with designing partially dynamically reconfigurable systems. A new design abstraction model for reconfigurable systems is proposed in order to support design exploration using the presented method. Given an input behavioural model, a technology server and a set of design constraints, the method will generate a reconfigurable design solution in the form of a 3D floorplan and a configuration schedule. The approach makes use of genetic algorithms. It facilitates global optimisation to accommodate multiple design objectives common in reconfigurable system design, while making realistic estimates of configuration overheads and of the potential for resource sharing between configurations. A set of custom evolutionary operators has been developed to cope with a multiple-objective search space. Furthermore, the application of a simulation technique verifying the lll results of such an automatic exploration is outlined in the thesis. The qualities of the proposed method are evaluated using a set of benchmark designs taking data from a real reconfigurable logic technology. Finally, some extensions to the proposed method and possible research directions are discussed
    corecore