38 research outputs found

    ASC: A stream compiler for computing with FPGAs

    No full text
    Published versio

    Optimizing hardware function evaluation

    No full text
    Published versio

    Object-oriented domain specific compilers for programming FPGAs

    No full text
    Published versio

    Hotspot detection of SPEC CPU 2006 benchmarks with performance event counters

    Get PDF
    Hotspot is the part of a program where most execution time is spent. Detecting the hotspot enables the optimization of the program. The performance event counters embedded in modern processors provide the hardware support for the hotspot detection. By sampling the instruc- tion addresses of the running program with performance event counters, hotspot of the program can be statistically detected. This technical re- port describes our tool to find the sections of the code that are detected as the hotspot of the program with performance event counters. SPEC CPU 2006 benchmarks are tested with our tool and the results show the hotspot sections and overhead of the hotspot detection tool

    Dataflow Design for Optimal Incremental SVM Training

    No full text
    This paper proposes a new parallel architecture for incremental training of a Support Vector Machine (SVM), which produces an optimal solution based on manipulating the Karush-Kuhn-Tucker (KKT) conditions. Compared to batch training methods, our approach avoids re-training from scratch when training dataset changes. The proposed architecture is the first to adopt an efficient dataflow organisation. The main novelty is a parametric description of the parallel dataflow architecture, which deploys customisable arithmetic units for dense linear algebraic operations involved in updating the KKT conditions. The proposed architecture targets on-line SVM training applications. Experimental evaluation with real world financial data shows that our architecture implemented on Stratix-V FPGA achieved significant speedup against LIBSVM on Core i7-4770 CPU

    Optimizing logarithmic arithmetic on FPGAs

    No full text
    This paper proposes optimizations of the methods and parameters used in both mathematical approximation and hardware design for logarithmic number system (LNS) arithmetic. First, we introduce a general polynomial ap-proximation approach with an adaptive divide-in-halves segmentation method for evaluation of LNS arithmetic functions. Second, we develop a library generator that au-tomatically generates optimized LNS arithmetic units with a wide bit-width range from 21 to 64 bits, to support LNS application development and design exploration. The ba-sic arithmetic units are tested on practical FPGA boards as well as software simulation. When compared with exist-ing LNS designs, our generated units provide in most case

    Parameterized function evaluation for FPGAs

    No full text
    This paper presents parameterized module-generators for pipelined function evaluation using lookup tables, adders, shifters and multipliers. We discuss trade-offs involved between (1) full-lookup tables, (2) bipartite (lookup-add) units, (3) lookup-multiply units, and (4) shift-and-add based CORDIC units. For lookup-multiply units we provide equations estimating approximation errors and rounding errors which are used to parameterize the hardware units. The resources and performance of the resulting design can be estimated given the input parameters. The method is implemented as part of the PAM-Blox module generation environment. An example shows that the table-multiply unit produces competitive designs with data widths up to 20 bits when compared with shiftand-add based CORDIC units. Additionally, the table-multiply method can be used for larger data widths when evaluating functions not supported by CORDIC.

    Memory mapping for multi-die FPGAs

    Get PDF
    This paper proposes an algorithm for mappinglogical to physical memory resources on Field-ProgrammableGate Arrays (FPGAs). Our greedy strategy based algorithmis specifically designed to facilitate timing closure on modernmulti-die FPGAs for static-dataflow accelerators utilising mostof the on-chip resources. The main objective of the proposedalgorithm is to ensure that specific sub-parts of the design underconsideration can fully reside within a single die to limit inter-die communication. The above is achieved by performing thememory mapping for each sub-part of the design separately whilekeeping allocation of the available physical resources balanced.As a result the number of inter-die connections is reduced onaverage by 50% compared to an algorithm targeting minimalarea usage for real, complex applications using most of the on-chip’s resources. Additionally, our algorithm is the only one outof the four evaluated approaches which successfully producesplace and route results for all 33 applications and benchmarks
    corecore