4,273 research outputs found

    Document Classification Systems in Heterogeneous Computing Environments

    Get PDF
    Datacenter workloads demand high throughput, low cost and power efficient solutions. In most data centers the operating costs dominates the infrastructure cost. The ever growing amounts of data and the critical need for higher throughput, more energy efficient document classification solutions motivated us to investigate alternatives to the traditional homogeneous CPU based implementations of document classification systems. Several heterogeneous systems were investigated in the past where CPUs were combined with GPUs and FPGAs as system accelerators. The increasing complexity of FPGAs made them an interesting device in the heterogeneous computing environments and on the other hand difficult to program using Hardware Description languages. We explore the trade-offs when using high level synthesis and low level synthesis when programming FPGAs. Using low level synthesis results in less hardware resource usage on FPGAs and also offers the higher throughput compared to using HLS tool. While using HLS tool different heterogeneous computing devices such as multicore CPU and GPU targeted. Through our implementation experience and empirical results for data centric applications, we conclude that we can achieve power efficient results for these set of applications by either using low level synthesis or high level synthesis for programming FPGAs

    Experimental Comparison of Synthesis Tools Altera Quartus II and Synthagate

    Get PDF
    The paper presents comparison between efficiency of an industrial FPGA design software tool Altera Quartus II and similar design software tool Synthagate by Syntezza company of an academic origin. The experiments were performed using a series of examples describing the Mealy finite state machines; one hot state encoding was used in all cases. Area (number of used logical blocks) was the main parameter used for the comparison. Influence of the way of FSM description (in VHDL language) on the quality of synthesis was studied. The obtained results show that Synthagate in almost all cases performs synthesis more efficiently and essentially quicker than Altera Quartus. Section I presents motivation of the research. Section II reminds the notion of FSM. Section III describes problems which had to be solved to provide correctness of experimental comparison. Section IV presents some details about state encoding way used in the experiments. In Section V, the experimental results are presented. Section VI describes the problems related to the comparison which still have to be solved. Section VII presents the conclusions from the experiments. Section VIII suggests possible reasons of the detected situation

    String Matching with Multicore CPUs: Performing Better with the Aho-Corasick Algorithm

    Full text link
    Multiple string matching is known as locating all the occurrences of a given number of patterns in an arbitrary string. It is used in bio-computing applications where the algorithms are commonly used for retrieval of information such as sequence analysis and gene/protein identification. Extremely large amount of data in the form of strings has to be processed in such bio-computing applications. Therefore, improving the performance of multiple string matching algorithms is always desirable. Multicore architectures are capable of providing better performance by parallelizing the multiple string matching algorithms. The Aho-Corasick algorithm is the one that is commonly used in exact multiple string matching algorithms. The focus of this paper is the acceleration of Aho-Corasick algorithm through a multicore CPU based software implementation. Through our implementation and evaluation of results, we prove that our method performs better compared to the state of the art

    Decomposition and encoding of finite state machines for FPGA implementation

    Get PDF
    xii+187hlm.;24c

    Hardware-Efficient Structure of the Accelerating Module for Implementation of Convolutional Neural Network Basic Operation

    Full text link
    This paper presents a structural design of the hardware-efficient module for implementation of convolution neural network (CNN) basic operation with reduced implementation complexity. For this purpose we utilize some modification of the Winograd minimal filtering method as well as computation vectorization principles. This module calculate inner products of two consecutive segments of the original data sequence, formed by a sliding window of length 3, with the elements of a filter impulse response. The fully parallel structure of the module for calculating these two inner products, based on the implementation of a naive method of calculation, requires 6 binary multipliers and 4 binary adders. The use of the Winograd minimal filtering method allows to construct a module structure that requires only 4 binary multipliers and 8 binary adders. Since a high-performance convolutional neural network can contain tens or even hundreds of such modules, such a reduction can have a significant effect.Comment: 3 pages, 5 figure
    • …
    corecore