319 research outputs found

    A Comparative Study of Scheduling Techniques for Multimedia Applications on SIMD Pipelines

    Full text link
    Parallel architectures are essential in order to take advantage of the parallelism inherent in streaming applications. One particular branch of these employ hardware SIMD pipelines. In this paper, we analyse several scheduling techniques, namely ad hoc overlapped execution, modulo scheduling and modulo scheduling with unrolling, all of which aim to efficiently utilize the special architecture design. Our investigation focuses on improving throughput while analysing other metrics that are important for streaming applications, such as register pressure, buffer sizes and code size. Through experiments conducted on several media benchmarks, we present and discuss trade-offs involved when selecting any one of these scheduling techniques.Comment: Presented at DATE Friday Workshop on Heterogeneous Architectures and Design Methods for Embedded Image Systems (HIS 2015) (arXiv:1502.07241

    Transformations of High-Level Synthesis Codes for High-Performance Computing

    Full text link
    Specialized hardware architectures promise a major step in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C/C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target specialized hardware, the optimization principles known from traditional software design are no longer sufficient to implement high-performance codes. Fast and efficient codes for reconfigurable platforms are thus still challenging to design. To alleviate this, we present a set of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. Our work provides a toolbox for developers, where we systematically identify classes of transformations, the characteristics of their effect on the HLS code and the resulting hardware (e.g., increases data reuse or resource consumption), and the objectives that each transformation can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip streaming dataflow, allowing for massively parallel architectures. To quantify the effect of our transformations, we use them to optimize a set of throughput-oriented FPGA kernels, demonstrating that our enhancements are sufficient to scale up parallelism within the hardware constraints. With the transformations covered, we hope to establish a common framework for performance engineers, compiler developers, and hardware developers, to tap into the performance potential offered by specialized hardware architectures using HLS

    Performance and resource modeling for FPGAs using high-level synthesis tools

    Get PDF
    High-performance computing with FPGAs is gaining momentum with the advent of sophisticated High-Level Synthesis (HLS) tools. The performance of a design is impacted by the input-output bandwidth, the code optimizations and the resource consumption, making the performance estimation a challenge. This paper proposes a performance model which extends the roofline model to take into account the resource consumption and the parameters used in the HLS tools. A strategy is developed which maximizes the performance and the resource utilization within the area of the FPGA. The model is used to optimize the design exploration of a class of window-based image processing application

    Suggesting loop unrolling using a heuristic-guided approach

    Get PDF
    Tese de mestrado integrado. Engenharia Informática e Computação. Faculdade de Engenharia. Universidade do Porto. 201

    Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators

    Full text link
    We show that DNN accelerator micro-architectures and their program mappings represent specific choices of loop order and hardware parallelism for computing the seven nested loops of DNNs, which enables us to create a formal taxonomy of all existing dense DNN accelerators. Surprisingly, the loop transformations needed to create these hardware variants can be precisely and concisely represented by Halide's scheduling language. By modifying the Halide compiler to generate hardware, we create a system that can fairly compare these prior accelerators. As long as proper loop blocking schemes are used, and the hardware can support mapping replicated loops, many different hardware dataflows yield similar energy efficiency with good performance. This is because the loop blocking can ensure that most data references stay on-chip with good locality and the processing units have high resource utilization. How resources are allocated, especially in the memory system, has a large impact on energy and performance. By optimizing hardware resource allocation while keeping throughput constant, we achieve up to 4.2X energy improvement for Convolutional Neural Networks (CNNs), 1.6X and 1.8X improvement for Long Short-Term Memories (LSTMs) and multi-layer perceptrons (MLPs), respectively.Comment: Published as a conference paper at ASPLOS 202

    OPTIMIZATION OF FPGA-BASED PROCESSOR ARCHITECTURE FOR SOBEL EDGE DETECTION OPERATOR

    Get PDF
    This dissertation introduces an optimized processor architecture for Sobel edge detection operator on field programmable gate arrays (FPGAs). The processor is optimized by the use of several optimization techniques that aim to increase the processor throughput and reduce the processor logic utilization and memory usage. FPGAs offer high levels of parallelism which is exploited by the processor to implement the parallel process of edge detection in order to increase the processor throughput and reduce the logic utilization. To achieve this, the proposed processor consists of several Sobel instances that are able to produce multiple output pixels in parallel. This parallelism enables data reuse within the processor block. Moreover, the processor gains performance with a factor equal to the number of instances contained in the processor block. The processor that consists of one row of Sobel instances exploits data reuse within one image line in the calculations of the horizontal gradient. Data reuse within one and multiple image lines is enabled by using a processor with multiple rows of Sobel instances which allow the reuse of both the horizontal and vertical gradients. By the application of the optimization techniques, the proposed Sobel processor is able to meet real-time performance constraints due to its high throughput even with a considerably low clock frequency. In addition, logic utilization of the processor is low compared to other Sobel processors when implemented on ALTERA Cyclone II DE2-70

    Simple SIMON: FPGA implementations of the SIMON 64/128 Block Cipher

    Full text link
    In this paper we will present various hardware architecture designs for implementing the SIMON 64/128 block cipher as a cryptographic component offering encryption, decryption and self-contained key-scheduling capabilities and discuss the issues and design options we encountered and the tradeoffs we made in implementing them. Finally, we will present the results of our hardware architectures' implementation performances on the Xilinx Spartan-6 FPGA series.Comment: 20 page
    corecore