2,065 research outputs found

    Spaceborne memory organization, phase 1 Final report

    Get PDF
    Application of associative memories to data processing for future space vehicle

    Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors

    Full text link
    [EN] We introduce a high performance, multi-threaded realization of the gemm kernel for the ARMv8.2 architecture that operates with 16-bit (half precision)/queryKindly check and confirm whether the corresponding author is correctly identified. floating point operands. Our code is especially designed for efficient machine learning inference (and to a certain extent, also training) with deep neural networks. The results on the NVIDIA Carmel multicore processor, which implements the ARMv8.2 architecture, show considerable performance gains for the gemm kernel, close to the theoretical peak acceleration that could be expected when moving from 32-bit arithmetic/data to 16-bit. Combined with the type of convolution operator arising in convolutional neural networks, the speed-ups are more modest though still relevant.This work was supported by projects TIN2017-82972-R and RTI2018-093684-B-I00 from the Ministerio de Ciencia, Innovacion y Universidades, project S2018/TCS-4423 of the Comunidad de Madrid, project PR65/19-22445 of the UCM, and project Prometeo/2019/109 of the Generalitat Valenciana.San Juan-Sebastian, P.; Rodríguez-Sánchez, R.; Igual, FD.; Alonso-Jordá, P.; Quintana-Ortí, ES. (2021). Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors. The Journal of Supercomputing. 77(10):11257-11269. https://doi.org/10.1007/s11227-021-03636-41125711269771

    Acceleration by Inline Cache for Memory-Intensive Algorithms on FPGA via High-Level Synthesis

    Get PDF
    Using FPGA-based acceleration of high-performance computing (HPC) applications to reduce energy and power consumption is becoming an interesting option, thanks to the availability of high-level synthesis (HLS) tools that enable fast design cycles. However, obtaining good performance for memory-intensive algorithms, which often exchange large data arrays with external DRAM, still requires time-consuming optimization and good knowledge of hardware design. This article proposes a new design methodology, based on dedicated application- and data array-specific caches. These caches provide most of the benefits that can be achieved by coding optimized DMA-like transfer strategies by hand into the HPC application code, but require only limited manual tuning (basically the selection of architecture and size), are neutral to target HLS tool and technology (FPGA or ASIC), and do not require changes to application code. We show experimental results obtained on five common memory-intensive algorithms from very diverse domains, namely machine learning, data sorting, and computer vision. We test the cost and performance of our caches against both out-of-the-box code originally optimized for a GPU, and manually optimized implementations specifically targeted for FPGAs via HLS. The implementation using our caches achieved an 8X speedup and 2X energy reduction on average with respect to out-of-the-box models using only simple directive-based optimizations (e.g., pipelining). They also achieved comparable performance with much less design effort when compared with the versions that were manually optimized to achieve efficient memory transfers specifically for an FPGA

    Spaceborne memory organization Interim report

    Get PDF
    Associative memory applications in unmanned space vehicle
    • …
    corecore