59 research outputs found

    EDRA:A Hardware-assisted Decoupled Access/Execute Framework on the Digital Market

    Get PDF
    EDRA was an Horizon 2020 FET Launchpad project that focused on the commercialization of the Decoupled Access Execution Reconfigurable (DAER) framework - developed within the FET-HPC EXTRA project - on Amazon's Elastic Cloud (EC2) Compute FPGA-based infrastructure. The delivered framework encapsulates DAER into a EC2 virtual machine (VM), and uses a simple, directive-based, high-level application programming interface (API) to facilitate application mapping to the underlying hardware architecture. EDRA's Minimum Viable Product (MVP) is an accelerator for the Phylogenetic Likelihood Function (PLF), one of the cornerstone functions in most phylogenetic inference tools, achieving up to 8x performance improvement compared to optimized software implementations. Towards entering the market, research revealed that Europe is an extremely promising geographic region for focusing the project efforts on dissemination, MVP promotion and advertisement

    INDEPENDENT PROJECT BASED LEARNING: APPLYING KNOWLEDGE BY CREATING A LOW-COST SUPERCOMPUTER WITH RASPBERRY PI

    Get PDF
    This article was written to detail the process of one-on-one project based learning and the benefits to the student. Students are increasingly seeking opportunities to gain experience with developing technologies that interest them personally, but may be beyond the planned curriculum. This paper explores an independent project of a student creating a supercomputer by using the parallel processing power of ten Raspberry pi. Traditionally computing is executed one instruction at a time, usually using a single processor. The speed in which instructions are completed depends on how fast data moves through the hardware. Parallel computing is a faster way to process instructions by breaking the large task into smaller tasks using a coordinated effort to process data simultaneously. Parallel computing is typically handled by supercomputers that range in costs from the hundreds of thousands to over a billion dollars

    An ultra low-power hardware accelerator for automatic speech recognition

    Get PDF
    Automatic Speech Recognition (ASR) is becoming increasingly ubiquitous, especially in the mobile segment. Fast and accurate ASR comes at a high energy cost which is not affordable for the tiny power budget of mobile devices. Hardware acceleration can reduce power consumption of ASR systems, while delivering high-performance. In this paper, we present an accelerator for large-vocabulary, speaker-independent, continuous speech recognition. It focuses on the Viterbi search algorithm, that represents the main bottleneck in an ASR system. The proposed design includes innovative techniques to improve the memory subsystem, since memory is identified as the main bottleneck for performance and power in the design of these accelerators. We propose a prefetching scheme tailored to the needs of an ASR system that hides main memory latency for a large fraction of the memory accesses with a negligible impact on area. In addition, we introduce a novel bandwidth saving technique that removes 20% of the off-chip memory accesses issued during the Viterbi search. The proposed design outperforms software implementations running on the CPU by orders of magnitude and achieves 1.7x speedup over a highly optimized CUDA implementation running on a high-end Geforce GTX 980 GPU, while reducing by two orders of magnitude (287x) the energy required to convert the speech into text.Peer ReviewedPostprint (author's final draft

    Compiler Discovered Dynamic Scheduling of Irregular Code in High-Level Synthesis

    Full text link
    Dynamically scheduled high-level synthesis (HLS) achieves higher throughput than static HLS for codes with unpredictable memory accesses and control flow. However, excessive dataflow scheduling results in circuits that use more resources and have a slower critical path, even when only a part of the circuit exhibits dynamic behavior. Recent work has shown that marking parts of a dataflow circuit for static scheduling can save resources and improve performance (hybrid scheduling), but the dynamic part of the circuit still bottlenecks the critical path. We propose instead to selectively introduce dynamic scheduling into static HLS. This paper presents an algorithm for identifying code regions amenable to dynamic scheduling and shows a methodology for introducing dynamically scheduled basic blocks, loops, and memory operations into static HLS. Our algorithm is informed by modulo-scheduling and can be integrated into any modulo-scheduled HLS tool. On a set of ten benchmarks, we show that our approach achieves on average an up to 3.7Ă—\times and 3Ă—\times speedup against dynamic and hybrid scheduling, respectively, with an area overhead of 1.3Ă—\times and frequency degradation of 0.74Ă—\times when compared to static HLS.Comment: To appear in the 33rd International Conference on Field-Programmable Logic and Applications (2023

    A low-power, high-performance speech recognition accelerator

    Get PDF
    © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Automatic Speech Recognition (ASR) is becoming increasingly ubiquitous, especially in the mobile segment. Fast and accurate ASR comes at high energy cost, not being affordable for the tiny power-budgeted mobile devices. Hardware acceleration reduces energy-consumption of ASR systems, while delivering high-performance. In this paper, we present an accelerator for largevocabulary, speaker-independent, continuous speech-recognition. It focuses on the Viterbi search algorithm representing the main bottleneck in an ASR system. The proposed design consists of innovative techniques to improve the memory subsystem, since memory is the main bottleneck for performance and power in these accelerators' design. It includes a prefetching scheme tailored to the needs of ASR systems that hides main memory latency for a large fraction of the memory accesses, negligibly impacting area. Additionally, we introduce a novel bandwidth-saving technique that removes off-chip memory accesses by 20 percent. Finally, we present a power saving technique that significantly reduces the leakage power of the accelerators scratchpad memories, providing between 8.5 and 29.2 percent reduction in entire power dissipation. Overall, the proposed design outperforms implementations running on the CPU by orders of magnitude, and achieves speedups between 1.7x and 5.9x for different speech decoders over a highly optimized CUDA implementation running on Geforce-GTX-980 GPU, while reducing the energy by 123-454x.Peer ReviewedPostprint (author's final draft
    • …
    corecore