Search CORE

59 research outputs found

EDRA:A Hardware-assisted Decoupled Access/Execute Framework on the Digital Market

Author: Alachiotis Nikolaos
Brokalakis Andreas
Pnevmatikatos Dionisis
Theodoropoulos Dimitris
Publication venue
Publication date: 01/07/2021
Field of study

EDRA was an Horizon 2020 FET Launchpad project that focused on the commercialization of the Decoupled Access Execution Reconfigurable (DAER) framework - developed within the FET-HPC EXTRA project - on Amazon's Elastic Cloud (EC2) Compute FPGA-based infrastructure. The delivered framework encapsulates DAER into a EC2 virtual machine (VM), and uses a simple, directive-based, high-level application programming interface (API) to facilitate application mapping to the underlying hardware architecture. EDRA's Minimum Viable Product (MVP) is an accelerator for the Phylogenetic Likelihood Function (PLF), one of the cornerstone functions in most phylogenetic inference tools, achieving up to 8x performance improvement compared to optimized software implementations. Towards entering the market, research revealed that Europe is an extremely promising geographic region for focusing the project efforts on dissemination, MVP promotion and advertisement

University of Twente Research Information

Recommended from our members

Harnessing Chip-Multiprocessors with Concurrent Threaded Pipelines ; CU-CS-1024-07

Author: Giacomoni John
Vachharajani Manish
Publication venue: CU Scholar
Publication date: 01/01/2007
Field of study

CU Scholar Institutional Repository

INDEPENDENT PROJECT BASED LEARNING: APPLYING KNOWLEDGE BY CREATING A LOW-COST SUPERCOMPUTER WITH RASPBERRY PI

Author: Franks Shaun
Yerby Johnathan
Publication venue: AIS Electronic Library (AISeL)
Publication date: 14/04/2014
Field of study

This article was written to detail the process of one-on-one project based learning and the benefits to the student. Students are increasingly seeking opportunities to gain experience with developing technologies that interest them personally, but may be beyond the planned curriculum. This paper explores an independent project of a student creating a supercomputer by using the parallel processing power of ten Raspberry pi. Traditionally computing is executed one instruction at a time, usually using a single processor. The speed in which instructions are completed depends on how fast data moves through the hardware. Parallel computing is a faster way to process instructions by breaking the large task into smaller tasks using a coordinated effort to process data simultaneously. Parallel computing is typically handled by supercomputers that range in costs from the hundreds of thousands to over a billion dollars

AIS Electronic Library (AISeL)

Code scheduling for multiple instruction stream architectures

Author: C. Stephens
Gary Tyson
J. E. Smith
Matthew Farrens
R. L. Sites
T. Austin
W. Wulf
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

An ultra low-power hardware accelerator for automatic speech recognition

Author: Arnau Montañés José María
González Colás Antonio María
Segura Salvador Albert
Yazdani Aminabadi Reza
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

Automatic Speech Recognition (ASR) is becoming increasingly ubiquitous, especially in the mobile segment. Fast and accurate ASR comes at a high energy cost which is not affordable for the tiny power budget of mobile devices. Hardware acceleration can reduce power consumption of ASR systems, while delivering high-performance. In this paper, we present an accelerator for large-vocabulary, speaker-independent, continuous speech recognition. It focuses on the Viterbi search algorithm, that represents the main bottleneck in an ASR system. The proposed design includes innovative techniques to improve the memory subsystem, since memory is identified as the main bottleneck for performance and power in the design of these accelerators. We propose a prefetching scheme tailored to the needs of an ASR system that hides main memory latency for a large fraction of the memory accesses with a negligible impact on area. In addition, we introduce a novel bandwidth saving technique that removes 20% of the off-chip memory accesses issued during the Viterbi search. The proposed design outperforms software implementations running on the CPU by orders of magnitude and achieves 1.7x speedup over a highly optimized CUDA implementation running on a high-end Geforce GTX 980 GPU, while reducing by two orders of magnitude (287x) the energy required to convert the speech into text.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

Compiler Discovered Dynamic Scheduling of Irregular Code in High-Level Synthesis

Author: Nabi Syed Waqar
Szafarczyk Robert
Vanderbauwhede Wim
Publication venue
Publication date: 29/08/2023
Field of study

Dynamically scheduled high-level synthesis (HLS) achieves higher throughput than static HLS for codes with unpredictable memory accesses and control flow. However, excessive dataflow scheduling results in circuits that use more resources and have a slower critical path, even when only a part of the circuit exhibits dynamic behavior. Recent work has shown that marking parts of a dataflow circuit for static scheduling can save resources and improve performance (hybrid scheduling), but the dynamic part of the circuit still bottlenecks the critical path. We propose instead to selectively introduce dynamic scheduling into static HLS. This paper presents an algorithm for identifying code regions amenable to dynamic scheduling and shows a methodology for introducing dynamically scheduled basic blocks, loops, and memory operations into static HLS. Our algorithm is informed by modulo-scheduling and can be integrated into any modulo-scheduled HLS tool. On a set of ten benchmarks, we show that our approach achieves on average an up to 3.7

\times

and 3

\times

speedup against dynamic and hybrid scheduling, respectively, with an area overhead of 1.3

\times

and frequency degradation of 0.74

\times

when compared to static HLS.Comment: To appear in the 33rd International Conference on Field-Programmable Logic and Applications (2023

arXiv.org e-Print Archive

A low-power, high-performance speech recognition accelerator

Author: Arnau Montañés José María
González Colás Antonio María
Yazdani Reza
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

© 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Automatic Speech Recognition (ASR) is becoming increasingly ubiquitous, especially in the mobile segment. Fast and accurate ASR comes at high energy cost, not being affordable for the tiny power-budgeted mobile devices. Hardware acceleration reduces energy-consumption of ASR systems, while delivering high-performance. In this paper, we present an accelerator for largevocabulary, speaker-independent, continuous speech-recognition. It focuses on the Viterbi search algorithm representing the main bottleneck in an ASR system. The proposed design consists of innovative techniques to improve the memory subsystem, since memory is the main bottleneck for performance and power in these accelerators' design. It includes a prefetching scheme tailored to the needs of ASR systems that hides main memory latency for a large fraction of the memory accesses, negligibly impacting area. Additionally, we introduce a novel bandwidth-saving technique that removes off-chip memory accesses by 20 percent. Finally, we present a power saving technique that significantly reduces the leakage power of the accelerators scratchpad memories, providing between 8.5 and 29.2 percent reduction in entire power dissipation. Overall, the proposed design outperforms implementations running on the CPU by orders of magnitude, and achieves speedups between 1.7x and 5.9x for different speech decoders over a highly optimized CUDA implementation running on Geforce-GTX-980 GPU, while reducing the energy by 123-454x.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Recommended from our members

FastForward for Concurrent Threaded Pipelines ; CU-CS-1023-07

Author: Giacomoni John
Moseley Tipp
Vachharajani Manish
Publication venue: CU Scholar
Publication date: 01/01/2007
Field of study

CU Scholar Institutional Repository