Automatic Speech Recognition (ASR) is quickly becoming a mainstream technology, mainly driven by the outstanding accuracy achieved by modern systems based on machine learning. However, these systems often require billions of arithmetic operations to decode a second of audio and relying on cloud services for ASR is usually inconvenient. Even though deployment of ASR systems directly on the edge is highly desirable, the requirements for high performance and low energy consumption, combined with the fast pace of evolution and heterogeneity of existing ASR systems, result in challenges for effective deployment of ASR on edge devices. In this work, we propose a programmable accelerator to efficiently support a variety of ASR implementations. We estimate the performance of our system by implementing a recently proposed streaming ASR system and show that it can perform real-time streaming decoding with a tight power budget and low area footprint while offering great flexibility to implement a variety of different models.This work has been supported by the CoCoUnit ERC Advanced Grant of the EU’s Horizon 2020 program (grant No 833057), the Spanish State Research Agency (MCIN/AEI) under grant PID2020-113172RB-I00, the ICREA Academia program and the Spanish MICINN Ministry under grant BES-2017-080605.Peer ReviewedPostprint (published version

Arnau Montañés, José María

González Colás, Antonio María

Pinto Rivero, Daniel

English

UPCommons. Portal del coneixement obert de la UPC

A Programmable Accelerator for Streaming Automatic Speech Recognition on EdgeDevicesDennis PintoUniversitat Politècnica de CatalunyaBarcelona, Spaindpinto@ac.upc.eduJose-Maria ArnauUniversitat Politècnica de CatalunyaBarcelona, Spainjarnau@ac.upc.eduAntonio GonzálezUniversitat Politècnica de CatalunyaBarcelona, Spainantonio@ac.upc.eduAbstract—Automatic Speech Recognition (ASR) is quicklybecoming a mainstream technology, mainly driven by theoutstanding accuracy achieved by modern systems based onmachine learning. However, these systems often require billionsof arithmetic operations to decode a second of audio andrelying on cloud services for ASR is usually inconvenient.Even though deployment of ASR systems directly on the edgeis highly desirable, the requirements for high performanceand low energy consumption, combined with the fast pace ofevolution and heterogeneity of existing ASR systems, result inchallenges for effective deployment of ASR on edge devices.In this work, we propose a programmable accelerator toefficiently support a variety of ASR implementations. Weestimate the performance of our system by implementing arecently proposed streaming ASR system and show that itcan perform real-time streaming decoding with a tight powerbudget and low area footprint while offering great flexibilityto implement a variety of different models.Keywords-Programmable Accelerator, ASR, Speech Recog-nitionI. INTRODUCTIONVoiced-based applications are quickly becoming main-stream. This widespread adoption is fueled by the out-standing improvement experienced by the Automatic SpeechRecognition (ASR) systems that power them [1]–[3], [8].Despite the impressive progress in automatic speechrecognition technology, deployment of ASR systems on edgedevices remains challenging and thus ASR systems are com-monly deployed on servers. This approach is problematicdue to high decoding latency and high energy consump-tion from the network subsystem of the edge device [5].However, the biggest concerns come from the security andprivacy issues related to sending personal data to externalservers.In order to move ASR to the edge, edge devices must pro-vide enough computing power to perform expensive DNNinferences and graph searches. The compute power require-ments can be fulfilled by including hardware accelerators inexisting SoCs [6]. However, generic accelerators may not beenough to provide consistent performance and very specificaccelerators may quickly become obsolete given the numberof alternative implementations available for ASR and the fastpace of innovation in the ASR field.i-cache SharedMemoryHypothesisMemory modelmemory /d-cachePECommandDecoderASRControllerHypothesisControllerMemoryConf.PE PE PEPEPEPEPEi-cacheRegisterBankd-cacheALUVectorMULVectorADDLogExpCosInstructionDecoderRISC-V ISAint->fpLSUVectorACUMVectorMACFigure 1. Diagram of the acceleratorIn this work, we propose a programmable accelerator forASR. A programmable accelerator can handle a large varietyof different ASR implementations while providing enoughcompute power in an efficient manner through some degreeof specialization. We explore a design that enables enoughcomputing performance in a low-power setup for real-timestreaming decoding. Furthermore, we provide a program-ming model that abstracts away some of the complexities ofthe system for ease of programming and flexibility.II. ACCELERATORThe accelerator, depicted in figure 1, relies on a pool ofgeneral-purpose cores (PE) to support parallel execution.Each core has a data cache and an instruction cache andimplements a RISC-V ISA with extensions for vector oper-ations and special functions.A. Command DecoderThe accelerator is accessed through a set of commands.These include commands to set up the beam size of thehypothesis expansion, configure the kernels that implementthe ASR system, start a decoding step and finish the currentdecoding process. The command decoder stores the requiredparameters in the ConfMemory, which is later accessed bythe ASR controller during run time.B. ASR ControllerFigure 2 shows the overall process of decoding an utter-ance with our accelerator. This process is controlled by theASR Controller. In streaming decoding, the signal is decodedin decoding steps. Each step decodes a few milliseconds ofthe audio signal. An external process captures an audio seg-ment and commands the accelerator to start a decoding step.Each decoding step is divided into two stages: (1) the firststage is the acoustic scoring, which consists of producingacoustic score vectors by processing the raw audio segment;(2) after the acoustic scores are generated, the hypothesisexpansion stage starts. That stage generates transcriptionshypotheses from the acoustic scores and additional models,such as lexicon and language models.During acoustic scoring, the ASR controller launches asequence of parallel kernels which collectively implementthe feature extraction and acoustic model processes in theASR system. These kernels access data from the sharedmemory, which can be used to store and retrieve intermediateresults, and the model memory, which is used to pre-loadmodel parameters, such as DNN weights. The kernels arelaunched sequentially, meaning that the next kernel will notstart executing until all the threads from the current kernelhave finished.Every kernel consists of a kernel program and a setupprogram. The setup program reserves and frees space in theshared memory, configures the model memory DMA to pre-fetch model data and check whether there are enough inputsfor the kernel program to execute or not. Figure 3 shows howthe different threads are scheduled in the PE pool duringacoustic scoring. Each square represents a PE executing asetup thread (yellow) or a kernel thread (blue). (1) First,the setup thread of kernel 0 is dispatched. It configuresthe DMA to load the model data for kernel 0 in modelmemory and waits for it to finish. (2) The execution of thefollowing kernels (asi in the figure) starts by dispatching thesetup thread for the next kernel (asi+1) alongside the kernelthreads of asi. (3) The ASR controller keeps dispatchingasi threads until the kernel is completely executed. If asetup thread determines that the corresponding thread cannotbe launched (4), it will notify the controller. Additionally,it can pre-fetch the model data for kernel 0 to skip step1 during the next decoding step. After the current kernelfinishes (5), the controller will interrupt the decoding stepand wait for the next decoding command, which will start anew decoding step from (1) or (2), depending on whether themodel data for kernel 0 is pre-loaded or not. (6) The setupfor the hypothesis expansion phase is launched alongside thethreads for the last acoustic scoring kernel. Finally, when allthe threads for the last acoustic scoring kernel finish (7), theaccelerator ends the acoustic scoring phase.After the acoustic scoring phase, the ASR controller entersthe hypothesis expansion phase. For this phase, the program-mer provides only one kernel. The controller dispatches asmany threads of this kernel as the number of hypotheses thatresulted from the previous hypothesis expansion phase. Eachthread reads a hypothesis and appends to it every possibleAcousticScoringPhaseHypothesisExpansionPhaseAcousticScoringPhaseHypothesisExpansionPhaseAcousticScoringPhaseHypothesisExpansionPhase...Feature extractionAM DNN layer 0AM DNN layer 1AM DNN layer 2AM DNN layer 3AM DNN layer 4AM DNN layer 5AM DNN layer N(100)(1200)(1520)(500)(500)(1200)(1520)(9000)Hyp. Expansion (nHyps)x3DecodingStep 0DecodingStep 1DecodingStep 2DecodingStep 3Figure 2. ASR decoding in the proposed acceleratorexecute setupi+1alongside asi threads3iiiiiiii2iiiiiiii+1PE pool011i+1i+1i+1i+1i+1i+1i+1i+1i+1 i+1i+1i+1i+1i+1 i+1i+25iiiiiiiiiiiiiiii+1to loadconfigure DMAas0 weights to loadconfigure DMAasi+1 weightsinterruptdecoding stepfinish acousticscoringmodelmemoryDMADMA2modelmemory 4 DMA modelmemory6NNNNNNNh7NNNNNNNN26 DMA modelmemoryto loadconfigure DMAas0 weightsFigure 3. Scheduling of kernel and setup threads in the PE poolacoustic token to generate new hypotheses. During thisstage, the model memory acts as a data cache to leverage theexisting locality in the access to the graph structures [7]. Thesetup program of the hypothesis expansion kernel determineshow many hypothesis expansions must be performed. Thisis useful in the case that the acoustic scoring phase producedmore than one scoring vector.C. Hypothesis ControllerThe hypothesis expansion threads send the generatedhypotheses to the hypothesis controller, which sorts themaccording to their score and prunes them according tothe beam width and the hardware maximum number ofthreads. This controller stores the received hypotheses inthe Hypothesis memory. The non-pruned hypotheses arekept there in-between decoding steps. Hypothesis expansionthreads access the hypothesis memory through the hypoth-esis controller.III. RESULTSWe configure the architecture with 8 cores, each contain-ing 4KB of i-cache and 24KB of d-cache. The vectorunits are of width 4. Outside of the cores, we include1MB of prefetch buffer/d-cache, 512KB of shared memory,64KB of i-cache and 24KB of hypothesis memory. Thisconfiguration results in around 12mm² at 22nm (64% ofwhich is dedicated for the core pool) and provides a peakperformance of 32GMAC/s at 500MHz. We estimate thepeak power at about 1.8WTo estimate performance, we implement a recent end-to-end CTC-based ASR system [4]. It consists of a TDSnetwork that extracts acoustic scores from MFCC features.The hypotheses are expanded by traversing a lexicon treeand a graph-based language model.The acoustic scoring phase consists of 1 kernel to computeMFCC features and 79 kernels (18 convolutions, 29 fully-connected layers and 32 LayerNorms) to implement the TDSnetwork, each preceded by its corresponding setup thread.We only implement each type once and reuse the kernelcode as explained in section II-B.The ASR system generates 100 MFCC frames per secondof audio and the TDS network applies a sub-sampling factorof 8 to the input. This means that decoding a second ofaudio requires 13 decoding steps. We estimate that it takesabout 520ms to decode a second of audio in the proposedaccelerator, which exceeds real-time performance. We alsoestimate that the average power dissipation is slightly over1W.ACKNOWLEDGMENTThis work has been supported by the CoCoUnit ERC Ad-vanced Grant of the EU’s Horizon 2020 program (grant No833057), the Spanish State Research Agency (MCIN/AEI)under grant PID2020-113172RB-I00, the ICREA Academiaprogram and the Spanish MICINN Ministry under grantBES-2017-080605.REFERENCES[1] “Speech recognition on librispeech test-clean,”https://paperswithcode.com/sota/speech-recognition-on-librispeech-test-clean, [Online; accessed 29-Oct-2021].[2] “Speech and voice recognition market size, share & industryanalysis, by component (solution, services), by technology(voice recognition, speech recognition), by deployment (on-premises, cloud), by end-user (healthcare, it and telecommuni-cations, automotive, bfsi, government, legal, retail, travel andhospitality and others) and regional forecast, 2019 – 2026,”2019.[3] Y.-A. Chung, Y. Zhang, W. Han, C.-C. Chiu, J. Qin, R. Pang,and Y. Wu, “W2v-bert: Combining contrastive learning andmasked language modeling for self-supervised speech pre-training,” arXiv preprint arXiv:2108.06209, 2021.[4] A. Hannun, A. Lee, Q. Xu, and R. Collobert, “Sequence-to-sequence speech recognition with time-depth separable convo-lutions,” arXiv preprint arXiv:1904.02619, 2019.[5] G. P. Perrucci, F. H. Fitzek, and J. Widmer, “Survey on energyconsumption entities on the smartphone platform,” in 2011IEEE 73rd vehicular technology conference (VTC Spring).IEEE, 2011, pp. 1–6.[6] D. Pinto, J.-M. Arnau, and A. González, “Design and evalu-ation of an ultra low-power human-quality speech recognitionsystem,” ACM Transactions on Architecture and Code Opti-mization (TACO), vol. 17, no. 4, pp. 1–19, 2020.[7] R. Yazdani, A. Segura, J.-M. Arnau, and A. Gonzalez, “Anultra low-power hardware accelerator for automatic speechrecognition,” in 2016 49th Annual IEEE/ACM InternationalSymposium on Microarchitecture (MICRO). IEEE, 2016, pp.1–12.[8] Y. Zhang, J. Qin, D. S. Park, W. Han, C.-C. Chiu, R. Pang,Q. V. Le, and Y. Wu, “Pushing the limits of semi-supervisedlearning for automatic speech recognition,” arXiv preprintarXiv:2010.10504, 2020.

A programmable accelerator for streaming automatic speech recognition on edge devices

https://upcommons.upc.edu/bitstream/2117/373474/1/A%20Programmable%20Accelerator%20CogArch2022.pdf

A programmable accelerator for streaming automatic speech recognition on edge devices

Abstract

Similar works

Full text

Available Versions

UPCommons. Portal del coneixement obert de la UPC