# Language Modeling on a SpiNNaker2 Neuromorphic Chip

Khaleelulla Khan Nazeer<sup>1,+</sup>, Mark Schöne<sup>1</sup>, Rishav Mukherji<sup>2</sup>,

Bernhard Vogginger<sup>1</sup>, Christian Mayr<sup>1,3</sup>, David Kappel<sup>4</sup> and Anand Subramoney<sup>5</sup>

<sup>1</sup>*Chair of Highly-Parallel VLSI-Systems and Neuro-Microelectronics, Technische Universitat Dresden, Germany ¨*

<sup>2</sup>*Birla Institute of Technology and Science, Pilani – Goa Campus, Goa, India*

<sup>3</sup>*Centre for Tactile Internet (CeTI) with Human-in-the-Loop, Technische Universitat Dresden, Germany ¨*

<sup>4</sup> Institut für Neuroinformatik, Ruhr Universität Bochum, Bochum, Germany

<sup>5</sup>*Dept. of Computer Science, Royal Holloway, University of London, Egham, United Kingdom*

<sup>+</sup>Email:khaleelulla.khan@tu-dresden.de

*Abstract*—As large language models continue to scale in size rapidly, so too does the computational power required to run them. Event-based networks on neuromorphic devices offer a potential way to reduce energy consumption for inference significantly. However, to date, most event-based networks that can run on neuromorphic hardware, including spiking neural networks (SNNs), have not achieved task performance even on par with LSTM models for language modeling. As a result, language modeling on neuromorphic devices has seemed a distant prospect. In this work, we demonstrate the first-ever implementation of a language model on a neuromorphic device – specifically the SpiN-Naker2 chip – based on a recently published event-based architecture called the EGRU. SpiNNaker2 is a many-core neuromorphic chip designed for large-scale asynchronous processing, and the EGRU is architected to leverage such hardware efficiently while maintaining competitive task performance. This implementation marks the first time a neuromorphic language model matches LSTMs, setting the stage for taking task performance to the level of large language models. We also demonstrate results on a gesture recognition task based on inputs from a DVS camera. Overall, our results showcase the feasibility of this neuro-inspired neural network in hardware, highlighting significant gains versus conventional hardware in energy efficiency for the common use case of single batch inference.

*Index Terms*—Neuromorphic, Language model, Energy efficient, Sparse activity, Sparse weights

## I. INTRODUCTION

Most deep learning systems, from edge to cloud, rely on highly regular SIMD processing. The tremendous success of this processing paradigm has encouraged further convergence of hardware accelerators and algorithms to high throughput SIMD systems [\[1\]](#page-3-0). At the same time, deep learning algorithms exhibit a surprisingly high degree of inherent sparsity, which SIMD accelerators are unable to exploit. Experimental and theoretical studies have shown that a large fraction of connections can be removed entirely without sacrificing learning accuracy [\[2\]](#page-4-0), [\[3\]](#page-4-1). Furthermore, deep learning models can operate on highly sparse representations without sacrificing precision [\[4\]](#page-4-2)– [\[7\]](#page-4-3). These findings challenge the design principles of today's SIMD-based deep learning systems from an energy efficiency perspective. Communication is the central energy and latency cost factor in contemporary computer architectures [\[8\]](#page-4-4). Dense

matrix operations are omnipresent in deep learning and require  $\mathcal{O}(n^2)$  messages for *n*-dimensional representations. This unfavorable behavior is under growing pressure as the annual growth rate of density of computational operations in hardware is about twice as fast as the growth rate of memory and interconnect bandwidth.

In this work, we present an implementation of a sparsely connected and sparsely activated architecture implemented on an processor that can take advantage of this unstructured sparsity for energy efficiency. More specifically, we present an implementation of a sparse network based on the EGRU [\[6\]](#page-4-5) architecture on a SpiNNaker2 chip. The EGRU [\[6\]](#page-4-5) is a recently proposed event-based network that naturally exhibits high levels of activity sparsity and was shown to have high levels of task performance on language modeling and gesture recognition tasks among others. To make full use of its potential, we implement it on the SpiNNaker2 chip, which is a digital neuromorphic system optimized for sparse communication and event-based processing [\[9\]](#page-4-6). Our implementation operates on unstructured sparsely connected units that communicate sparse in time. Both operations can be accelerated on SpiNNaker2, but not on conventional SIMD architectures. We choose language modeling as our demonstrator. Since the EGRU is a recurrent network, it is able to exploit the temporal inductive bias of sequence modeling tasks such as language modeling for computationally efficient processing. While transformers [\[10\]](#page-4-7) are the dominant architecture for language modeling, they are computationally very expensive, which makes it even more urgent to find an energy efficient alternative. This first demonstration of the energy gains achievable using a recurrent architecture on neuromorphic hardware will set the stage for neuromorphic language modeling using even more powerful recurrent architectures [\[11\]](#page-4-8), [\[12\]](#page-4-9).

#### II. RELATED WORK

Recent advances in machine learning have led to increased interest in energy-efficient hardware accelerators. Hardwaresoftware co-design for machine learning accelerators have been used to target scaling to extremely large models [\[13\]](#page-4-10), [\[14\]](#page-4-11). More recently, there has been an increased focus on making transformer-based neural networks more efficient using accelerators for conventional hardware (see [\[15\]](#page-4-12) for a review). A 4-bit quantized accelerator in 5 nm presented recently [\[16\]](#page-4-13) demonstrated high energy efficiency and throughput. Spiking variants of popular transformer architectures have also recently been introduced [\[17\]](#page-4-14)–[\[19\]](#page-4-15), but no advantage on custom hardware has been reported yet. Neuromorphic LSTM accelerators have been developed using FPGAs [\[20\]](#page-4-16), systolic arrays [\[21\]](#page-4-17), and memristors [\[22\]](#page-4-18). A hybrid LSTM/spiking neuron architecture was implemented on Intel's Loihi chip, demonstrating energy gains [\[23\]](#page-4-19). None of these LSTM-based approaches have been scaled to standard NLP benchmark tasks yet. Spiking LSTM [\[24\]](#page-4-20) and EGRU [\[6\]](#page-4-5) are two attempts at bringing event-based properties to the respective base architectures and allow for full precision gates and graded spike communication between units. Other related approaches include Sigma-delta quantised networks that communicate only quantised changes in activations to the next layer in a feed-forward network [\[25\]](#page-4-21) and its extension to recurrent networks [\[4\]](#page-4-2). An FPGA accelerator for sparsely connected and sparsely communicating Delta Networks greatly reducing required memory access, was presented in [\[26\]](#page-4-22).

## III. BACKGROUND

#### *A. The SpiNNaker2 System*

SpiNNaker2 is an accelerator for large-scale event-based and asynchronous processing [\[9\]](#page-4-6). The chip consists of 152 processing elements (PEs) connected via a network-on-chip (NoC). Each processing element is composed of an Arm M4f core, 128 kB SRAM, and a set of accelerators for exponential functions, random number generation and multiply-accumulate (MAC) operations. The total of 19 MB on-chip SRAM is accompanied by 2 GB LPDDR4 memory. Communication between the PEs in a single chip can be implemented by direct memory access (DMA) to other PEs' local memory. The local SRAM is organized into 4 memory banks of  $32$  kB each. One is usually reserved for program memory, and three banks for values such as RNN weights and intermediate variables. See Table [I](#page-3-1) for details of memory footprint of EGRU Language model.

#### *B. Event-based Gated Recurrent Unit*

The Gated Recurrent Unit (GRU) is an effective recurrent neural network that has been widely adopted for sequence modeling [\[27\]](#page-4-23). To reduce the communication between logical neurons, [\[6\]](#page-4-5) apply a biologically inspired thresholding mechanism to the GRU. In this model, called Event-based Gated Recurrent Unit (EGRU), a layer consists of  $n$  neurons with output y and state c. A sparse output  $y = (y_1, \ldots, y_n)$  is generated from the GRU cell state  $\mathbf{c} = (c_1, \dots, c_n)$  via the following mechanism

$$
y_i^{\langle t \rangle} = c_i^{\langle t \rangle} H\left(c_i^{\langle t \rangle} - \vartheta_i\right), \quad H(x) = \begin{cases} 1, & x \ge 0 \\ 0, & x < 0 \end{cases} \tag{1}
$$

Only the sparse output y is communicated between neurons to compute the update gate u and the reset gate r of the GRU

$$
\mathbf{u}^{\langle t \rangle} = \sigma \left( \mathbf{W}_u \left[ \mathbf{x}^{\langle t \rangle}, \ \mathbf{y}^{\langle t-1 \rangle} \right] + \mathbf{b}_u \right) \tag{2}
$$

$$
\mathbf{r}^{\langle t \rangle} = \sigma \left( \mathbf{W}_r \left[ \mathbf{x}^{\langle t \rangle}, \ \mathbf{y}^{\langle t-1 \rangle} \right] + \mathbf{b}_r \right) \,. \tag{3}
$$

As outlined in [\[28\]](#page-4-24), the sparse state y and the gates u and r compute a proposed state z and the new cell state c

$$
\mathbf{z}^{\langle t \rangle} = g\left(\mathbf{W}_z \left[\mathbf{x}^{\langle t \rangle}, \ \mathbf{r}^{\langle t \rangle} \circ \ \mathbf{y}^{\langle t-1 \rangle}\right] + \mathbf{b}_z\right) \tag{4}
$$

$$
\mathbf{c}^{\langle t \rangle} = \mathbf{u}^{\langle t \rangle} \circ \mathbf{z}^{\langle t \rangle} + (1 - \mathbf{u}^{\langle t \rangle}) \circ \mathbf{c}^{\langle t-1 \rangle} - \mathbf{s}^{\langle t \rangle}. \tag{5}
$$

Similar to biologically plausible spiking neural networks, [\[6\]](#page-4-5) subtract already communicated signals y from the cell state via the reset term  $\mathbf{s}^{\langle t \rangle} = \boldsymbol{\vartheta} H (\mathbf{c}^{\langle t \rangle} - \boldsymbol{\vartheta})$ . During training, the surrogate function  $\frac{dH}{dc} = \lambda \max(1 - |c|/\epsilon)$  provides gradients below the threshold.

#### *C. Language Modeling with EGRU*

Word-level language modeling is a popular benchmark task to measure the performance of sequence models, including RNNs. A language model processes a sequence of words  $w_1, \ldots, w_t \in \mathcal{D}$  from a dictionary  $\mathcal{D}$ , and predicts the conditional distribution  $p(w_{t+1}|w_1, \ldots, w_t)$ . Its training objective is minimizing the cross entropy  $H(p,q)$  between this prediction  $p$  and a one-hot encoding  $q$  of the actual next word in the sequence. The standard metric for measuring performance is perplexity (PPL), the exponential cross entropy  $e^{H(p,q)}$ . Artificial texts can be generated by a trained language model by iterative sampling from the next-word distribution  $p$ predicted by the model.

We trained three EGRU layers without skip connections to processes word embedding vectors drawn from a learned lookup table similar to [\[29\]](#page-4-25). The model estimates the likelihood of the next word in a sequence by computing the dot-product similarity between the output vector of the final EGRU layer and all word embedding vectors of the dictionary. Softmax applied to this set of values serves as an estimate of the conditional distribution  $p$ . The dimension of word embedding vectors and the final layer cell state was 750. The dimension of intermediate layers' cell state was 1350. We used a model from Mukherji et al. [\[28\]](#page-4-24) trained on the WikiText-2 dataset [\[30\]](#page-4-26) with a parameter sparsity of 95 % per weight matrix. The model weights were stored on SRAM in a Sparse CSR format. The three EGRU layers were implemented on 150 PEs.

#### *D. DVS gesture recognition*

We also evaluated our model on gesture prediction, using the DVS128 Gesture Dataset [\[31\]](#page-4-27). This dataset contains 11 gestures from 29 subjects recorded with a DVS128 event camera [\[32\]](#page-4-28). Each event encodes a relative change of illumination and is given as spatio-temporal coordinates of X/Y position on the 128×128-pixel sensor and time stamp.

Our model consisted of a CNN feature extraction head and 2 EGRU layers of 256 units each. The dimension of features extracted from the CNN was 512. Finally a linear layer was used to predict the class of the gesture. For this task we used only dense weights and stored them directly on SRAM since the model is small enough to fit in local memory. The two EGRU layers were implemented on 128 PEs.

## IV. SPINNAKER2 IMPLEMENTATION

#### *A. Implementation of EGRU on a single processing element*

We were able to fit the simplest EGRU model on a single PE of SpiNNaker2. There are three operations that need to performed as part of EGRU algorithm: 1) input matrix multiplication 2) recurrent matrix multiplication and 3) pointwise operations. For a single PE implementation, we can simply execute these operations sequentially. There is no data transfer needed as all the results are available in local memory. Although there is a Multiply-accumulate (MAC) accelerator on SpiNNaker2, we do not use it in this application to take full advantage of EGRU's dynamic sparsity.

#### *B. Parallelization approaches*

Since any realistic model, including our larger EGRU models, will not be small enough to fit onto a single PE, we need to split the network over multiple PEs. To do this, we split the network and place the neurons on different PEs. This approach reduces the communication and synchronization required within the network. The output generated by the units placed on a single PE determined the output of that PE. This output  $y^{(t)}$ , at time t, needed to be broadcast to the rest of the units in the EGRU layer. On receiving such a broadcast each PE concatenated the outputs from all other PEs together with the output of the units stored locally to form the next recurrent input. This broadcast was implemented by sending internal NoC packets between PEs. This operation is demonstrated in Fig. [1](#page-2-0) and the algorithm is presented in Algorithm. [1](#page-2-1)

With this parallelization, we only split one dimension of the  $R$  weight matrix. Since the second dimension was as large as the number of units in a layer, the recurrent weight matrix was still too large to fit in individual PE memory. To mitigate this, we used a 95% pruned EGRU model for language modeling. The pruned weights were stored in compressed sparse row (CSR) format. In this format the non-zero (NZ) elements of the matrix were represented using three one-dimensional arrays. These contained NZ values, column indices of the NZ elements and the extents of rows, which required  $2*NZ + N_{rows} + 1$  memory.

#### *C. Dataset and pre-processing*

*1) Language Modeling:* The model was trained and validated on WikiText-2 dataset. The text was tokenised and split into sequence of length 70. The embeddings were precomputed and transferred to the LPDDR4 memory.

*2) DVS:* We combined the DVS raw event times into 'frames' by binning them over time windows of 25 ms, and then downscaled them to  $32 \times 32$  pixels using a maxpool layer. The dataset was pre-processed and the features extracted using the CNN head. The extracted features were stored in the LPDDR4 memory.

# <span id="page-2-1"></span>Algorithm 1 EGRU algorithm for multi PE implementation

# procedure EGRU

- Input:
- Network Configuration Parameters
- Input Data
- Output Data Destination
- Output: Processed output data

#### Initialization:

- Initialize temporary variables.
- while run is true do
	- Check for input data availability.
	- Process input data and prepare it for computation. for each time step  $t$  do
		-
		- $Wx \leftarrow$  Matrix Multiplication:  $W_x \times x_t$ .
		- $Rh \leftarrow$  Matrix Multiplication:  $W_r \times y_{t-1}$ .
		- Point-wise operation on  $Wx$  and Rh.
	- Store output data.
	- Wait for host to read output data.
	- if run is false then
	- Stop.



<span id="page-2-0"></span>Fig. 1. EGRU operations and distribution strategy: This figure shows computation performed on a single PE as part of a multi-core implementation, the grayed out portions are computed on other PEs,  $W$  and  $\overline{R}$  are kernels processing current input  $x_t$  and previous output  $y_{t-1}$  respectively.  $f\{\cdot\}$ represents point-wise operations.

#### V. RESULTS

We measured the time required by the EGRU operations using an internal timer. This timer ticks at 1 MHz rate and decrements a counter. We logged the timer value at various points in the algorithm to estimate the time spent by the algorithm at each stage. The results of this profiling are shown in Fig. [2.](#page-3-2) As can be seen, the most expensive part was the recurrent matrix multiplication (*egru internal*). Broadcasting of layer activations was a comparably cheap operation, since the broadcast uses efficient NoC packets to communicate. The bottleneck of the algorithm was therefore found to be dominated by memory reading and writing rather than communication, for the single chip case. However, this might not be the case for multi-chip communication.

<span id="page-3-1"></span>TABLE I MEMORY FOOTPRINT OF 95% PRUNED EGRU MODEL. WEIGHT MATRICES STORED IN SPARSE CSR FORMAT. TOTAL AVAILABLE INSTRUCTION MEMORY IS 32 KB AND THE AVAILABLE DATA MEMORY IS 96 KB.

|                    | Instruction Debug Weights Variables |      |     |
|--------------------|-------------------------------------|------|-----|
| Memory $(KB)$ 17.9 |                                     | 88.3 | 2.6 |

<span id="page-3-3"></span>TABLE II EGRU LM ON SPINNAKER2 COMPARISON WITH GPU



#### *A. Power and energy consumption*

The power consumption of the EGRU language model is shown in Table. [II.](#page-3-3) We show that for inference, our implementation on SpiNNaker2 consumes only a fraction of a Watt. Whereas the time required on SpiNNaker2 scales linearly with batch size, the GPU can process larger batch sizes in the same time. Hence at larger batch sizes GPU tend to be more efficient. This is only shown for the DVS task because the LM task has a memory bottleneck that only allows a batch size of one on a single chip. See Table. [III](#page-3-4) for this energy comparison. The test accuracy on the classification task was identical on both GPU and SpiNNaker2 implementation, demonstrating numerical equivalence, since we perform 32-bit floating point operations on both architectures.

# VI. OUTLOOK

The successful implementation of a language model on the SpiNNaker2 chip using the EGRU event-based architecture represents a significant milestone in the field of neuromorphic computing. We compare our implementation with the one on the Nvidia GPU and show a real energy advantage in the single batch size setting, which we expect to be the most relevant for inference, especially on edge devices. We also identified several bottlenecks in our implementation that need improvement for even further efficiency. In particular, quantizing the model will allow us to work with even tighter memory constraints

<span id="page-3-4"></span>TABLE III EGRU DVS GESTURE PREDICTION ON SPINNAKER2 COMPARISON WITH GPU. TIME AND ENERGY MEASUREMENTS NORMALIZED OVER BATCH SIZE.

| Measurement  | Nvidia A100 |       | SpiNNaker2 |       |
|--------------|-------------|-------|------------|-------|
| Batch size   |             | 6     |            | 6     |
| Power (W)    | 61.0        | 61.0  | 0.39       | 0.42  |
| Time $(mS)$  | 1.09        | 0.19  | 60.19      | 56.20 |
| Energy $(J)$ | 0.067       | 0.011 | 0.023      | 0.023 |
| Accuracy (%) | 96.83       | 96.83 | 96.83      | 96.83 |



<span id="page-3-2"></span>Fig. 2. Profiling of internal EGRU operation for Language modeling. EGRU model has 1350 units, 3 layers and 95% pruned. Weights stored in sparse CSR format.

to fit larger models and further increase energy efficiency, albeit with a slight reduction in task performance. Quantization will also allow us to take advantage of the MAC accelerator available on the SpiNNaker2 chip. We also plan to further scale the deployed model to multiple chips, as SpiNNaker2 is designed for efficient distributed computing.

Scaling up neuromorphic language models to more contemporary large sizes by harnessing very recent innovations in recurrent architectures [\[33\]](#page-4-29), [\[34\]](#page-4-30) is the self-evident next step. However, bringing the performance on par with standard deep learning also suggests expanding the range of realworld applications for neuromorphic hardware in future work, including real-time applications, which they are well suited for. Overall, our implementation has demonstrated, for the first time, that challenging machine learning tasks are not beyond the scope of neuromorphic computing and heralds the beginning of more mainstream use of neuromorphic devices as complementary to GPUs for appropriate use cases.

#### ACKNOWLEDGMENT

This work was partially funded by the German Federal Ministry of Education and Research (BMBF) and the free state of Saxony within the ScaDS.AI center of excellence for AI research. This work was partially funded by the German BMBF within the KI-ASIC project (16ES0996). Khaleelulla Khan Nazeer is funded by the German Federal Ministry of Education and Research (BMBF), funding reference 16ME0729K, joint project "EVENTS". Mark Schöne is fully funded by the Bosch Research Foundation. David Kappel is funded by the German Federal Ministry for Economic Affairs and Climate Action (BMWK) project ESCADE (01MN23004A). Christian Mayr is affiliated to German Research Foundation (DFG, Deutsche Forschungsgemeinschaft) as part of Germany's Excellence Strategy – EXC 2050/1 – Project ID 390696704 – Cluster of Excellence "Centre for Tactile Internet with Human-in-the-Loop" (CeTI) of Technische Universität Dresden.

#### **REFERENCES**

<span id="page-3-0"></span>[1] Paul Barham and Michael Isard. Machine learning systems are stuck in a rut. In *Proceedings of the Workshop on Hot Topics in Operating Systems*, HotOS '19, page 177–183, New York, NY, USA, 2019. Association for Computing Machinery.

- <span id="page-4-0"></span>[2] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. *arXiv preprint arXiv:1803.03635*, 2018.
- <span id="page-4-1"></span>[3] Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. *J. Mach. Learn. Res.*, 22(1), jan 2021.
- <span id="page-4-2"></span>[4] Daniel Neil, Jun Haeng Lee, Tobi Delbruck, and Shih-Chii Liu. Delta networks for optimized recurrent network computation. In Doina Precup and Yee Whye Teh, editors, *Proceedings of the 34th International Conference on Machine Learning*, volume 70 of *Proceedings of Machine Learning Research*, pages 2584–2593. PMLR, 06–11 Aug 2017.
- [5] Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J. Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, and Sanjiv Kumar. The lazy neuron phenomenon: On emergence of activation sparsity in transformers. In *The Eleventh International Conference on Learning Representations*, 2023.
- <span id="page-4-5"></span>[6] Anand Subramoney, Khaleelulla Khan Nazeer, Mark Schöne, Christian Mayr, and David Kappel. Efficient recurrent architectures through activity sparsity and sparse back-propagation through time. In *The Eleventh International Conference on Learning Representations*, 2023.
- <span id="page-4-3"></span>[7] Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta, Carlo C Del Mundo, Oncel Tuzel, Golnoosh Samei, Mohammad Rastegari, and Mehrdad Farajtabar. Relu strikes back: Exploiting activation sparsity in large language models. In *NeurIPS*, 2023.
- <span id="page-4-4"></span>[8] Mark Horowitz. 1.1 computing's energy problem (and what we can do about it). In *2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC)*, pages 10–14, 2014.
- <span id="page-4-6"></span>[9] Hector Andres Gonzalez, Jiaxin Huang, Florian Kelber, Khaleelulla Khan Nazeer, Tim Hauke Langer, Chen Liu, Matthias Aleander Lohrmann, Amirhossein Rostami, Mark Schöne, Bernhard Vogginger, Timo Wunderlich, Yexin Yan, Mahmoud Akl, and Christian Mayr. SpiNNaker2: A large-scale neuromorphic system for event-based and asynchronous machine learning. In *First Workshop on Machine Learning with New Compute Paradigms*, 2023.
- <span id="page-4-7"></span>[10] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In *Proceedings of the 31st International Conference on Neural Information Processing Systems*, NIPS'17, page 6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc.
- <span id="page-4-8"></span>[11] Albert Gu, Karan Goel, and Christopher Re. Efficiently modeling long sequences with structured state spaces. In *International Conference on Learning Representations*, 2022.
- <span id="page-4-9"></span>[12] Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Xiangru Tang, Bolun Wang, Johan S. Wind, Stansilaw Wozniak, Ruichong Zhang, Zhenyuan Zhang, Qihang Zhao, Peng Zhou, Jian Zhu, and Rui-Jie Zhu. Rwkv: Reinventing rnns for the transformer era, 2023.
- <span id="page-4-10"></span>[13] Karl Freund and Patrick Moorhead. The graphcore second-generation ipu, 2020.<br>[14] Sean Lie.
- <span id="page-4-11"></span>Cerebras architecture deep dive: First look inside the hardware/software co-design for deep learning. *IEEE Micro*, 43(3):18– 30, 2023.
- <span id="page-4-12"></span>[15] Sehoon Kim, Coleman Hooper, Thanakul Wattanawong, Minwoo Kang, Ruohan Yan, Hasan Genc, Grace Dinh, Qijing Huang, Kurt Keutzer, Michael W Mahoney, et al. Full stack optimization of transformer inference: a survey. *arXiv preprint arXiv:2302.14017*, 2023.
- <span id="page-4-13"></span>[16] Ben Keller, Rangharajan Venkatesan, Steve Dai, Stephen G Tell, Brian Zimmer, Charbel Sakr, William J Dally, C Thomas Gray, and Brucek Khailany. A 95.6-tops/w deep learning inference accelerator with pervector scaled 4-bit quantization in 5 nm. *IEEE Journal of Solid-State Circuits*, 58(4):1129–1141, 2023.
- <span id="page-4-14"></span>[17] Jiyuan Zhang, Lulu Tang, Zhaofei Yu, Jiwen Lu, and Tiejun Huang. Spike transformer: Monocular depth estimation for spiking camera. In *European Conference on Computer Vision*, pages 34–52. Springer, 2022.
- [18] Rui-Jie Zhu, Qihang Zhao, and Jason K Eshraghian. Spikegpt: Generative pre-trained language model with spiking neural networks. *arXiv preprint arXiv:2302.13939*, 2023.
- <span id="page-4-15"></span>[19] Malyaban Bal and Abhronil Sengupta. Spikingbert: Distilling bert to train spiking language models using implicit differentiation. *arXiv preprint arXiv:2308.10873*, 2023.
- <span id="page-4-16"></span>[20] Zhanrui Sun, Yongxin Zhu, Yu Zheng, Hao Wu, Zihao Cao, Peng Xiong, Junjie Hou, Tian Huang, and Zhiqiang Que. Fpga acceleration of lstm based on data for test flight. In *2018 IEEE International Conference on Smart Cloud (SmartCloud)*, pages 1–6. IEEE, 2018.
- <span id="page-4-17"></span>[21] Francesco Conti, Lukas Cavigelli, Gianna Paulin, Igor Susmelj, and Luca Benini. Chipmunk: A systolically scalable 0.9 mm 2, 3.08 gop/s/mw@ 1.2 mw accelerator for near-sensor recurrent neural network inference. In *2018 IEEE Custom Integrated Circuits Conference (CICC)*, pages 1–4. IEEE, 2018.
- <span id="page-4-18"></span>[22] Kamilya Smagulova and Alex Pappachen James. A survey on lstm memristive neural network architectures and applications. *The European Physical Journal Special Topics*, 228(10):2313–2324, 2019.
- <span id="page-4-19"></span>[23] Arjun Rao, Philipp Plank, Andreas Wild, and Wolfgang Maass. A long short-term memory for ai applications in spike-based neuromorphic hardware. *Nature Machine Intelligence*, 4(5):467–479, 2022.
- <span id="page-4-20"></span>[24] Ali Lotfi Rezaabad and Sriram Vishwanath. Long short-term memory spiking networks and their applications. In *International Conference on Neuromorphic Systems 2020*, ICONS 2020, New York, NY, USA, 2020. Association for Computing Machinery.
- <span id="page-4-21"></span>[25] Peter O'Connor and Max Welling. Sigma delta quantized networks. In *International Conference on Learning Representations*, 2017.
- <span id="page-4-22"></span>[26] Chang Gao, Tobi Delbruck, and Shih-Chii Liu. Spartus: A 9.4 top/s fpga-based lstm accelerator exploiting spatio-temporal sparsity. *IEEE Transactions on Neural Networks and Learning Systems*, pages 1–15, 2022.
- <span id="page-4-23"></span>[27] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors, *Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25- 29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL*, pages 1724–1734. ACL, 2014.
- <span id="page-4-24"></span>[28] Rishav Mukherji, Mark Schöne, Khaleelulla Khan Nazeer, Christian Mayr, and Anand Subramoney. Activity sparsity complements weight sparsity for efficient RNN inference. In *First Workshop on Machine Learning with New Compute Paradigms*, 2023.
- <span id="page-4-25"></span>[29] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing LSTM language models. In *International Conference on Learning Representations*, 2018.
- <span id="page-4-26"></span>[30] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In *International Conference on Learning Representations*, 2017.
- <span id="page-4-27"></span>[31] Arnon Amir, Brian Taba, David Berg, Timothy Melano, Jeffrey McKinstry, Carmelo Di Nolfo, Tapan Nayak, Alexander Andreopoulos, Guillaume Garreau, Marcela Mendoza, Jeff Kusnitz, Michael Debole, Steve Esser, Tobi Delbruck, Myron Flickner, and Dharmendra Modha. A low power, fully event-based gesture recognition system. page 10.
- <span id="page-4-28"></span>[32] P. Lichtsteiner, C. Posch, and T. Delbruck. A 128 x 128 120db 30mw asynchronous vision sensor that responds to relative intensity change. In *2006 IEEE International Solid State Circuits Conference - Digest of Technical Papers*, pages 2060–2069. ISSN: 2376-8606.
- <span id="page-4-29"></span>[33] Albert Gu and Tri Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces, December 2023.
- <span id="page-4-30"></span>[34] Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. Hyena Hierarchy: Towards Larger Convolutional Language Models, April 2023.