Abstract-Asynchronous handshaken interchip links are very popular among neuromorphic full-custom chips due to their delayinsensitive and high-speed properties. Of special interest are those links that minimize bit-line transitions for power saving, such as the two-phase handshaken non-return-to-zero (NRZ) 2-of-7 protocol used in the SpiNNaker chips. Interfacing such custom chip links to field-programmable gate arrays (FPGAs) is always of great interest, so that additional functionalities can be experimented and exploited for producing more versatile systems. Present-day commercial FPGAs operate typically in synchronous mode, thus making it necessary to incorporate synchronizers when interfacing with asynchronous chips. This introduces extra latencies and precludes pipelining, deteriorating transmission speed, particularly when sending multisymbols per unit communication packet. In this brief, we present a technique that learns to estimate the delay of a symbol transaction, thus allowing a fast pipelining from symbol to symbol. The technique has been tested on links between FPGAs and SpiNNaker chips, achieving the same throughput as fully asynchronous synchronizerless links between SpiNNaker chips. The links have been tested for periods of over one week without any transaction failure. Verilog codes of FPGA circuits are available as additional material for download.
I. INTRODUCTION
N EUROMORPHIC chips and systems use typically the asynchronous four-phase handshaken address event representation (AER) scheme to interchange information in an event-driven manner [1] , [2] for vision systems [3] , [4] and robotics [5] . The recently available multi-ARM-core SpiNNaker chips [6] (intended for simulating large-scale neuromorphic systems) use a special multisymbol very low-power two-phasehandshaking non-return-to-zero (NRZ) protocol [7] , which is called 2-of-7 [8] . Each link is unidirectional and uses eight lines (seven for data and one for Ack, i.e., Acknowledge). A symbol is transmitted by changing the state of two data lines only, which signals a Request for the handshaking. Although there are 21 possible transitions in two lines out of seven lines, only 17 are used by SpiNNaker (16 data symbols and one "Endof-Packet" symbol). This way, data symbols can be represented by 4-bit nibbles, as illustrated in Fig. 1 . SpiNNaker chips can communicate a packet (also called "event") of either short format (44-bits) with 11 4-bit symbols or long format (76-bits) with 19 4-bit symbols.
The structure of a packet/event is 8-bit header, 32-bit data, 32-bit optional payload (extra data for the long format), and "End-of-Packet" symbol. Fig. 2 shows a commercial SpiNN5 board hosting 48 SpiNNaker chips. Each chip connects to six neighbor chips (north, south, east, west, northeast, and southwest), emulating a hexagonal grid [6] . Each chip-to-chip connection contains a pair of 8-bit 2-of-7 lines, one for each direction. Interchip links (which need minimum PCB trace length) can exchange short-format events at a rate of about 6 Meps (mega events per second), which accounts to about 15 ns per symbol transaction. On the top of the board, one can see three Spartan6 field-programmable gate arrays (FPGAs).
1549-7747 © 2016 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. They connect to some of the SpiNNaker chips through a bidirectional pair of 8-bit 2-of-7 NRZ asynchronous links. On the top in Fig. 2 , we highlight two of such links: link A between the central FPGA F 2 and the SpiNNaker chip U 19 on the top row, and link B between the same FPGA and the chip U 58 far away and close to the bottom right edge. The circuitry inside the FPGA is clocked and requires the use of synchronizers to interface properly with external asynchronous links [9] , [10] . In the next section, we briefly describe how such a standard link would operate, achieving a maximum average throughput of 3.51 Meps from the FPGA to the SpiNNaker chip. Afterward, in Section III, we present the new proposed approach, which can reach up to 6.89 Meps from the FPGA to the SpiNNaker chip. Our proposed approach could not be used to improve the speed of the reverse direction (from SpiNNaker chip to FPGA) because it would require modifying the SpiNNaker chip itself. Nonetheless, the proposed scheme can be included in future versions of the SpiNNaker chip.
II. CONVENTIONAL SYNCHRONIZATION APPROACH
Figs. 3 and 4 show the diagram and the timing between an FPGA transmitter (TX) link side and a SpiNNaker chip receiver (RX) link side, 1 using a conventional two-D-flip-flop synchronization, respectively. Short 32-bit (or long 64-bit) events are provided to a finite-state machine (FSM), which will convert them to the 11 (or 19) 4-bit symbol sequence, loading each into the 4-bit register in Fig. 3 . After this, an "Encoder" activates the 2-of-7 bits that need to change, according to Fig. 1 , which, after being "XOR-ed" with the previous output, provides the new output storing it in an output 7-bit register. This register holds the new 2-of-7 Data-and-Rqst Data INT . Once it is available, it requires a delay due to output pad buffering I/O t PO to go out of the FPGA as Data EXT , plus an interchip PCB trace delay of t pcb to be visible at the SpiNNaker chip input. The SpiNNaker chip RX port requires a delay, which is here called t SP1 , between detecting the 2-of-7 Data-and-Rqst until providing its acknowledge signal Ack SP , which after t pcb will make Ack EXT visible at the FPGA external input. The FPGA input pad introduces an additional delay t PI until the asynchronous acknowledge signal Ack INT is visible internally inside the FPGA. After this, the synchronization circuit using a standard two-D-flip-flop delay line, requires two additional clock edges to make a synchronized version of the acknowledge signal sAck available. At this point in time, a new Data INT value can be made available for the next clock edge. From the Spartan6 FPGA manufacturer specifications, we know that t PI ≈ 1.2 ns and that t PO may vary between 1.7 and 5.9 ns, depending on output pad settings. For our settings, t PO ≈ 3.0 ns. In Fig. 4 , we can observe that, if Δt 2 = t PO + t SP1 + 2t pcb + t PI is between two and three clock cycles (10-15 ns), then a full symbol transaction can be done in five clock cycles (25 ns). If 10 ns < Δt 2 < 15 ns, then it implies that 5.8 ns < t SP1 + 2t pcb < 10.8 ns. Otherwise, if 10.8 ns < t SP1 + 2t pcb < 15.8 ns, then a symbol transaction would require six clock cycles (30 ns). Implementing the link in Fig. 4 on link A in Fig. 2 results in five-cycle transactions, while on link B, this results in six cycles. In summary, for the 48-chip PCB in Fig. 2 , a symbol transaction varies between five and six clock cycles, depending on the length of PCB traces. In the next section, we propose a method to reduce this time down to two clock cycles for both links. It requires changing the FSM in Fig. 3, i. e., the sender of the link. Therefore, this means that we could only test it by changing the FSM at the FPGA (the sender side). Consequently, we present results only for the case of transmitting data from the FPGA to the SpiNNaker chip.
Time t SP1 is typically quite stable for each SpiNNaker link, except for the cases when the interchip circuitry is sending back pressure (i.e., delaying Ack) because of internal traffic saturation.
III. PROPOSED PREDICTIVE SYNCHRONIZATION SCHEME
The herein proposed new synchronization scheme is based on the following observation in Fig. 4 . The SpiNNaker RX side of the link is, in principle, ready to receive a new 2-of-7 Dataand-Rqst, as soon as it has provided acknowledge signal Ack SP at time t SPReady . However, due to the synchronization with two D-flip-flops on the FPGA side, the FPGA cannot provide a new Data INT until four clock edges later. Here, we propose a scheme where the FPGA "learns" to forecast, reliably, the minimum number of clock cycles required to send a new symbol (without waiting to receive each symbol's synchronized acknowledge signal sACK). Nonetheless, during the multisymbol transmission of a packet, an independent process in parallel would count the total number of actual acknowledge signals received during the full packet to make sure that the full packet transaction was completed successfully.
The new proposed algorithm for the transmitter FSM in Fig. 3 is shown in Fig. 5 . This FSM will send out the 11 (or 19) event/packet symbols without waiting for individual Acks from the receiver side. It will simply wait for a "Symbol Period" time (i.e., number of clock cycles) before sending the next symbol. A parallel process (not shown in the figure) will be counting the number of Ack signals received and generating an internal "Packet Ack" signal once all of them are received. The operation of the FSM in Fig. 5 is as follows. The first state S0 waits for a new (32-or 64-bit) event/packet. After this, the corresponding sequence of symbols must be sent. For this, an extra state S2 is included, which waits for a given number of clock cycles ("Symbol Period") before sending the next symbol. After sending all header and data symbols, the "End-of-Packet" (EOP) symbol is also sent. After this, the FSM enters state S3, where it waits for the internal "Packet Ack" signal. This signal is triggered only if the receiver has acknowledged all symbols sent. Once "Packet Ack" is received, an optional "Inter-Packet Gap" (IPG) wait time can be included to allow the receiver some extra time for event/packet processing. In case not all symbol Acks have been received within a given "Time-out" period, state S3 will branch out through its "Packet NOT sent" output, indicating there has been a failure in the event/packet transmission. In this case, it will try to resend the event/packet. For this, it will first send an EOP symbol to the receiver and wait for the corresponding Ack, through state S4. If this Ack is not received, then it will wait for some time to let the receiver recover and, after this, retry the transmission of an EOP symbol. This situation may occur in case the SpiNNaker internal event handling circuitry is sending back pressure (because of event traffic saturation), or there is a transient fault/disconnection in the transmission line.
At startup, there is a "learning process" in which the FSM adjusts its parameter "Symbol Period." Initially, this period is set to "1" (one clock cycle), and it will be increased progressively until reaching a stable communication. For each "Symbol Period" value, two weights are defined. The first weight is the rate of failure, and the second is the rate of success. Every time state S3 leaves through its "Yes" output, the success weight is increased. If state S3 leaves through its "NO" output, the failure counter is increased. During learning, the "Time-out" parameter values are reduced to speed up learning. The rate of convergence of this startup learning process is relatively fast, although it also depends on weight granularity and initial state. In our case, we used 8-bit weights, and both of them (success and failure) were initially set to "0." Convergence time was on the order of 500 μs. After this, no more failures were detected, even when running the links for over one week. If there are transient faults during the startup learning process, it will converge to very conservative Ack/Rqst intervals. Therefore, during startup, the system and all physical connections should be in optimum conditions. So far, we have discussed the situation of sending events from the FPGA (as TX) to the SpiNNaker chip (as RX). In this case, it is the TX who learns to forecast the intersymbol delay, and also who detects whenever an event/packet has not been sent. In order to implement this forecasting/acceleration capability for the reverse direction without changing the SpiNNaker chip, we would need to change the RX side in the FPGA. For this, the receiver in the FPGA would need to send out Acks before obtaining the synchronized versions of the 2-of-7 Data-and-Rqst transitions. In case of failure, only the RX circuit in the FPGA would be aware of it, and the TX in the SpiNNaker chip would not be able to resend the event/packet. There are three obvious approaches for solving this. First, add an extra 2-of-7 command to the table in Fig. 1 , so that the FPGA can request the retransmission of an event/packet. Second, implement this new algorithm inside the SpiNNaker chip in its TX ports. Or third, implement a slower upper layer in software to detect event/packet loss and request a new retransmission. The first two options require a redesign of the SpiNNaker chip, and we leave this as suggestions for future versions. The third solution is beyond the scope of this brief. In the next section, we provide experimental results for the link direction from the FPGA (TX) to the SpiNNaker chip (RX).
IV. EXPERIMENTAL RESULTS
Exhaustive tests have been performed on the 48-chip SpiNNaker PCB shown in Fig. 2 , to test the performance of packet/event communication from an FPGA to a SpiNNaker chip. The results shown here focus on two of such links: "Link-A" between FPGA "F2" and SpiNNaker chip "U19," which is one of the shortest links on the PCB, and "Link-B" between the same FPGA and chip "U58," which is one of the longest links. Experimental characterizations were performed by generating sequences of numbers with a counter on one end and checking the sequence on the other end. Failure-free transmissions were obtained after a few hundred microseconds of training, which would stay failure free for long periods (we tested for over one week). Experimental measurements were done through the use of Xilinx's built-in logic analyzer module "ChipScope." This tool allows monitoring FPGA internal signals with reference to its internal clock. For our experiments, we have set this internal clock to either 200 MHz (5-ns period) or 100 MHz (10-ns period). Fig. 6 shows ChipScope screen captures for different measurements. For each measurement, we show the same four signals: signal ACK_IN, which corresponds to Ack INT in Fig. 4 ; ACK_IN_SYNC, which is sAck in Fig. 4 ; DATA_OUT_HEX, which is Data INT in Fig. 4 ; and SYMBOL_NUM (not shown in Fig. 4) , which counts the symbol number within the packet/event. On the top of each subfigure, the ticks indicate clock cycle number. Fig. 6(a) illustrates the case of using the conventional synchronization approach on link A for a short package with a 200 MHz clock. As can be seen, to transmit all 11 symbols, 57 clock cycles are needed, which corresponds to 3.51 Meps Fig. 7 . Measured parameters are "Pck Rt" packet rate (in mega events per second), "Nc Pck" number of clock cycles per packet, and "Nc Sym" number of clock cycles per symbol. Vertical columns show measurements for links A and B, for short packets (11 symbols) and long packets (19 symbols), at 200-MHz and 100-MHz FPGA clock frequencies, for "Normal" (conventional) synchronization scheme and for "Fast" predictive handshaking scheme. Horizontal rows are repeated for three different FPGA output pad bias settings ("current" and "slew-rate"): "fast" is 12 mA per pad with nominal 1.71-ns delay, "slow" is 6 mA per pad (recommended) with nominal 3.00-ns delay, and "quiet" is 2 mA per pad with 5.47-ns delay.
(mega events per second). Transmission of one symbol requires five clock cycles. Fig. 6(b) shows the same case, but when implementing the predictive handshaking approach. As can be seen, one 11-symbol packet needs now only 29 clock cycles, which corresponds to 6.89 Meps. Each symbol can be reliably transmitted with only two clock cycles (see signals DATA_OUT_HEX and SYMBOL_NUM), although now there is an extra seven-cycle overhead after transmitting all symbols. Interestingly, the parallel independent process in charge of counting the transitions at ACK_IN normally needs two clock cycles, although sometimes it needs one or three (see signals ACK_IN and ACK_IN_SYNC). Fig. 6 (c) and (d) illustrates the same setup as Fig. 6 (a) and (b) but for a long 19-symbol packet. Similarly, Fig. 6 (e) and (f) shows the same as Fig. 6 (a) and (b) but setting the clock to 100 MHz. This is to illustrate the situation for a slower FPGA. As can be seen, for the conventional synchronization approach, four clock cycles per symbol are required (instead of five) and 46 per packet (instead of 57). This is because the fixed delay Δt 2 in Fig. 4 is framed into less clock cycles. However, for the same reason, in the predictive handshaking approach, one symbol can be transmitted now in just one clock cycle. On the other hand, the overhead requires the same eight clock cycles; thus, the overall delay for a short-packet transaction is 18 cycles, resulting in a speed improvement factor of 2.55. Fig. 7 shows the measured packet/event rate (Pck Rt) and the number of clock cycles per symbol (Nc Sym) and per packet/ event (Nc Pck) for all experimental setups: for links A and B, for 100-MHz and 200-MHz clock frequencies, for short and long packets, and also for three different settings of the FPGA output pads (setting SLOW with 6 mA per pad, which corresponds to all cases shown in Fig. 6 ; setting FAST with 12 mA per pad, and setting QUIET with 2 mA per pad). The packet transaction speed improvement varies between a factor of about 2 (1.96 for link A, short packet, 200 MHz) up to a factor of almost 3 (2.89 for link A, long packet, 100 MHz). The Verilog codes used for these setups are provided as additional material for download.
V. CONCLUSION
A scheme for accelerating asynchronous handshaken multisymbol packet transmissions between an asynchronous module and a synchronous one has been proposed and successfully tested on a 48-chip SpiNNaker board. The scheme exploits the fact that, within the same packet, the transaction delay per symbol remains stable and can be "learned" by the sending circuit. Symbol Acks are counted by a separate process in parallel to verify correct packet transmission. In case of failure, the packet is resent. Exhaustive tests have been performed on the 48-chip SpiNNaker board for different PCB trace lengths, packet sizes, clock frequencies, and pad delays. Once trained, the transmission stays stable and failure free. The proposed scheme can help to improve the traffic bottleneck between SpiNNaker and PCBs, as this bandwidth is limited by the throughput between FGPA and SpiNNaker chips on board.
