Abstract. FAST protocol is a kind of Data Compression Protocols which is widely used in Highfrequency trading. So far, implements of fast protocol decoding mainly depend on software. On the other hand, transmission delay of decoding can be effectively decreased by customized hardware instead of software. However, most of FPGA hardware architectures are serial. A parallel FPGA hardware structure for fast protocol decoding adapting to 40Gbps bandwidth is proposed in this paper. The structure includes 3 modules: field dividing module, field matching module and parallel decoding units. The practicability and efficiency of the proposed architecture is verified in System C platform and it takes 173ns to decode a message whose field number is 64. At last, influence of three factors to transmission delay is discussed in this paper, including decoding unit number, field type and sequence length of message.
Introduction
In financial markets, high-frequency trading (HFT) is a type of algorithmic trading characterized by high speeds, high turnover rates, and high order-to-trade ratios that leverages high-frequency financial data and electronic trading tools [1] . In general, HFT can be viewed as a primary form of algorithmic trading in finance [2] [3] [4] . [5] estimates that HFT on average initiated 10-40% of trading volume in equities, and 10-15% of volume in foreign exchange and commodities in 2016. So HFT is playing a more and more important role in equities and exchange market.
The Financial Information exchange(FIX) protocol is an electronic communication protocol initiated in 1992 for international real-time exchange of information related to the securities transactions and markets [6] . The FAST protocol (FIX Adapted for Streaming) is a technology standard developed by FIX Protocol Ltd., specifically aimed at optimizing data representation on the network. It is used to support high-throughput, low latency data communications between financial institutions. In particular, it is a technology standard that offers significant compression capabilities for the transport of high-volume market data feeds and extremely low latency applications [7, 8] . The characteristics of FAST protocol make it a composition of technological base on which HFT relies in financial market.
40Gbps Ethernet (40GbE) and 100Gbps Ethernet (100GbE) are groups of computer networking technologies for transmitting Ethernet frames at rates of 40 and 100Gbps, respectively. On June 17, IEEE has officially approved the 802.3ba-2010 standard, which started the commercialization of 40Gbps and 100Gbps Ethernet [9] . 100GbE focuses on network convergence and trunk line transmission, while 40GbE is oriented to all kinds of applications. Therefore, making full use of 40GbE to transmit HFT information can effectively enhance the transmission speed, reduce transmission delay, and lead to powerful competitiveness. On the other hand, a wider bandwidth means a wider input in one cycle and leads to the necessary exchange of the hardware structure.
In this paper, a hardware structure is introduced to decode FAST Protocol data, which takes full advantage of 40Gbps bandwidth. The efficiency of this architecture has been verified by SystemC simulation, and it has a better performance compared with previous studies.
Related Works
Before the appearance of hardware, applications mainly use software to decode FAST protocol data and the latency is about 2.1 us under the environment of 1000Mbps internet. Compared with software decoding, hardware decoding is more flexible and has a shorter transmission delay. [10] provides a very good summary about FPGA Hardware for High-Frequency Trading, describing some program trading system utilizing hardware and showing advantages and disadvantages compared with software solution. An integrity specific accelerator using FPGA is presented in [11] , proposing different approaches to reduce latency. The whole system is implemented in Xilinx Virtex-4 FX100 FPGA board. Adapting to 1000 Mbps bandwidth, the system spends 500 ns on decoding a FAST message. [12] has presented a multi-template processing engine to decode some certain messages using FPGA. Adapting to 10Gbps bandwidth, [13] presents a hardware structure of accelerating the decoding process in parallel and takes 380.4 ns to decode a message including 64 fields. However, it does not support complex template format and variety of fields. [14] has implemented an investment strategy for financial securities in parallel on FPGA with a speedup of more than 17000compared to a high-performance PC which is amazingly efficient.
Background
In FAST protocol, decoding makes use of "operator" to operate field value and a useful message can be made up of a number of fields. A field can be a string or an integer which is corresponding to the template definition. The FAST protocol has presented some measures to reduce the bandwidth of message transmission.
FAST Template
FAST protocol does not transmit every data and its related information specifically, and the related information is located in FAST template which is shared between sender and receiver before the message transmission. In other words, some information has been fixed before the communication starting and only indefinite information needs to be delivered. In this way, FAST template results in the reduction of data needing to transfer and saves bandwidth effectively and templates are usually held in XML files.
Stop Bit
The field of FAST protocol does not have a fixed size and a field may contain several bytes. Therefore, stop bit is used to help divide the fields. The first bit of one byte is defined as stop bit and only the stop bit is set to '1' that means the byte is the last byte of current field. Consider the 3 bytes binary stream: 00101011 10010100 11001101. According to the stop bit rule, the binary stream can be divided into two fields: 01010110010100 and 1001101.
Presence Map
The Presence Map (Pmap) is also a field which is used to judge whether or not a field is located in a message. Pmap appears in front of all fields in every message and one bit of Pmap is corresponding to one field of a message. If the bit is set to '1', the field is present in the binary stream.
FAST Operator
FAST protocol has defined s series of operations to encode and decode data. Every field of message is responding to an operator presented in the FAST template. There are 6 kinds of operators, including Constant, Default, Copy, Increment, Delta, Tailor. Besides, in some case, there is no operator for a field which means the field value is the same as the transmitting value. Each operator has its own way to compress data with the help of Pmap, previous value and other details. Table 1 lists the relationship between operator and its assistant details. 
Sequence
Sequence means that a group of fields will be transferred for a few times successively. The length of sequence is located in front of the sequence. The significance of sequence is updating a group of fields in one message instead of transmitting different messages.
Implementation
To adapt to 40Gbps internet, the bandwidth can bring 128 bits in every cycle when the frequency of hardware is 312.5MHz that is completely achievable at present. Our design is divided into 3 parts: field dividing module, field matching module and parallel decoding units. To achieve a higher performance and a lower latency, the 3 modules compose a pipelined structure as shown in the Fig.  1 . Figure 1 . Schematic View of the Structure.
Field Dividing Module
We have known that there are 128 bits data reaching in one cycle and the 128 bits data may contain several fields. The purpose of this module is to divide the data into several fields according to stop bit rules as previously described. 128 bits data consists of 16 bytes and there are 16 stop bits. One of advantages of this design is this module could judge the 16 stop bits in a parallel way. However, there is a particular situation that one field is not transferred in one cycle. For instance, a 3 bytes filed is segmented whose first 2 bytes reach earlier than the last byte. A ring-register is designed to solve this problem. Fig. 2 shows the structure. There are 32 registers end to end forming a ring. Each register could store one byte data, and the whole Ring Register could store 32 by testate coming in 2 cycles. There is a read pointer pointing the first '0' after the last '1'. In Fig.3 , the stop bits are 01000100 10001010 from current cycle, adding the '000' of last cycle, there are 5 fields and the relevant stop bits are 00001, 0001, 001, 0001, 01. Besides, the next cycle read pointer point to the No.15 register.
There are two cases when the field is not transferred in one cycle. When the read pointer is located in register No.0 to No.15, the start address is shown in equation (4-1 
Field Matching Module
The function of this module is to match the data from field dividing module and the command information from the FAST template. Before the implementation, we have gotten some FAST Templates in practice and done some general works. This module could convert the template information to binary command information. In other words, this module is flexible to different FAST templates and it could support 64 fields of one template which is longer than the maximum length of the templates we get.
Because of the existence of sequence, the matching can not easily select several successive commands from the template. So we divide the template and control the matching by a state machine as shown in Fig.3 . 
Parallel Decoding Units
Under 40Gbps bandwidth, 16 bytes binary stream reaches in one cycle. On the other hand, there are mainly 3 kinds of field format: int32, int64 and string known from the templates we get. Considering the stop bit, transmitting an int32 field needs 5 bytes, transmitting an int64 field needs 10 bytes and transmitting a string is not fixed. In conclusion, setting 4 decoding units to parallel decode is sufficient and reasonable which is proved in the later simulations. The structure of one of the decoding units is shown in the Fig.4 . Bytes Joint Unit is used to discard the stop bit and combine the bytes to a complete field. To achieve this goal, FAST protocol takes different measures to ASCII fields and others. In ASCII codes, every byte represents a character and several bytes compose a string. For non-ASCII fields, several bytes compose an integer after dropping the stop bits. Field Decoding Unit just does the decoding job according to the operator, input data and other related information. In FAST protocol, some operators are related to the previous value, so a Pre-value Storing Unit is necessary and the decoding unit need to update the previous value in every cycle.
Laboratory Finding
The design performance is measured in SystemC platform and the simulating sets the cycle to 3.2ns which is corresponding to 312.5MHz. Firstly, it takes 173ns to decode a message whose field number is 64 with 4 decoding units and the following fig.5 proves that using 4 decoding units is optimal. Then, we keep the field number fixed and explore the field format effecting to the latency of decoding without sequence. The result is shown in the Fig.6 .In order to make the result more obvious, the length of string values 16. The result shows that the decoding speed of int32 is faster than int64, int64 is faster than string and we can conclude the longer the field binary stream is, the greater the decoding latency is. At last, we explore the length of sequence effecting on the decoding latency. The result is shown in Fig.7 . Decoding a sequence of 6 fields takes roughly 10.5ns on average. 
Conclusion
A parallel FAST decoding structure is presented in this paper. The advantage of the design is that the structure could adapt to 40Gbps bandwidth and could match all kinds of FAST templates. Because ring register of Field Dividing Module has 32 one-byte registers, the structure only could process string whose length is no more than 16. Luckily, the length of string transferred by FAST protocol is barely bigger than 16 in practical applications.
At present, using hardware to decode trading information is very popular due to its feature of low latency. The structure proposed in this paper takes full use of 40Gbps bandwidth and has a better performance than most of other designs. It takes 173ns to decode a message including 64 fields. As a contrast, a similar structure in [10] takes 380.4ns to decode a message with the same length under 10Gbps bandwidth.
