Abstract-Linear network coding [1] requires arithmetic operations over Galois fields, more specifically over finite extension fields. While coding over GF(2) reduces to simple XOR operations, this field is less preferred for practical applications of random linear network coding due to high chances of linear dependencies [2] and therefore redundant coded packets. Coding over larger fields such as GF (16) and GF(256) does not have that issue, but is significantly slower. SIMD vector extensions of processors such as AVX 2 on x86-based systems or NEON on ARM-based devices offer the potential to increase performance by orders of magnitude [3] , [4] .
I. INTRODUCTION
For linear network coding, a transmitter generates a linear combination of packets that were previously been received. Generally speaking, to generate a single encoded packet a linear combination of N source packets is required, which can be expressed as a matrix-vector multiplication over a given extension field. The number of packets N is generally referred to as the generation size. The arithmetic complexity to generate a single encoded packet is therefore in O(M · N ), where M denotes the packet length in number of data words. In case of random linear network coding, the coefficients forming the coding vector are randomly drawn from a finite extensions field. For those applications, the generation size N is severely limited to a small number of packets, commonly less than 256, as larger generations imply higher worst-case delays in case decoding is not possible due to packet loss. Therefore, the arithmetic complexity of individual operations dominates the computation time in those cases [6] - [8] . While naive approaches using general purpose registers or simple lookup tables are severely limited in throughput, performance can be increased by multiple orders of magnitude when using vector instructions [3] , [4] .
In this paper we extend the finite field library libmoepgf [4] by implementations for the AVX 512 instruction set extensions as offered by Skylake-X based Xeon processors as well as the upcoming Intel Ice Lake microarchitecture. The latter introduces AVX 512 for the first time to the mobile and lowpower segment as well as to desktop computers.
We start in Section II with a formal description of network coding operations. The support for AVX 512 of different microarchitectures is discussed in Section III. Section IV presents the new AVX 512 kernel using compiler instrinsics and discusses necessary changes compared to the AVX 2-based implementation. In Section V, we demonstrate the performance advantages of the new implementation for different finite fields and packet sizes, and present a survey of multiple different computing platforms and the best achievable results on each. Section VI concludes the paper.
II. ARITHMETICS OF LINEAR NETWORK CODING
A data packet of length l bit can be considered as a sequence of M = l/n data words, where n denotes the word size in bit. The individual words a i for i ∈ {1, 2, . . . , M } are elements of the finite extension field F q with q = 2 n elements. Of particular interest for (random) linear network coding are the finite extension fields with 2, 4, 16, and 256 elements which we refer to as GF(2), GF(4), GF(16), and GF(256). Assuming that l is a multiple of 8 bit, which is reasonable as memory of most computers is accessible byte-wise, l/n is also guaranteed to be an integral value for those fields. Note that for GF(2) words represent individual bits while for GF(16) a word is a nibble, i. e., a data word of 4 bit.
A packet can therefore be written as vector a = [a 1 , a 2 , . . . , a M ]
T . Given a generation of N source packets, those packets form a matrix
To generate an encoded packet b, we need a coding vector c ∈ F N q . Depending on whether or not we talk about random linear network coding, the components of c may be chosen identically and independently distributed from F q or by means of some deterministic algorithm. The actual encoding operation now consists in the matrix-vector multiplication
Given the extension fields mentioned above, addition over F q is always a bit-wise XOR operation while multiplication is a modulo operation given a reduction polynomial specific to the respective field. According to (2), we now need efficient algorithms to multiply a vector a by some constant value c and to add the result to an array used as accumulator over the respective extension field. That operation is commonly known as multiply and add (madd). Such an algorithms suitable vector instruction set extensions has been proposed by Anvin [9] and later by Plank et al. in [3] and is referred to as shuffle algorithm. As its name implies, this algorithm requires a shuffle instruction that swaps words in vector registers. Another algorithm called imul has been proposed by us in [4] . It does not require any special instructions, but its complexity linearly depends on the word size, i. e., the larger the word size the more operations are required. Both algorithms are implemented for SSSE 3, AVX 2 and NEON instruction set extensions in libmoepgf [4] .
We now continue to extend those implementations to the AVX 512 instruction set extensions, which double the register width and -in theory -provide a two times speedup compared to the AVX 2 implementations.
III. AVX 512 SUPPORT
AVX 512 is a family of instruction set extensions widely introduced by Intel with the Skylake-X server processors in 2017. Beforehand, a subset of AVX 512 extensions was available in Intel's Xeon Phi processors, which were derived from the Larrabee project [10] . Skylake-X processors thereby vary in the amount of execution units for fused multiply-andadd (FMA). In addition, Intel uses rather complex mechanisms to determine the maximum operating frequency depending on the number of active cores, the instruction set extensions in use, and even specific instructions being executed.
A. Variants of AVX 512
As mentioned before, AVX 512 is not a single instruction set extension but a whole family of extensions. The relevant parts for the scope of this paper are AVX 512-F (foundation), which is supported by all processors supporting AVX 512, as well as AVX 512-BW (byte and word), which is so far only supported by Skylake-X and Ice Lake processors. Without AVX 512-BW, there is no support for byte-wise shuffle operations which makes a port of the AVX 2 shuffle algorithm impossible. However, the imul algorithm would still be useable.
Since AVX 512-BW is supported by all Skylake-X and Ice Lake processors, this is a rather theoretic case suitable only for the discontinued Xeon Phi processors and certain add-in accelerators. Nevertheless, libmoepgf differentiates between these feature sets and enables an AVX 512-F variant of the imul algorithm if necessary. 
B. Special frequency ranges for AVX code
Intel has been using different maximum frequencies for their processors since AVX 2. Unfortunately, Intel does not make the clocking behaviour of their processors transparent to end users and publish those details only as part of their specification update manual [11] . Assuming that the CPU is not thermally limited or otherwise constrained, current Skylake-X based processors have three different frequency levels referred to as license levels (LVL) [12] . Within a given LVL, the number of active cores further restricts the maximum frequency. We do not further consider multiple active cores in this paper since we limit our benchmarks to a single thread. LVL 0 refers to the fastest turbo frequencies applicable to normal code, LVL 1 is slower and applies to most AVX 2 instructions, while LVL 2 is the slowest level.
Whether an AVX 512 instruction is executed within the envelope of LVL 1 or LVL 2 depends on whether it is a so called light or heavy instruction. The latter group are those instructions being executed on the FMA units while the former are most logic and byte-wise instructions, in particular the shuffle operation.
Depending on the processor model, the maximum frequencies between LVL 1 and LVL 2 might differ significantly. Since the imul algorithm is based on packed integer multiplications, which are considered heavy instructions, the processor runs much slower compared to the shuffle algorithm that only uses light instructions. Table I shows these frequencies for the processors we use for comparison in this paper. While the Intel Xeon Gold 6130 reduces its frequency by just 200 MHz when executing AVX 512 heavy code on a single core compared to normal code, the Intel Xeon Silver 4116 throttles by 1200 MHz under the same conditions [11] . Consequently, the gain of AVX 512 enabled code may be much less than one might expect.
AMD has so far no special operating frequencies when executing AVX code but the AMD Epyc 7xx1 series only supports 128 bit operations, i. e., AVX 2 operations are split into two 128 bit operations. This changed with the 7xx2 series, but there is still no support AVX 512.
However, we found that our AMD Ryzen 1700, which has a maximum single-core turbo of 3.7 GHz, reduces its clock speed to 3.5 GHz under heavy load, i. e., it did not achieve its maximum advertised single-core turbo independent of the load offered while it was neither limited by thermal nor power constraints.
IV. IMPLEMENTATION
We implement AVX 512 variants of both the shuffle and imul algorithms for GF(2), GF(4), GF(16), and GF(256). The following sections show the exemplary implementations for GF(4) as they contain all special cases of the larger fields but allow for a more compact representation since less constant values are required. The algorithms are integrated into libmoepgf and its benchmark application.
A. shuffle algorithm Algorithm 1 shows the implementation of the shuffle algorithm using C compiler instrinsics to make sure proper AVX 512 code is produced. The function expects two regions a and b corresponding to a source packet a i and the accumulator array b as well as a coding coefficient c i represented by the constant value c. It then calculates b := b + c i · a i . The register variables in Algorithm 1 are preloaded with various constants, in particular the lookup tables and bit masks needed for the shuffle algorithm. Afterwards, the two trivial cases for c i ∈ {0, 1} are caught, which result in either no operation or a simple XOR between the input arrays. Otherwise, temporary registers are preloaded and the shuffle algorithm performs the respective madd operation. 
B. imul algorithm
On platforms not supporting AVX 512-BW but AVX 512-F (foundation), the imul algorithm shown in Algorithm 2 is used instead. The algorithm again catches trivial cases first and then sets up a number of temporary registers with constant values: mi contains bit masks used to isolate individual coefficients within a word and are repeated as many times as words fit into the 512 bit registers. sp contains the powers of the constant factor c i for the respective extension field. Afterwards, the loop performs literally the required polynomial multiplications on c i and b i using only AVX 512-F instructions. The result is XORed into the destination register, which corresponds to the add-part of the madd operation. While this algorithm is guaranteed to work on all devices supporting AVX 512, it has two drawbacks. First, its complexity depends on the word length as the number of multiplications and logical operations within the main loop depend on the word length of the respective extension field. Second, the integer multiplications used in the main loop are heavy instructions and therefore subject to lower frequencies as discussed in Section III-B. 
V. EVALUATION
We first evaluate the implementation of the shuffle and imul algorithms for AVX 512 vs. the AVX 2 variant proposed in [4] on an Intel Xeon Gold 6130 CPU in Section V-A, showing significant performance advantages when making use of AVX-512BW in particular. Section V-B contains a survey of the best performing settings for different processors, showing the microarchitectural advances over the past few years.
A. Performance depending on ISA extensions Figure 1 shows the test results on an Intel Xeon Gold 6130. The process is pinned to a specific core of the CPU. Each measurement point is averaged over 4 · 10 7 repetitions to preclude any short-term effects from influencing the measurement. Figures 1a -1d show the encoding throughput with varying packet sizes ranging from 256 B up to 8 KiB for the different Galois fields. The encoding throughput is thereby defined as the amount of encoded packets measured in Gbit/s based on 16 uncoded source packets of the respective size that are randomly combined, i. e., the test resembles random linear network coding with a generation size of 16. The vertical red lines indicate the packet sizes at which the working set fills the L1, L2, and L3 caches, respectively. For instance, the Intel Xeon Gold 6130 has 1 MiB of L2 cache per core which means that 16 packets of 64 KiB exactly fill the L2 cache, which is indicated by the middle red line. 1 The results for GF(256) (Figure 1a ) and GF(16) (Figure 1b) clearly show a significant drop in performance once the working set exceeds the L2 cache size. While the results for GF (4) (Figure 1c ) and in particular GF (2) (Figure 1d ) also drop at the L2 cache boundary, there is another significant drop at a packet size of 32 KiB. A slight drop of performance at or (depending on the cache size) even before the cache boundary can be a result of the L2 cache being inclusive and the working set consisting not only of 16 source packet but also an array used as accumulator for the encoded packet, such significant drops must have other reasons. Interestingly, this phenomenon was neither observed on the Haswell-based processor of our original publication [4] , nor is it reproducible with any processor other than those based on Skylake-Xeven the desktop derivatives of Skylake (not Skylake-X) do not show this behavior. At the moment, we do not have a conclusive explanation for this behaviour, but we can preclude errors in the benchmark itself (since the other processors GF (2) GF (4) GF ( behave as expected) and also any mitigations of security issues on Intel processors (Spectre, Meltdown etc.) as we conducted the tests both with and without those mitigations showing virtually no difference in performance.
In any case, performance is severely limited by memory bandwidth resulting in a significant drop once the L2 cache is exceeded. As we expect, most affected are fast algorithms using AVX 2 and AVX 512 demonstrating a massive processing power of the CPU as long as execution units are kept busy. On the other hand, too small packet sizes result in too much overhead, e. g. start and end code for loops. For the commonly used GF(256), maximum performance is achieved for packet sizes between 512 B and 16 KiB, which covers a significant range of common packet sizes for networking applications.
The reason why the encoding throughput over GF(4) is considerable higher compared to GF(16) and GF(256) is that in case of GF(4) half of the operations are trivial: in case of multiplication by 0 or 1 (see Section IV), the operations reduce to a null or copy/XOR operation, and as there are only four elements in GF(4), these cases make up half of all operations.
Although it is obvious, it should be noted that increasing the generation size of course increases the working set, e. g. doubling the generation size cuts the packet size at which the algorithms become bandwidth limited in half.
Given that modern CPUs are able to encode at speeds of 30 Gbit/s over GF(256) using a single core only, the encoding operations do not pose a significant overhead on those systems anymore. However, on older processors or embedded systems, things might look different which is discussed in the following section.
B. Comparison between CPUs
To compare the performance of different CPUs, we evaluate our library on eight different processors listed in Table I . Since the performance depends on the two major factors -SIMD support (and therefore the chosen algorithm) and packet size (and therefore cache size) -we define the following benchmark procedure: 1) Each CPU is benchmarked with the fastest algorithm available, i. e., making use of the specific SIMD extensions available.
2) The result for each CPU is the average of results for working sets fitting in the L2 cache. The results are shown in Figure 2 . The first three bars for each Galois field show the results of recent Intel Xeon processors, which are the only processors under test supporting the new AVX 512 instruction set extensions. According to Table I , the faster Xeon Gold 6130 uses a frequency of 3.6 GHz compared to 2.9 GHz of the Xeon-D2166 and Xeon Silver 4116 because the shuffle algorithm is preferred in all cases over the imul algorithm. Consequently, the difference in encoding throughput is almost solely based on the difference in clock frequency. The fact that the Xeon-D2166 is slightly faster than the Xeon Silver 4116 albeit identical clock speeds is most likely related to some background tasks on the latter because we were unable to dedicate the machine for benchmarks.
The fourth bar represents the aging Intel Xeon E5-2696 v3, which is a processor based on the Haswell microarchitecture, i. e., two generations before Skylake. Considering that this processor does not support AVX 512 but only AVX 2, the results are exceptionally good.
The fifth bar shows the results for a Xeon 2650, which is based on the Sandy Bridge EP microarchitecture from 2012 and thus a legacy system. It does not even support AVX 2 and therefore libmoepgf falls back to SSE 2 / SSSE 3 instruction set extensions.
Column number six and seven represent an AMD Ryzen 1700 desktop and AMD Epyc 7601 server processor. Under perfect test conditions one would expect that the difference is solely due to clock speed since both processors use the exactly same CPU core complexes. However, the Epyc falls back significantly which we believe is caused by background load on this machine as we were not able to dedicate this machine for benchmark purposes. Compared to the faster Intel CPUs, we have to keep in mind that these AMD processors only have 128 bit registers. Consequently, AVX 2 operations have to be split into two operations and AVX 512 is not supported at all. The newer Epic 7xx2 and Ryzen 3000 CPUs are expected to be on par with Intel CPUs supporting AVX 2.
As a comparison for current embedded systems, we also include the performance on a Broadcom BCM 2711 as used on the Raspberry Pi Model 4B, which is an ARMv8-based CPU supporting the NEON instruction set extensions. Considering that this CPU has not even a passive cooler and is clocking at only 1.5 GHz, the results are quite astonishing: even over GF(256) we achieve over 1 Gbit/s of encoded throughput.
VI. CONCLUSION
We can conclude from the results that increases in encoding throughput almost solely stem from advances in SIMD support and clock frequency. Main memory throughput is of less importance due to the ratio between packet and cache sizes. In case of memory limitations, AVX 2 or even AVX 512 support does not offer a significant benefit as shown in Section III-A.
However, within the range of packet sizes interesting for networking application we see a substantial increase in encoding throughput when using AVX 512 extensions. The increase is not a perfect two-time scaling compared to AVX 2, which is partially but not completely explained by lower frequencies when AVX 512 code is executed. A similar phenomenon was already observed in [4] when comparing SSSE 3 to AVX 2 implementations.
The new implementations are integrated into libmoepgf, which is published under the LGPL at [5] . Of course we are open for suggestions regarding the improvement of libmoepgf.
