Abstract-We propose a non-binary stochastic decoding algorithm for low-density parity-check (LDPC) codes over with degree two variable nodes, called Adaptive Multiset Stochastic Algorithm (AMSA). The algorithm uses multisets, an extension of sets that allows multiple occurrences of an element, to represent probability mass functions that simplifies the structure of the variable nodes. The run-time complexity of one decoding cycle using AMSA is for conventional memory architectures, and if a custom memory architecture is used. Two fully-parallel AMSA decoders are implemented on FPGA for two (192,96) (2,4)-regular codes over GF(64) and GF(256), both achieving a maximum clock frequency of 108 MHz. The GF(64) decoder has a coded throughput of 65 Mb/s at 2.4 dB when using conventional memory, while a decoder using the custom memory version can achieve 698 Mb/s at the same . At a frame error rate (FER) of the GF(64) version of the algorithm is only 0.04 dB away from the floating-point SPA performance, and for the GF(256) code the difference is 0.2 dB. To the best of our knowledge, this is the first fully parallel non-binary LDPC decoder over GF(256) reported in the literature.
, IEEE 802.16e (WiMAX) [9] , and others. Additionally, LDPC codes have found their use in storage systems [10] , [11] .
Non-binary LDPC codes have been shown to have better resilience against burst errors [11] and mixed types of noise and interference [12] , are better suited for higher-order modulation, and provide a considerable performance boost to medium and short length codes. Because of these properties non-binary LDPC codes were studied within the DAVINCI project which aims at further reducing the gap between state-of-the-art performance of practical codes and Shannon capacity [13] , [14] .
The performance improvements associated with the generalization of LDPC codes to non-binary fields comes at a cost. The complexity of the Sum-Product Algorithm (SPA) under becomes , which limits the feasibility of non-binary decoders to lower-order fields. There have been multiple attempts at tackling the complexity problem with algorithms like FFT-SPA [15] , Log-SPA [16] , and Extended Min-Sum [17] (EMS) but the complexity remains high and a fully-parallel implementation of these decoders has not been shown to be practical to date. The highest previously reported throughput for a hardware implementation of the GF(64) code used in this work is of 3.8 Mbit/s [18] .
Stochastic decoding for LDPC codes was introduced in [19] as a way of reducing hardware complexity [20] while matching and even improving on the performance of reference algorithms like SPA in both binary [19] , [21] , [22] and non-binary [23] cases. In stochastic decoding, instead of the probabilities of symbols the symbols themselves are sent as messages with the probabilities being encoded in the statistics of the stream. Despite the implementation advantages of non-binary stochastic decoders designing a fully-parallel decoder remains challenging, especially for high-order fields . In this paper, we present a new stochastic decoding algorithm for non-binary LDPC codes that considerably reduces the complexity of the computations performed in the variable nodes (VNs) while inheriting the benefits of very simple check nodes (CNs) and interleaver circuit [20] , [24] . The algorithm is called Adaptive Multiset Stochastic Algorithm (AMSA).
Additionally, this work successfully extends the redecoding technique [22] to non-binary codes. Redecoding is a technique, introduced originally for relaxed half-stochastic (RHS) decoding of binary LDPC codes, that improves bit-error-rate (BER) performance and lowers error floors by making multiple decoding attempts on codewords that fail to decode initially.
Furthermore, we propose a fully-parallel architecture for AMSA, which is used for implementing LDPC decoders for two practical codes from the DAVINCI project [13] , [14] . The GF(64) decoder achieves a coded throughput of 65 Mbit/s using conventional memory and 698 Mbit/s using custom memory at an SNR of 2.4 dB, which are, to the best of our knowledge, the highest reported throughput for this code. We show that AMSA decoder architecture scales gracefully with the order of the field . The length of the memory blocks scales with while the width of the memory blocks, the number of wires in the decoder, the length of the registers, and the size of control logic scale with . Moreover, we are not aware of any other reported fully-parallel non-binary LDPC decoder implementation for GF(256).
The term conventional memory is used in this paper to refer to memory architectures where the task of writing identical words takes a number of steps that is proportional to .
A background on LDPC decoding and relevant topics is given in Section II. The Adaptive Multiset Stochastic Algorithm and its hardware implementation are presented in Sections III and IV. Section V is concerned with the analysis of the simulation results. Finally, the conclusion is found in Section VI.
II. LDPC DECODING OVER

A. SPA Decoding Over
SPA was originally extended to non-binary LDPC codes over in [29] , and in [15] it was shown that the check node computations can be simplified by introducing permutation nodes on the edges of the Tanner graph as shown in Fig. 1 . These additional nodes perform a permutation on the pmf messages exchanged between the variable and check nodes. Note that the subscripts in the message names specify the source and destination nodes of the messages, e.g. is a message from variable node to permutation node . Using this notation, the SPA algorithm for non-binary LDPC codes can be expressed in the following way:
1) Initialization
Each variable node is initialized with the channel likelihoods where is the channel likelihood of the symbol specified by the bits , is the th bit of the symbol, and is the corresponding noisy bit received.
2) Product step
For each variable node the output messages are computed for all the symbols in , i.e. for all : (1) where , and are the channel likelihoods computed in the initialization step, are the variable node incoming messages, and all the products are tensor dot products. Note that after this step requires normalization so that .
3) Permutation step
A Tanner graph edge between a variable node and a check node corresponds to a non-zero element in the H matrix of the non-binary code. The output of this step is a pmf message which is a permutation of the input message :
where and bits correspond to symbols and respectively, such that . Passing a message on the same edge in the opposite direction, from a check node to a variable node , involves the inverse permutation given by .
4) Check step
The sum-product update computes the output messages corresponding to a check node :
where is the set of symbols corresponding to bits such that over , where is the symbol corresponding to bits , and .
The complexity of one decoding iteration for SPA is , where is the codeword length and is the average column weight in matrix H, and is dominated by the complexity of the sum-product computation.
Log-SPA, the logarithm domain variation of the SPA algorithm [30] , converts the multiplication operations into additions, but does not reduce the computational complexity. The FFT-SPA algorithm [11] , [31] performs the check node computations in frequency domain and reduces the complexity to , where is the number of mutually independent rows in H, and is the mean row weight. The Extended Min-Sum Algorithm (EMS) [15] decreases the size of the non-binary messages from to and reduces the complexity to at the cost of some decoding performance loss.
B. Stochastic Decoding Over
Stochastic decoding is inspired by the technique of stochastic computing [32] where quantities are represented as Bernoulli sequences of bits and the information is conveyed through the statistics of the stream. This stream representation allows for complex computations to be implemented with simple hardware, and reduces the number of interconnecting wires required. The result of these reductions in complexity are circuits that can sustain higher clock rates [32] .
The stochastic concept can be extended to the non-binary case where a stochastic stream of symbols can be used to represent the distribution of probabilities of the symbols. Let be the four possible symbols in GF(4). The non-binary stream is equivalent to the following distribution of probabilities: , , ,
. The distributions created this way are always normalized because where is the number of times symbol has been observed and is the total number of symbols. It is important to note the difference in the nature of the messages in SPA or MSA, and the ones used in stochastic decoding. In stochastic decoding a message consists of a single symbol from [19] . The probability distribution associated with the stochastic stream can be inferred as shown above. In contrast, in SPA and MSA a message is an expression of a distribution of probabilities, a probability mass function (pmf) represented by a vector of length , while in EMS the length is reduced to [15] . The amount of information needed to represent an SPA or MSA message is bits (or bits for EMS) where is the number of bits required to store each probability or likelihood. The number of bits needed to represent a stochastic message is , making them considerably more compact than their SPA, MSA, and even EMS counterparts.
The small size of the stochastic message brings two benefits. Firstly, it reduces the hardware complexity of the non-binary decoders both in the number of interconnection wires and in the processing units themselves [20] . Secondly, shorter messages improve the average throughput when the received vector is close to a codeword and few cycles are enough for convergence. With alternative approaches, even if one iteration is required, larger messages have to be passed between the nodes resulting in reduced throughput.
It is also important to point out the difference between an SPA, MSA, or EMS decoding iteration and a stochastic decoding iteration. In the former case during one iteration the nodes exchange full pmf messages, while in the stochastic case only one symbol is exchanged per edge. To emphasize this difference, the stochastic decoding iteration is referred to as a decoding cycle (DC) [33] .
Finally, SPA, MSA, and EMS decoders follow a deterministic trajectory, and when a local optimum is reached it results in decoding failure. Stochastic decoders, on the other hand, follow stochastic trajectories, meaning that repeating the decoding process can yield an alternative path that avoids the local optimum. It has been shown in [22] that stochastic decoders are capable of decoding some of the frames that initially failed to decode. The method involves restarting the decoder with the same received soft-values vector but using a different random sequence. This technique is called redecoding, and was shown to improve the BER performance and lower the error floors.
Stochastic decoding was proposed in [19] for the binary case, then extended to non-binary LDPC codes in [24] . Using the same message names as in Fig. 1 , but noting that here a message is no longer a pmf, but rather a symbol, the algorithm can be described as follows:
1) Initialization
For each variable node , the output messages are initialized according to the channel likelihood values obtained from the channel model.
2) Variable node update
Given the variable node input messages , the output messages are computed: if for all otherwise (4) where is a random sample generated from the statistics. In other words, if the input messages on all edges other than the current one agree on the same value , this value becomes the output for the current edge. When such an agreement is not reached, a value is randomly picked from a stored history of values.
3) Permutation step
In the direction from variable node to the check node , the permutation operation is given by:
where is the non-zero value from check matrix H corresponding to the edge from to . For the inverse direction, when sending a message from a check node to a variable node, the value is used.
4) Check node computation
A non-binary check node output message is computed by summing the input messages from all edges except the edge corresponding to the output message itself: (6) where the sum is over .
A common problem for stochastic decoders using the above algorithm is latching, the undesired scenario when a group of nodes form a cycle and lock into a state of reduced or no switching resulting in poor bit-error-rate (BER) performance [21] , [33] . It has been shown that such non-binary stochastic decoders are only functional for when [24] .
C. Relaxed Half-Stochastic Decoding Over
The Relaxed Half-Stochastic (RHS) decoding algorithm for LDPC codes was proposed in [22] and represents a combination of SPA and stochastic decoding techniques. The algorithm uses successive relaxation to convert stochastic streams into LLR values. An RHS decoder can be seen as a hybrid decoder operating in both LLR and stochastic domains. Structurally, the difference between an RHS and a stochastic decoder is the variable node, with the interleaver and the check nodes being identical.
In the non-binary version, Tracking Forecast Memories (TFMs) [34] are used to store the probabilities associated with the corresponding stochastic streams. For an incoming symbol the memories are updated according to the following rule: if otherwise ( for all where is the probability of being equal to symbol at time , and is the relaxation parameter.
The RHS algorithm was successfully generalized for and was shown in [23] to have a performance close to that of SPA. Additionally, in [35] an optimized version of RHS was introduced for the case when variable nodes are of degree two, called RD2, that eliminates the need to perform term-by-term pmf multiplications in the variable node. In both cases, the computational complexities are lower than previously reported for non-binary LDPC decoders, but a fully-parallel implementation still remains challenging. Table I shows a few non-binary decoders reported in literature. In [26] Spagnol et al. propose a serial architecture and FPGA implementation for a GF(8) LDPC decoder. In [25] a partially-parallel implementation of the EMS algorithm is given for GF(4). In [27] an architecture for decoding non-binary quasicyclic LDPC codes is proposed along with an ASIC implementation for a (620, 310) GF (32) code. An architecture that works for higher order fields is presented without implementation in [17] . An optimized version of this architecture with an FPGA implementation is provided in [18] .
D. Non-Binary LDPC Decoders in Literature
More recently, several partially-parallel implementations have been reported [36] [37] [38] for codes over GF (32) , with clock frequencies in the 200-260 MHz range, and reported throughputs in the range of 47-66 Mbit/s.
A fully-parallel implementation for a (160,80) code over GF(64) is reported in [28] , and the details are included for comparison in Table I .
III. THE ADAPTIVE MULTISET STOCHASTIC ALGORITHM
In this section we propose a new stochastic decoding algorithm called the Adaptive Multiset Stochastic Algorithm (AMSA). The prominent feature of this algorithm is that it scales gracefully with the order of the field both in terms of run-time and hardware complexity, making it possible to implement practical fully-parallel LDPC decoders over higher order fields like GF(64) or GF(256).
AMSA relies on the properties of multisets, a generalization of the concept of sets that allow for multiple instances of the same element, for efficient storage and computation of beliefs, but also on the inherent simplicity of degree-two variable nodes.
The algorithm enables a substantially less complex variable node, while retaining the already simple permutation nodes and check nodes as in [15] and [24] , respectively.
In what follows we show how non-binary stochastic LDPC decoding can be expressed in terms of operations on multisets, define AMSA, and give its computational complexity.
A. Multiset Representation of a Probability Mass Function
Multisets, a generalization of sets, allow for multiple instances of the same element. The cardinality of a multiset is denoted by and represents the total number of instances of elements.
Definition: Let be a multiset containing symbols from with being the probability of finding symbol in , where is the number of times appears in . A random variable that takes on values in has an associated pmf which is defined by the probabilities . As seen in the example provided in Section II-B a pmf can be computed from any multiset of symbols. Here we show that it is always possible to construct a multiset representation of a given pmf, such that the difference between the probabilities and is less than , for any , and all . Proposition 3.1: For any given pmf defined by probabilities and any given , there exists a multiset such that for all . (See Appendix A for the proof.)
In fact, a multiset can be used as an approximate representation of a probability mass function that has the advantage of being simple to sample and not requiring normalization. Table II illustrates that in order to increase a probability by , one or more instances of symbol are added to . The exact number of instances to add, , can be calculated by solving the equation . Sampling a pmf involves calculating a cumulative density function (CDF) , and finding where is closest in value to , the realization of a uniform random variable between 0 and 1. In contrast, sampling a multiset representation of a pmf is trivial, and is equivalent to picking a random symbol from . 
B. Algorithm Definition and Analysis
In [23] , [35] pmfs are used to represent the statistics of the stochastic streams associated with edges in the Tanner graph. However, the variable node update involves recalculating probabilities, and sampling takes up to steps and requires computing or updating a cumulative density function.
AMSA addresses these problems by using multisets instead of pmfs. Fig. 2 illustrates the structure of a degree two variable node with its input symbols and , which, together with the corresponding channel likelihoods and , are used to update the multisets and . These multisets are associated with the two edges of the variable node. The output symbols are samples from these multisets.
In the context of this configuration, we define three routines that operate on multisets: the Add routine (Algorithm 1), the Remove routine (Algorithm 2), the Sample routine (Algorithm 3). Together these routines constitute AMSA.
1) The Add Routine: This routine is shown in Algorithm 1 and updates a multiset by adding zero or more instances of incoming symbol to it. When one or more symbols are added the probability associated with is increased while the probabilities of all other symbols are decreased. The routine makes use of the floor operator because only an integer number of symbols can be added to . The fractional part is compared against , the realization of a uniform random variable, to decide whether or not an additional symbol should be added. It is easy to show that the expected number of symbols added by this routine is . The term is the difference between the maximum capacity of and its current size, and can be interpreted as the empty part of , or the spare capacity of . 
where is the probability of symbol at time . Note that this update maintains the probabilities normalized, i.e.
for all . Equation (8) is recognizable as the RHS update rule from (7) but instead of using a constant term , it uses , a function of the likelihood corresponding to incoming symbol .
This result is confirmed by experimental data. Fig. 3 shows how the probability of the correct symbol evolves during the decoding process in a TFM using the non-binary RHS update from (7) and AMSA as defined in this section. At any given point in time, the difference between the two probabilities is due to the fact that the multiset used by AMSA is only an approximation of a pmf.
2) The Remove Routine: The goal of the Remove routine is to uniformly and gradually remove symbols from multiset . The decision whether to remove a symbol or not is based on the result of a probabilistic experiment of comparing , the realization of a uniform random variable, from the interval to . It can be shown that the expected number of symbols removed by Remove is . Intuitively, that means that when is closer in value to , i.e. is close to maximum capacity, it is more likely that a symbol will be removed, and as approaches zero, i.e. becomes empty, it is less likely that a symbol will be removed. Note that the Remove routine enforces the lower bound on the cardinality of , because is not smaller than 1. 
Lemma 3.2:
Let be the number of times symbol appears in , then the probability that the Remove routine removes symbol is . As it is shown in Proposition 3.3, Remove does not change the expected value of the probabilities of the symbols in . This means that by invoking the Add and Remove routines, the expected values of the probabilities will be updated as in (8) .
Proposition 3.3: Let be the probability of symbol in , and let be the expected probability of the same symbol after the invocation of Remove, then , i.e. the expected value of the probability is not changed by Remove.
Proof: Looking at a symbol from , the Remove routine is an experiment with three outcomes: no symbol is removed, symbol is removed, and another symbol is removed. Let be the probability of outcome , then , and, from Lemma 3.2,
, and, finally, . When no symbol is removed, the probability is unchanged, and remains equal to . When is removed, its new probability is . Finally, when a symbol other than is removed, the probability of is . We can now compute the expected value:
3) The Sample Routine: The Sample routine generates random symbols according to the probabilities of the symbols in . Unlike Add and Remove, this routine does not modify . The probability that Sample returns symbol is equal to . Having introduced the multiset representation of probability mass functions, and the mathematical properties of the Remove, Add, and Sample routines, we present the steps of AMSA decoding: 1) Initialization. When the soft-decision sequence is received from the channel, the decoder front-end uses to compute, for each variable node, a set of initial likelihoods for each symbols in where . Additionally, random samples are generated according to the likelihoods and fill the multisets and . 2) Variable node update. The variable node processing starts with the routine and is followed by the routine, where , 1. The variable nodes' outputs are outcomes of the Sample routine from and , as depicted in Fig. 1 . Furthermore, the variable where is the likelihood of the input symbol , and , 1. During the first iteration the multisets and contain symbols. Therefore, no symbols are added. The output symbols, in this case, are generated as usual.
3) Permutation step follows (5). 4) Check node computation follows (6). 5) Tentative decoding. If the beliefs computed in the variable
nodes satisfy all the parity equations, decoding stops. Otherwise, decoding continues until a maximum preset number of decoding cycles is reached.
Note that in non-binary stochastic LDPC decoding, the variable update is, by far, the most complex computation, with the permutation step being equivalent to a single multiplication, and the check node requiring only a summation of values over . AMSA, and its variable node update presented here, stand out from the other approaches in the literature [22] , [23] , [33] , [34] , [39] in several ways. First, it solves the problem of variable node inputs agreement, seen in (4), where as the order of the field grows , it becomes increasingly unlikely that the condition is satisfied. Second, AMSA avoids the hybrid approach, like in the RHS algorithm, where both pmfs and stochastic messages are used, which requires a conversion between the two, increasing hardware complexity, especially in the non-binary case. Finally, AMSA exhibits two properties: low computational complexity (as shown in Table III) , and low hardware complexity (as shown in Section IV), which enabled the implementation of a fully-parallel non-binary LDPC decoder over GF(256), which, to the best of our knowledge, is the first such implementation.
D. Complexity Analysis
Table III presents the upper bounds on the number of operations needed for each stage of the proposed decoding process. Note that in the case of the Add routine, the complexity is more intuitively , where is the degree of the variable node, but since scales linearly with for all practical purposes, it was presented as to simplify comparison with other results in the literature.
The second column of the table gives the upper limit on the run-time of the algorithm computations using conventional memory which requires steps in order to write identical values. This approach was used for the FPGA implementation presented in Section IV. Whenever operations can be carried out in parallel and implemented as such on hardware, the corresponding reduction in complexity has been considered. For instance, the Sample routine is shown to require operations as it is required for the edges of the variable node. However, in the FPGA implementation all edges are instantiated and can execute the routine in parallel in time. In fact, all the multisets associated with edges are sampled in parallel in 1 clock-cycle. Similarly, the rest of the routines can be executed in parallel to reduce the run-time complexity.
The third column provides the run-time complexities when custom designed memories are available, which are capable of writing an identical value in multiple locations in only one clock-cycle. In this case, all the variable node computations have a run-time complexity of , which further increases the throughput of the decoder.
Table IV presents the memory space requirements for AMSA. Note that a symbol is represented in hardware by bits and that the multisets are implemented as memories, while is the number of bits used to represent probabilities.
E. Redecoding
It has been shown in [22] that if a stochastic decoder fails to decode a codeword, it may succeed by trying to decode it again using different random sequences.
Redecoding can be seen as a tradeoff mechanism between error-rate performance and latency. A redecoding configuration has two parameters: the number of attempts , and the maximum number of decoding cycles for each attempt , and latency scales with the product of these two parameters. These parameters can be changed at run-time making the AMSA decoder suitable for variable latency application.
The technique of redecoding has been successfully extended to non-binary stochastic decoding in this work in order to improve performance and lower the error floor.
IV. CIRCUIT IMPLEMENTATION
In this section we consider two fully-parallel AMSA decoder implementations: AMSA-128 over GF(64), and AMSA-512 over GF(256), where AMSA-denotes a configuration of the decoder that uses multisets of maximum size . Fig. 4 . The interfaces of a channel likelihood memory and an edge memory. Probabilities are represented using bits, and symbols using bits. Note that both types of memories are dual-port. The GF(64) and GF(256) codes are (192,96) (2,4)-regular LDPC codes used in the DAVINCI project [13] . Table V presents a detailed comparison between the decoders.
The FPGA platform used for implementation is the EP4SGX230-KF40C2 chip from the Altera Stratix IV GX family. It provides, among other things, 182,400 Adaptive Lookup Tables (ALUTs) , 182,400 registers, 1,235 MD memory blocks, and 1,288 18 18-bit Digital Signal Processing (DSP) blocks.
A. Variable Node
The channel likelihood values are represented in fixed-point format using bits. The likelihood values are organized in a bit dual-port memory, as illustrated in Fig. 4 . Within each variable node the Add and the belief computation routines make use of the likelihood values. In both cases the access is read-only. The only time the likelihood memory is written to is during the initialization phase of the algorithm.
The hardware representation for a multiset is a memory array of length , an approach similar to the Edge Memories (EMs) that were introduced for the binary case in [33] and then extended to GF (16) in [24] . In other words, we impose an upper and lower bounds on the cardinality of the multisets , , 1. For practical reasons and efficient utilization of memory resources, is chosen to be a power of 2. On the Altera Stratix IV FPGA platform used in this work, the so-called MD memory blocks were used for this purpose. Each block has a capacity of 8192 bits with configurable dual read-write ports. Both codes used in this work have 384 edges in the Tanner graph, and therefore, use 384 MD memory blocks.
The Add, Remove, and Sample routines make use of randomly generated bits. This implementation uses a 32-bit linear feedback shift register (LFSR) in each variable node. The LFSRs have as feedback polynomial and achieve the maximum-length sequence of [40] . 1) Implementation of the Remove Routine: On the hardware implementation, we represent with a memory of length , which is implicitly an ordered collection. Normally, if the order of the elements had to be preserved, removing an arbitrary element from the memory would imply shifting up to elements by one position to the left, which would take steps to perform. Fortunately, in AMSA we are not concerned with the order of the elements in the memory. In this case, removing an element can be achieved by overwriting it with the last element in the sequence and reducing the length of the sequence by one. This operation takes only steps. The hardware implementation of the Remove routine can be expressed as in Algorithm 4, where is the size of the multiset, and represents the th element in the memory array. The test is done by a -bit comparator which controls the read-write mode of the edge memory. Note that it is possible to avoid explicitly fetching the value from the memory each time. It can be stored and updated in a register, denoted as in Fig. 6 .
Algorithm 4:
The set of commands implementing the Remove routine. 
2) Implementation of the Add Routine:
The circuit for computing , the number of symbols to add, is given in Fig. 7 , and it uses a -bit multiplier and two -bit comparators and two multiplexers. On the FPGA system used here, each of the 1,288 DSP blocks provides a 18 18-bit multiplier, more than enough to instantiate one for each of the 384 edges. On platforms where DSP blocks are not available, it is possible to use truncated multipliers [41] , [42] .
When using conventional memory modules, the task of writing symbols to the edge memory is controlled by a state machine as in Fig. 8 . 3) Implementation of the Sample Routine: One way to generate random integers in the range, when L is variable and not necessarily a power of 2, is to first generate a random integer where and then compute the remainder of . Unfortunately, using an integer divider is not practical. Alternatively, we use an adapted version of the acceptance-rejection sampling method [43] . Since the Sample routine has to generate a sample at every invocation, we use a series of fallback values in case of rejection as shown in Fig. 9 . Here is a realization of a uniform random variable, and represents the sequence of bits from where is the most significant bit, and stands for the concatenation of the specified sequences of bits.
Three -bit comparators are used in parallel to implement three acceptance-rejection tests. If the sample used in the first test passes the test it is routed to the output , otherwise we fallback to the sample on the next level, and if necessary next level, and so on. If all the tests fail, the last fallback value is which is guaranteed to be a valid sample. One can create Fig. 9 . A circuit for the Sample routine as implemented for AMSA-128, where the output is used as address on the edge memory in order to read a random symbol.
a longer chain of comparators for more uniform sampling however, the tradeoff is the longer critical path for the computation of .
Observe that the final distribution of values of is not truly uniform in the mathematical sense, but rather biased towards values closer in value to , this is due to replacing the most significant bits in with those from . This bias can be directed to the other end of the interval by replacing the most significant bits of with zeros. Alternatively, the bias can be reduced by using a random bit and multiplexing among the two options. Experimental results have shown that AMSA decoders converge faster to the correct codeword when sampling is biased like in the former case (as shown in Fig. 9 ) even if compared with uniform sampling. This is due to the fact that the Add routine adds new incoming symbols at the end of the memory, resulting in a non-uniform ordering of the symbols, and in this case using more recent results seems more suitable. An alternative approach is to modify the Add routine to randomly interleave the symbols as it adds them to the memory, resulting in a slight improvement in performance, but at the cost of increased hardware complexity and additional memory operations.
B. Check Node
The check node in stochastic decoders is considerably simpler than the SPA equivalent, as seen in (6), and it is a sum over , implementable directly with XOR gates. Besides the output messages, the check nodes have an additional output bit to signal whether the parity check is satisfied, which is determined based on the belief symbols of the variable nodes, rather than the edge outputs.
C. Permutation Block
On each edge of the Tanner graph (see Fig. 10 ) there are three multiplication-by-constant operations performed: one for the message from the variable node to the check node, another one for the reverse direction, and one for the belief message from the variable node to the check node. Even though the number of such blocks is large, as shown in Table V , the hardware complexity of each of them is small.
A multiplication by a constant is efficiently implemented by a -bit LUT, assuming is a power of 2. Modern FPGA platforms provide 6-input LUT resources which . The multiplication factor is the non-zero value from matrix associated with this edge. Message is the variable node belief. can be used directly for or combined for . A GF(64) permutation nodes is implemented using six 6-input LUTs while the GF(256) version requires fifteen 6-input LUTs.
D. Synthesis Results
The FPGA synthesis results for the GF(64) AMSA-128 and GF(256) AMSA-512 designs are given in Table VI , while Fig. 11 shows the floor plan of the FPGA chip for the GF(256) AMSA-512 decoder.
Synthesis results confirm the complexity analysis done in Section III-D and summarized in Table IV . The total amount of memory used scaled, as predicted, with . Similarly, the size of the variable node control logic, which is built using ALUTs and registers, scales, as predicted, with . Note that the number of memory blocks did not increase because in both cases the number of edge memories is the same, and each edge memory fits in a MD block.
E. ASIC-Specific Considerations
In this section, we combine the Add and Remove routines into a single memory write operation, based on the observation that since both routines are non-deterministic, there are three possible scenarios to consider (see Fig. 12 ). We also note that by designing a memory with a custom address decoder that enables multiple cells at the same time this write operation can be done in one cycle. Fig. 13 provides an architecture for such an SRAM memory. The Address Decoder used in the circuit is a standard address decoder. The Mask Overlay unit sets high all the address lines in the segment . Thus, having selected multiple SRAM cells, the data will be written to multiple locations of the memory in one cycle. Even though more than the necessary symbols are written (see Algorithm 1 line 8) in the segment of the memory, only will be in , the rest being outside and, thus, not having any effect.
We can consider the Address Decoder and Mask Overlay pair as a Custom Address Decoder for the SRAM memory block. In order to estimate how much bigger is the Custom Address Decoder compared to the standard Address Decoder, both have been implemented in VHDL and compared in terms of logical resources used. In the case of the size increased by 43% while for the size increased by 38%. Considering that the custom address decoder represents only a part of the total area of an SRAM block (the rest being occupied by the SRAM cells, sense amplifiers, etc.), the overall area increase for an SRAM memory block is rather insignificant. The use of such custom SRAM memory enables an ASICspecific architecture of the decoder, which at the cost of additional memory controller complexity increases throughput significantly (as shown in Section V-B).
V. SIMULATION RESULTS AND ANALYSIS
A. Performance
The performance of several configurations of the AMSA algorithm are given in Fig. 14 and Fig. 15 for the Additive White Gaussian Noise (AWGN) channel. As it can be seen, the decoder configurations with larger values of have better performance. This is because when is larger the multiset representation is more precise, as it was shown in Section III-A.
Another parameter that affects performance is redecoding. In Fig. 14 and Fig. 15 , the performance is compared for the same configuration with and without redecoding but keeping the total number of decoding cycles equal. For example, for the GF(64) AMSA-256 decoder the frame error rate at is with maximum allowed decoding cycles, while if doing 5 redecoding attempts of maximum decoding cycles each, the performance is improved to . For the GF(64) code, AMSA-512 is only 0.04 dB away from the floating-point SPA performance at an FER of . On the other hand, when using the GF(256) version of code, the difference between the SPA results and AMSA-1024 are of about 0.2 dB at an FER of . Note that the SPA algorithm used here uses a floating-point representation, and a fully-parallel SPA decoder for the codes used in this work is, at the moment, impractical.
B. Throughput and Latency
As the SNR increases, the AMSA decoder takes fewer decoding cycles to complete decoding. This implies that the throughput is a function of the SNR, as shown in Fig. 16 where in order to simplify comparison of throughput of the conventional memory and custom memory architectures, we use the same clock frequency in both cases.
As Table I shows, the highest reported throughput for a hardware implementation for the GF(64) (192,96) code is 3.8 Mbit/s at an SNR of 2.4 dB [18] using EMS with the Bubble Check algorithm. We note that, in this case, AMSA presents a significant improvement in throughput.
It is interesting to compare the performance results of this implementation to the one in [28] , which has a reported throughput of 1.15 Gbit/s and uses a code over GF(64). We note that AMSA-256 GF(64) implementation achieves the same throughput at an SNR of approximately 3.1 dB, even though the clock frequency in this case is much lower. Additionally, the error correction performance in [28] is good in the 3.2-3.6 dB SNR range, but very limited in the 2.0-2.4 dB range used in this work.
In this case the limit on the number of decoding cycles is set to even though the average number of decoding cycles at of 2.4 dB is approximately 180. Note that while there are multiple possible redecoding configurations for a given total number of decoding cycles, some of these configurations perform better than the others. In the case of AMSA-256 GF(64) decoder, the best results seem to be obtained with configurations like , , or , , while configurations where generally perform poorly. This observation is confirmed by the settling curves plot presented in Fig. 17 , which shows the frame error rate settle down slowly as the number of decoding cycles increases.
For the run-time complexity ASIC architecture a decoding cycle corresponds to one clock cycle, meaning that for a clock frequency of , the latency introduced by a decoder with is approximately 0.5 ms. In the case of SPA, the average number of iterations at of 2.4 dB is approximately 4 for both the GF(64) and GF(256) version of the code. In this case latency cannot be directly estimated without knowing how many clock cycles an SPA iteration takes for the particular implementation, though a fully-parallel implementation of non-binary SPA decoders for these codes seems to be impractical.
VI. CONCLUSION
Based on a multiset representation for probability mass functions, the Adaptive Multiset Stochastic Algorithm was introduced and applied for the non-binary stochastic decoding of LDPC codes with . Additionally, the concept of redecoding was applied to non-binary LDPC decoding and shown to improve the decoding performance.
AMSA was used for the FPGA implementation of two fullyparallel LDPC decoders over GF(64) and GF(256). To the best of our knowledge, this is the first reported fully-parallel LDPC decoders over GF(256).
The GF(64) decoder achieves a coded throughput of 65 Mbit/s at when using conventional memory, and 698 Mbit/s at the same when using custom memory. At an FER of the GF(64) version is only 0.04 dB away from the floating-point SPA performance, while the difference for the GF(256) decoders is of 0.2 dB. Furthermore, these implementations demonstrate the highest throughput reported for the particular codes used.
A memory architecture was proposed that reduces the runtime complexity of an AMSA decoding cycle to . The estimated throughput for a fully-parallel decoder using this architecture is of 698 Mbit/s for the GF(64) decoder, and 512 Mbit/s for the GF(256) decoder at 2.4 dB and at a clock frequency of 108 MHz.
There are several possible direction for future work like the generalization of the algorithm for (especially if better non-binary high-order fields codes become available), gaining better understanding of the redecoding process and how to optimally choose the parameter, evaluating the impact on circuit power consumption from employing the presented decoding techniques, understanding why the decoding performance gap between SPA and AMSA is larger for GF(256) than for GF(64), improving the latency, etc. Note that for all , and that it is always possible to create a large enough multiset such that
APPENDIX B MEMORY ACCESS SCHEDULE FOR FPGA IMPLEMENTATION
As it is shown in Fig. 2 , each variable node uses three dualport memory blocks: one to store the channel likelihoods table , and two to store the multisets and , respectively. The following table details the memory access schedule for the state machine in Fig. 8 .
