A Joint Source/Channel Approach to Strengthen Embedded Programmable Devices against Flash Memory Errors by Maurizio Martina et al.
1A Joint Source/Channel Approach to Strengthen
Embedded Programmable Devices against Flash
Memory Errors
Maurizio Martina, Member, IEEE, Carlo Condo, Guido Masera, Senior Member, IEEE, Maurizio Zamboni
Abstract—Reconfigurable embedded systems can take advan-
tage of programmable devices, such as microprocessors and
FPGAs, to achieve high performance and flexibility. Support to
flexibility often comes at the expense of large amounts of non
volatile memories. Unfortunately, non-volatile memories, such
as multi-level-cell (MLC) NAND flash, exhibit a high raw bit
error rate that is mitigated by employing different techniques,
including error correcting codes. Recent results show that low-
density-parity-check (LDPC) codes are good candidates to im-
prove the reliability of MLC NAND flash memories especially
when page size increases. This work proposes to use a joint
source/channel approach, based on a modified arithmetic code
and LDPC codes, to achieve both data compression and improved
system reliability. The proposed technique is then applied to the
configuration data of FPGAs and experimental results show the
superior performance of the proposed system with respect to
state of the art. Indeed, the proposed system can achieve bit-
error-rates as low as about 10 8 for cell-to-cell coupling strength
factors well higher than 1.0.
Index Terms—arithmetic coding, LDPC coding, flash memo-
ries, FPGA
I. INTRODUCTION
Modern embedded systems can take advantage of the re-
configurability offered by programmable devices, such as mi-
croprocessors and Field-Programmable-Gate-Arrays (FPGAs).
Unfortunately, the amount of configuration data increases with
both the size of programmable resources and the number
of possible configurations, requiring a large non-volatile
memory, such as flash memory, to store them. Besides, very
high density Multi-Level-Cell (MLC) NAND-flash memories,
implemented with highly scaled technologies, have a reduced
noise margin and a high raw Bit-Error-Rate (BER) [1]. This
unreliability is related to threshold voltage susceptibility, that,
according to [2], is mainly caused by program disturb, read
disturb and retention time limit. Different strategies are em-
ployed to face these phenomena, e.g. [3], [4], where Low-
Density-Parity-Check (LDPC) codes and Data-Pattern-Aware
(DPA) error prevention techniques are exploited to face cell-
to-cell interference and vulnerable threshold voltage levels
respectively. In particular, DPA error prevention techniques
could be used to remap the output of the LDPC encoder to
jointly improve system reliability. A further step to improve
the reliability is reported in [5], where the authors increase the
error correction capability of the LDPC code by using data
compression. However, experiments shown in [5] rely on a
negligible compression ratio (less than 5%) where the residual
BER after LDPC decoding is nearly the same as the BER on
decompressed data. On the contrary, when high compression
ratios are employed, a small residual BER on compressed
data leads to high BER on decompressed data. The novel
contribution of this work is to go one step beyond the solution
proposed in [5] and to propose a technique that improves
the BER performance even when higher compression ratios
are applied. Inspired by the idea developed in [6] for image
transmission over wireless channels, in this work we propose
a joint source/channel approach to combat MLC NAND-flash
errors, induced by cell-to-cell interference, and to apply it to
the compression of FPGA configuration data. The proposed
system exploits a modified arithmetic code (MAC) not only
to achieve high data compression but also to improve the BER
performance of LDPC codes. Experimental results show the
superior performance of the proposed system with respect to
state of the art techniques based on LDPC codes. To the best
of our knowledge this is the first work that tries to apply
joint source-channel techniques to concurrently compress and
protect against errors data to be stored in a flash memory.
II. BACKGROUND AND PROPOSED TECHNIQUE
Efficient compression with arithmetic codes (ACs) is ob-
tained when input symbols are mutually unrelated. Unfortu-
nately, this is not always the case and a reversible decorrela-
tion algorithm (DA) could be required. As an example, in [7],
[8] AC is used to compress the bitstream of Xilinx Virtex and
Virtex II pro FPGAs. In [7] a DA tailored on the structure
of the bitstream is employed, namely the DA exploits the
presence of different regions (e.g. Configurable Logic Blocks,
I/O Blocks and Block RAMs), to find for each region the
most probable bit-pattern. Each pattern is then applied to the
proper region of the bitstream, that is performing a bitwise xor
operation. Then data are coded by the AC. A similar approach
with a proper DA can be used for other devices as well.
In general AC-based techniques can be described as follows.
Let x = fx0; x1; : : : ; xK 1g be the binary array representing
the data to be stored in the non-volatile memory (made of
K bits) and y = fy0; y1; : : : ; yK 1g the result obtained by
applying an optional DA to achieve a stream of bits where
the occurrence of 0 is very high (P0  0:9) with respect
to the occurrence of 1 (P1 = 1   P0). The AC maps y
onto the sequence z = fz0; z1; : : : ; zK0 1g, where K 0 < K
is the length of the compressed sequence z. The compressed
sequence z is obtained as the result of a recursive process and
represents the probability interval of y. The AC initializes the
20
1
bit−lines
word−lines 0
0
0
0
0
0
0
1
1
1
1
1
1
1
DA
optional
encoder
MAC
encoder
LDPC
memory
flash
decoder
LDPC
decoder
MAC
DA
inverse
optional
0
y
z
P0(1 − ǫ) P1(1 − ǫ) ǫ
Cy
xˆyˆzˆwzyx wˆ
Figure 1. General block scheme of a Modified-Arithmetic-Coding-based (MAC-based) system for CD compression/error-correction with optional Decorrelation
Algorithm (DA) and Low-Density-Parity-Check (LDPC) codes. Details: i) MAC encoding process and probability interval update; ii) flash memory structure
and capacitance-coupling (Cy); iii) tree representation of the MAC decoding process.
probability interval as [0; 1) and for each bit to be encoded
selects the corresponding probability interval portion, P0 for
0 or P1 for 1. After K iterations an interval I(y), whose width
corresponds to the input sequence probability, is obtained and
encoded by means of the shortest binary sequence belonging
to it. This sequence represents the value v(z) 2 I(y). On the
other hand, the AC decoder progressively selects the intervals
represented by v(z) and obtains y.
In this work we propose to use the MAC [6], namely to
strengthen AC by using a ternary alphabet in the encoding.
The basic idea of the MAC is to have a forbidden symbol ,
which is never encoded and whose probability is fixed to an
arbitrary value P = . The forbidden symbol perturbs the
width of the intervals corresponding to 0 and 1 leading to
I0 = P0(1  ) and I1 = P1(1  ) (see the left part of Fig. 1)
and providing a form of coding redundancy per encoded bit
 =   log2(1  ): (1)
At the decoder side, the presence of  can be used for
error detection; if the MAC decoder detects the forbidden
symbol, then the compressed data contain errors. Let z^ be
the received compressed data containing errors; the coding
redundancy associated to the forbidden symbol can be used
by the decoder to select the best estimate of the encoded
sequence, received from the error prone flash memory. Thus,
when errors occur, the decoder should find the path that
does not contain the forbidden symbol in the tree of all the
possible decoded sequences (see the right part of Fig. 1),
namely the decoder behaves as a Maximum-Likelihood (ML)
decoder. In other words, let ~y be a generic K bit sequence,
the decoder selects y^ among all possible ~y sequences such
that m^ = P (z^jy^)  P (z^j~y) = ~m, where z^ is the received
stream of coded bits, ~z is the arithmetic coded version of ~y
and contains ~K 0 = K 0 bits. Since AC is a reversible process,
~y can be replaced by ~z and we can rewrite the problem in
logarithmic form as
ln[ ~m] =
K0 1X
j=0
ln[P (z^j j~zj)]; (2)
where P (z^j j~zj) is the transition probability of each bit of z^.
Finally, depending on the optional DA used at the encoder
side, the decoded sequence y^ is used to obtain x^, the best
estimate of x. Unfortunately, the complexity of exploring
the tree of all possible decoded sequences is prohibitive and
sub-optimal search strategies, such as the M -algorithm, must
be employed [6]. Indeed, the M -algorithm is a breadth-first
technique to explore the tree that limits the search space to M
paths at each depth in the tree. The ML decoder guides the
tree exploration and, for each node, MAC decoding operations
shown in Algorithm 1 are computed, where Ei is the left
endpoint of interval Ii and endpoints initialization is E0 = 0,
E1 = P0(1 ) and E2 = (1 ). In particular, lines from 1 to
7 correspond to the MAC decoding, namely to output ‘0’ (‘1’)
or to detect the forbidden symbol depending on the interval
where v(z^) lays in. Lines from 8 to 22 correspond to the
interval renormalization, namely to keep limited the precision
to represent the intervals. Finally, lines from 23 to 24 update
Ei values. As shown in Fig. 1 the proposed system includes
an LDPC code encoder and the corresponding decoder as
well. Namely, the AC compressed sequence z is coded by
an LDPC encoder that produces w = fw0; w1; : : : ; wN 1g
with N > K 0. If errors occur then the sequence read from the
flash memory is w^ and the LDPC decoder produces z^, that is
the input of the MAC decoder.
III. EXPERIMENTAL SETUP AND RESULTS
A. System parameter values
Let RS , RC and RM be the code-rate of the whole system,
the LDPC code and the MAC respectively, where
RS = RC RM RM = 1
H + 
(3)
and H is the entropy of y [6]. Then, combining (1) and (3)
we obtain
 = 1  2 (RC=RS H): (4)
As a possible application of the proposed technique, this
work targets configuration data of FPGAs. Thus, we analyzed
3several configuration data for Xilinx FPGAs, including the
ones in [7], [8], and we observed that 0:93  P0  0:95,
corresponding to 0:29  H  0:37. Moreover, since in [5] a
2 bit/cell NAND flash with 4 kB page (K = 32768) and a rate-
15/16 LDPC code are considered, we fixed RS = 15=16 and
N = 34952 in the following tests to have a fair comparison
with previous works. As shown in (4) the value of  depends
on RC and vice-versa. It is worth noting that the minimum
possible RC is achieved when  = 0, that is  = 0. Thus,
from (3) we infer that RC  RS H; when P0 = 0:95 this leads
to about RC  1=4. The opposite case is RC = 1 (no LDPC
code is used) that in (4) gives   0:42 when P0 = 0:95.
As a consequence, given an LDPC code 1=4  RC  1
the MAC is configured to obtain RS = 15=16. However,
fixing RS limits the flexibility of the proposed system and
the maximum compression ratio (K 0=K) is about 11%. The
LDPC codes employed in this paper are regular ones with
N = 34952 and the column weight of H is 4 as in [3], [5].
The decoding algorithm is the layered normalized-min-sum
with  = 0:75 and performs at maximum 30 iterations. The
data are represented in the form of Logarithmic-Likelihood-
Ratios (LLRs) as 2’s complement values on 8 bits and, when
LDPC decoding is finished, are used to compute ln[P (z^j j~zj)],
that is required in (2), resorting to the Max-Log approximation
routinely employed in turbo and LDPC code decoding:
ln[P (z^j j~zj = 1)] 

0 if [z^j ]  0
[z^j ] else
; (5)
ln[P (z^j j~zj = 0)] 
  [z^j ] if [z^j ]  0
0 else
: (6)
The maximum number of paths explored with the M -
algorithm has been set to 1024.
The MLC NAND flash memory has been modeled as
described in [3] for all-bit-line structures (shown in the middle
part of Fig. 1). Each cell of an MLC NAND flash can be
programmed to one of the possible L levels with a threshold
voltage V (k)t with k = 0; : : : ; L 1. The statistical distribution
of V (k)t changes depending on the considered level. When
a cell is erased the threshold voltage V (0)t can be modeled
as a Gaussian random variable with e and e as mean
and standard deviation respectively. On the contrary, a cell
programmed at level k (k 6= 0) has a threshold voltage
distribution that is uniform in the range [V (k)p ; V
(k)
p +Vpp],
where V (k)p is the verify voltage for level k and Vpp is the
incremental program voltage step.
Unfortunately, cell-to-cell interference caused by parasitic
capacitance-coupling effects (see the middle part of Fig. 1)
causes a distortion in the threshold voltage distribution of
programmed cells. As argued in [3] this effect can be taken
into account considering vertical capacitance-coupling among
different word-lines only (Cy), and introducing i) the vertical
coupling ratio y = Cy=CTot, where CTot is the total
capacitance of the victim cell, ii) the cell-to-cell coupling
strength factor s to obtain 0y = y  s. Thus, the threshold
voltage of a cell is perturbed when one vertical neighboring
cell is programmed. Programming an erased cell to level k
generates a threshold voltage shift V (k)t = V
(k)
t   V (0)t and
Algorithm 1 MAC decoding operations.
1: if v(z^) 2 [E0; E1) then
2: E00  E0, E01  E1, output 0
3: else if v(z^) 2 [E1; E2) then
4: E00  E1, E01  E2, output 1
5: else
6:  detected, exit
7: end if
8: done 0
9: while done = 0 do
10: if [E00; E01] 2 [0; 1=2) then
11: E00  2E00, E01  2E01
12: v(z^) 2v(z^)
13: else if [E00; E01] 2 [1=2; 1) then
14: E00  2 (E00   1=2), E01  2 (E01   1=2)
15: v(z^) 2 (v(z^)  1=2)
16: else if [E00; E01] 2 [1=4; 3=4) then
17: E00  2 (E00   1=4), E01  2 (E01   1=4)
18: v(z^) 2 (v(z^)  1=4)
19: else
20: done 1
21: end if
22: end while
23: E0 = E
0
0, I
0 = E01   E00
24: E1 = E
0
0 + P0(1  )I 0, E2 = E00 + P1(1  )I 0
induces on a victim cell a cell-to-cell interference 0y V (k)t .
Assuming V (k)t and V
(0)
t as independent random variables, and
the probability that a cell is programmed to a certain level is
uniform, the distribution of the total cell-to-cell interference
can be derived [3] and used to compute [w^i], the LLR of the
i-th bit stored in a cell. As in [3], [5] we target a 2 bit/cell
NAND flash with the standard gray code mapping, namely 11,
10, 00 and 01 correspond to levels 0, 1, 2, and 3 respectively.
The other parameters are equal to the ones used in [3].
B. Simulation results
In Fig. 2 BER versus cell-to-cell coupling strength factor
s performance is shown for different cases. The solid star-
marked and diamond-marked curves, referred to as “LDPC
only”, show the performance of the rate-19/20 (K = 32794,
N = 34520) and the rate-15/16 (K = 32768, N = 34952)
LDPC codes used in [3] and [5] respectively. On the other
hand, the solid curves with square and circle marks represent
the case RC = 9=10,M = 1024 for P0 = 0:93 and P0 = 0:95
respectively. As it can be observed, these test cases achieve
significantly better BER performance than the LDPC only
case. Even better results are achieved with RC = 5=6 (solid
curves with cross and plus marks). Since M represents the
maximum number of paths explored at each depth in the tree
of all the possible decoded sequences, it is directly related to
the speed and the amount of memory and complexity of the
MAC decoder. In order to increase the speed and to reduce
the complexity of the MAC decoder, the same test cases have
been considered with M = 128 (dashed curves in Fig. 2).
As it can be observed, even if there is a performance loss
40.9 1 1.1 1.2 1.3 1.4 1.5 1.6
10−9
10−8
10−7
10−6
10−5
10−4
10−3
10−2
10−1
strength factor s
B
E
R
 
 
LDPC only RS = 19/20 [3]
LDPC only RS = 15/16 [5]
P0 = 0.93,RC = 9/10,M = 1024
P0 = 0.95,RC = 9/10,M = 1024
P0 = 0.93,RC = 5/6,M = 1024
P0 = 0.95,RC = 5/6,M = 1024
P0 = 0.93,RC = 9/10,M = 128
P0 = 0.95,RC = 9/10,M = 128
P0 = 0.93,RC = 5/6,M = 128
P0 = 0.95,RC = 5/6,M = 128
Figure 2. BER versus cell-to-cell coupling strength factor s comparison.
with respect to the case with M = 1024, the proposed system
still performs better than the LDPC only case. Experimental
results for RC = 1=4, corresponding to  = 0, and RC = 1,
corresponding to  = 0:42 are not shown as they achieve a
BER in the order of 10 4 in all the considered cell-to-cell
coupling strength factor range. It is worth noting that these
results are still valid for other applications, given that system
parameters belong to the same ranges.
C. Implementation area and throughput discussion
Several LDPC decoder architectures described in the liter-
ature can be used to implement the proposed system, e.g. [9],
[10]. On the contrary, to the best of our knowledge, the only
work addressing an implementation of the MAC decoder is
[11]. However, in [11] the MAC decoder acts as MAP decoder,
whereas the proposed technique relies on an ML decoder.
Thus, we simplified the architecture used in [11], by removing
the backward unit required to implement the BCJR algorithm.
As in [11], the MAC decoder architecture contains two AC
decoders to perform 0 and 1 extension in the tree exploration,
a forward unit to implement (2), two FIFOs to store the status
of the AC in the M explored nodes. As argued in [11], since
the first FIFO stores the metric obtained from 0 extension and
the second FIFO the metric belonging to 1 extension, they
are sorted. Thus, with a comparator one can compute the M
candidates avoiding the use of a large sorter.
According to [5] an LDPC decoder for NAND flash memo-
ries with RS = 15=16, as the one we considered, requires 1.32
mm2 of area on a 65 nm standard cell technology [12] and,
with a clock frequency of 300 MHz, can sustain a throughput
of few Gb/s. Moreover, in [5] it is shown that adding rate
adaptivity comes at a negligible area overhead (1.36 mm2
for a dual-rate decoder). The implementation of the MAC
architecture proposed in [11] for M = 128 works with a
sliding-window-based scheduling. When configured as an ML
decoder with a window of 125 data, the MAC architecture
requires nearly 2 mm2 on a 90 nm standard cell technology
(about 1.5 mm2 on a 65 nm technology). Given that a 300
MHz clock frequency is employed for the MAC architecture,
it sustains a throughput of 1.36 Mb/s. As a consequence,
the proposed system recovers a 10 MB bitstream in about a
minute. This value is compatible with the reconfiguration time
measured in [13] for Xilinx Virtex II Pro devices. Indeed, in
[13] it is shown that reconfiguring a Xilinx Virtex II pro FPGA
with a 14.6 kB bitstream requires 101 ms, corresponding to
about 70 s for a 10 MB bitstream.
IV. CONCLUSIONS
In this work a joint source/channel technique to improve
the robustness of configuration data of programmable devices
against error occurrence in MLC NAND flash memories has
been presented. Inspired by the joint source/channel paradigm
this work shows that concatenating a MAC and an LDPC
code both data compression and improved error correction
capability can be achieved with a limited area overhead.
The proposed systems features superior BER performance
with respect to state of the art techniques relying on LDPC
codes and achieves BER values of about 10 8 for cell-to-cell
coupling strength factors s well higher than 1.0.
REFERENCES
[1] C. Zambelli, D. Bertozzi, A. Chimenton, and P. Olivo, “Non
volatile memory partitioning scheme for technology-based performance-
reliability trade-off,” IEEE Embedded Systems Letters, vol. 3, no. 1, pp.
13–15, Mar 2011.
[2] N. Mielke, T. Marquart, N. Wu, J. Kessenich, H. Belgal, E. Schares,
F. Trivedi, E. Goodness, and L. R. Nevill, “Bit error rate in NAND
flash memories,” in IEEE International Reliability Physics Symposium,
2008, pp. 9–19.
[3] G. Dong, N. Xie, and T. Zhang, “On the use of soft-decision error cor-
rection codes in NAND flash memory,” IEEE Transactions on Circuits
and Systems I, vol. 58, no. 2, pp. 429–439, Feb 2011.
[4] J. Guo, Z. Chen, D. Wang, Z. Shao, and Y. Chen, “DPA: A data pattern
aware error prevention technique for nand flash lifetime extension,” in
Asia and South Pacific Design Automation Conf., 2014, pp. 592–597.
[5] N. Xie, G. Dong, and T. Zhang, “Using lossless data compression in
data storage systems: Not for saving space,” IEEE Transactions on
Computers, vol. 60, no. 3, pp. 335–345, Mar 2011.
[6] M. Grangetto, B. Scanavino, G. Olmo, and S. Benedetto, “Iterative
decoding of serially concatenated arithmetic and channel codes with
JPEG2000 applications,” IEEE Transactions on Image Processing,
vol. 16, no. 6, pp. 1557–1567, Jun 2007.
[7] M. Martina, G. Masera, A. Molino, F. Vacca, L. Sterpone, and M. Vi-
olante, “A new approach to compress the configuration information
of progammable devices,” in Design Automation and Test in Europe
conference, 2006, pp. 6–10.
[8] L. Sterpone and M. Violante, “A new decompression system for the
configuration process of SRAM-based FPGAs,” in ACM Great Lakes
Symposium on VLSI, 2007, pp. 241–246.
[9] J. Kim and W. Sung, “Rate-0.96 LDPC decoding VLSI for soft-decision
error correction of NAND flash memory,” IEEE Transactions on VLSI,
vol. 22, no. 5, pp. 1004–1015, May 2014.
[10] C. Condo, M. Martina, and G. Masera, “VLSI implementation of a multi-
mode turbo/LDPC decoder architecture,” IEEE Transactions on Circuits
and Systems - I, vol. 60, no. 6, pp. 1441–1454, Jun 2013.
[11] S. Zezza, S. Nooshabadi, and G. Masera, “A 2.63 Mbit/s VLSI im-
plementation of SISO arithmetic decoders for high performance joint
source channel codes,” IEEE Transactions on Circuits and Systems I,
vol. 60, no. 4, pp. 951–964, Apr 2013.
[12] A. Pulimeno, M. Graziano, and G. Piccinini, “UDSM trends comparison:
From technology roadmap to UltraSparc Niagara2,” IEEE Trans. on
VLSI, vol. 20, no. 7, pp. 1341–1346, Jul 2012.
[13] K. Papadimitriou, A. Anyfantis, and A. Dollas, “An effective framework
to evaluate dynamic partial reconfiguration in FPGA systems,” IEEE
Transactions on Instrumentation and Measurement, vol. 59, no. 6, pp.
1642–1651, Jun 2010.
