AES has been one of the most popular encryption and decryption algorithms for data security applications. At the same time, data randomization (or "homogeneous") technology was applied to reduce the bit error rate (BER) of MLC and TLC flash memory. Here, AES algorithm was found efficient to replace the orthogonal polynomials which normally carry out homogeneous function by scrambling data. This paper put forward a novel hardware architecture providing both homogeneous and data encryption/decryption functions concurrently by an embedded AES hardware engine while getting rid of randomization engine with Linear Feedback Shift Register (LFSR). It made a flash controller simple and reduced the die size because the independent homogeneous hard engine is no longer necessary for a flash memory system, in which AES security algorithm embedded. Finally a SSD controller designed in this architecture was silicon proven.
Introduction
NAND flash memory is dominating today's mobile storage market. For example, SD/MMC card with flash memory inside came to replace films in camera all over the world ten years ago. Therefore, Solid State Disk (SSD) reveals its ambition in replacing traditional motor-driven hard disk in highend computing systems including, but not limited to, notebook computer, network server, etc.
It is well-known that NAND flash itself is not perfect media. It contains a large number of randomly scattered bad blocks and requires on-the-fly error corrections. These limitations are dramatically worse in Multi-Level Cell (MLC) and Triple-Level Cell (TLC) NAND flash compared with Single Level Cell (SLC). Therefore, MLC flash is asking for a higher error correction (ECC) ability than that of SLC flash. There are many ECC algorithms developed in the industry. BCH, brought up by Hbose, Ray-Chaudhuri and Ocquenghem separately [1] , has been regarded as one of the most popular algorithms. Many of the flash controllers today are declaring their ECC ability in range of 24 bit, 48 bit or even 72 bit per 1 KB payload with the BCH algorithm. It is somehow disappointing that there are still fail cases in MLC and TLC flash memory caused by BER beyond the enhanced ECC ability. And then Low-Density Parity-Check (LDPC) was applied with higher expectation of ECC ability.
When studying the MLC and TLC's bit error models according to their physical working mechanism, a technology called "Homogeneous" is applied to reduce the probability of the bit errors generated and reduce the pressure on the ECC's algorithm ability. Although homogeneous is a concept that is applied in other fields, Homogeneous plays a more important role in flash controllers.
At the meaning time, data security is becoming more important in today's storage system. The Advanced Encryption Standard (AES) [2] is the wellknown algorithm for data security.
Here an AES engine is designed as an embedded module taking not only the role of encryption/decryption but also concurrently carrying out homogeneous function to reduce flash memory bit error rate, so it is not necessary to embedded both the AES engine and the independent homogeneous engine in SD/MMC or SSD controllers.
Homogeneous technology
Bit errors in the basic flash memory are regarded as randomly distributed. This is accurate enough for SLC flash memory and a decent estimate for the MLC or TLC BER. The situation is very similar with BER analysis in communication systems. However, the much higher bit rate in MLC and TLC compared with that of SLC is obviously not caused by this kind of random bit error mechanism, "Burst Error" and "Inter-Page Error" becomes the error case dominator.
There are two logic bits stored in a physical memory cell in MLC flash memory. These two bits are actually recognized by the four levels of voltages on a same floating gate. This means SLC has larger noise margin. Therefore, the much higher bit error rate is doubtlessly an inherent vice of MLC and TLC.
In the communication system, random errors and burst errors can be suppressed by scrambling the data packets. Here, the same technology, so called Homogeneous, is introduced to deal with bit errors in flash memory. Most of the mathematic algorithms, which originally developed as white noise sources, can be applied for the Homogeneous purpose because the basic theory is the same.
It should be noted that Homogeneous cannot correct any bit of errors. Homogeneous only helps distribute error bits more evenly and cut the "peak" BER to an average level. Assume a BCH algorithm can protect 1 K byte payload data, and a certain TLC flash memory asks K bit error ECC ability defined by the peak number of error bits. With homogeneous, a flash controller can deal with it by ECC ability as small as 2 K/3 or even K/2 bits.
Here is an example of algorithm to realize homogeneous (randomization) by data scrambling method with Linear Feedback Shift Register (LFSR) [3] . Fibonacci Implementation as polynomial Eq. (1) can be free-running Linear Feedback Shift Register (LFSR) in hardware. In Fig. 1 , The initialized value of an LFSR seed (D0-D15) shall be FFFF.
NAND flash memory is accessed by unit of "Page" which depends on different flash memory part numbers, for example, 2 K byte or 8 K byte, etc. Unfortunately, the bits in a same cell of MLC or TLC flash memory are not located on a same page. The 2-bit information on each cell of MLC flash is distributed on two associated pages: LSB page (Least Significant Bit Page) and MSB page (Most Significant Bit Page). These two associated pages are called Paired Pages or Shared Pages. Flash memory chips produced by different vendors or even different flash part numbers from a same vendor, have different paired page structures. For example, MT29F64G08CBAA, a MLC NAND type flash memory by Micron [4, 5] , has the Paired Page structure shown in Fig. 2 .
Because The 2-bit information is in the same floating gate, the states of two pages will be affected by one another. This kind of error happens between the paired pages. Therefore, it is called "Inter-Page Error".
Homogeneous is enhanced here to affect the paired pages by a group of orthogonal polynomials, and finally eliminate, or minimize, the influence between the paired pages. Each page is scrambled by one of the polynomials. The property of orthogonal polynomials makes the paired pages "separated" (shown in Fig. 3 ). Here, Ki is a logic switcher selecting the corresponding data stream, and Each polynomial is a hardware engine consisting of D flip-flops and exclusive OR gates in a real silicon chip. Homogeneous function is normally implemented by a group of polynomial engines. It is possible to make it simpler by only one polynomial if it can provide a group of white noise patterns with enough distance longer than flash page size. The polynomial (1) with its physical circuit in Fig. 1 is a good example to meet this requirement. Starting with different seeds, which are the reset values of D flip flops, the polynomial (1) can generate a group of vectors, each set of vectors (pattern) can be applied to homogenize the corresponding flash page.
Assume the target flash is TLC type with 256 pages per block and 8 K bytes per page. At least eight seeds should be prepared for eight vectors and each vector with its corresponding seed is applied for one flash page because the paired pages consist of eight physical flash pages. Of course, the vectors should be kept 8 K byte in distance. A list of eight seeds is given in Fig. 4 as an example.
Efficiency of homogeneous
A lab test result is shown in Fig. 5 . The test target is two randomly selected flash memory chips from one thousand samples of Micron's TLC type flash, which had 64 G bit density with the following features: Each flash chip was tested with 15 rounds of whole space scanning. This means total K ¼ 1:2 Ã 10 8 times of BCH error detection. In Fig. 5 each point shown is the detected times in the range of error bits. For example, point X is the error detected number N for those BCH result indicating 5 or 6 or 7 or 8 error bits. The error detected number is rescaled by "Log (N)". N is very close to K because the majority of BCH result indicated no error or error bits less than 4. From the curve in Fig. 5 , the ECC ability has to be beyond 60 bit per 1 K byte payload to obtain least reliability without homogeneous applied, while 40 bit or even 36 bit ECC ability is good enough if there is Homogeneous.
Data security by AES algorithm
The Advanced Encryption Standard (AES) is a specification for the encryption of electronic data established by the U.S. National Institute of Standards and Technology (NIST) in 2001. Having been adopted by the U.S. government, AES superseded the Data Encryption Standard (DES) and became one of the most popular data security algorithms. Basically, it is a symmetric-key algorithm, in which the same key is used for both encrypting and decrypting the data.
Federal Information Processing Standards Publication 197 (FIP-197) described the AES algorithm in details. As FIP defined, the AES encryption algorithm uses fairly straightforward techniques for substitution and permutation, except for the MixColumns routine. The MixColumns routine uses special addition and multiplication. The addition and multiplication used by AES are based on mathematical field theory. In particular, AES is based on a field called GF (2 8 ). For the AES algorithm, the irreducible polynomial is:
For both its Cipher and Inverse Cipher, the AES algorithm uses a round function composed of four different byte-oriented transformations: 1) SubBytes() transformation: it is a non-linear byte substitution that operates independently on each byte of the state using a substitution table (S-box). This is shown in Fig. 6 . There are some operation modes which allow block ciphers; these were invented to provide confidentiality for messages in arbitrary length. They are ECB, CBC, OFB and CFB mode [6] . These modes, with the exception of ECB, require an initialization vector (IVector), a sort of 'dummy block', to kick off the process for the first real block and also to provide some randomization for the process.
IVector is not necessary to be secret in most cases, but it is important to make sure IVector is never reused with the same key. In CBC mode, the IVector must, in addition, be randomly generated during encryption. Fig. 7 is a diagram showing normal implementation. First, the data is encrypted by AES, and then homogenized by exclusive OR operation with the selected output stream among a group of orthogonal polynomials.
Double functions of AES engine in flash controller
Although AES algorithm is not developed to make signals randomized or white noise, it does encrypt data and make the data "unpredictable" for attackers. So the AES encrypted data can be regarded as random patterns, which is the theoretical target of Homogeneous. Thus, it is worth it to have a study on AES encryption as a homogeneous algorithm. If the answer is OK, a designer can get rid of those polynomials for homogeneous in a system inside which AES security exists. A group of orthogonal polynomials or a set of the different seeds for a polynomial is assigned to break the interactions among the paired pages. For AES method, a unique security key for the paired pages provides the same benefit. Of course, such kind of keys should be page number related functions with expression as following:
Furthermore, if a system designer does not like "changing keys" considering the risk or difficulty of management, there is a simpler alternative method by changing the IVector:
For AES-256 algorithm, the key is 256 bits and IVector is 128 bits. Here, two simple but effective assignments are applied for lab verification:
(1) Key F 16 byte original key D page number Assume the original key is 256'h1234_0000_0000_*_0000, then the encrypt keys for those pages in a physical block are listed as following: 256'h1234_0000_0000_*_0000, 256'h1234_0000_0000_*_0001, 256'h1234_0000_0000_*_0002, 256'h1234_0000_0000_*_0003, Á Á Á Á Á Á (2) IVector F any presetting value D page number. The simplest presetting value is zero, so the IVectors can be listed as following: 128'h0000_0000_0000_0000, 128'h0000_0000_0000_0001, 128'h0000_0000_0000_0002, 128'h0000_0000_0000_0003, Á Á Á Á Á Á Fig. 8 is the diagram of using AES as double function engine for both data security and Homogeneous. All IVectors in Fig. 8 can have preset values as described above. The test result given in Fig. 9 shows that AES can bring the effects on suppressing the bit errors in a TLC flash application, the same as the polynomial method does. Curve 0 is the test result without Homogeneous, and Curve 1 with homogeneous by a polynomial with a set of seeds. These two curves are the same with those in Fig. 5 . Although the different is tiny, Curve 2 and Curve 3 are both a little bit lower than Curve 1, indicating that the two AES implementations have slightly better effects than that of polynomial methods. Therefore, AES method brings the 1 st benefit of getting rid of special polynomials for homogeneous in data secured NAND flash memory applications.
Another benefit is brought by homogeneous is that no matter implemented by AES with double functions or scrambling with polynomials, homogeneous lowered the peak bit error numbers. This greatly helps reduce the gate count number or silicon die size, and finally cuts the cost.
The complexities of the syndrome computation and Chien search in the BCH ECC engine are proportional to the number of correctable errors and codeword length. BCH algorithms, which can detect and correct maximum E bit errors, need equivalent gate count number G for a certain semiconductor process. G can be roughly estimated as following:
Here, g is a consistent for a dedicated process. In the above sample, without homogeneous or AES, the TLC required 72 bit or even higher ECC ability. Homogeneous with AES or polynomials made it as low as 48 bit for the equivalent reliability. The logic synthesis result is shown in Table I , the gate counts of 48 bit ECC engine is 75% compared with that of 72 bit ECC engine.
For high speed implementation, ECC engine is very exhaustive for a real silicon die area. It is significant for the Solid State Disk (SSD) controller design because it manages multi-channels of flash memories. The most common SSD controller specifications are declaring eight or ten channels running in parallel Fig. 9 . AES took the role of homogeneous with high performance. That is to say, a SSD controller must be embedded with eight or even more ECC hardware modules. It is very helpful in designing a cost-effective SSD controller and reduces the gate count of ECC engine.
Furthermore, there must be an AES engine in a SSD controller with security function, and an AES engine can also make data randomized, so an independent homogeneous hardware engine is not necessary to be implemented. In this way, the SSD controller is cost effective by getting rid of a group of orthogonal polynomials for homogeneous in Fig. 3 . Although the gate count (or die size) reduce in each channel is not much (almost 1% compared with that of ECC engine in Table I ), the whole architecture become simpler.
Finally, a SSD controller for TLC flash memory may have area including AES algorithm and ECC engine with 72 bit error endurance in maximum, which is 15856213 according to Table I. A SSD controller with data security function was designed with the methodologies described in this paper. A total of five BCH ECC modules having ability of correcting 48 error bits were embedded. The AES engine was designed to play the roles of both data security and homogeneous concurrently with performance up to 300 MB/s. Fig. 10 is the snap-shot of the real silicon chip under a microscope. This SSD controller, measured as 3628 um Ã Fig. 10 . The snap-shot of the SSD controller 3956 um in die size, has one SATA-II interface and ten flash channels in total. The maximum throughput tested 260 MB/s with sequential read burst with desired reliability.
Conclusion
Homogeneous is effective in suppressing the bit error rate in MLC or TLC flash memories and improves the reliability of the storage systems protected by ECC algorithms. AES, which plays the major role of data encryption and decryption engine for most of the storage devices, was found the concurrent function of homogeneous. An AES hardware module was built to provide both data security and homogeneous functions while an independent homogeneous module that scrambled data was no longer necessary. In order to verify all the described analysis, a SATA-II SSD controller design was turned into a real silicon chip and this novel structure was silicon proven.
