Abstract-The limited endurance of flash memories is a major design concern for enterprise storage systems. We propose a method to increase it by using relative (as opposed to fixed) cell levels and by representing the information with Write Asymmetric Memory (WAM) codes. Overall, our new method enables faster writes, improved reliability as well as improved endurance by allowing multiple writes between block erasures. We study the capacity of the new WAM codes with relative levels, where the information is represented by multiset permutations induced by the charge levels, and show that it achieves the capacity of any other WAM codes with the same number of writes. Specifically, we prove that it has the potential to double the total capacity of the memory. Since capacity can be achieved only with cells that have a large number of levels, we propose a new architecture that consists of multi-cells -each an aggregation of a number of floating gate transistors.
I. INTRODUCTION
Flash memory is the most widely-used type of non-volatile electronic memory [1] . The amount of charge stored in a flash memory cell can be quantized into q 2 discrete values in order to represent up to log 2 q bits. (The cell is called a singlelevel cell (SLC) if q = 2, and called a multi-level cell (MLC) if q > 2). We call the q states of a cell its levels: level 0, level 1, . . ., level q − 1. The charge is quantized into the discrete levels by an appropriate set of threshold levels. The level of a cell can be increased by injecting charge into the cell, and decreased by removing charge from the cell. Flash memories have the prominent property that although it is relatively easy to increase a cell's level, it is very costly to decrease it. This follows from the fact that flash-memory cells are organized as blocks, where every block has about 10 5 ∼ 10 6 cells. To decrease any cell's level, the whole block needs to be erased (which means to remove the charge from all the cells of the block) and then be reprogrammed. Block erasures not only are slow and energy consuming, but also significantly reduce the longevity of flash memories, because every block can endure only 10 4 ∼ 10 5 erasures with guaranteed quality [1] . Therefore, it is highly desirable to minimize the number of block erasures.
We can store log 2 q bits on a flash cell with q levels. That way, each time we want to update the data on the memory, we would have to erase the whole block. We call this representation method "the trivial scheme". We could also use a bit more sophisticated update schemes. For example, we could store only 1 bit in each cell, according to the parity of the level of the cell. If the cell is in level 3, for example, it stores the value 1. Using this scheme, we can update the data q − 1 times before a block erasure will be required. We call this scheme "the parity scheme". Update schemes like the parity scheme can be especially useful for enterprise applications of flash memory, where the endurance of the memory becomes a major design concern. Update schemes are also known as write once memory (WOM) codes for q = 2 [6] , and write asymmetric memory (WAM) codes for q > 2 [3] . In this paper we focus on the WAM model. The capacity of WAM codes was studied at [4] , but no capacity achieving constructions are known.
A main trade-off in the design of WAM codes is between the number of times we can update the memory and the amount of data the memory can store at a time. We call the amount of data stored at a time the instantaneous rate of the code. In general, the higher the instantaneous rate, the lower the number of times we can update the memory. In order to settle this trade-off, we focus on optimizing the product of these two values. In other words, we optimize the total amount of data the memory stores between two erasures. We call that number the total rate of the code. Once the total rate is optimized, we optimize the instantaneous rate under that constraint. Back to the previous examples, we remember that the trivial scheme has an instantaneous rate of log 2 q bits in each cell. Its total rate is the same, since we can only write the data once between block erasures. The parity scheme, however, has an instantaneous rate of only 1 bit per cell. Its total rate is q − 1 bits per cell, since it allows for q − 1 writes between block erasures. So the parity scheme is better according to our standards.
In MLC flash memory, the process of writing a specific level on a cell is designed to cautiously approach the target level from below so as to avoid undesired block erasures in case of overshoots. Consequently, these attempts require many programming cycles, and they work only up to a moderate number of levels per cell. In order to avoid that problem, it was suggested to represent the data by a set of n cells, according to the permutation induced by the relative charge levels of the individual cells [5] . When we inject charge into a cell, we only need to make its charge level higher than that of the previous cell in the permutation, and therefore there is no risk of overshooting. Another advantage of representing data by relative levels is that the threshold levels are no longer needed. This mitigates the effects of retention in the cells (slow charge leakage). That method was called rank modulation.
In this paper we extend this idea and suggest to use permutations of a given multiset. That is, we still use the relative levels of the cells instead of the absolutes level, but allow multiple cells to be in the same relative level. We use multisets with the same multiplicity for all the elements. That is, the number of cells in each relative level is equal. Using multiset permutations, we still benefit from all of the advantages of rank modulation. In addition, we gain more flexibility, and we show in this work that this flexibility could result in better performance.
While the values of the cells don't need to be quantized using thresholds, we still use discrete levels for the analysis. This is to allow easy and fair analysis, and because there should still be a certain charge difference between the cells in order to limit errors. When we use a discrete model, the problem of designing update schemes with relative levels become a special case of the WAM problem. Namely, we are interested in a class of WAM codes, where the data is represented only by the multiset permutation induced by the levels of the cells. We call this class of codes: rank modulation WAM (RMWAM) codes. We define the capacity of the model as the tightest upper bound on the amount of information that can be stored on the memory over multiple writing cycles, and study the capacity of rank modulation WAM model. We show that when q is large, it can achieve the capacity of the more general WAM model for the same number of writes. Specifically, it is possible to store almost 2 bits per cell at a given time, while reusing the memory close to q times. That is twice the amount of information that is stored with the "parity scheme" from the example above. One caveat for that results is that in practical flash memory devices, q is a moderately small number. In order to tackle this obstacle, we propose a method to achieve high values of q with the existing cell technology. The main idea is to combine several floating gate transistors into a virtual cell, which we call a multi-cell.
The rest of the paper is organized as following: In section II we present the notations and definitions. In section III we study the cost of updating the memory under the RMWAM model. In section IV we state and discuss the main result of the paper, the capacity of the model. Section V describes the proof of the capacity theorem, and finally, in section VI we present the proposed structure of multi-cell flash memory.
II. DEFINITIONS AND NOTATIONS
A write asymmetric memory (WAM) is a q-ary information storage medium consists of n cells. The q states of a cell are also called levels: from level 0 to level q − 1. A cell can change from level i to level j if and only if i < j. The initial state of all the cells is 0. We want to reuse the WAM for T successive cycles. We only consider the following case: The encoder knows and the decoder does not know the previous state of the memory. The encoder and decoder can use arbitrary codes for every cycle, and there are no decoding errors (zeroerror case). For the vectors
we denote x n ⇒ y n if and only if x i y i , i = 1, 2, . . . , n. 
We will use the binary logarithm in the rest of the paper. The T-tuple (R 1 , · · · , R T ) is called the rate vector of this code. The closure of the set of all rate-vectors A T is called the capacity region of the WAM. The maximum total number of information bits stored in one storage cell of the WAM during the T updating cycles is
Fu and Vinck [4] showed that C T = log ( T+q−1 q−1 ). Since we want to use only the relative values of the cells, we use permutations of a multiset. We use a multiset of l elements (not including repetitions), where the multiplicity of each element is z. The cardinality of the multiset is thus n = lz. We denote the set of all permutations of a multiset of l elements with multiplicities z as S l,z . Let c = (c 1 , c 2 , . . . , c n ), with c i ∈ {0, 1, . . . , q − 1} be the state of an array of n flash cells, each having q discrete levels. We further assume that the variables induce a multiset permutation
is the set of all cells with relative level i.
Definition 2.
A (l, z, T, M)-Rank-Modulation WriteAsymmetric-Memory (RMWAM) code is a WAM code for which:
2) g : S l,z → I is based only according to the multiset permutation induced by the cell levels.
Since M t = M for t = 1, · · · , T, it follows that R t = R = (1/lz) log M for t = 1, · · · , T. Therefore we call R the instantaneous rate of the code, and RT the total rate (also known as sum rate). In addition, we define C = C T /T, and call C the instantaneous capacity of the RMWAM model, and C T its total capacity.
III. COST OF UPDATE
To change the multiset permutation from σ to σ , we program the cells based on their relative levels in σ , so that every cell's level increases as little as possible. Let c = (c 1 , c 2 , . . . , c n ) denote the new cell's levels to be set. First, for each i ∈ σ −1 (0), we set c i = c i . Then, for j = 1, 2, · · · , l − 1, and for i ∈ σ −1 (j), we set
Given two cell states c and c , let cost(c → c ) denote the cost of changing the cell state from c to c . We define the cost as the difference between the levels of the highest cell, before and after the update operation. Namely, if
In order to calculate the cost, we need to simulate the update operation. We now present an equivalent definition of the cost, that can be calculated directly from the current multiset permutation and the multiset permutation to be written. The Lemma is a generalization of Theorem 1 from [2] , and it further assumes that c i = σ(i) for i = 1, · · · , n.
In other words, the cost is the asymmetric infinity metric.
Proof: Assume by induction on σ (i), that
In the base case, σ (i) = 0, so there is no j s.t. σ (j) < σ (i). Therefore, c i = σ(i) = c i , as described in the programming process. For σ (i) > 0,
And the induction is proven. Now let σ (i) = l − 1:
We now show an example of an update operation, and calculate the cost according to the two equivalent definitions: Example 1. Let (l, z) = (3, 1), and c = (0, 1, 2). So σ = [0, 1, 2] . Now let σ = [1, 2, 0]. We increase the levels of the cells to represent σ . First we set c 3 = c 3 = 2. Then we set c 1 = max{c 1 , c 3 + 1} = max{0, 3} = 3. Finally we set c 2 = 4. The cost of the update is c 2 − c 3 = 4 − 2 = 2. We can also calculate it directly from the multiset permutations: σ = [0, 1, 2] , and σ = [1,
, and the maximum is 2, so this is the cost.
Finally, for a fixed σ ∈ S l,z , set
We note that |B l,z,r (σ)| is independent of σ.
IV. CAPACITY
In the following we present an expression for the capacity of the RMWAM model. Theorem 1. C T is maximal when T = q − l + 1, and in this case:
The proof of Theorem 1 will be given in section V. As a corollary of the theorem, we look at three different cases:
1) The case of (l, z) = (2, ∞).
Therefore we can store up to 1 bit in each cell in each updating cycle. In fact, in this case, a trivial code that assign a different message index to each multiset permutation achieves the capacity.
2) The case of (l, z) = (∞, 1).
Therefore in this case we can also store up to 1 bit in each cell in each updating cycle. However, here it is not easy to design a code that archives the capacity, and that problem is still open.
So we can store up to 2 bits per cell in each updating cycle in this case. So C T = 2(q − l + 1), and in the case of q l, C T → 2q. We notice that in that case the total capacity of the WAM model is the same [4] . That is since for WAM,
So in this case, the total capacity of the RMWAM model is the same as that of the WAM model.
V. PROOF OF THEOREM 1

A. Converse Part
If we want to guarantee that the cost of each update operation is no more than r, we must set M |B l,z,r (σ)|. Otherwise, if we would like to write the message m, we cannot guarantee that there is a multiset permutation in B l,z,r (σ) that represents m. We let K r = 1 lm log |B l,z,r (σ)|. By setting R K r , we cannot guarantee to write more than (q − l + 1)/r times, so RT is at most (q − l + 1)K r /r. In Lemma 2 we calculate B l,z,r (σ) in order to achieve an explicit bound on the rate. The lemma is a generalization of Theorem 2 from [2] for general z.
So the instantaneous rate of such a code is asymptotically optimal. If we show that the cost is always 1, it follows that the total rate is also asymptotically optimal.
Suppose {F m } M m=1 is a partition of S l,z , i.e., F m ∩ F m = ∅, m = m ; and ∪ M m=1 F m = S l,z . We now show that there exists a partition of S l,z , such that for any σ ∈ S l,z and any m ∈ M, there exists a vector σ ∈ F m , such that cost(σ → σ ) = 1. We use a random coding method. With every σ ∈ S l,z , we connect a random index r σ which is uniformly distributed over the data set I = {1, · · · , M}, and all these random indices are independent. Define
This implies that when n = lz is sufficiently large, there exists a partition of S l,z such that the cost of each update is 1.
VI. MULTI-CELL FLASH MEMORY
NAND flash memory is the most widely used type for general storage purpose. In NAND flash, several floating gate transistors are connected in series, where we read or write only one of them at a time. We propose to replace each transistor with a multi-cell of m transistors connected in parallel, and to connect their control gates, as shown in Figure 1 . That way, their current sums together in read operations, and the read precision increases by m times, allowing to store mq levels in a single multi-cell. In write operations, we write the same value to all the transistors, such that the sum of their charge levels gives us the desired total level. We suspect that the error rate would be similar to that of a traditional flash cell.
If we store data on n transistors that form n/m multi-cells of mq levels without a WAM code, we would get a rate of R = RT = (n/m) log 2 (mq). This is less than the n log 2 q we would get using traditional cells. However, if we use RMWAM codes, we could get a total capacity approaching 2nq both with multi-cells and with traditional cells. In order to approach a total capacity of 2nq with RMWAM, the number of updates the code can take must be much greater than the number of relative levels we use. By using multi-cells, we increase T at the expense of the R, and thus approach C T faster.
VII. CONCLUSIONS
In this paper, we studied the capacity of rank-modulation write asymmetric memory codes. A class of WAM codes, RMWAM codes can allow faster update and better protection against errors in flash memories, since they don't require discrete threshold levels. We showed that the capacity of RMWAM codes approaches the capacity of WAM codes. In addition, we presented a new flash cell structure (multi-cell) that can increase the number of levels in the cells.
VIII. ACKNOWLEDGMENTS
This work was partially supported by an NSF grant ECCS-0801795 and a BSF grant 2010075. The author would like to acknowledge that Qing Li from Texas A&M University derived Lemmas 1 and 2 independently.
