Abstract
Introduction
Frequently, satellite missions need acquiring a lot of data. The acquisitions are performed in hostile environment -where a number of errors are possible-and, normally, result in a large quantities of data that must be collected for the transmission to the earth station. Efficient data storing requires fast and large mass memory. In addition, an effective memory must be robust with respect to the mechanical stresses and to the radiation. Due to the technology improvements, now it is possible to realize fault tolerant solid state mass memories (SSMM) based on DRAM's, assuring a good leve1 of speed and integration. An efficient integration is very important in order to reduce the power consumption and the mass of the payload. The drawback of this approach is related to the large number of errors that a mass memory based on DRAM presents. In this context, the use of error detecting and correcting codes (ECC) can be a suitable solution for improving the tolerance of the solid state mass memories to soft and hard errors [3] . In [6] we have presented the structure of a mass memory, able to collect a large amount of acquired data and based on a suitable ECC code. This memory was proposed for a satellite developed for a scientific mission, where a lot of measurement results, coming from different sensors, must be stored. The main task of these sensors is the measurement of the energy of particles. This data must be quickly written into the memory -during the particle detection-and quickly read when they have to send to the communication device, during the data transmission from the satellite to the earth station. In the considered application, we have defined the memory architecture using a statistical relation between error sources and design parameters. This selection model considers some factors, which are typical for the considered application. Memory error correcting capabilities must assure the correction of the errors that can be generated during two memory access time -for writing or reading-and must handle the eventual single chip failure, related for example to the latch-up. Moreover, the implementation of the coder and decoder must be sufficiently simple to allow real time implementation. The required redundancy is another key factor for the selection of a particular error correcting code. The proposed selection model proposed in [6] considers the reliability of system and it doesn't take into account other constraints also important in satellite applications. For this reasons we decided to generalize the selection process developing a decisional model that includes other aspects. The paper will discuss the following points. Section 2 considers the possible faults induced on DRAM integrated circuits by high-energy radiation [1] [6] . Section 3 shows the relation between the code parameters and the reliability of the SSMM of [6] . Finally, in Section 4, we describe the approach based on operational research theory that will be used for the design of the SSMM.
Evaluation of the possible faults
The radiation present in the space environment poses a risk to all earth orbiting satellites as well as to the mission toward other planets. In this environment the sources of electronic device faults are: chip failure and radiation damages. The chip failure depends on fabrication process and operational conditions. The radiation damages are induced by charged particles as high-energy electrons, protons, alpha particles and heavy ions. There are two types of radiation damages: the total dose effects and the Single Event Effect (SEE). The total dose effects are cumulative ionization damages whereas SEE is caused by single high-energy ion passing through a device. The SEE's include Single Event Upset (SEU) and Single Event Latchup (SEL). The SEUs cause soft errors and the SELs can be destructive under certain conditions. The total dose radiation effects can be reduced by using suitable shields. On the other hand, the SEE susceptibility doesn't change significantly with shielding. As consequence, in this work we only consider the presence of SEEs. In particular, our analysis is based on the typical DRAM values of reliability and SEU [1] .
Relation between code properties and MTBF of the disk
The DRAM used in mass storage applications have 4 or 8 bit word length. Then, a chip failure may correspond to multiple errors contemporaneously present on all the 4 or 8 bits memory words. For this reason, we have chosen the class of the multiple-errorcorrecting-codes named Reed-Solomon (RS) codes. The RS codes are an important and popular maximum distance code set based on q-ary symbols. The maximum distance property corresponds to minimize the redundancy for a given correction capability. The use of q-ary symbols allows us to define a symbol that corresponds to a 4 or 8 bit word, stored in a single memory. An r-error correcting RS code with symbols from GF(q m ) has the following parameters :
• Block length :
The choice of the RS code parameter values (i.e. codeword and data length) depends on the code rate, the failure statistics of memory chips and the SEU statistics. Moreover, our analysis is based on the following assumptions :
(i) the SEU are characterized by a Poisson Distribution P k (t) with λ SEU = SEU rate =1e-6 upset/(bit/day) [1] .
(ii) the DRAMs work in the random failure period with the hazard rate λ DRAM = 1e-6 h -1 . Then the reliability function is:
(iii) the memory is organized in module of n chip each. Each chip is arranged as a (64/m) Mwords of ⋅m bit DRAM.
(iv) data recording cycle = 24 h . With these hypotheses, we have calculate the Disk Memory Reliability Function parameter given by .
Using equation (3) we have computed the MTTF (Mean Time TO Failure) of the memory due to the chip failures and the MTTCE (Mean Time To Codeword Error) due to the Single Event Upset [6] . We have observed that low values of MTTF and MTTCE parameters correspond to a code rate R ≈ 1 (m= 8). The code rate is an important parameter because it determines the complexity and the size of disk memory. Therefore, to increase the MTTCE and MTTF values, we have considered the shortened (72,64) RS code [2] . Table 1 shows the parameters for this code. To obtain a further improvement, we have introduced s spare chips. In this case, we obtain the parameter values shows in Table 2 . In this table we use the disk memory reliability function given by (3) , where the corresponding DRAM reliability function is:
Using Table 2 we have choose s =8 as best solution between memory module size (n+s), overhead (or code rate) and reliability property for a mission of 3 year. A possible architecture of the disk memory that uses these parameters is show in [6] . We assume that one module is realized by 80 memory banks (8x8 Mbit DRAM chip). The network interface can access a set of external registers containing the information about bank status, status of bank supply voltage, addresses of spare chip.
Since one module can allocate 512 MB of memory, 8 modules are needed to achieve 4 GB net storage capacity. 
Decision model
In the previous sections we have defined the architecture structure described by the mathematical model given by (3) e (4). This model represents a relation between failure sources and design parameters (n,k,s). We have chosen the code type and the values of (n,k,s) that represent the better compromise between data block length and reliability. The model only considers the reliability of the system for a particular code and it doesn't take into account other constraints important in satellite applications. For this reasons we decided to generalize the SSMM design decisional process developing a decisional model that includes other aspects and other error correction code type. As first step we identify all the possible solutions suitable for solving our problem. Then, as second step, we search the optimal solution, among the above-defined solutions. A good method to find such an optimal solution is using a decision model based on the following scheme:
We have to find x* ∈ F : ϕ(x*) ≤ϕ(x) ∀ x∈ F where x = (x 1 ,…, x n ) ∈ R n represents the decisional variable vector, corresponding to the vector of the design parameters, F ⊆ R n represents the space of admissible solutions, ϕ:F→R represents the target function (or cost function) and x* is the optimal solution. For SSMM design, the definition of the function F implies the definition of the relations between the following constraint specifications: At the present, we are working to define these relations for the different constraints. A first result we have obtained is a global relation for the code parameter derived by information theory that allows extending the researcher of best solution to a large set of code rather a particular code. Figure 2 .a shows a theoretical model of a mass memory that defines it as transmission channel. During the latency period the noise corrupts the integrity of data. If the noise is generated by SEU, the B.E.R .depends on the latency time of data and it is given by BER(LT) ≈ SEUrate * LT<< 1 (6) Fig.2 .b shows the correction scheme. The correction device realizes the inverse operation of noise source. It uses the correction data (output of the virtual observer) and the channel output to restore data integrity. The scheme of Figure 2 .b allows using the channel-coding theorem (by Shannon) [5] . This theorem states that word error probability can be reduced to zero increasing the code length n and/or reducing the code rate R, provided only that the code rate does not exceed the channel capacity C [5] . Given the set of binary block and convolutional codes, the maximum attainable word error probability (p w ) and error event probability (p e ) over any discrete channel is bounded by p w ≤ 2 -n E(R) , R≤C (7) for block code, and
for convolutional code. E(R) is the random-exponent error, whose typical behavior is show in Figure 3 . In this Figure R is the code rate, n is the codeword length for block code, N 0 and L are the convolutional code parameters [5] . If we take (7) and (8) We can rewrite (7) by using a simpler expression:
R 0 is the cutoff rate and represents a characteristic of the channel. In [5] are shown some cutoff rate expressions. In particular for BSC channel and for block code the cutoff rate is given by
Where p b is the attainable bit error probability which, for p b <<1, is ≈ Bit Error Rate (B.E.R.). Using equation (9) we can reduce the set of code that can be useful for our application. For example, the convolutional code presents low values of p w , because for a given value of R it has a higher value of the random-exponent error. Nevertheless, the requirement of high rate, necessary for reducing the redundancy, needs higher values of n. The equation (9) represents a good and simple tool to select the code type among all the possible codes present in the space F. Moreover, the knowledge of the physical constraints on the codewords (related to the maximum module size) may be sufficient to reduce the set of possible channel code. In the following, we will apply equations (6), (9) and (10) to an example of decisional model developed to optimize the decisional variables TMW (Memory Washing Cycle), k (dataword length,), n (codeword length), m (symbol bit number) and BER, for a given LT ( latency time). In the example, this model will be used for the application discussed in sect.s 2 and 3 [6] and the obtained results will be compared with those derived from the model in sect. 2.
An example of decision model
Using the memory washing solution we can reduce the memory overhead (increasing the code rate). The problem is the best choice of the washing period (TMW) value. In fact, decreasing the TMW value we have an improvement of soft error correction capability but also a reduction of the time available for the data storage. An index that measures the memory washing inefficiency is given by the ratio between memory washing duration and memory washing cycle (TMW). We suppose i) single bi-directional data BUS, ii)
one ECC device for 2 GB data memory, iii)
ECC device has an encoding and decoding throughput (f t ) of 50 MB/sec, iv) maximum codeword length = 72 symbols ⇒ n ≤72, v) symbol length = 8 bit the memory washing inefficiency (MWI) is given by
We can observe that varying TMW (for MWI fixed) we must modify the n/k ratio (memory overhead). But varying n/k we change also the B.E.R. at LT. The question is: what is the best solution? Efficient method to answer such a question consists of searching the optimal solution of a decision model with the following decision variable: TMW, k, n, B.E.R. calculate at a fixed value of LT. The space of admissible solutions is defined by (12) where BER 0 is the memory bit error rate without ECC system, i is the number of memory washing cycles calculate at LT (latency time). For sake of simplicity we introduce the following constraints on the parameter values: latency time (LT= 10, 100, 1000 days), B.E.R. i (maximum B.E.R. value calculate at LT) ≤ 10 -11 , and code rate 0.6≤R≤R 0 . To evaluate the performance of a possible solution, we have chosen an evaluation function given by relation (13). I 1 and I 2 are two performance index: I 1 represents the ratio between cost overhead and error correction capability -given by -log10(B.E.R.)-at a given LT, I 2 represents the ratio between memory washing inefficiency, given by (11), and error correction capability. We can attribute a different importance to index I 1 and I 2 varying the coefficient α 1 and α 2 ,. For simplicity we have choose α 1 = α 2 ,=1. The equation (12.5) is derived from [5] and represents the value of B.E.R. for a given attainable word error probability p w , calculate at end of i-th memory washing cycle.
( ) 
Using MATLAB Toolbox we have calculated the optimal solutions (Table 3. ) showing the following property: if the data latency (LT) increases we have to decrease data length value and to increase B.E.R. i value, in order to equalize the evaluation function value. These solutions are general because they are obtained without to specify a particular code. The following step is the search of best code, which respect the n, k, B.E.R. optimized values. With this methodology we can reduce the time of code searching. In fact, as shown, the selection process doesn't start by a fixed set of codes, which must be analyzed and characterized a priori. It's sufficient to solve the optimization problem and to search a code that satisfies the obtained solution. Table 3 . Parameter of (72,64) RS code.
TL
For example for LT = 10 days we can use a shortened (72,68)RS code. On the contrary, RS code is inefficient for LT = 100 or 1000 days because n-k must be a power of 2. For (n,k)= (72,63) we can use a (72,64)RS code with one data byte fixed at 0. For (n,k)= (72,65) we can use a (72,68)RS code with three data bytes supposed fixed at 0. In these two cases it needs to investigate other possible codes. The discussed decision model is an example that, for simplicity of presentation, doesn't take into account other important aspects as chip failure rate and reliability constraint.
Conclusions
In this paper we have discussed two design methodologies of solid state mass memory for satellite application. Due to the presence of radiation damages the introduction of error correcting codes are required. The work presents a first methodology for the choice of the code and for the evaluation of the spare chips in order to match the constraints related to required endurance. Moreover we have proposed a more general decisional model to perform the best trade-off among the different parameters, taking into account the different design constraints. Depending on the introduced constraints, the final decisional model may be very complex, we have shown the main characteristics of the proposed methodology as well as its effectiveness applying it to a simple example.
