State-of-the-art automatic reliability analyses as used in system-level design approaches mainly rely on Binary Decision Diagrams (BDDs) and, thus, face two serious problems:
INTRODUCTION
Although shrinking CMOS technology sizes allow to integrate more complex systems on a single chip and, thus, have advantages for important design objectives like monetary costs and volume consumption, the resulting increase in process variation is a major issue regarding the reliability of the designed components. The requirement to design reliable systems from these unreliable components has made reliability to become one of the main objectives of modern automatic embedded system-level design approaches.
Formal state-of-the-art reliability analysis techniques like [5, 8] are based on the abstraction of the system-level design to a Boolean function given as a Binary Decision Diagram (BDD). The exponential worst-case complexity of BDDs leads to two serious drawbacks for the scalability of the analysis approaches: (1) The BDDs exhaust available memory during their construction and/or (2) the size of the final BDDs is often up to several orders of magnitude larger than the available memory. These drawbacks are critical for the so-called up-scaling, i.e., combining several subsystems and components to form the complex overall system that results in an explosion of the required memory. To make a contribution towards scalable system-level reliability analysis techniques, the work at hand proposes two approaches to cope with the abovementioned problem of oversized BDDs.
Contributions. Several known automatic reliability analyses rely on the use of temporary variables during the construction of the BDDs. This arises for example from an analysis that only considers hardware component defects, but needs to consider the binding of processes to determine the state of the system, i.e., whether the system works properly or not. The variables used for the process binding are excluded using exists quantification, such that the final BDD reflects the reliability based on component defects. Thus, in contrast to the final BDD, the size of the temporary BDDs at construction time is critical. To cope with this problem, an early quantification approach based on a heuristic that is tailored for system-level reliability analysis is proposed that aims to quantify variables as early as possible to keep the size of the BDDs during construction small.
For highly complex problems, the proposed early quantification method might fail due to oversized BDDs as well. As a remedy, the work at hand proposes a novel approach where a simulation is assisted by a state-of-the-art SAT-solver to approximate the reliability. This efficient simulation allows to carry out a high number of simulation runs and, thus, allows to trade the memory consumption of the BDD-based approaches for runtime. The simulative approach is capable of achieving a very accurate reliability analysis with a reasonable overhead regarding runtime even for large and very complex systems where known exact methods fail.
The proposed methodologies are compared to state-of-theart reliability analysis approaches on several test cases to give evidence of their scalability.
Outline. The rest of the paper is outlined as follows: Section 2 discusses related work. While the problem targeted in this paper is outlined in Sec. 3, the the proposed reliability analysis approaches are introduced in Sec. 4 and Sec. 5 . Section 6 presents experimental results before the paper is concluded in Sec. 7.
RELATED WORK
The importance of reliability as an objective in automatic embedded system design is supported by a large number of approaches that have been published in recent years. These approaches aim to increase the reliability of the system by checkpointing and dynamic voltage scaling [21] , improved scheduling techniques, cf. [10, 22] , reliability-aware selection of components [11] , or introducing redundancy at componentlevel [13, 17, 20] , or process-level [5, 6, 8] . However, one can observe that nearly all reliability-increasing techniques result in an increase of cost in one or even several objectives like runtime, monetary costs, area consumption, or power consumption. Thus, finding optimal system implementations with respect to multiple and often conflicting objectives requires an accurate analysis technique for each objective. Most of the approaches discussed so far either perform reliability analysis using simplified failure models like a constant failure probability [13] , rely on so-called series-parallelstructures for the analysis that cannot model important em-bedded system design aspects like resource reuse [3] , or target reliability analysis at lower levels of abstraction using extensive simulation [22] . State-of-the-art system-level reliability analyses are presented in [8] and [5] that can analyze arbitrary embedded systems at system-level using Boolean functions encoded in BDDs. The scalability of the methodology proposed in [5] is improved by an application-specific early quantification approach presented in [6] . In contrast, the work at hand proposes a generalized early quantification heuristic that even outperforms the approach from [6] . In [9] , the approach from [8] is extended by using a specific variant of BDDs which perform better for the considered test cases but are still suffering from the drawbacks of common BDDs.
Early quantification is a technique that aims to cope with the size explosion of BDDs during their construction known from several domains, most prominently from formal verification, cf., e.g., [15] . A smorgasbord of approaches for early quantification is available like [12, 19] with [7] being the most closely related approach. In comparison, the approach proposed in the work at hand uses a divide-and-conquer heuristic based on finding cuts in a graph-based representation of the dependencies between relations. Moreover, the relations are not explicitly given, but are a result of the clustering of conjunctions of terms. This heuristic is especially tailored for Boolean functions that have a Conjunctive Normal Form (CNF) related structure as typical for formal system-level reliability analysis.
Monte-Carlo simulation has been widely studied as an appropriate method to perform reliability analysis for large systems with complex behavior, cf. [1] . To the best of the authors knowledge, this is the first SAT-assisted simulation approach that implements an efficient system-level reliability analysis.
PROBLEM DEFINITION
Reliability analysis approaches commonly require to determine the so-called reliability function R : R + → R [0,1] of the overall system that returns the probability of the life time τLT of the system being greater than a certain time τ :
It holds that R(0) = 1 and R(∞) = 0. The reliability function allows to compute all important reliability-related measures like Mean-Time-To-Failure (MTTF) or the MissionTime (MT). To determine the reliability function, knowledge about the state of the system in case of failures and defects is needed. Thus, most formal reliability analysis techniques rely on the so-called structure function ϕ. This Boolean function ϕ : {0, 1} |X| → {0, 1} takes a vector x = (x1, . . . , xi, . . . , x |X| ) encoding the states of all system components X, i.e., xi = 1 if a component i works properly and xi = 0 if it failed, and returns the state of the system as 1 if the system works properly and 0 if the system failed, respectively.
The main challenge for reliability analysis techniques is the representation of ϕ by appropriate data structures that are mostly based on Binary Decision Diagrams (BDDs) [2] . An example on how to derive the reliability function from a BDD-encoded structure function can be found in [6] and is outlined as follows: Using a specific Shannon-decomposition as proposed in [16] , the probability P of a proper working system at time τ is determined by traversing the BDD representing ϕ:
This function determines the probability of a structure function ϕ to evaluate to 1 at a given time τ , depending on the reliability function Rx(τ ) of each component x with Rx :
common approach early quantification the overall system is then given by:
EARLY QUANTIFICATION
In many automatic reliability analysis approaches, temporary variables Y represented by the binary vector y = (y 1 , . . . , y i , . . . , y |Y | ) are required in the structure function construction process to determine the correct state of the system, but are removed as soon as the overall structure is generated to derive ϕ:
Imagine the case where the state of the system depends on both, the availability of hardware components and on the ability of the system to correctly establish communication among processes. Thus, variables for the communicating processes Y are needed to construct ψ(x, y), with ψ : {0, 1} |X|+|Y | → {0, 1}. However, if the assumed failure model is the defect of components, the process variables in Y need to be existentially quantified 1 to derive the desired structure function ϕ(x). The existential quantification ensures that ϕ(x) evaluates to 1 only if there exists a feasible process binding that establishes a correct communication for a given state x of the hardware components. This leads to BDDs exhausting available memory during the construction process, although the BDD without the temporary variables might fit the memory.
This size explosion during the construction of the data structures is known from formal verification as well and can be faced using so-called early quantification techniques. These techniques aim to partition the problem into subproblems such that temporary variables can be quantified as soon as the construction of the BDD for each subproblem is completed. The resulting BDDs of the subproblems are combined afterwards. Given the function smax : {0, 1}
{0,1}
n → N that returns the maximum memory consumption of a
Boolean function 2 with n variables during the construction of the BDD, it can be observed that
with • = {∧, ∨}. That means, the earlier variables can be quantified during the construction of the BDD, the smaller is the maximum size of the BDD during construction. A schematic representation of early quantification is depicted in Fig. 1 . When Boolean functions are combined using disjunction (logical or, ∨), early quantification becomes trivial since it holds:
On the other hand, Boolean functions combined using conjunction (logical and, ∧) are challenging with respect to early quantification. It holds:
with c : {0, 1}
|X|+|Y | × Y → {0, 1} that evaluates to 1 if a variable is contained in a function, i.e., if the function is not invariant to a variable, and 0 if the variable is not contained in the function, respectively. Thus, the condition states that an early quantification is only allowed if the variables to quantify are contained in at most the function g(x, y) or h(x, y). In most cases, the functions will share variables and, thus, Eq. (7) is not applicable. To overcome this problem, the work at hand proposes a transformation of Eq. (7) to allow early quantification based on a special partitioning of the variables Y . In this partitioning, three sets are determined such that two sets Yg and Y h consist of variables that are contained in one function only, while the third set Yz consists of the variables that are shared by both functions. With corresponding subvectors y g , y h , y z of y being defined, it holds:
Following this early quantification scheme allows to early quantify the two functions g and h with respect to variables that are contained in only one function, combine the resulting data structures using conjunction, and finally quantify the variables contained in both functions. Thus, this approach is capable of making use of early quantification where former approaches like [6] , where early quantification is only enabled if Eq. (7) is applicable, fail.
In several reliability analysis techniques, partitioning is used to speed-up the analysis process by preventing outsized BDDs. Commonly, these approaches rely on domain-specific knowledge about the structure of the system for the partitioning, cf. [6] . Given Eq. (8) , the work at hand proposes a partitioning that partitions temporary variables based on a given Boolean function solely and, thus, does not need to take into account the given application and/or architecture.
For the partitioning and without loss of generality, a Boolean function ψ(x, y) of the form ψ(x, y) = t∈T t(x, y) is assumed since disjunctions can be trivially early quantified following Eq. (6) . Each term t ∈ T is a Boolean function 
Given ψ in the above form, the structure function is defined as follows:
It is assumed that each term contains at least one variable y ∈ Y . Otherwise, these terms trivially fulfill Eq. (7) and, thus, can be analyzed independently. 
The dependency graph for ψ can be found in Fig.2(a) Fig. 2(b Fig. 2(c) . While the first partition can be considered good since it allows to early quantify 4 variables with only one variable being left for quantification after combining the functions, the latter partition can be considered bad since no variable can be quantified early. A good partitioning fulfills the following minimum-requirement:
In other words, a good partitioning aims to determine two large partitions Yg and Y h while trying to keep variables Yz that are included in both partitions at minimum, such that many variables can be quantified early and only few variables are left for quantification after the functions are combined. This requirement allows a maximum benefit from the early quantification approach given in Eq. (8).
In the following, an algorithm is presented that makes use of the proposed early quantification scheme to construct the structure function ϕ.
Preprocessing
Given a Boolean function of the form given in Eq. (9), several well-known preprocessing techniques are applied. Important in this context is the rule of absorption that allows to eliminate terms. The effect can be visualized by the corresponding dependency graph before and after preprocessing: Due to absorption, edges (y, y) ∈ Ey can be removed if terms can be excluded such that two variables y and y are not contained in the same term anymore. The resulting graph has Require:
return ∃y : z(x, y) 10: end if a decreased problem complexity and, in some cases, decays to a set of independent subgraphs, cf. Fig. 3 . These subgraphs can, following Eq. (7), be analyzed independently. In particular, the resulting subgraphs in the dependency graph correspond to the partitioning approach presented in [6] . However, in [6] , domain-specific knowledge is used to find these independent subgraphs, based on a special analysis of the system during the construction of the Boolean function from the structure of the system. In contrast, the approach proposed in the work at hand is applicable to any given structure function.
Divide-and-Conquer Algorithm
In the following, an algorithm is used that recursively uses Eq. (8) to implement an efficient early quantification approach in a divide-and-conquer fashion: Given the set of components X = {x1, . . . , x |X| } of a Boolean function ϕ(x), the set of temporary variables Y that are quantified, and the function ρ : 2
T with ρ(Y ) = {t|t ∈ T ∧ y ∈ Y ∧ c(t, y)} that allows to obtain all terms that contain at least one variable y ∈ Y ⊆ Y , the approach can be formulated as follows:
The recursion is performed by the function dac that is outlined in Alg. 1. The algorithm requires the temporary variables that need to be considered for the current partition, cf. line 0. If the number of temporary variables in the partition is less than a given , the recursion ends and the partition is transformed into a BDD. If the partition contains too many temporary variables, the recursion works as follows: After the two partitions Yg and Y h are determined using the divide function, cf. line 3, the BDDs for the partitions Yg and Y h are constructed and early quantified following the recursion scheme, cf. lines 4 and 5. After Yz is determined in line 6, the partitions are combined using conjunction in line 8 and a quantification with respect to the variables Yz in line 9 completes the recursion.
Divide
As outlined in Eq. (10), determining a good partition is crucial for the effectiveness of the proposed early quan- Select y ∈ Y \ P with max(w(y, P )) 4:
P := P ∪ {y} 5: end while 6: return P tification approach. This partitioning is performed by the divide function. This function aims to find a set of candidate partitions and determines the best partitioning with respect to Eq. (10) . Therefore, the divide algorithm performs an ordering of the nodes Y to derive an ordered set P = (y1 < . . . < yi < . . . < y |Y | ), cf. Alg. 2, first: The ordering algorithm starts with an empty ordered set P and generates the required dependency graph, cf. lines 0 and 1. It iteratively adds new nodes to the set with the next node to add being the one with the maximum weight, cf. lines 3 and 4. The weight w of a node is defined as follows:
, if wo(y) > 0 wi(y), else. (12) with
Given the ordered set P , the divide function determines the best cut with respect to Eq. (10) by a linear search in the ordered set that results in two sets G = {y1, . . . , yi} and H = {yi+1, . . . , y |Y | }. Given G and H, the desired subsets Yg and Y h are derived by
and returned by the divide function.
SAT-ASSISTED SIMULATION
This section proposes a novel SAT-assisted Monte-Carlo simulation technique. With growing system complexity, analytical methods may become impracticable or even unusable because the final BDDs exhaust available memory as well. An alternative category of reliability analysis techniques that allow to target more complex systems are based on simulation, i.e., Monte-Carlo simulation. Simulation has the drawbacks of accurateness being related to the number of performable simulation runs. On the other hand, the memory needed by the introduced formal methods can be traded for runtime and, thus, the problem of outsized BDDs is avoided. By a state-of-the-art SAT-solver based on the DPLL backtracking algorithm [4] , a very compact data structure, i.e., a Conjunctive Normal Form (CNF) is tested for satisfiability to determine the current system state whenever needed by the simulation. This enables to carry out hundreds of simulation runs in a very short time, even for large and complex systems where known exact methods fail. If the system function is not directly given in CNF like in [8, 6] , several efficient techniques are known to transform any Boolean function into CNF, cf. [18] .
The iterative SAT-assisted Monte-Carlo simulation approach works as follows: First, a set Γ ϕ of N times-tofailure is determined for the overall system encoded in ϕ(x) by
Algorithm 3 mcs(ϕ(x)) -SAT-assisted Monte-Carlo simulation.
Require: ϕ(x)
Ensure: Γ is an ordered set 1: Γ := timesToFailure(X) 2:
return γ // time to failure 6: end if 7: end for The function mcs : {0, 1}
{0,1}
n → R + carries out one simulation run based on a given structure function ϕ(x). The function mcs is outlined in Alg. 3. The algorithm first computes a set Γ of times-to-failure in ascending order that contains a specific time to failure γ for each component x ∈ X of the structure function using the function timesToFailure : 2 X → 2 (X×R + ) , cf. line 1. The time-to-failure of each system component x is determined by using inverse transform sampling based on the reliability function Rx(τ ) of the component:
with r ∈ R [0,1] being a random number. For each element (x, τ ) ∈ Γ and with respect to the order of Γ, the structure function ϕ is combined with a negated component variable ¬x using conjunction, cf. line 3. This corresponds to the component x being failed. The SAT-solver is invoked using the function sat : {0, 1}
n → {0, 1} that returns true if the structure function can be satisfied, i.e., if the overall system works properly. If the overall system failed, cf. line 4, the time-to-failure of the component that failed last corresponds to the overall system time-to-failure and is returned in line 5.
Given the times-to-failure Γϕ, the desired reliability function of the system as given in Eq. (1), is approximated as follows:
EXPERIMENTAL RESULTS
To give evidence of the effectiveness of the proposed approaches, a comparison to state-of-the-art reliability analysis approaches is given: (1) The proposed early quantification method (EQ) and the SAT-assisted simulation approach (SAT) are compared to a reliability analysis without early quantification like [5] (COMMON) and the partitioning technique presented in [6] (GLRHT08). As described in Sec. 4.1, GLRHT08 corresponds to the preprocessing proposed in the work at hand and is compared based on this preprocessing in order to be independent of the system model used in [6] . (2) The SAT-assisted Monte-Carlo simulation is compared to the reliability analysis presented in [8] that does not use temporary variables such that a comparison to early quantification techniques is impossible. For comparison, a similar algorithm (IH08*) is used that is based on BDDs instead of the so-called TPDDs proposed in [8] that, however, leads to a data structure of comparable size and complexity. Testsuite. For the comparison of the proposed and stateof-the-art approaches, a testsuite containing various systemlevel design specifications is arranged: The testsuite contains both, real-world as well as synthetic test cases. The real-world examples exhibit a certain structure in both application and architecture that is the result of the structured development process. This structure often allows for a better partitioning and an easier analysis. On the other hand, the synthetic examples lack that certain structure because these examples are randomly generated. This randomness commonly makes analysis harder and often leads to significantly larger data structures when compared to structured test cases of equal size. 7 real-world specifications from the data-streaming as well as the automotive domain are chosen. The complexity of the real-world test cases ranges from about 50 tasks with 30 available resources up to about 250 tasks with about 1000 available resources. Moreover, 8 synthetic test cases are generated. The complexity of the synthetic test cases ranges from 50 tasks with 25 available resources to 150 tasks with 75 available resources. For each of the 15 test cases, 10 implementations of different complexities with respect to the BDD sizes are generated: Using very few resources with marginal task redundancy creates implementations of low complexity, whereas using many resources with a high amount of task redundancy results in implementations of high complexity. The result is a testsuite of 150 test cases that covers a broad variety of test instances ranging from small examples up to highly-complex real-world test cases that also max out state-of-the-art design space exploration and performance evaluation approaches, cf. [14] . The experiments are carried out on an Intel Pentium 4 3.00 GHz Dual Core machine with 1.5 GB RAM. The number of simulation runs for the SAT approach is set to 2000.
The results for the comparison of SAT, EQ, GLRHT08, and COMMON are depicted in Fig. 4 : Runtime. The scatter plots show that in very few cases, the overhead resulting from proposed early quantification can slightly increase runtime τ RT . However, for more complex systems to analyze, the runtime of EQ is significantly lower than the runtime of both former approaches COMMON and GLRHT08. The runtime of the proposed SAT approach is significantly larger for all test cases where the exact methods were able to deliver feasible results. However, the advantage of SAT lies in its scalability, discussed in the following. Scalability. Since the paper focuses on the scalability of the proposed methods, the number of test cases, where no feasible analysis was possible due to outsized BDDs, is taken as a measure for scalability. The COMMON approach performs worst and fails in 95 test cases. GLRHT08 performs better, but still fails in 35 cases. The proposed EQ approach fails only in 18 cases and never failed where one of the known approaches succeeded. Thus, the proposed early quantification approach outperforms known approaches on reasonable complex test cases. However, only the proposed SAT approach was able to solve each test case. Note that one test case from the automotive area was one order of magnitude larger with respect to the number of components compared to all other test cases. Thus, the proposed SAT-assisted simulation approach has the best performance in terms of scalability. However, it should be replaced by the proposed early quantification approach whenever possible to take advantage of the lower runtime and exact results of EQ. Accuracy. Since EQ, GLRHT08, and COMMON are exact approaches with the SAT approach only being an approximation of the reliability function of the system, the relative error in percent is determined based on the 132 test cases where an exact reliability function could be derived. The relative error is determined based on the Mean-TimeTo-Failure MT T F = ∞ 0 R(τ )dτ that is derived from the approximated and exact reliability functions. The relative error for 2000 simulation runs per test case is 1.51% with a standard deviation of 1.02%. For a reliability analysis at system-level, the accuracy can be considered very good, especially with respect to the SAT approach having its main application where known exact methods fail. For comparison, the relative error for 500 test-runs is 4.01% with a standard deviation of 2.08% while the relative error for 4000 runs is 1.18% with a standard deviation of 0.94%.
The results of the comparison of SAT and IH08* can be found in Fig. 5 . The memory-consuming BDDs that result from the analysis of both transient and permanent faults allow to highlight the scalability of the SAT approach. While SAT was able to analyze every test case successfully, IH08* was able to deliver feasible results in 12 test cases only and failed for 138 test cases. This shows the ability of the SATassisted simulation to increase scalability and its good performance, especially when more sophisticated analysis techniques lead to outsized BDDs already for relatively small test cases.
CONCLUSION
This paper proposes two approaches to tackle the problem of memory-exhausting Binary Decision Diagrams (BDDs) in state-of-the-art automatic system-level reliability analysis of embedded systems. The contributions of this paper are (1) a symbolic early quantification technique that keeps the size of the BDDs during construction small and (2) a SAT-assisted simulation approach that allows to deliver appropriate results for large and complex systems where the final BDDs used in exact approaches exhaust available memory. A testsuite of 150 test cases consisting of synthetic examples as well as problem instances from the data-streaming and automotive domain have been used to show the scalability of the proposed approaches. The presented early quantification approach outperforms known exact methods and decreased the number of test cases where the BDDs exhausted available memory by nearly 50% compared to the best known approach. The SAT-assisted simulation was capable of analyzing all given test cases at the costs of an increased runtime. Thus, the presented approaches in the work at hand enable the application of reliability analysis techniques to problems of industrial relevance where known approaches from literature failed due to the problem size.
In the future, other early quantification approaches shall be applied to formal reliability analysis and compared to the proposed approach. Moreover, the applicability of the SATassisted simulation approach to take the repair of resources into account shall be investigated.
