Introduction
Now that computers and communication systems are proliferating in all kinds of devices and home appliances, high-dependability is no longer restricted to systems that are being used in traditional safety-or mission-critical applications, such as space and aircraft or (nuclear) power control systems. An important difference with these traditional systems, however, is that although high dependability is a key concern, achieving it should be affordable in terms of costs. Hence, high dependability must be achieved as a "by product" of a sound design and implementation trajectory, at almost no additional costs. Therefore, dependability evaluation techniques are being integrated in design frameworks, to enable a cost-efficient comparison of design alternatives with respect to the dependability requirements.
Although the standard theory of reliability engineering has been around for many years now [20] , the actual use * This research has been partially funded by the Netherlands Organization for Scientific Research (NWO) under FOCUSIBRICKS grant numbers 642.000.505 (MOQS) and 542.000.504 (VeriGem); by the EU under grant numbers IST-004527 (ARTIST2); and by the DFGINWO bilateral cooperation programme under project number DN 62-600 (VOSS2).
1-4244-2398-9/08/$20.00 <92008 IEEE 512 of these methods during the design of computer and communication systems is far less common. Nevertheless, a wide variety of modeling approaches has been developed for evaluating system dependability. We categorize them in three classes: (1) General purpose models, such as CTMCs, stochastic Petri nets (SPNs) [3] and their extensions; stochastic process algebras (SPAs) [14, 15] ; interactive Markov chains (IMCs) [13] , Input/Output IMCs (I/O-IMCs) [5] , and stochastic activity networks (SAN) as used in UltraSAN and Mobius [19] . These approaches are general-purpose, serving the specification and validation of a wide variety of quantitative properties of computer and communication systems, and certainly not of dependability properties only. (2) In contrast, several dependabilityspecific approaches have also been developed, such as reliability block diagrams (RBDs), the System Availability Estimator (SAVE) language [12] , dynamic RBDs (DRBDs) [9] ; dynamic fault trees (DFTs) [10] and extended fault trees (eFTs) [7] ; OpenSESAME [21] , and TANGRAM [8] . (3) Finally, for some architectural (design) languages specific extensions have been developed to allow for dependability analysis, most notably, the error annex of the architectural description language AADL [2] , and the UML dependability profile [1 7] .
We have identified five criteria, a good dependability formalism in our opinion should satisfy: (1) Modeling effort: how easy is it to model a system and its dependability aspects? (2) Expressiveness: what features (repair, spare management, different failure modes, etc.) can be modeled and can new ones easily be added? (3) Formal semantics: is the meaning of the models unambiguously clear? (4) Compositionality: we distinguish between (4a) Compositional modeling, meaning that a model can be created by composing smaller submodels and (4b) Compositional state space generation and reduction. Compositional state space generation means that the state space of the entire model is constructed out ofthe state spaces of its constituent subcomponents. Compositional state space reduction means that the global state space of a multi-component system is obtained by repeated composition and reduction (e.g., by bisimula-tion reduction). (5) Tool support: are tools available for automatic analysis?
The general-purpose formalisms, specifying system models in terms of states and transitions, have the advantage of being very flexible (hence, expressive) and precise. But, with these formalisms, it is often difficult to specify dependability models, since they do not provide any dependabilityspecific constructs, which in tum may lead to specifications that are hard to understand and thus are potentially subject to modelling errors. We also found that dependability specific approaches score relatively low on expressiveness; although each of them incorporates certain dependability constructs, none of them includes them all. Although we agree that it is impossible to include all possible features, we do think that a modeling approach should be extensible (cf. Section 3.6), so as to be able to accommodate any, also future, needs. Architectural languages require limited modeling effort, since they annotate architectural models (which play an important role throughout the design). However, these languages, as we know them, lack a formal and compositional semantics and tool support for automatic dependability evaluation, although recently some work in this direction has been done [18] .
In this paper, we therefore propose a new, formally wellrooted and extensible framework for dependability evaluation that satisfies the five criteria we have discussed above:
Arcade (for architectur~l gependability~valuation). In addition, we define our framework in an architectural style, i.e., we define a system model in terms of components or entities that (directly) map to actual physical/logical system components. In fact, our framework is ultimately intended to be incorporated into an architectural design language. Arcade defines a system as a set of interacting components, where each component is provided with a set ofoperational/failure modes, time-to-failure/repair distributions, and failure/repair dependencies. Arcade models have a semantics in terms of 1/0-IMCs, thus pinning down their interpretation in an unambiguous way. Moreover, the compositional state space generation and reduction technique for I/O-IMCs also enables an efficient analysis of very large Arcade models.
The paper is further structured as follows. In Section 2 we provide background on IMCs and 1/0-IMCs, the underlying semantical models used in the remainder ofthis paper. Section 3 introduces the Arcade modeling approach. Section 4 describes the currently employed tool-chain to evaluate Arcade models, whereas Section 5 reports on two cases studies. Section 6 concludes the paper.
Input/Output Interactive Markov Chains
Input/Output interactive Markov chains (1/0-IMCs) [5] are a combination of Input/Output automata (1/0-automata) [16] and interactive Markov chains (IMCs) [13] . 1/0-IMCs distinguish two types of transitions: (1) Int c t v t n t nlabeled with actions (also called signals); (2) M k v n t n t n labeled with rates A, indicating that the transition can only be taken after a delay that is governed by an exponential distribution with parameter A. n z if either (1) they are both ready to accept the same input action or (2) one is ready to output an action a! and the other is ready to receive that same action (i.e., has 1In the sequel we often omit these self-loops for the sake of clarity and simplicity of the UO-IMC representation.
input action a?). 1/0-IMCs can be combined with a parallel composition operator "II", to bui d ar erI/O-IMCs out 0 sma elOnes. The behavior 0 P == QIIR, i.e., the para e composition 0 1/0-IMCs Q and R, is the joint behavior 0 its constituent 1/0-IMCs and can be described as 0 ws:
1. I an action does not require synchronization then Q and R can evo ve independent )( i.e., i Q (R) can rna e any transition (interactive or Mar ovian) and behaves a ternards as Q' (R'), the same behavior is possib e in the para e context, i.e., QIIR can evo~to Q'IIR (QIIR').
2. I an action 0 an interactive transition requires synchronization, then both I/O-IMCs Q and R must be ab eto per orm that action at the same time, i.e., QII R evo~ssimu taneOlE yintoQ'IIR'. Note that when an output and an input action synchronize the resu tis an output action.
Li e in process a elras, the hidin operator hide A in P rna es output actions in a set A intema ,such that no uther synchronization is possib eover actions in A. More detai s on the I/O-IMC orma ismcan be illnd in [5] .
Arcade: Semantics and Syntax
This section describes the semantics and syntax 0
Arcade. We have identified three main building blocks with which we can, in a modu a CBhion, construct a system mode: (1) a Basic Component (BC), (2) a Repair Unit (RU), and (3) a Spare Mana ement Unit (SMU). These bui din b oc s interact with each other by sendin and receivin input/output actions. The semantics 0 these bui din b oc sand their interactions is based on the 1/0-IMC tamewor. In the 0 win, we describe each 0 these bui din b oc s.
Basic component
component is used as a primary or as a spare. We define operationa modes in terms 0 roups 0 operationa modes. A group of operational modes defines a set of utu x c u voperationa modes, e. . active mode versus inactive mode. At the I/O-IMC are, a mode corresponds to an operational state. Thus, each OM group defines a set of operationa states. I a BC has mu tiJ e OM roups, then the BC operationa states consist 0 the c duct 0 the operationa states 0 the di erent OM roups. For examp e, ets assume a BC has two OM roups: inactive/active and on/o . In this case, the BC has our operationa states, name y: (active,on), (active,o ) (inactive,on), and (inactive,o ). Switchin 10m one operationa mode to another needs to be defined as an n ut ct n at the I/O-1MC are . The mode switchin or transition is thus tri ered by some extema event. Fi .2 shows, rr the examp e,the two OM roups (a on with the mode switches) 0 the BC and the resu tin operationa states
•
A user can essentia yspeci y any number 0 OM roups as on as or each roup the mode switches are c ear )defined. At this point, we have identified a predefined set of OM roups 10m which a user can chose:
1. ct v / n ct 12 As exp ailed ear ie~this OM roup a ws the mode in 0 a component actin as a spare (and thus typica )havin a reduced a lIe rate whi e in inactive mode). The activation and deactivation sina s (causin the mode switchin ) are mana ed by a spare mana ement unit (c . Section 3.3).
n/ :
This roup a ws to mode, or instance, the cct that i the power ai s, then the BC is shut down and can no m er a (i.e., its ai urerate equa szero).
cc b / oc
b : This roup is used to mode a n n d t uct v urct n d nd nc (as in, e.., [10, 21] ); or examp ea database becomes inaccessib ei the bus in in to it ai s. Switchin 10m accessib eto inaccessib edoes not mean that the component 2For readability, all input self loops have been omitted. (2) defining the BC's failure model. In theory, there could be a di erent a lIe mode (ai ue behavior) or each 0 the BC's operationa modes; however, or simp irity we wi restrict these di erences.
Operational modes
A basic component can be in various operationa modes (OM). Examp eso operationa modes inc ude ct v versus n ct v, which are two typica modes 0 operation when ap has a ed(hence, no repair is initiated). However, to the environment, i.e., the rest 0 the system, an inaccessib ecomponent mi ht or mi ht not be, dependin on the system at hand, viewed as a ai edcomponent. While defining a BC, the user has to specify if inaccessibi ity is seen by the environment as a ai ue or not (c . Section 3.5).
n /d g d d
This roup is use u to mode deraded modes 0 operation. A prime examp eis d ng where a component switches to a de raded operationa mode (and consequent y exhibits an increased ai uc rate) in case the component with which it is sharin the Old ai s. It is 0 course possib e to mode a roup with more than two modes, e. ., norma Ide raded1lde raded 2 / ••• Ide raded n .
Whenever a mode switch (except or the ct v / n ct v OM roup) occurs, this is due to ai uc or repair events 0 other components (this is uther exp aned in Section 3.5).
Failure model
We attach a ai uremode to each BC operationa state. For simp icityand to eep the ramewor we -stnctured, the a rre mode 0 each operationa state is essentia y the same except or possib ydi erent va ueso Mar ovian rates
•
The ai ue mode describes how a BC a s,i.e., how it moves 10m an operationa (or up) state to a ai ed(or down) state and visa versa. We distin uish two ways in which a component can ai : (1) an n nt u specified as a mode and no destructive unctiona dependency5. Note the di erent rates used in the mode .There is, 0 course, a syntactic way orspeci yin these rates (c . Section 3.5).
4With their sum being 1. 5For readability, we omitted the transitions from the four unnumbered states which are similar to the ones for states 1 through 4. o a rates re atedto repair times. In short, the RU istensto a rre si na soutput by one or many components, pic s a component ( iven some po i<y) and initiates a repair operation according to a specific repair rate, and finally outputs the appropriate repaired! signal when the repair is finished.
This procedure is then repeated. We a UI at most one RU per component.
So far we have considered the following repair configurations/strate ies: (1) d d c t drepair, where each component has its own RU, (2) .first come first served (FCFS) , (3) with two repair rates, /-lm and /-ldj respective y
Spare management unit
The spare mana ement unit (SMU) hand es the activation and deactivation of spare components. Two configurations are possib eat this point:
1. One primary and one spare: In this configuration, the assumption 6 is that the primary component is aways in active mode, and thus a ways providin the service whenever it is operationa .In ad, the primary component does not have an inactive mode per se and is there cre never activated or deactivated by the SMU.
When the primary a~the SMU activates the spare component which ta es over the primary. As soon as the primary is up a ain, the spare is deactivated and the primary resumes operation. The I10-IMC mode o the SMU is shown in Fi . 8.
2. One primary and two or more spares: This configuration can be modeled based on the previous configuration; however, due to lack of space it will not be urther discussed here.
System failure evaluation
Once a the basic components and units have been defined along with their interactions and dependencies, we need to speci y the condition under which the who e system is a ed or operationa . We chose a al t tree representation (i.e., an AND/OR expression whose iten s are a tre modes 0 the BCs) 7 as the system eva uation criterion 8 • A au tree a so has a correspondin I/O-IMC mode [6] . Thus, the entire system a tre/operation is represented as an I/O-IMC. A simp eexamp ewou dbe a system comprised 0 two redundant processors; the system ai si both processors ai . In this case, the who e system aure/operation wou dbe mode tIl by a al ttree consistin o a repairab eAND ate with the two processors as inputs.
The repairab eAND ate represents the overa system aure/operation and has a correspondin I/O-IMC mode [6] .
60ther assumptions, e.g., treating symmetrically both components, are possible at the cost of complicating the SMU I/O-IMC model. 7We can also use the KIM gate as a shorthand notation. 8We can also consider adding the Priority-AND gate [10] . Line (4), specifies if the inaccessibility of the BC is seen as a ai ue by the environment (c. . Section 3.1.1). Line (7) defines the time-to-failure distribution for each operational state 9 (e.., in Fi . 5, the BC has our operationa states, there ore the user needs to provide our distributions). Line (8) defines the n probabi ties correspondin to n a tre modes. Line (9) defines the time-to-repair distributions for each 0 the n a tre modes and the distribution associated to the destructive unctiona dependency. Fina ¥ he (10) specifies the condition under which the BC fails due to a destructive unctiona dependency.
All the distributions defined in lines (7) and (9) can, in enera ,be any phase-type distribution (see an examp ein Section 5).
9The order in which the OM groups are listed determines which distribution matches which operational state. The same goes for the repair distributions w.r.t. failure modes. 
System failure evaluation syntax
(1) SYSTEM DOWN: AND/OR expression Line (1) defines the condition under which the system is a ed (c. Section 3.4 or more detai s) The e ementaty conditions under which the system a s, are expressed in terms of the failure modes that are defined for the component. I or a component more than one a rre mode is defined, then the user has to specify the failure mode that is re want orthe system ai ue eva uation. For examp e, component X has two ai u~modes, and mode 2 is re want or the eva wtion, then the user writes X.down.m2 to state that mode 2 is the re arant ai u~mode. I there is on yone a rre mode, we can simp ywrite X.down.
Extensibility
Arcade is extensib ein the sense that it is easy to incorporate new or additiona dependabi ityconstructs the user may thin are important <r hislher needs. A that has to be done is to provide the syntax, i.e., the Arcade specification o that additiona construct, and its semantics in terms 0 an I/O-IMC mode. State space eneration, reduction and ana ysisdo not have to be chan ed at a .
As an examp e, a simp e v t (i.e. the time it ta es or an SMU to detect the primary a rre and activate the spare component is exponentia )distributed rather than instantaneous) can be added to the ramewor in the 0-(Win way: First, Arcade's syntax is extended (here <r an SMU with one primary and one spare):
(1) SMU: Name (2) COMPONENTS: primary, sp 1 (3) FAILOVER-TIME: exp (6) Secondly, the 1/0-IMC model has to be defined (Fig. 9) , which is an extension 0 the semantic mode 0 an SMU (Fi .8).
System dependability evaluation
To eva uate Arcade mode s, we use a three step approach, simi ar to the one in [5] , usin the CADP too set [II] .
First, we translate (according to the models defined in Section 3) a basic components, spare mana ement units, repair units, and system a rre eva uation mode sinto their under)in 1/0-IMCs. This trans ation step has not been automated yet.
The second step is to combine these mode sto obtain the overa system mode . To this end, we use the Composer too [5] , which incrementa ycomposes (usin the wedefined parallel composition operator) the I/O-IMC models. Each composition step is 0 wed by an a re ation (Le., state minimization or reduction) step. The order in which the I/O-IMC mode sare composed is iven by the user. This compositiona a re ation approach has proved to be crucia in a ariatin the state-space exp CBion prob tlll. The output 0 the Composer too is a sin el/O-IMC, modein the entire system. This 1/0-IMC has two output si na s:
failed! to denote the ai ue and up! or the restoration 0 the system. Our Composer too ,which uses the CADP too set, u )automates the composition and a re ation steps.
In a third step, we convert this system I/O-IMC into a abe e<CTMC on which standard CTMC so ttion techniques to compute avai abi ityand re iabi itycan be perormed. This step has been automated, usin the CADP too set
Case studies
To demonstrate the easibi ity and usabi ity 0 Arcade, we address two case studies rom the ittrature. In Section 5.1 we ana yzea distributed database system (DDS),
eactivate_sparel-6-up...primary? Figure 9 . The SMU I/O-IMC model with failover time.
which was eva uatedin [19] usin SANs. In Section 5.2, we ana yzea coo in system 0 a nuc tar reactor, which was eva mted in [7] usin eFTs.
Distributed database architecture
This system consists 0 two processors, one 0 which is a spare; our dis contro er$ divided into two sets; and 24 hard dis s, divided in 6 c ustelS, Le., each c lEter consistin 0 ou" dis s. Each contro e is responsib e or three dis c lEters, and each 0 the twe vedis s, which the contro elSet is responsib e or, is accessib eby any 0 the two contro ersn the respective set. Furthermore, each processor can access each 0 the our dis contro~. The processors are administrated by a spare mana ement unit and share one repair unit. For each dis contro eJSet and dis custer there is a separate repair unit responsib e. A repair units choose the next item to be repaired accordin to a FCFS repair strate y.
The system is down, i (at east) one 0 the 0 win conditions is met: (1) a processors are down, or (2) in at eastone contro elSet, no contro ens operationa ,or (3) more than one dis in a c usteris down.
Arcade model
The Arcade mode s or the components 0 the DDS system are air y simp e Most components have no distin uished OMs, except the spare processor which has OM roup (n ct v, ct v ) I there are no specia OMs to be considered, the ire OPERATIONAL MODES can be omitted. Usin the overa CTMC we can ana yze the steadystate avai abi it)(A) and re iabi it)(R(t)) 0 the distributed database system. Tab e 1 shows the resu ts 0 this ana ysis compared to the SAN-based resu ts in [19] . Note that the reliability results in this table are based on the definition o re iabi it)used in [19] , i.e., the probabi ityo havin no system ai ues within a certain mission time assumin that no component is ever repaired. Because 0 the discrepancy in reliability results we have also verified our results for the DDS system with the DFT too Ga i ec[ 1].11 Table 1 . Dependability analysis for DDS
Reactor Cooling System
This case study was described in [22, 7] . In [7] , the system was mode edusin the eFT approach.
The reactor coo in system (RCS) consists 0 a reactor, two para epump ines,a heat exchan er and a bypass system or the heat exchan ere Each 0 the two pump ines consists of a single pump, a single filter and a number of contro va~s. The heat exchan in unit consists 0 the heat exchanger itself, a number ofvalves and one filter. The 11 It is possible to use DFTs here because we do not consider repair. bypass system can be opened and c osedby means 0 two motor driven va ves.
A components, except the reactor itse whose ai lIe behavior is not considered here, are subject to ai lIes and are repairable. The filters and the heat exchanger are either operationa or a ed The va \eS can a in two di erent modes, either tuck n or tuck c d The pumps have two di erent operationa modes and one ai ue mode. They are either u )Operationa, or in a de raded operationa mode, which is reached i one 0 the two pumps ai s. In de raded mode, the remainin pump wi ai with a hi her a lIe rate. We re er to this operationa mode as d g d d mode. This is indeed a typica d 7g situation. Except or the two pumps, which share a sin e repair unit with an FCFS repair strate y, each component has its own dedicated repair unit.
The system is down, i either none 0 the two pump ines is operationa ,or both the heat exchan er and the bypass system are not operationa .A pump ineis de ective, i one o its components is de ective, where or the va ves,on ythe stuc -c osedcase is considered to be a re want ai ue. The heat exchan in unit is de ectire i the heat exchan er itse fails or one of its accompanying filters or valves fails. Fina ythe bypass ine a si one 0 the motor driven va \eS is stuc -c o~d. 3 . The filters can be either free or blocked, where the latter state is the a lIe case. In Arcade termino 0 y this means to be either "up" or "down"13 • OMPONENT FPl TIME TO AlLURE exp (2.19· 10-6 ) TIME TO REPAIR exp(O.l) 4 . The heat exchan er can be either up or down, it ai s with rate 1.14 .10-6 . The repair rate is a ain 0.1.
OMPONENT HX TIME TO AlLURE exp(1.14· 
TIME TO REPAIR exp(O.l)
In [7] the repair policies are not clearly specified. From the remar s w.r.t. the repair 0 the pumps, we conc ute that there are dedicated repair units or a e ements, thus, we wi assi n to each component its own repair unit, except or the pumps, which have a common repair unit. A ter eneratin the CTMC mode s rr the pump and the heat exchan er subsystem, we cou dapp ythe technique 0 modu aJization [7] to compute the re idJi ityand avai dJiity 0 the RCS. The CTMC or the pump subsystem has 10,404 states and 109,662 transitions. The CTMC rr the heat exchan er subsystem (inc udn the bypass) has 240 states and 1,668
12The Arcade models for the remaining valvesare similar. 13 Arcade models for other filters are similar.
transitions. The a est mode encountered durin eneration had 98,056 states and 411,688 transitions. Un ortunate yin [7] no state space size was iven, thus no comparison is possib ein this case.
For a mission time 0 or examp e 50 hours, the system unavai abi it)and unre iabi it)are 6.52100 . 10-10 and 52.9242 . 10-10 respective yOur unavai abi it)fesu tscoincide with the resu tsin [7] .
Summary and conclusions
In this paper we have proposed a new ramewor cr dependabi ity eva uation named Arcade. The ramewor is based on the crma and compositiona I/O-IMC semantics. Moreover, its compositiona a re ation technique has shown to be very e ectire in combatin the state space exp C8ion prob en durin ana ysis. The Arcade approach is extensib e Furthermore, we envision Arcade as a step towards a desi n m ua e cr a e and comp tX systems. Indeed, the u tillBte oa is to inte rate Arcade in a desi n environment based on e. . AADL or UML. It is important to note that a thm h the syntax 0 the Arcade an ill e bears resemb ance to SAVE, the approaches are tru y dierent. Where in SAVE the actua semantics 0 the mode s was hidden in a so tWlre pro ram that coded the trans ation from that syntax to a large (flat) Markov chain, Arcade has a orma semantica mode that a UlS orcompositiona mode in and state space eneration and reduction, as we as cci itatesthe extension 0 the mode in an ill e.
As cr the uture, we p anto wor on a uther automation 0 the too chain, as we as to connect to desi n approaches based on AADL and UML. Furthermore, where we now use re atire ysimp e au t-tee i eexpressions to speci y system a rre (c . Section 3.5.4), we p anto a UI or CSL-type expressions [4] , thus queryin more comp tX measures than system re iabi it)or avai abi it}'
