This version is available at https://strathprints.strath.ac.uk/59169/ Strathprints is designed to allow users to access the research output of the University of Strathclyde. Unless otherwise explicitly stated on the manuscript, Copyright © and Moral Rights for the papers on this site are retained by the individual authors and/or other copyright owners. Please check the manuscript for details of any other licences that may have been applied. You may not engage in further distribution of the material for any profitmaking activities or any commercial gain. You may freely distribute both the url (https://strathprints.strath.ac.uk/) and the content of this paper for research or private study, educational, or not-for-profit purposes without prior permission or charge.
Introduction 2 Related Work
The design of reconfigurable systems is an ongoing research challenge. While many works have concentrated on analysing the influence of homogeneous redundancies [12] [13] [14] [15] [16] , approaches focusing on the evaluation of heterogeneous redundancies are scarce 7, 10 . Heterogeneous redundancies can take many forms: design diversity 17 , analytical redundancies 18 , or redundancies arising from overlapped system functions 4 . In our approach we focus on identifying and exploiting implicit redundancy which may exist in an application. Detailed knowledge and mathematical formulation of the system is typically needed to get analytical redundancy relations 18 . However, the complexity of the mathematical formulation increases with the system size, and this has led us to adopt a function-based viewpoint that uses qualitative attributes (see also Subsection 5.2) . The use of functional alternatives to compensate for component failures is discussed in 19 . The authors use weighted sums to to combine different attributes and compare the overall utility of alternative configurations. The shared redundancy concept is presented in 20 with the goal of reusing processing units in the presence of software component failures. Authors perform availability and cost evaluations using Fault Trees and Monte Carlo simulations. Implicit redundancies are also aligned with the goal of reusing components 21 . The paper describes an adaptation model used to specify for each component its implicit redundancies and quality constraints. Component Fault Trees and Markov chains are used to estimate failure probabilities. Similarly, the integrated modular avionics paradigm shares the goal of replacing software units via standardized generic hardware modules 22 . Their goal is not to use heterogeneous redundancies in highly networked scenarios, but exploit replaceable processing units in reconfiguration.
While the influence of fault detection, reconfiguration and communication implementations on system design has been addressed for homogeneous redundancies, to the best of our knowledge, these mechanisms has been assumed ideal for heterogeneous redundancies. The evaluation of the faulty behaviour of these implementations leads to obtaining an approach which better adheres to reality and consequently provides more accurate estimation of dependability. In D3H2, dependability is a key criterion of performance in the decision between alternative reconfiguration strategies. Due to the complex, dynamic and repairable nature of the systems, we need a dependability approach which is able to specify:
(S1) Time-dependent behaviour of system configurations.
(S2) Modular or hierarchical system failure behaviour to manage the complexity of the model and be able to trace from the design model to the dependability model and vice-versa.
(S3) Repair behaviour of hardware, software and communication resources of the system.
(S4) Any cumulative distribution function for failure and repair events.
(S5) User-defined reconfiguration strategies according to the defined configuration priorities.
There is a wealth of recent development in dependability analysis from which D3H2 could benefit. Dynamic Fault Trees (DFT) extend Fault Trees to integrate system dynamics 23 . Dynamic Fault Trees have been extended to address repairable systems by embedding repair mechanisms in the failure specification logic 24 . Similarly, Dynamic Reliability Block Diagrams (DRBD) are based on the dynamic extension of Reliability Block Diagrams 25 . In DRBD each block is modelled with three possible states: operating, standby, and failed state. Transitions between these states are defined with four events: wakeup, sleep, repair, and failure. For the dependability assessment the DRBD approach defines cause-effect relationships between connected blocks. HiP-HOPS 26 is a modular dependability analysis approach which integrates dynamic analysis with design optimization and safety requirement allocation using meta-heuristics. The designer makes failure annotations in the design model and HiP-HOPS synthesizes Dynamic Fault Trees used for subsequent analysis and optimisation of the system design 27, 28 . Boolean Driven Markov Processes (BDMP) 29 integrate Markov chains and Fault Trees to specify the dynamic failure behaviour. In a BDMP model, different events (or leafs) can trigger other events in the Fault Tree dynamically. The specification of leafs is done with predefined Markov chains. The modular system failure specification has been addressed for different dynamic dependability models such as Dynamic Fault Trees 30, 31 . Other dependability analysis approaches integrate the modular specification logic in the dependability specification formalism through the transformation of a high-level component-based model into a low-level dependability analysis formalism for quantification. State-Event Fault Trees (SEFT) 32 combine the specification of Component Fault Trees with statemachine representations in order to specify the failure behaviour of repairable systems in a modular way. In order to quantify the SEFT model, it is transformed into an underlying Deterministic and Stochastic Petri nets model. Similarly, Generalized Fault Trees (GFT) 33 rely on transformations to solve highlevel GFT models which combine parametric and repairable DFT concepts. As for the quantification of Generalized Fault Tree models, they are transformed into Stochastic Well-Formed Nets. Although these top-level formalisms are modular, their transformation into a low-level formalism results in a flat dependability analysis model. Table 1 displays analysed dynamic dependability analysis techniques and addressed properties. Most approaches in Table 1 address temporal analysis, can be applied in a modular fashion, can deal with repair, and assume any cumulated distribution function for component failures. However, approaches to repair require users to make a priori assumptions about this repair process which have a static character. For instance, DFT spare gates require predefined repair priorities 24 , DRBD embeds possible dependencies 25 , and BDMP defines the reactivation logic for inter-dependent components with predefined trigger mechanisms 29 . Fixing elements of the repair logic, however, has its drawbacks; for instance, it is difficult, if not impossible, to represent situations where repair is dynamically decided. Although a few techniques have been extended with more flexible mechanisms (e.g., BDMP 34 ), the representation and analysis of dynamic repair scenarios remains a research challenge that we try to address within D3H2.
Overview of the D3H2 Methodology
D3H2 integrates the modelling and analysis activities as shown in Figure 2 . Systems are specified as a set of interacting hardware, software, and communication resources, including their interfaces and provided functionality.
D3H2 Methodology
Figure 2: D3H2 design methodology 4 .
The main approaches integrated in the D3H2 methodology are listed below:
• The Functional Modelling Approach specifies the functional model including system functions and related attributes including the physical location in which these functions are performed and a necessary list of resources to develop these functions (see Subsection 5.1).
• The Compatibility Analysis identifies compatible implementations (i.e., redundancies) in the functional model. To use these compatible implementations, it may be necessary to aggregate additional resources and perform reallocation of new elements. Subsequently, reconfiguration strategies and reconfiguration priorities are defined (see Subsection 5.2).
• The Extended Functional Modelling Approach (see Subsection 5.3) revisits the functional model to include the fault detection and reconfiguration functions needed to implement the strategies identified in Compatibility Analysis. The functional model is also extended to include allocation of hardware/software (HW/SW) resources to the system functions. At this point, a HW/SW architecture emerges and the effect of design improvements on dependability and cost can be assessed.
• The Dependability and Cost Evaluation Approach predicts the dependability and cost of the HW/SW architecture. Via iterative application and comparison of results, it enables the adoption of informed trade-off decisions between candidate design decisions and incurred cost (see Section 6) . The HW/SW architecture needs to be evaluated to verify if the initial requirements are met. If they are not satisfied there are two options: Option A takes the process to an earlier activity and iterates from there while Option B moves the design process back to its starting point so that design requirements are reconsidered. Depending on the requirements, Option A redirects the design flow to an intermediate design step: redundancy-related design decisions are reconsidered through the application of the Compatibility Analysis (e.g. changing homogeneous redundancies with heterogeneous redundancies to reduce design costs), whereas health management functions are reconsidered through the Extended Functional Modelling Approach (e.g. reducing fault detection implementation redundancies to reduce design costs). Generally the application of the Compatibility Analysis implies the application of the Extended Functional Modelling Approach. The reconsideration of design requirements from Option B results in the redesign of the functional model. Note that the fault hypothesis that underpin the dependability analysis in the D3H2 is the occurrence of permanent, but potentially repairable, dynamic failures of hardware, software, and communication components which are manifested with loss of function (omission failure) or delivery of function out of context (commission failure) 8 .
The four approaches of D3H2 will be discussed with the aid of a railway system which is introduced next and described in more detail in 6 .
Train Car Door Status Control System
The door status control is a safety-critical function which determines the safe operation of door open and close actions. It has dependencies with other systems of the train and the door operations are controlled by the driver depending on the status of the train, e.g. the doors must remain closed while the train is running. Each door in the train has sensors and control buttons for the passengers and the driver. Figure  3 shows the door status control configuration including both hardware and logical dependency models. There is one opening and closing button for the driver connected to the processing unit of the driver (PU Driver ) and each door throughout the train has: one opening button for passengers, one door speed sensor, one door open detection sensor, one door closed detection sensor and one obstacle detection sensor. All these sensors, their controllers, and the door control algorithm are located in the processing unit PU Door . In the train there is a component called TCMS (Train Control and Monitoring System), which monitors and controls different critical systems of the train such as traction and doors. This component is homogeneously duplicated in two reliable processing units (PU TCMS ) for safety purposes. The TCMS receives information about the speed of the train and it will not allow the driver to open the doors while the train is running. To this end, the TCMS sends an enable signal to the driver to inform about the safe operation of door opening or closing (Enable Door Driver -EDD). Using the information of the Enable Door Driver signal, the driver sends an enable signal to the controller of each door (Enable Door Passenger -EDP) to act safely on opening/closing the doors, while taking into account if the train is moving and if there is an obstacle in the door (cf. Figure 3b ). All the processing units of the door status control system are connected to Multifunction Vehicle Bus (MVB) 35 . Other systems in the train are connected to Ethernet (e.g., video surveillance) and CAN (e.g., fire protection) communication networks. An interconnecting gateway enables the communication between processing units connected to different communication networks.
5 System Design using D3H2
Functional Modelling Approach
The Functional Modelling Approach specifies the functional operation of the system in a top-down manner. Inspired from SADT (Structured Analysis & Design Technique) 36 , a set of tokens aid in the systematic specification of the key operational parts of the system starting from a set of high-level functions (e.g., different railway train operations: train operating properly, train stopped) tracing down to the necessary resources to perform these functions:
• A high level function consists of a set of Main Functions (MF), e.g., train operating properly = {traction system OK, signalling system OK, braking system OK, air conditioning control OK, . . . }.
• Main functions are performed in possibly different Physical Locations (PLs), e.g., a single air conditioning control implementation may span a whole train car or each car compartment in a train car may have its own air conditioning control.
• A main function consists of a set of subfunctions (SF), e.g., input, control and output subfunctions.
• A subfunction may have multiple implementations (#) to carry out the subfunction and these are ordered with respect to their priority.
• Each implementation requires a set of hardware, software and communication resources.
For simplicity, the token-based specification process focuses on main functions and a first level of decomposition from main functions to subfunctions. However, the Functional Modelling Approach is extendible to N functional levels. The full specification of a subfunction's implementation of a generic main function is specified as follows:
To define the physical location of system functions consistently, a physical location map is defined for the physical structure. Figure 4 shows the physical location map of an hypothetical train, where each car of the train is comprised of different compartments (Zone A , Zone B ). Based on the token-based specification defined in Eq. (1), Table 2 describes the functional model of the door status control (cf. Figure 3 ). ). Door open commands are generated by passengers and the driver, but the door close command is controlled only by the driver. These input subfunctions are directed toward the door control algorithm (DCA) subfunction [#11] which determines when and how to close the doors through the door manipulation (DM) subfunction [#12] . Note that the final decision on opening/closing the door relies on the Enable Door Passenger (EDP) signal, which is determined by the driver. Table 2 also shows the functional model of the video surveillance main function, which is connected to the Ethernet communication network and it is located in the same physical location as the door status control main function (cf. 
Compatibility Analysis
The Compatibility Analysis identifies heterogeneous redundancies based on tokens of the functional model (cf. Eq. (1)). There may exist two compatibility cases among the system implementations defined in the functional model:
• Natural compatibility is the case of implementations carrying out the same subfunction in compatible physical locations.
• Forced compatibility is the case of implementations carrying out different but potentially equivalent subfunctions located at compatible physical locations.
To identify heterogeneous redundancies we identify matching subfunctions and compatible physical locations in the functional model to determine if the analysed implementations are compatible or not. We define compatible physical locations according to the location of subfunctions (cf. Table 2 are located in a compatible physical location. Based on engineering design knowledge, we can identify that the video surveillance can provide a compatible implementation to the door status control function by reusing the camera and adding an image processing software to perform different functions. Specifically, the following heterogeneous redundancies can be implemented reusing video surveillance camera As a result of the compatibility analysis, the designer can select different homogeneous or heterogeneous redundancy strategies for each subfunction. Apart from the identified heterogeneous redundancies, it is possible to add homogeneous redundancies duplicating existing sensors. For instance, for the door status control function in Table 2 the homogeneous and heterogeneous redundancy decisions in Table 3 can be adopted. Communication integrates MVB and Ethernet communication networks and their connecting gateway. There are several approaches in the diagnostics and fault-tolerant control community focused on identifying analytic redundancies systematically 18 . A number of approaches in this area evaluate if it is possible to provide the same service with a combination of remaining sensors, i.e., if there exists an alternative analytic equation, which uses a different set of variables (resources) to provide the same service. The identification of redundancies focuses on the relations among system equations, and variables. That is, if there exists redundant information about the system structure (i.e., if there are more equations than variables to be determined) there may also exist alternative ways to define a variable.
The exhaustive characterization and mathematical formulation of complex systems is not trivial and in some cases is infeasible. The identification of analytic redundancies is typically feasible at subsystem level, but the complexity of the mathematical formulation increases dramatically at system level. Additional complexity exists in highly networked scenarios where systems consists of many subsystems, which are all interconnected through a communication network. In general, the formal identification and categorisation of heterogeneous redundancies for complex systems is a challenging task. This is pronounced in the case of non-evident redundancies raised from forced compatibilities because there is no direct relationship between them.
Reconfiguration strategies integrate the functional model with redundancies. They define all possible realizations of the main function comprised of the necessary subfunctions and prioritized implementations. The prioritization is based on the weighted sum of functional degradation, failure probability and cost of the implementation 4 . The functional degradation depends on the relative physical distance (applicable for heterogeneous redundancies arising from natural compatibilities). For heterogeneous redundancies raising from forced compatibilities, the designer's knowledge is necessary.
Extended Functional Modelling Approach
The Extended Functional Modelling Approach augments the functional model by adding health management functions and implementations: fault detection to detect the incorrect operation of an implementation and reconfiguration to recover from implementation failures. We have defined the following mechanisms and protocols for fault detection and reconfiguration subfunctions:
• Fault detection (FD): each subfunction has an associated fault detection subfunction (FD SF).
The FD SF is located at the destination processing unit where the information of the source processing unit is used to detect communication omission failures directly.
• Reconfiguration (R): each subfunction has its own reconfiguration subfunction (R SF), which receives fault detection (FD SF) signals and sends reconfiguration signals to subfunction implementations.
• Fault detection of the reconfiguration (FD R): each reconfiguration implementation (R SF) has its own fault detection mechanism (FD R SF) implemented in keepalive configuration. Each R SF implementation sends keepalive signals to all their FD R SF implementations to indicate that it is operating. In the absence of a keepalive signal during a time-slot, an R SF implementation is assumed to have failed. When this happens, the FD R SF implementation sends an activation signal to the available R SF implementation with the highest priority.
• Communication is considered at resource level.
There does not exist a uniquely valid solution when allocating health management implementations. The adopted decisions predefine the behaviour of health management mechanisms so that it is possible to design and evaluate HW/SW architectures systematically.
Since fault detection and reconfiguration are subfunctions of a given main function, they are also modelled using tokens (FD SF, R SF, FD R SF). Accordingly it is possible to analyse alternative fault detection and reconfiguration strategies. Figure 5 describes the closed-loop operation of a system deployed in a highly networked scenario including input, control and output subfunctions. The operation of the HW/SW architecture is described for the output subfunction with redundancies. Overlapped rectangles describe alternative implementations for the same subfunction.
Extending the functional model of the door status control main function in Table 2 , Table 4 displays the HW/SW architecture including the identified heterogeneous redundancies (cf. Table 3 ) and their health management mechanisms. Namely, for each subfunction with redundancies: a single fault detection implementation (FD SF), duplicated reconfiguration implementations (R SF), and duplicated fault detection of the reconfiguration (FD R SF) implementations have been selected.
The HW/SW architecture design step can be automated 4 and implemented in real systems 5 . As for the automation, the token-based annotations make it possible to parse the HW/SW architecture from a design model (e.g. Simulink 37 ) which includes designers decisions with respect to the level and type of redundancy and health management strategies. For implementation, each processing unit needs a wrapper that ensures the interchangeability between compatible implementations and a reconfiguration mechanism to redirect its information. Furthermore, the units with FD R SF implementations require monitoring keepalive signals to control the correct operation of the active R SF implementation 5 .
Dependability and Cost Evaluation Approach for Repairable Systems

Concepts and Notation
The failure model of the HW/SW architectures considers the possible failure modes of its health management mechanism and functional implementations: fault detection implementations (FD SF, FD R SF) fail in omission (O) when they do not detect an occurred failure, and in false positive (FP) when they falsely report a failure that has not occurred; reconfiguration implementations fail in omission when they fail to act on needed reconfiguration; and failure of subfunction implementations (SF) cover omission and incorrect value failure modes. Implementations are reconfigured sequentially for non-repairable systems 5 . However, for repairable systems, it is necessary to check the status of all subfunction implementations to know which implementation is active and reconfigure the implementation with the highest priority (cf. Figure 1) . Implementation i becomes active if at initialisation it has the highest priority among the implementations for the same subfunction, or when the active implementation fails and implementation i has the highest priority among the available implementations. The logical and temporal combination of failure and repair events are specified using repairable Dynamic Fault Tree gates (cf. Table 5 ). The use of these gates is limited to expressing certain events with predefined failure and repair logic, but more flexible failure and repair specification logics are also needed to model non-predefined random events (see Subsection 6.3). Table 6 defines the notations of the failure events and working events according to their subfunction and failure modes. For brevity, in subsequent characterizations we omit the common part ([MF].
[PL]). 
The failure specification of each resource is defined by sampling randomly the failure and repair times according to their cumulative distribution functions along the system lifetime. The methodology supports any cumulative distribution function, but for the sake of demonstration and without loss of generality, in subsequent probabilistic characterizations exponential failure distributions are assumed. In line with this assumption, the failure specification of resources (F Res ) is defined according to their failure rates (λ Res ) and repair rates (µ Res ). 
The same equation holds for the specification of the omission failures of: fault detection (FD SF -
, and fault detection of the reconfiguration (FD R SF -F FD R i O ). Accordingly, false positive failures of fault detection implementations (F FD FP and F FD R i FP ) are specified with failure and repair distributions and parameters.
Dependability Analysis Algorithm
The dependability analysis algorithm defines compositionally combinations of subfunction implementation failures that prevent the HW/SW architecture from performing its intended subfunction. The failure of any subfunction necessary for a main function provokes the immediate failure of a main function. Hence, from this point onwards, we will only consider the failure of a subfunction. To express these events we use equations with the logic gates defined in Table 5 .
The subfunction fails (F SF ) when all implementations have failed (F All Impl. ), an implementation fails and reconfiguration does not happen (failure unresolved, F Unresolved ), or its input dependencies have failed (F Dependencies ):
Assuming that we have N SF implementations of the subfunction, the F All Impl. event happens when each implementation fails or is detected as failed:
The failure unresolved (F Unresolved ) occurs when the active implementation fails and either the fault is not detected (failure undetected event) or the reconfiguration itself fails (reconfiguration failed event). For each implementation there are different failure unresolved events (F Unr. Imp i ) because each implementation has different failure probabilities:
To define the failure unresolved event (F Unr. Imp i ) we introduce two new events. The first event occurs when first the reconfiguration subfunction fails and then the i th implementation of the subfunction fails when it is active (reconfiguration sequence failure, F R Seq. i ):
The second event occurs when first the fault detection of the subfunction fails and then the i th implementation of the subfunction fails when it is active (fault detection sequence failure, F FD Seq. i ): 
Dependencies address the influence of Input (I) and Control (C) subfunctions to influence on Control and Output (O) subfunctions respectively. A Control subfunction failure impacts directly the output subfunction failure (C→O). The influence of an input subfunction on a control subfunction depends on the control configuration of the system, i.e. whether this is Closed Loop (C CL) or Open Loop (C OL):
Assuming that W C X =OR(W C X 1 , . . . , W C X N W ) means that any of the N W implementations of the C X subfunction are working (where X = {CL, OL}), equations in (10) describe the different input subfunctions that affect each control configuration (I CL→C CL, I OL→C OL). 
The reconfiguration failure is a special subfunction and therefore F R is developed like Eq. (3), except that there are no additional dependencies:
F All R Impl. indicates the failure of all reconfiguration implementations and F R Unresolved designates the failure unresolved condition of the reconfiguration. Assuming M reconfiguration implementations:
F R Unresolved happens when M implementations of the reconfiguration's fault detection fail simultaneously and it is a direct consequence of design choice: all fault detection implementations of the reconfiguration (FD R SF) are active and homogeneous redundancies (keepalive implementations):
The false positive of the reconfiguration's fault detection occurs when all reconfiguration's fault detection implementations raise the false positive condition simultaneously. Although the system may operate correctly when a false positive occurs, it has to assume that the information provided by the fault detection is correct, since there is no mechanism to detect the incorrect operation of fault detection. The fault detection failure F FD depends on the operation of the destination subfunction (SF Dest ), because the fault detection implementation is located at the same processing unit. Hence, F SF Dest influences directly F FD .
When the fault detection implementation fails, the change of destination subfunction's (SF Dest ) implementation determines its reconfiguration. We assume that the change of destination subfunction's implementation activates the corresponding fault detection implementation and the previous one is deactivated. Eq. (14) describes the fault detection subfunction failure case when fault detection subfunction has K implementations:
The failure of the i th fault detection implementation while it is active (F FD Dest i | Act ) expresses the next event: either the i th destination subfunction or the i th fault detection implementation fail while active (note that i th fault detection and SF Dest i implementation are located at the same processing unit):
To avoid creating loops, the influence of dependencies is taken into account at the subfunction's failure level (cf. Eq. (3)). At this level, the failure of any dependent subfunction leads directly to the subfunction failure.
Implementation
Stochastic Activity Networks (SAN) 38 meet all the requirements to specify the dependability evaluation model of HW/SW architectures including the specification of: time-dependant scenarios; modular system behaviour; repair behaviour; any cumulative distribution function; and user-defined reconfiguration strategies (cf. Section 2).
Preliminaries on SAN
SAN was first introduced in the mid-1980s 39 and it has been used for performance, dependability and performability evaluations 6, 40, 41 . SAN makes use of reduced base models 42 so as to alleviate the stateexplosion problem and it extends stochastic Petri Nets generalizing the stochastic relationships and adding mechanisms for hierarchical models 38 . Figure 6 shows the SAN modelling constructs. Places represent the state of the modelled system. Each place contains tokens defining the marking of the place: a standard place contains an integer number of tokens, while extended places contain data types other than integers (e.g. float, array). We will denote the marking function of the place x as m(x), e.g. m(x) = 1 means that the place x has a marking equal to one.
There are two types of activities: instantaneous which complete in negligible amount of time; and timed whose duration has an effect on the system performance and their completion time can be a constant or a random value. The random value is ruled by a probability distribution function defining the time to fire the activity.
Activities fire based on the conditions defined over the marking of the network and their effect is to modify the marking of the places. The completion of an activity of any kind is enabled by a particular marking of a set of places. The presence of at least one token in each input place enables the firing of the activity removing the token from its input place(s) and placing it in the output place(s).
Another way of enabling activities consists of utilising input and output gates. Gates make SAN general and powerful enough to model complex real situations. They determine the marking of the network via employing user-defined C++ rules. Input gates control the enabling of activities and define the marking changes that will occur when an activity completes. A set of places is connected to the input gate and the input gate is connected to an activity. A Boolean condition enables the activity connected to the gate and a function determines the effect of the activity completion on the marking of the places connected to the gate. Output gates specify the effect of activity completion on the marking of the places connected to the output gate. An output function defines the marking changes that occur when the activity completes.
SAN models which include the specified SAN elements form a SAN atomic model (see Figure 10 "Reusable Block" column). The join operator links SAN models through a compositional tree structure in a unique composed model (e.g., see Figure 8 ). It is possible to link atomic models, composed models, or combinations thereof. Composed and atomic SAN models are linked through join operators using shared places between them. Thus, the analyst can focus on specific characteristics through fitfor-purpose atomic/composed models and later join independently validated models to obtain a more complex composed model.
The performance measurements are carried out through reward functions defined over the designed model. Reward functions are defined based on the marking of the network (state reward function) or completion of activities (impulse reward function) and they are evaluated as the expected value of the reward function. For a complete and formal definition of SAN please refer to 38 . Figure 7 shows the specification of the dependability analysis algorithm comprised of the following models and activities:
Dependability Evaluation Approach Specification in SAN
• Functional Modelling: for each subfunction (SF) its resources, implementations, and the reconfiguration logic are specified using SAN atomic models. The same modelling process applies for each fault detection (FD SF), reconfiguration (R SF) and reconfiguration's fault detection (FD R SF) subfunction implementations.
• Failure Logic Modelling: the failure logic of the gates used in Eqs. (2)- (15) are modelled in SAN.
• SAN Synthesis: according to the dependability analysis algorithm, SAN composed models are created linking resources, implementations, reconfiguration logic and failure logic. Composed models are constructed by creating shared places between implementations and failure gates. They define implementation-level failures (cf. Eq. (2)) and they are linked to define subfunction and main function level failures (cf. Eq. (3)). Functional Modelling: for each subfunction its different implementations, resources, and reconfiguration logic are specified using SAN atomic models. For instance, assuming that the implementation Impl 1 is comprised of resources Res 1 and Res 2 , Figure 8 shows the SAN atomic specification of (a) resources (Res 1 ); (b) implementations (Impl 1 ); and (c) the SAN composed model that links implementations and resources via shared places.
As modelled in the resource specification (Figure 8a ), Res 1 (and Res 2 ) transits between working and failed states according to its failure and repair cumulative distribution functions (F (t), R(t)). Initially resources are assumed to be operative (<m(Res 1 Working), m(Res 1 Failure)> = <1, 0>) and implementations can be in working or standby state, e.g., Impl 1 is working (<m(Impl 1 Working), m(Impl 1 Failure), m(Impl 1 Standby)> = <1, 0, 0>).
According to the atomic implementation specification, when Res 1 or Res 2 fails, Impl 1 switches to failure state (see the logic in F Impl 1 input gate). When both resources Res 1 and Res 2 are repaired, Impl 1 switches to standby state (see the logic in R Impl 1 input gate). If Impl 1 is in standby state and receives a reconfiguration signal (m(Impl 1 Reconfigure)=1), then instantaneously returns to the working state (see atomic model of the implementation specification - Figure 8b ).
The composed model of the implementation links atomic models of resources and implementations sharing their dependent places: Res 1 Failure and Res 2 Failure (Figure 8c ). This modelling process is repeated for all the implementations and their constituent resources.
After specifying all the implementations and resources, it is necessary to define the reconfiguration logic between implementations. Figure 9 shows the reconfiguration process for Impl 1 and Impl 2 R(t) F(t) assuming that Impl 1 has higher priority than Impl 2 . The SAN atomic model of the reconfiguration (Reconfig SF) defines the reconfiguration process:
• The implementation with the highest priority starts operating (Impl 1 ).
• When Impl 1 fails the next implementation in standby state with the highest priority is activated (Impl 2 ).
• When the failed implementation is repaired, it returns to the standby state and it remains in standby state until the implementation that is active fails.
• When the implementation which does not have the highest priority fails, standby implementations are checked according to their priority. In this case, if Impl 1 is in standby state when Impl 2 fails, it returns to the active operation.
This process is extendible to N implementations and the implementation reconfiguration priorities are determined according to if-else-if statements and implementations states.
To implement the reconfiguration logic the atomic model Reconfig SF in Figure 9 is joined with the composed models of Impl 1 and Impl 2 (cf. Figure 8c ) creating shared places between the implementations and the reconfiguration logic for each implementation: Impl i Failure, Impl i Reconfigure, and Impl i Standby, where i identifies the implementation, i={1, 2}. Recon✁ gure Figure 9 : Specification of the reconfiguration process.
Failure Logic Modelling: in order to implement the logic in the equations of the dependability analysis algorithm it is necessary to model in SAN the logic of repairable Dynamic Fault Tree gatessee Table 5 . Figure 10 shows the specification of repairable Dynamic Fault Tree gates in SAN using state machines and their corresponding SAN model. In the state machine the initial state is indicated with an arc, failure states are identified with doubled circles, and F x and R x indicate failure and repair events of x. The resultant reusable blocks are used to create the equations of the dependability analysis algorithm systematically. Note that the repairable Dynamic Fault Tree gates in Figure 10 are directly extendible to gates with N inputs and they can be used in a broader context for the evaluation of any complex repairable Dynamic Fault Tree model. The behaviour of the repairable gates have been validated using other repairable Dynamic Fault Tree analysis tools 24 .
SAN Synthesis: linking the design and operation logic for all the system resources, implementations, and subfunctions and then connecting them with failure gates leads to synthesis in SAN of the equations of the dependability analysis algorithm. The algorithm is applied bottom-up using Eqs. (2)- (15), starting from resources and implementations (Eq. (2)) up to the subfunction failure (Eq. (3) ).
For instance, Figure 11 shows F All Impl. event (cf. Eq. (4)) assuming that the subfunction under study is comprised of two implementations. The same modelling process applies to the remainder of the equations of the Dependability Evaluation Approach. In this way compositional dependability evaluation of complex reconfigurable systems is achieved by linking the dependability analysis algorithm with component-based SAN models of system elements. Note that the reconfiguration model for each subfunction (cf. Figure 9 ) is linked at the subfunction failure level (cf. Eq. (3)) so as to reconfigure subfunction implementations consistently.
Door Status Control Case Study Application
Starting from the functional model of the door status control in Table 2 , we have identified heterogeneous redundancies for different subfunctions. For instance, it is possible to reuse a video surveillance camera to provide redundancies for door open detection, door closed detection, obstacle detection; and door velocity subfunctions -see Subsection 5.2. Table 3 displays alternative redundancy strategies that can be considered at the design phase.
To use these redundancies, the HW/SW architecture is designed adding fault detection and reconfiguration mechanisms. In the HW/SW architecture displayed in Table 4 we have assumed that for each subfunction with redundancies we have one fault detection subfunction (FD SF), two reconfiguration (R SF), and two fault detection of the reconfiguration (FD R SF) implementations.
The cost assessment of the designed architecture is carried out by adding up the cost of hardware and software resources. The cost of software components is quantified by considering their development cost assuming that it will be paid off in X years (let us assume X=4 years for calculation purposes). We classify four types of SW components: fault detection (SW FD), reconfiguration (SW R), fault detection of the reconfiguration (SW FD R) and Control-Detector (SW Det). The development costs for each of these four software components is considered once for different subfunction implementations: once developed, they are adapted for the related subfunction implementations.
This assumption is adopted because the grouped subfunction implementations are closely related and they do not need a significant development cost (the cost of N variants is not N times the cost of a single software variant 43 ): fault detection implementations adapt to different subfunctions modifying subfunction-specific time/value thresholds. The cost of development of reconfiguration implementations does not differ for different subfunctions because the reactivation logic remain. The fault detection implementations of a reconfiguration differ only in the keepalive timeout and the development is independent of any subfunction. All the control-detector software implementations have a similar logic.
Hardware cost is evaluated using the sensors, controllers and actuator costs obtained from suppliers. The labour cost related with mounting/testing is considered for sensors and actuators assuming 10 minutes per sensor (actuator) at a rate of 60 e/hour. Downtime cost is measured as the combination of travels lost while the train was stopped (travels lost); people in each travel (people travel); and cost of a ticket per person (ticket cost):
downtime cost = travels lost × people travel × ticket cost travels lost = travels hour × downtime downtime = f ailure probability × mission time
We assume that we do not have to stop the whole train to fix a failure in a car. Besides, we adopt the following values for a short-distance train (≤ 50 km): travels hour = 2; people travel = 20; ticket cost = 1 e; mission time = 30 years. We will evaluate the failure probability at T = 30 years time instant.
Regarding their failure rate values, resources with the same characteristics have been grouped in Table 7: pressure sensor covers open, closed and obstacle detection sensors; PU gathers characteristics of all different processing units; and communications include MVB and Ethernet communication protocols and their gateway. Regarding software components, plausible values are assumed. The repair rate for all components is assumed to be µ = 0.5 y -1 . We have analysed the failure probabilities of different HW/SW architectures with alternative redundancy strategies by applying the dependability analysis algorithm (cf. Subsection 6.2) and synthesizing the equations of Dependability Evaluation Approach in SAN (cf. Subsection 6.3.2). Table 8 displays analysed redundancy strategies using the redundancies displayed in Table 3 and Table 9 displays the implementations of the health management mechanisms used for the set of subfunctions with redundancies denoted as SF={DOD, DCD, OD, DV}. The HW/SW architecture in Table 4 displays the implementation of the health management configuration in Table 9 for the different subfunctions with redundancies of the door status control main function. Figure 12 and Table 10 show respectively the relative failure probability and relative cost of different HW/SW architectures for alternative redundancy strategies displayed in Table 8 normalized with the architecture without redundancies (cf. Table 2 ). The following improvements have been observed at T=20 years with respect to the configuration without redundancies (cf. Figure 12 ): (#1): 42% better; (#2): 42.57% better; (#3): 43.23% better; (#4) 44.07% better; and (#5): 44.74% better. When considering the cost of hardware, software and communication implementations, heterogeneous redundancy configurations are cheaper than homogeneous redundancy configurations. However, with downtime costs, the less reliable the architecture, the higher its cost. Accordingly, heterogeneous redundancy configurations are more expensive than homogeneous redundancy configurations.
To examine the influence of reconfiguration strategies we have evaluated the failure probability for input, control, and output subfunctions with different reconfiguration arrangements for input subfunction implementations. Table 11 displays the arrangement of reconfigurations, where the subscript indicates the priority of the software reconfiguration implementation. All these configurations have the same fault detection configuration displayed in Table 9 .
The system failure probability does not vary changing the number and distribution of reconfiguration implementations. However, focusing on Eq. (6) and Eq. (11) there are some properties worth mentioning. Taking door closed detection subfunction as a reference (note that the remainder of input subfunctions are characterized equally -door open detection, obstacle detection and door velocity), Table 12 shows the failure probability of the reconfiguration sequence failure event (F R.Seq. DCD -Eq. (6)) and the reconfiguration subfunction failure event (F R DCD -Eq. (11)) at T=10 years. These events have been analysed for different values of failure rates for health management software implementations (fault detection, reconfiguration, and reconfiguration's fault detection): SW FD, SW R, SW FD R. We have modified the failure rates of these software resources altogether (denoted collectively as λ SW HM ) to see the effect on the failure probability. The following characteristics are identified in Table 12 :
• As the number of redundant implementations of reconfiguration increase, the failure probability of F R.Seq. SF and F R SF decreases.
• As the failure rate of the health management implementations increases, the failure probability of F R.Seq. SF and F R SF also increase.
• F R.Seq. SF is lower than F R SF due to the sequence-dependent constraint (cf. Eq. (6)).
Taking the HW/SW architecture with the redundancy configuration #1 as reference configuration (see Table 8 ), the influence of fault detection, reconfiguration and communication implementations have been analysed assuming their ideal and real behaviour. Figure 13 shows the failure probability of these configurations. Real configuration Configuration with ideal reconfiguration Configuration with ideal fault detection Configuration with ideal communication Figure 13 : Door status control failure probability with ideal assumptions. Figure 13 shows that the influence of the communication is more important than health management implementations because the communication influences many subfunctions and implementations at the same time. In this case, there is no difference in the influence of fault detection and reconfiguration implementations and their influence can be considered negligible (cf. Table 13 ).
Conclusions and Future Work
In this paper we have extended the recently proposed D3H2 methodology to model and evaluate repairable systems for the cost-effective design of dependable reconfigurable systems. Prioritized repair strategies are taken into account including components with complex logic and repeated events. The compositional modelling in D3H2 improves traceability between design and dependability models.
Application of the method to a railway case study has confirmed that the reuse of system resources reduces system cost compared with the addition of extra hardware components. However, this is only true when the additional cost incurred from increased failure probability of the system is not greater than the extra cost of homogeneous redundancy. When excluding downtime costs, heterogeneous redundancies are cheaper than homogeneous redundancies. However, downtime cost is higher with less reliable architectures and it is more penalising than hardware, software, and communication costs. D3H2 assists in the trade-off analysis between these properties and it enables informed decision making. The D3H2 methodology also includes the effect of health management mechanisms on system dependability. It is true that in many cases their effect may not be significant for the system performance, but assuming them ideal may result in an optimistic system evaluation. Therefore, their effect needs to be evaluated, specially for safety-critical systems.
When evaluating reconfiguration strategies, distributed reconfiguration strategies have shown a lower failure probability than the centralised reconfiguration redundancies in the analysed case study. However, it should be noted that the effect of increasing reconfiguration redundancies on system failure probability is attenuated because there are sequence-dependent intermediate, lower-level failure events. That is, the failure of the reconfiguration subfunction occurs when first the reconfiguration mechanism fails and then the subfunction implementation failure occurs. This time-dependent condition constraints the effect of increasing reconfiguration redundancies on the system failure probability.
As shown in the case study, optimisation of design decisions with respect to the level and type of redundancy and reconfiguration strategies to maximize dependability and minimize the cost are feasible within the D3H2 methodology. We acknowledge that the methodology assumes a design rationale and process, which designers may not wish to use in every application. However, the innovative and useful aspects of D3H2 such as dependability modelling can be adopted within other design methods. Our future goals towards improving D3H2 will focus on improving the proposed approach by addressing the following extensions:
• Automatic extraction of the dependability evaluation models: this approach would alleviate modelling errors (e.g., using meta-modelling techniques 47 ) and accordingly enable the implementation of meta-heuristics, e.g., extending the work in 2, 13 to automate and optimise design decisions. One possible direction is synthesis of D3H2 with model-based dependability analysis techniques 13 .
• Formal identification of heterogeneous redundancies: this is a challenging task for complex systems because there may not be a deterministic relationship between variables. Further refinement of the proposed identification approach could focus on formalising engineering knowledge or exploring multi-physics based modelling formalisms 48 .
• Verification of heterogeneous redundancies: include architecture-specific requirements such as timeliness constraints 49 or memory and processing capacity.
• Quality degradation caused by the use of heterogeneous redundancies: analyse other properties than the failure probability.
• Repair and maintenance strategies: the train operates through different phases and it is possible to schedule repair and maintenance actions accordingly. For instance, if an asset is not critical, it can be left in the failed state until reaching a railway depot and repair altogether. For critical assets, condition-based maintenance techniques 50 can be considered to monitor the condition of components and schedule maintenance before their failure occurrence reducing downtime costs 51 .
• Application of the D3H2 methodology at the overall system level including interactions and dependencies between all the system main functions through high level functions. 
