Safety Analysis for Vehicle Guidance Systems with Dynamic Fault Trees by Ghadhab, Majdi et al.
Safety Analysis for Vehicle Guidance Systems
with Dynamic Fault TreesI,II
Majdi Ghadhaba, Sebastian Jungesb, Joost-Pieter Katoenb, Matthias Kuntza,
Matthias Volkb
aBMW AG, Munich, Germany
bRWTH Aachen University, Aachen, Germany
Abstract
This paper considers the design-phase safety analysis of vehicle guidance systems.
The proposed approach constructs dynamic fault trees (DFTs) to model a
variety of safety concepts and E/E architectures for drive automation. The
fault trees can be used to evaluate various quantitative measures by means of
model checking. The approach is accompanied by a large-scale evaluation: The
resulting DFTs with up to 300 elements constitute larger-than-before DFTs, yet
the concepts and architectures can be evaluated in a matter of minutes.
Keywords: Model Checking, Hardware Partitioning, Dynamic Fault Trees
1. Introduction
Motivation. Cars are nowadays equipped with functions, often realised in soft-
ware, to e.g., improve driving comfort and driving assistance (with a tendency
towards autonomous driving). These functions impose high demands on the
required functional safety. ISO 26262 [1] is the basic norm for developing
safety-critical functions in the automotive setting. It enables car manufacturers
to develop safety-critical devices—in the sense that malfunctioning can harm
persons—according to an agreed technical state-of-the-art. The safety-criticality
is technically measured in terms of the so-called Automotive Safety Integrity
Level (ASIL). This level takes into account driving situations, failure occurrence,
the possible resulting physical harm, and the controllability of the malfunction-
ing by the driver. The result is classified from QM (no special safety measures
required) up to the most stringent level ASIL D (with ASIL A, B, C in between).
To meet the functional safety requirements, it is crucial to execute the software
I c© 2019. This preprint version is made available under the CC-BY-NC-ND 4.0 license
http://creativecommons.org/licenses/by-nc-nd/4.0/. The full paper is available in RESS
10.1016/j.ress.2019.02.005.
IIFunding: This work was supported by the CDZ project CAP and the DFG RTG 2236
“UnRAVeL”.
Email address: matthias.volk@cs.rwth-aachen.de (Matthias Volk)
Preprint submitted to Reliability Engineering & System Safety March 14, 2019
ar
X
iv
:1
90
3.
05
36
1v
1 
 [c
s.S
E]
  1
3 M
ar 
20
19
functional
block diagram
an
aly
se,
cf.
F
ig
.
13
system
B1 B4. . .
E/E architecture
hardware
(+fault trees)
hardware assignment
system
B1 B4. . .
H1 H2 Bus1
Sect. 4.1
Sect. 4.2.1
Sect. 4.2.2
Sect. 4.2.3
Figure 1: Overview of the model-based safety approach
functions with a sufficiently low probability of undetected dangerous hardware
failures. This paper considers the design-phase safety analysis of the vehicle
guidance system, a key functional block of a vehicle with a high safety integrity
level (ASIL D, i.e., allowing not more than 10−8 residual hardware failures per
hour). The key point of our approach is to: (1) manually construct dynamic
fault trees [2] (DFTs) from industrial system descriptions and combine them (in
an automated manner) with hardware failure models for several partitionings of
functions on hardware, and (2) analyse the resulting overall DFTs by means of
probabilistic model checking [3, 4, 5].
A model-based approach. Fig. 1 summarises the approach, in relation to the
structure of this paper. The failure behaviour of the functional architecture,
given as a functional block diagram (FBD), is expressed as a two-level DFT:
the upper level models a system failure in terms of block failures Bi while the
lower level models the causes of block failures Bi. The use of fault trees is
natural: They are a well-known model in reliability engineering. No familiarity
with additional formalisms is required. Fault trees for hardware components are
typically provided by manufacturers. Failures in function blocks can easily be
described by fault trees. The use of DFTs rather than static fault trees allows
to model warm and cold redundancies, spare components, and state-dependent
faults; cf. [6]. Each functional block is assigned to a hardware platform for which
(by assumption) a provided DFT Hi models its failure behaviour. Depending on
the partitioning, the communication goes via different fallible buses that are also
modelled by DFTs Busi. From the partitioning, and the DFTs of the hardware
and the functional level, an overall DFT is constructed (in an automated manner)
consisting of three layers: (1) the system layer; (2) the block layer; and (3) the
hardware layer. Details are discussed in Sect. 4.
Analysis. We exploit probabilistic model checking (PMC) [3, 4, 5] to analyse the
DFT of the overall vehicle guidance system. PMC can be used as a black-box
algorithm—no expertise in PMC is needed to understand its outcomes—and
supports various metrics that go beyond reliability and MTTF [7]. While they
are all expressible by a combination of PMC queries, the number of queries is
prohibitively large for some measures relevant for the safety analysis of highly
automated cars. Therefore, we developed dedicated algorithms to compute these
measures within the probabilistic model checker Storm [8], by reusing building
blocks for standard PMC queries. In contrast to simulation and statistical
2
model checking [9, 10], where results are obtained with a given statistical
confidence, PMC provides hard guarantees that the safety objectives are met.
These guarantees are important as ISO 26262 requires that “metrics are verifiable
and precise enough to differentiate between different architectures”[1, 5:8.2].
Whereas most ISO 26262-based analyses focus on single and dual-point
failures, PMC naturally supports the analysis of multi-point failures of the
vehicle guidance system’s DFT. Consideration of multi-point failures is highly
relevant, as “it is necessary to consider multiple-point failures of a higher order
than two in the analysis when the technical safety concept is based on redundant
safety mechanisms.”[1, 5:9.4.3.2]. To limit the computation time, we extend
Storm’s capabilities for approximative computation: Instead of a precise value,
we compute sound upper and lower bounds for the measures.
Contributions. The main contribution of this paper is two-fold: We report on
the usage of dynamic fault trees for safety analysis in a potential automotive
setting. While standard fault tree analysis is part of the ISO 26262, the usage of
DFTs in this field is new. This paper shows how the additional features offered
by DFTs help to create faithful models of the considered scenarios. These models
are then used to analyse the given scenarios. To increase the applicability of
DFTs as a method for probabilistic safety assessment in an industrial setting, we
give concrete building blocks to work with, e.g. redundancy and faults covered
by fallible safety mechanisms.
A clear benefit of the usage of DFTs is that all these methods are integrated
in existing off-the-shelf analysis tools, which provide sound error bounds. The
usage of DFTs reduces the amount of domain-specific knowledge in the analysis,
and thus supports a more model-oriented approach. In this paper, we take this
model-oriented approach to investigate the effect of different hardware partitioning
on a range of metrics. The generated DFTs are to the best of our knowledge
the largest real-life ones in the literature—larger trees have only been artificially
created for scalability purposes [11]. Notably, this paper is the first to consider
model-checking based approaches for DFT analysis on real-life case studies.
A short version of this paper was presented in [12]. The three major extensions
to the earlier paper are: (1) A more detailed step-by-step description of the
methodology. (2) New and faster algorithms to compute a number of safety
measures, and a more thorough explanation of the algorithms involved. (3) More
thorough experiments and a detailed analysis of the results.
Remark. This work has been carried out in cooperation with BMW AG. The
proposed concepts and architectures are exemplary for real-life systems. No
implication on actual safety concepts or E/E architectures implemented by
BMW AG can be derived from these examples. The same remark applies on
any quantity (failure rates, obtained metrics, ...) presented in this paper.
2. Vehicle Guidance
The most challenging safety topic in the automotive industry is currently the
driving automation, where the driving responsibility is moving partly or even
3
EP TP AM...
s1
sn
...
a1
ak
(a) Nominal function
EP TP AM
EP TP AM
EP TP AM
Voter...
s1
sn
...
a1
ak
(b) SC1: Triple Modular Redundancy (TMR)
EP TP
s-EP s-TP
TCS AM...
s1
sn
...
a1
ak
(c) SC2: Nominal path and safety path
EP TP
fb-EP fb-TP
Switch AM...
s1
sn
...
a1
ak
(d) SC3: Main path and fallback path
Figure 2: Different functional block diagrams for vehicle guidance
entirely from the driver to the embedded vehicle intelligence. Rising liability
questions make it crucial to develop functional safety concepts adequately to
the intended automation level and to provide evidence regarding the availability
and the reliability of these concepts.
2.1. Scenario
As a real-life case study from the automotive domain, we consider the
functional block diagram (FBD) in Fig. 2(a) representing the skeletal structure
of automated driving. Data collected from different sensors (cameras, radars,
ultrasonic, etc.) are synthesised and fused to generate a model of the current
driving situation in the Environment Perception (EP) functional block. This
model is used by the Trajectory Planning (TP) functional block to build a
driving path with respect to the current driving situation and the intended
trip. The Actuator Management (AM) functional block ensures the control
of the different actuators (powertrain, brakes, etc.) following the calculated
driving path. Thus, the blocks in the FBD fulfil tasks: The tasks are realised by
(potentially redundant) functional blocks, connected by lines to depict dataflow.
We like to stress that these diagrams are not reliability block diagrams in which
the system is operational as long as a path through operational blocks exist.
2.2. Modelling Safety Concepts
2.2.1. Technical safety concepts
Based on the criticality of the vehicle guidance function, especially when
the driver is out-of-the-loop, ASIL D, the highest level, applies to the safety
goal of following a safe trajectory. According to the automation level, the
vehicle guidance function must be designed as fail-operational, i.e., the system
should safely continue to operate for a certain time after a failure of one of its
components. Different design patterns have been developed and implemented in
safety-critical systems with fail-operational behaviour and high safety levels, cf.
e.g. [13]. The variety of possibilities is illustrated by the following three concepts:
4
ADAS
I-ECU
ECU1
...
ECUk
...
s1
sn
a1
aka0
Bus
(a) E/E architecture A
ADAS1
ADAS2
ADAS3
I-ECU
ECU1
...
ECUk
...
s1
sn
a1
ak
a0
BusBus
(b) E/E architecture B
ADAS1
ADAS+2
I-ECU
ECU1
...
ECUk
...
s1
sn
a1
ak
a0
BusBus
(c) E/E architecture C
Figure 3: Different E/E architectures
SC1 - Triple Modular Redundancy (TMR), Fig. 2(b): The nominal function for
vehicle guidance is replicated into three paths each fulfilling ASIL B. A Voter,
fulfilling ASIL D, ensuring that any single incorrect path is eliminated.
SC2 - Nominal path and safety path, Fig. 2(c): Consists of two different paths,
a nominal path (n-Path) and a safety path (s-Path) in hot-standby mode. The n-
Path provides a full extent trajectory—including comfort functions not necessary
for safe operation—with ASIL QM and the s-Path a reduced extent trajectory but
with highest safety integrity level ASIL D. The reduced extent safety trajectory
is generated from a reduced s-EP (safety Environment Perception) and reduced
s-TP (safety Trajectory Planning). The Trajectory Checking and Selection
(TCS) verifies whether the trajectory calculated by the n-Path is within the safe
range calculated by the s-Path or not. In the case of failure, the s-Path takes
over the control and the safe trajectory with reduced extent is followed by the
AM. In this case, the system is considered to be degraded.
SC3 - Main path and fallback path, Fig. 2(d): Similar to SC2 although the main
path (m-Path) is now developed according to ASIL D in order to detect its own
hardware failures and signalise them to the Switch. The Switch then commutates
the control of the AM to a fallback path (fb-Path), operated in cold-standby,
with ASIL D. Upon activation of the fb-Path, the system is considered to be
degraded.
2.2.2. Partitioning on E/E architecture
The next design step consists of extending the nominal E/E architecture
for vehicle guidance and partitioning the blocks of every safety concept on
its elements. The nominal E/E architecture is represented in Fig. 3(a). The
vehicle guidance function is implemented on an ADAS-platform (Advanced
Driver Assistance System) which is connected to all sensors. A number of
dedicated ECUs (Electronic Control Unit) control the actuators. On an I-
ECU (Integration ECU), additional, non-dedicated actuation functions can be
implemented. Naturally, implementing all blocks from the safety concepts on
the ADAS in Architecture A defeats the purpose of the redundant paths.
Fig. 3 gives further illustrative examples for E/E architectures for the different
safety concepts: For SC1, Architecture B (Fig. 3(b)) allows an implementation
5
of the three redundant paths on separate ADAS-cores. The Voter could then
be implemented on the I-ECU. For SC2, the following two implementations
both yield ASIL D for the safety path, each with TCS and AM on the I-ECU:
(1) Executing the nominal path on one ADAS and redundant execution of the
s-Path on two ADAS-cores in lock-step mode, using Architecture B. (2) Encoded
execution [14] of the s-Path on a single ADAS+-core in Architecture C (Fig. 3(c)),
where the + refers to the additional hardware resources to run an encoded s-Path.
An E/E architecture for SC3 could be realised on Architecture C, where the
m-Path is implemented on ADAS1 and the fb-Path on ADAS
+
2 . Alternatives
are considered in our experiments in Sect. 6.
2.2.3. Hardware platforms and faults
We assume that all hardware platforms can completely recover from transient
faults (e.g. by restarting the affected path), so that only transient faults directly
leading to a system failure are of importance. Transient faults quickly vanish.
Thus, the probability of an additional transient fault occurring is small. We
assume that this probability is negligible and during a transient fault no other
faults occur [1]. It can be understood as modelling only single transient faults.
2.3. Measures
The safety goal for the considered systems is to avoid wrong vehicle guidance,
i.e., following unsafe trajectories. As the system is designed to be fail-operational,
the system should be able to maintain its core functionality for a certain time,
e.g. 10 seconds—even in the presence of faults. The safety goal is violated, if e.g.
two out of three TMR paths or both the n-Path and the s-Path fail. The goal
is also violated if e.g. a failure of the n-Path is not detected. The safety goal
is classified as ASIL D. We stress that safe faults do not need to be considered.
For the sake of conciseness, we define the complement of probability p as 1− p.
Several measures allow insights in the safety-performance of the different
safety concepts: System reliability refers to the probability that the system
safely operates during the considered operational lifetime. To obtain the average
failure-probability per hour, the complement of the reliability is scaled with the
lifetime. Besides the reliability, the mean time to failure (MTTF) is a standard
measure of interest. We also consider degraded states in which some faults
already occurred in the system. If the system is in a degraded state it still safely
operates but provides reduced functionality. The following measures focusing on
degraded states reflect insights also relevant for customer satisfaction:
1) the probability that the system provides the full functionality at time t,
2) the fraction of system failures which occur without being in a degraded state
before,
3) the expected time to failure upon entering a degraded state,
4) the criticality of a degraded state, in terms of the probability that the system
fails within e.g. a typical drive cycle of one hour [1, 5:9.4] while being degraded
already, and
5) the effect on the overall system reliability when imposing limits on the time
a system remains operational in a degraded state.
6
(a) BE
k
. . .
(b) VOTk
. . .
(c) OR
. . .
(d) AND
→
. . .
(e) SEQ
. . .
(f) PAND
. . .
(g) SPARE (h) FDEP
Figure 4: Node types in ((a)-(d)) static and (all) dynamic fault trees
It is important to consider the robustness or sensitivity of all measures w.r.t.
changes in the failure rates. Furthermore, it is beneficial for a proper analysis to
consider the system under (hypothesised) combinations of events.
3. Technical Background
3.1. Fault trees
Fault trees [15, 6] (FTs) are directed acyclic graphs (DAG) with typed nodes
(AND, OR, etc.). Nodes of type T are referred to as “a T”. Nodes without
children (successors in the DAG), are basic events (BEs, Fig. 4(a)). Each BE is
equipped with some failure rate, or is assumed to not fail by itself (a dummy
event). The BEs of a fault tree F are denoted FBE. Other nodes are gates
(Fig. 4(b)-(h)). We say a BE fails if the event occurs; a gate “fails” if its failure
condition over its children holds. The top-level event (TLE(F )) is a specifically
marked node of a FT F . TLE(F ) fails iff the FT F fails.
3.1.1. Static fault trees
The key gate for static fault trees (SFTs, gates (b)-(d)) is the voting gate
(denoted VOTk) with threshold k and at least k children. A VOTk-gate fails, if k
of its children have failed. A VOT1-gate equals an OR-gate, while a VOTk-gate
with k children equals an AND-gate.
3.1.2. Dynamic fault trees
For fail-operational systems such as future vehicle guidance systems, essential
concepts such as (cold) redundancies and complex dependencies cannot be
modelled faithfully with static fault trees, cf. e.g. [16]. Dynamic fault trees
(DFTs) [17] are an extension well-suited to model and analyse the advanced
concepts. A recent account of precise DFT semantics, including corner cases
omitted here, is given in [18]. The presentation below is a more intuitive summary
of these semantics. DFTs additionally contain the following gates:
Sequence-enforcers. The sequence enforcer (SEQ, Fig. 4(e)) do not propagate
failures, but restrict the order in which BEs can fail—their children may only fail
left-to-right. Thus, SEQs exclude certain failure orders in the model. Contrary
to a widespread belief, SEQs can in a general context not be modelled by SPAREs
(introduced below) [16]. For example, SEQs with children being gates cannot be
modelled with SPAREs. SEQs appear in the ISO 26262 [1, 10-B.3], where they
are indicated by the boxed L.
7
Priority-and. The priority-and (PAND, Fig. 4(f)) fails iff all children fail ordered
from left-to-right. If the children fail in any other order, e.g. a child failed before
its left sibling, the PAND cannot fail.
Spare-gates. Spare-gates (SPARE, Fig. 4(g)) model spare-management and sup-
port warm and cold standby. Warm (cold) standby corresponds to a reduced
(zero) failure rate. Likewise to an AND, a SPARE fails if all children have failed.
Additionally, the SPARE activates its children from left to right: A child is
activated as soon as all children to its left have failed. By activating and there-
fore using a child the failure rate is increased. The children of the SPAREs are
assumed to be roots of independent subtrees, these subtrees are called modules.
Upon activation of the root of a module, the full module is activated.
Functional dependencies. Functional dependencies (FDEP, Fig. 4(h)) ease mod-
elling of feedback-loops. FDEPs have a trigger (a node) and a dependent event
(a BE). Instead of propagating failure upwards, upon failure of the trigger, the
dependent event fails. While FDEPs are syntactic sugar in SFTs, they cannot
be expressed by other gates in DFTs [11].
Activation dependencies. To overcome syntactic restrictions induced by SPAREs
and to allow greater flexibility with activation, we use activation dependencies
(ADEPs), as proposed in [16, Sect. 3E]. If the activation source is activated, the
activation destination is also activated. We typically use ADEPs in conjunction
with an FDEP, where the activation sources are the dependent events and the
activation target is the trigger.
3.2. Markov Chains
The semantics of DFTs can be expressed in terms of Markov models, more
explicitly continuous-time Markov chains (CTMCs) [19].
Definition 1 (CTMC). A CTMC is a tuple C = (S, P,R,L) with
• S a finite set of states,
• P : S×S → [0, 1] a stochastic matrix with ∑s∈S′ P (s, s′) = 1 for all s ∈ S,
• R : S → R>0 a function assigning an exit rate R(s) to each state s ∈ S,
• L : S → 2AP a labeling function assigning a set of atomic propositions
L(s) to each state s ∈ S.
The exit rate specifies the rate of a negative exponential distribution governing
the residence time in each state. The transition rate between states s and s′ is
defined as R(s, s′) = R(s) · P (s, s′). State labels are used in expressing desired
properties over CTMCs, and are used here to identify failed or degraded states
of the DFT.
8
s0
∅
s1{a}
s2 {a}
s3 {b}
3
6
5
4
7
Figure 5: A CTMC.
Example. In Fig. 5 a CTMC with 4 states and the tran-
sition rates is given. The corresponding exit rates and
transition probabilities can be derived from the transition
rates. The exit rate for s0 is 3 + 5 = 8. The transition
probabilities are P (s0, s1) = 3/8 and P (s0, s2) = 5/8. State
s2 is labelled with a, state s3 is labelled with b.
4. Creating the Dynamic Fault Trees
This section describes in detail the approach from Fig. 1. In particular, we
discuss how to systematically create an FT consisting of three layers. The top-
level event is assumed to represent a safety-critical failure of the vehicle-guidance
system. From top to bottom, the layers are:
1. the system layer describes how the top-level event occurs due to failing
function blocks (e.g., EP or TP). Thus, the layer’s basic events describe
the failure of a function block.
2. the block layer describes how a function block can fail, taking into account
the possibility of failing computation, or wrong inputs. Thus, the layer’s
basic events describe either failures of the E/E-architecture. In particular,
either failures of the hardware platform on which the function block is
executed, or the failure of busses that realise the communication between
function blocks.
3. the hardware layer describes how the hardware platforms can fail, based
on failures of hardware components, which are considered as basic events.
The system- and block layer depend on the functional architecture of the system,
in particular on the structure and information captured in a (given) functional
block diagram. The fault trees for the hardware are typically constructed by the
manufacturers. The connections between the block layer and the hardware layer
are inferred from the E/E-architecture and the hardware partitioning.
4.1. From Function Block Diagram to the system- and block layer
We discuss the creation of the system- and block layer of the fault tree, based
on a functional architecture and a function block diagram for this architecture.
Definition 2 (Block diagram). A block diagram D = (Blc,) is a finite di-
rected graph. The vertices Blc are a set of blocks, the edges  are called channels.
Given a channel (B,B′) ∈ , we call B the source and B′ the target. The set
inB = {e ∈  | e ∈ Blc × {B}} denotes the input channels of block B, and
outB = {e ∈  | e ∈ {B}×Blc} denotes its output channels. The block diagram
may contain cycles, cf. Fig. 8(b).
Within the scope of this paper, we advocate the structured manual creation
of the system- and block layer. A further discussion is given in Sect. 7.2. Below,
we discuss the creation of both layers.
Example 1. Consider the function block diagram in Fig. 6. Formally, the
diagram is given by Blc = {B1, B2, B3}, and  = {(B1, B3), (B2, B3)}.
9
B1
B2 B3
Figure 6: Toy-example function block diagram
block
input
internhw
in 1 in 2
(a) Block FT
block
2
input
internhw
in 1 in 2 in 3
(b) Block FT – Voter
switch
all input wrongwrong path
in 1switching in 2
(c) Block FT – Switch
Figure 7: Block fault trees
Remark. Typically, fault trees are created top-down. We deviate from this
scheme to ease the presentation, and show the potential for automation.
4.1.1. Creation of the block layer
For the systematic construction of the block layer, we assume that the possible
causes of a block to fail are described by an appropriate block fault tree. The
causes include faulty input channels.
Creating fault trees for each block. In a first step, we create a fault tree FB for
each block B ∈ Blc in the block diagram. To reflect the interaction within both
the design pattern and the influence from the assigned hardware later on, we
annotate each block fault tree FB with relevant information.
• For each input channel of B, we define a BE for faulty input.
• For each output channel c of B, we define an element in FB which causes
B’s output on c to be faulty. Later, we propagate the faulty output to the
block fault trees for the target block along the channel c.
• We define a BE for the failure of the hardware platform that executes B.
Formally, we capture these fault trees as follows.
Definition 3 (Block fault tree). A block fault tree for block B is a tuple
(FB , IB , OB , Y B), where FB is a fault tree, IB : inB → FBBE denotes input faults,
OB : outB → FB output failures, and Y B ∈ FBBE the hardware fault BE.
A fault tree for a block B with two input channels (c1, c2) and one output
channel (c3) is given in Fig. 7(a). The block fails if either the hardware fails, due
to some internal fault, or due to faulty input. Typically, the hardware failure
(formally, Y B = hw) is connected to the hardware fault tree, see Sect. 4.2.1. The
10
B1
B2 B3
B1 B2 B3
input hw
in 2 in 1
hw intern hw intern
(a) Simple block diagram with con-
nected block fault tree
B1 B2 B3
B1 B2 B3
in 3 hw in 1 hw in 2 hw
(b) Feedback loop and connected block
FT
Figure 8: Connecting block fault trees
internal fault can be used to support sub-blocks or for additional fallible events.
Faulty input means the failure of any of the inputs (formally, IB(c1) = in 1,
IB(c2) = in 2). The output is faulty, if the block fails (formally, O
B(c3) = block).
Adaptions of the standard scheme are possible:
• For the voter block, as in triple modular redundancy SC1. We assume
that the voter fails if two out of three inputs are faulty. Thus, we obtain a
block fault tree as depicted in Fig. 7(b), with three inputs, and the block
fails if two out of three inputs fail, as reflected by a VOT2-gate.
• For the switch as in SC3. The switching mechanism may fail. This only
leads to a failure when the path has to be switched. This behaviour is
reflected by the DFT in Fig. 7(c): the Switch fails if it either uses the
wrong path or if all input is wrong. The path is wrong if the switching
mechanism fails before the primary input fails, i.e., it can no longer switch
to an operational path, as reflected by a PAND-gate.
Block fault trees can easily be extended to several hardware failures, to in-
put/output faults of several types, etc.
Connecting block fault trees. The goal of this step is to connect the inputs and
outputs of the block fault trees as specified by the relation . The connected
block fault tree consists of the disjoint union of all block fault trees. Additionally,
for each channel c = (B,B′) in the block diagram, we connect the output failure
OB(c) with the input fault IB
′
(c). The connection is realised by means of an
FDEP with trigger OB(c) and dependent event IB
′
(c). As we only consider the
block layer, we do not have a top-level event.
Example 2. We illustrate the connected block fault tree in Fig. 8(a) for a toy
example. For each block, we use a standard block as in Fig. 7(a). We assume
11
safe
planning
...
TCS
...
AM
path s-path
...
EP
...
TP
...
s-EP
...
s-TP
(a) System layer for SC2
safe
planning
...
TCS
...
AM
path fb-path
...
EP
...
TP
...
fb-EP
...
fb-TP
(b) System layer for SC3
Figure 9: Task-based construction of the system layer
faulty input to the incoming channels of B3 if the respective source block fails.
A major benefit of FDEPs is that they allow to faithfully model feedback
loops [15]. We illustrate the support for feedback loops in Fig. 8(b): For three
blocks as shown on top, the three DFTs are connected via FDEPs. If, e.g., B1
fails, the failure is propagated to the input of B2, etc. Using FDEPs allows that
cyclic dependencies can be modelled by (acyclic) fault trees, and is very flexible.
4.1.2. Creation of the system layer
The goal of this step is to express the occurrence of a safety-critical failure as
a fault tree over the failure of the blocks. Towards this goal, we process top-down.
A task-based partitioning of the DFT is helpful. The fault tree, i.e. the top-level
event, is assumed to fail if any of the tasks can no longer be executed (OR). The
tasks fail if no block can realise the task anymore (AND). We do not need to
consider error propagation through the blocks at this point, as this is already
handled by the connections in the connected blocks fault tree. We consider the
system layer for SC2 in Fig. 9(a): The system fails, if either the planning, the
selection, or the actuator management fails. Planning fails if both the nominal
and the safety path fail.
To model cold/warm redundancy, we replace AND by the more general
SPARE-gate. Let us illustrate this for SC3. Recall that the fb-Path is in cold
standby. The traction and environment in the fallback-path only operate as soon
as they are required, i.e., as soon as the nominal path fails. This behaviour is
captured by using a SPARE instead of an AND, as in Fig. 9(b).
Example 3. Let us continue Example 2. Assume that the blocks B1 and B2
execute the same task with B2 in hot standby, and B3 executes another task. The
resulting DFT is depicted in Fig. 10. Green colour indicates the system layer
and blue colour the block layer.
12
safe
task 1 task 2
B1 B2 B3
input internhw
in 2 in 1
hw intern hw intern
Figure 10: Putting the system layer (green) and the block layer (blue) together
4.2. Adding the hardware layer to the system- and block layers
In this section, we discuss how to combine the system- and block layer fault
tree with the hardware layer. We assume additional input fault trees for all
hardware components, an E/E-architecture and a hardware assignment. Below,
we first discuss the additional inputs and then consider how to combine them to
a complete fault tree.
4.2.1. Fault trees for hardware platforms and buses
We assume that the (D)FTs to model hardware failures are provided by the
manufacturers, and we do not make any assumptions about the structure of
these FTs. We briefly illustrate how to integrate coverage and both transient
and permanent faults in DFTs, analogous to [1, 10:B]. We would like to stress
the dynamic nature of coverage by fallible safety mechanisms.
Consider the DFT depicted in Fig. 11. A failure results from either a transient
or permanent error. Each type has its own corresponding safety mechanism.
We consider the transient error. A transient error occurs if either the transient
fault is not covered by the safety mechanism, or the fault is covered but the
safety mechanism has failed before. We model the latter by a SEQ.1 It can
neither be modelled faithfully by a static FT nor a PAND. A PAND would model
that a covered error is never propagated if the covered fault occurs before. The
permanent error is modelled similarly.
1For other analysis purposes a model using a PAND might be more adequate.
13
HW failure
permanent
...
transient
covereduncovered
→
safety mech. covered
Figure 11: Hardware Fault Tree
4.2.2. Hardware Assignment and E/E Architecture
We briefly discuss the assumptions about the hardware assignment and the
E/E architecture. We formalise the concepts as follows:
Definition 4 (E/E-architecture). An E/E-architecture is a tuple (H,Bus) of
a set H of hardware platforms, and a set Bus of buses. Each bus is a transitive
relation over hardware platforms, i.e. Bus ⊆ 2H×H .
Often, buses are equivalence relations, i.e., they connect a set of platforms to
each other, as e.g. in FlexRay [20] or CAN [21]. We assume that all hardware
platforms p ∈ H contain an internal bus internalp = {(p, p)} ∈ Bus. Consequently,
function blocks on the same hardware platform can always communicate.
Example 4. We formalise the E/E-architecture B from Fig. 3(b). The hardware
platforms H are:
{s1, . . . , sn,ADAS1,ADAS2,ADAS3, I-ECU,ECU1, . . . ,ECUk, a0, . . . , ak},
and the Bus is given as
{ECU-actuator0, . . . ,ECU-actuatork,CAN-BUS} ∪ {internalp | p ∈ H},
with
• ECU-actuator0 = {(I-ECU, a0), (a0, I-ECU)},
• ECU-actuatori = {(ECUi, ai), (ai,ECUi)}, and
• CAN-BUS = H \ {a0, . . . , ak} ×H \ {a0, . . . , ak}.
Definition 5 (Hardware assignment). Given a block diagram D = (Blc,)
and an E/E-architecture (H,Bus), a hardware-assignment is a function h : D →
H ∪ Bus s.t. h(Blc) ⊆ H and h() ⊆ Bus ∪H.
More precisely, hardware assignments map internal channels to hardware plat-
forms, and all other channels to buses. A hardware assignment is consistent,
14
if the source and target of each channel are connected by a bus. Thus, the
assignment is consistent if
(B,B′) ∈  =⇒ ∃C ∈ Bus with (h(B), h(B′)) ∈ C.
A hardware assignment trivially supports mapping several function blocks to
the same hardware. In this case, we assume that a failure of a function block does
not affect other function blocks on the same hardware. If necessary, dependent
failures of function blocks can be modelled explicitly in the DFT using FDEPs.
We require that any function block runs on at most one hardware platform. If
the same function should be implemented on several hardware platforms, we
require an explicit duplication in the function block diagram.
Remark. We do not consider whether a hardware assignment is feasible, i.e.,
whether the hardware platforms provide the (computational) resources to realise
the functions.
Block-based assignments. Typically, we are only given an assignment from blocks
to hardware platforms; the channel-assignment then follows from the E/E
architecture. That is, we assume that any channel between two blocks assigned
to the same hardware platform is realised by the internal bus. For channels that
connect blocks which are assigned to different platforms, we select the unique
bus (e.g., the CAN-bus) that connects these two platforms. If there is no unique
bus, then manual intervention is required to specify the proper bus.
Example 5. Consider the block diagram of Example 2, and the E/E-architecture
as in Example 4. The mapping of the blocks h(B1) = ECU1, h(B2) = h(B3) =
ADAS1 induces a unique mapping of the channels according to the scheme
described above: h((B1, B3)) = CAN-BUS and h((B2, B3)) = internalADAS1 .
4.2.3. Constructing a complete fault tree
To obtain a complete DFT of the vehicle guidance system, we take the disjoint
union of the DFT F for the system and block layer, and the hardware DFTs
{Fy | y ∈ H ∪ Bus}. The DFT F contains annotations Y B for each block B.
For each B ∈ Blc, an FDEP with trigger TLE(Fh(B)) and dependent event Y B is
added, and an ADEP in the reverse direction. The FDEP ensures that a failure
of the hardware platform causes the function blocks executed on that hardware
to fail. The ADEP ensures that the hardware FTs are correctly activated if
the corresponding function blocks are activated. Furthermore, for each channel
c = (B,B′) ∈ , an FDEP with trigger TLE(Fh(c)) and dependent event IB′(c)
is added. The FDEP ensures that if the bus fails, the input of the target block
fails as well.
Example 6. We connect the system and block layer from Example 3 with hard-
ware fault trees. The connection is based on the hardware assignment in Exam-
ple 5. The relevant fragment of the complete fault tree is depicted in Fig. 12.
Blue colour indicates the block fault trees and red colour the hardware fault trees.
15
B1 B2 B3
input hw
in 2 in 1
hw intern hwintern
ADAS1ECU1
CAN-BUS
Figure 12: Fragment of a complete fault tree for a toy example
rewrite
SF
X1
B A C
DFT
SF
B A C
Simplified DFT
generate
CTMC
analyse
0 2 4 6 8
0.2
0.4
0.6
0.8
1
reliability
time
re
li
a
b
il
it
y
Result
Storm
convert
safety
measure
(e.g., MTTF)
ET(♦ failed)
Model checking query
Figure 13: Overview of the DFT analysis
5. Analysing the Dynamic Fault Trees
Our aim is to analyse the complete FTs constructed as in Sect. 4 with respect
to the eight measures described in Sect. 2.3. We build upon the model checker
Storm [8], a state-of-the-art tool for the automated analysis of DFTs [7].
The full tool-chain for the analysis is depicted in Fig. 13. First, the DFT is
simplified, and converted into a CTMC. Details are given in Sect. 5.1. Second,
a measure is translated into a model-checking query, as detailed in Sect. 5.2.
Then, the result for the query on the CTMC is computed based on traditional
probabilistic model checking, as detailed in Sect. 5.3.
5.1. Model generation
In the following we describe the translation from a DFT into a CTMC. This
translation is a key step in the analysis process. The main challenge is the
possible state-space explosion. Several techniques such as rewriting DFTs [11],
or partial state-space generation [7] have been developed to address the issue.
5.1.1. Rewriting DFTs
The structured creation of FTs induces an extensive structure, which is often
counterproductive for the analysis performance. Thus, the first step is to rewrite
the given FT using the techniques from [11]. Rewriting aims to automatically
16
simplify the FT to improve the performance of its analysis while preserving the
semantics of the original FT. Examples of rewriting include the elimination of
superfluous levels of ORs, and the removal of FDEPs whose trigger already lead
to the failure of the top-level event.
5.1.2. State space generation
Next, the simplified FT is translated into a CTMC [7, 22] by a process
referred to as state-space generation. Transitions in the CTMC correspond to
BEs failing. States record in which order BEs have failed and administer the
status of children of SPAREs, e.g., are they currently in use or not. Additionally,
states are labelled to denote which events have failed.
The state-space generation is computationally the most expensive step as
it explores the complete state space defined by the successive failures of BEs.
To improve the performance, several optimisations are used [7]. For example,
Storm exploits symmetric structures occurring in the FT modelling the sensors.
When using the default translation, the FT is converted into a Markov
Automaton [23], an extension of CTMCs with non-determinism. The non-
determinism stems from different orders in which FDEPs propagate their failure.
However, in our setting, the order of FDEP failures does not influence the
obtained analysis result. Therefore, we fix a failure order for dependent events
and restrict ourselves to MAs without non-determinism, that is, CTMCs.
5.1.3. Transient faults
Transient faults are considered during the state-space generation. Recall
from Sect. 2.2.3, that if transient fault occurs, either the system fails directly, or
the fault disappears and the system returns to its previous state. During the
state-space generation transient faults are considered in each state similar to
regular BE faults. However, the corresponding transition is only added to the
CTMC if the TLE has failed at the target state of the particular transition.
5.2. Converting measures
The generated CTMCs are analysed against properties defined in continuous
stochastic logic (CSL) with reward extensions [24]. We convert the safety mea-
sures introduced in Sect. 2.3 into model-checking queries composed of standard
CSL properties.
5.2.1. CSL properties
Sets of states in a CTMC are described by Boolean combinations over the
state labelings. Thus, for a CTMC constructed from a fault tree, sets of states
can be described by combinations of failed and operational events. The following
three standard CSL properties can be solved efficiently and are the building
blocks for our queries.
17
Measure Model-checking queries
S
y
st
em
Reliability 1− P(♦≤t failed)
AFH 1lifetime · P(♦≤lifetime failed)
MTTF ET(♦ failed)
D
eg
ra
d
at
io
n
FFA 1− P(♦≤t (failed ∨ degraded))
FWD P
(
(¬degraded)U≤t (¬degraded ∧ failed))
MTDF Σs∈degraded
(
P(¬degradedU s) · ETs(♦ failed))
MDR argmins∈degraded
(
1− Ps(♦≤t failed))
FLOD Σs∈degraded
(
P(¬degradedU≤t s) · Ps(♦≤drivecycle failed))
SILFO 1−
(
FWD + FLOD
)
Table 1: Model-checking queries
Reach-avoid probability. Given a state s ∈ S, a set of target states and a set of
bad states, the property Ps(badU target) describes the probability to eventually
reach a target state from state s without visiting a bad state in-between. If the
set of bad states is empty, the reach-avoid probability reduces to Ps(♦ target),
and is just a reachability probability. If s is the initial state, then we omit the
superscript and write P(♦ target).
Time-bounded reach-avoid probability. Incorporating an additional time-bound
t, the property Ps(badU≤t target) describes the probability to reach a target
state (while avoiding bad states) from state s within time-bound t. Similar as
before the time-bounded reachability is described by Ps(♦≤t target).
Expected time. Given state s and a set of target states, the property ETs(♦ target)
describes the expected time to reach a target state from s. The expected time
is only defined if the reachability probability to reach target is one.
5.2.2. Conversion
The conversion of the measures in Sect. 2.3 into model-checking queries
is given in Tab. 1. The atomic proposition failed denotes states where the
top-level event in the FT has failed and degraded denotes states where only
reduced functionality is available.
The first three model-checking queries consider the safety-performance of the
complete system. For reliability, the time-bounded reachability of a failed state is
considered. The complement of this probability describes the probability that the
system is still operational within time-bound t. The average failure-probability
per hour (AFH) is obtained similarly by using the lifetime of the system as
time-bound and afterwards normalising by the lifetime. Mean time to failure
(MTTF) is the expected time of failure of the top-level event. In the considered
DFTs, the expected time is always defined.
The second set of model-checking queries describes the influence of degrada-
tion in the system.
18
1) Full Function Availability (FFA) describes the time-bounded probability that
the system provides full functionality, i.e., it is neither failed nor degraded. It
is described as the complement of the time-bounded reachability of a failed
or degraded state.
2) Failure Without Degradation (FWD) describes the time-bounded probability
that the system fails without being degraded first. It is the time-bounded
reach-avoid probability of reaching a failed state without reaching a degraded
state.
3) Mean Time from Degradation to Failure (MTDF) describes the expected
time from the moment of degradation to system failure. It is obtained by
taking the expected time of failure for each degraded state and scaling it with
the probability to reach this state while not being degraded before.
4) Minimal Degraded Reliability (MDR) describes the criticality of degraded
states by giving the worst-case failure probability when using the system in a
degraded state. For all degraded states the time-bounded reachability of a
TLE failure is computed. The MDR is the minimum over the complement of
this result for all degraded states.
5) System Integrity under Limited Fail-Operation (SILFO) considers the system-
wide impact of limiting the degraded operation time. SILFO is split into two
parts considering failures without degradation (FWD) and failures with degra-
dation (FLOD). Failure under Limited Operation in Degradation (FLOD)
describes the probability of failure when imposing a time limit for using
a degraded system. For all degraded states the time-bounded reachability
probability of a failed state is computed within the restricted time-bound
given by a drive cycle. This value is scaled by the time-bounded reach-avoid
probability of reaching a degraded state without degradation before.
For sensitivity analysis, several model-checking queries on CTMCs of DFTs with
different failure rates are performed.
5.3. Analysis via model checking
The resulting CTMC is analysed w.r.t. the given model-checking queries by
using state-of-the-art probabilistic model checkers such as Storm [8].
5.3.1. Model checking
The first five measures are computed with a single model-checking query each.
Furthermore, it is possible to compute (time-bounded) reach-avoid probabilities
starting in a state s for all states s ∈ S simultaneously. Thus, MDR can be
obtained by computing Ps(♦≤t failed) for all degraded states in a single query
and afterwards computing the minimum over the complement of all results.
However, the computation of MTDF and SILFO requires the computation
of (time-bounded) reach-avoid probabilities reaching a degraded state s for
all degraded states s ∈ S independently. A naive implementation therefore
would require a model-checking query for each degraded state, increasing the
computation time drastically. Using similar ideas as in [25], we implemented an
improved algorithm computing (time-bounded) reach-avoid probabilities for all
states simultaneously. The improved algorithm works as follows.
19
DFT
Partial
state space
CTMC
upper bound
CTMC
lower bound
Approximation
[l, u]
Result
extend
model
checking
model
checking
partial
generation
contains
refinement
On-the-fly approximation
Figure 14: Approximation, based on [7]
The standard approach for reach-avoid probabilities is a “backwards” com-
putation of the probabilities starting from the target states. The result is a
vector of the (time-bounded) probabilities to reach a target state for each state.
The idea here now is to perform a “forwards” computation of the probabili-
ties starting from the initial state. The result is a vector of (time-bounded)
probabilities to reach each target state from the initial state. The improved
algorithm performs a model-checking query only once for all degraded states at
the same time. Thus, the computation of MTDF can be reduced to performing
one model-checking query for P(¬degradedU s), one query for ETs(♦ failed) and
combining the results afterwards. The computation of SILFO can be performed
similarly by combining the results of the three model-checking queries – for
FWD, P(¬degradedU≤t s) and Ps(♦≤drivecycle failed).
5.3.2. Evidence
To flexibly support the analysis of degrades states, we use the concept of
evidence. Evidence is given as a set of BEs considered as already failed and
all possible failure orderings (traces) of these BEs are examined. Following the
traces from the initial state yields a set S′ of states describing the system status
based on the given evidence. Computing a model-checking query using evidence
gives results starting in each state s ∈ S′. When the complete state space has
been built before, it is beneficial to reuse it for analysing the degraded system.
5.3.3. Approximation
The main bottleneck of the DFT analysis is the state-space explosion problem.
For MTTF computations, the problem can be alleviated by building only the
most relevant fraction of the state space and derive approximative results [7].
The idea is visualised in Fig. 14. For a given DFT a partial state space is
generated. From this partial state space two CTMCs are built. One CTMC
captures the behaviour over-approximating the exact result and one CTMC
under-approximates the result. The former is obtained by assuming that all
remaining (i.e., still operational) BEs have to fail in order for the DFT to fail, the
latter results by assuming that a failure of any single BE leads to a failure. By
20
model checking both CTMCs an approximation [l, u] can be derived. The exact
result for the original DFT is guaranteed to lie in this interval. If the interval
[l, u] is too coarse, the state space is refined by exploring additional parts of the
state space. Incremental refinement of the state space approximates the exact
result up to the desired precision.
We extend the approximation algorithm of [7] to also compute the reliability.
To over-approximate the probability of a system failure we assume that each
terminal state in the partial state space corresponds to the failure of the TLE.
The CTMC for the under-approximation is constructed by assuming that each
terminal state is absorbing. Due to the absorption, the TLE cannot fail in the
future if it has not failed in one of the terminal states.
6. Experiments
We show the applicability of the proposed methodology on systems and
concepts similar to those from Sect. 2.2. The generated (anonymized) DFTs are
available on our website2.
6.1. Set-up
All experiments are executed on a 2.9 GHz Intel Core i5 with 8 GB RAM
and a time-out (TO) of one hour. We consider the three safety concepts SC1,
SC2 and SC3. For each safety concept, we construct a system- and block layer
fault tree as described in Sect. 4.1. All scenarios include four sensors, of which
at least two are required for safe operation, and four actuators, which are all
required for safe operation.
6.1.1. Different partitioning schemes
We discuss several partitioning schemes for the safety concepts, which we
refer to as scenarios. Each scenario is defined by the safety concept, the used
architecture with possible adaptions, the hardware assignment, and the fraction
of sensors and actuators which have to be operational. An overview of (a selection
of) considered scenarios is given in Tab. 2. We consider E/E-architectures A,
B and C depicted in Fig. 3, and vary the architectures, by e.g. considering a
redundant bus or introducing more ADAS platforms. Additionally we consider
partitions which do not use the I-ECU, and instead assign functions to the
ADAS2. Furthermore, we scale the number of sensors and actuators and the
required number of sensors for safe operation.
6.1.2. Failure rates
For presentation purposes we assume the following failure rates, which do not
necessarily reflect reality and especially do not reflect any system from BMW AG.
We assume that function blocks, e.g., EP, are free of systematic (internal) faults.
2http://www.stormchecker.org/publications/dfts-for-vehicle-guidance-systems
21
Table 2: Scenarios
Scenario Safety concept Architecture Adaptions Sensors Actuators
I SC1 B — 2/4 4/4
II SC2 B — 2/4 4/4
III SC2 C ADAS+ 2/4 4/4
IV SC3 C — 2/4 4/4
V SC2 A — 2/4 4/4
VI SC2 B removed I-ECU 2/4 4/4
VII SC2 B 5 ADAS, 2 BUS 2/8 7/7
VIII SC2 B 8 ADAS, 2 BUS 2/8 7/7
Sensors, actuators, and (I-)ECUs have failure rates of 10−7/h. In the ADAS
hardware platforms transient faults occur with rate 10−4/h and permanent faults
occur with rate 10−5/h. All faults can be detected by a safety mechanism which
itself fails with rate 10−5/h. The safety mechanism for ASIL D covers 99% of
the faults, the safety mechanism for ASIL B covers 90%, and the one for ASIL
QM covers 60% of the faults. For the ADAS+ platform, the failure rates increase
by a factor 10 and the safety mechanism covers 99.9% of all faults.
6.1.3. Tool-support
The complete workflow depicted in Fig. 1, i.e., both the generation of the
DFTs as well as their analysis, are supported by a Python tool-chain. Given
a safety concept as a function block diagram and the FT for each block, the
tool-chain first automatically generates a FT where dependencies are inserted
according to the data flow in the safety concept. Given an E/E architecture with
a partitioning and hardware FTs, and the system- and block layer FT generated
before, the complete FT is automatically constructed. This generation of the
complete FT is performed in milliseconds. Finally, the analysis of the complete
FT as described in Fig. 13 is performed fully automatically as well. Timings for
the analysis are given in Tab. 5.
We describe the failure rates in the hardware FTs symbolically, i.e., as
parameters. Thus, changes in coverage or failure rates require only a different
instantiation of the parameters, instead of reconstructing the FT.
6.2. Evaluation
In the following we consider the scenarios of Tab. 2. The characteristics
of the corresponding DFTs and CTMCs can be found in Tab. 3. The first
column identifies the scenario. The following three columns give the number
of BEs, dynamic gates and the total number of nodes in the DFT. The last
three columns describe the CTMC obtained after generating the state space and
applying reduction techniques. The columns contain the number of states and
transitions, and the percentage of degraded states in the CTMC.
The analysis results for the measures from Sect. 5.2 are given in Tab. 4.
Notice that SC1 does not contain degraded states and in scenario V the system
fails before reaching a degradation. We use a lifetime of 10,000 hours, and a
22
Table 3: Model characteristics
DFT CTMC
Scen. #BE #Dyn. #Elem. #States #Trans. Degrad.
I 76 25 233 5,377 42,753 —
II 70 23 211 5,953 50,049 19%
III 57 19 168 1,153 7,681 17%
IV 57 21 170 385 1,985 12%
V 58 19 185 193 897 0%
VI 65 21 199 1,201 8,241 20%
VII 96 30 266 109,369 1,148,785 19%
VIII 114 36 305 5,179,105 84,454,945 11%
Table 4: Obtained measures with operational lifetime=10,000h and drive cycle=1h (xc indicates
complement 1− x)
System Degradation
Scen. rel.c AFH MTTF FFAc FWD MTDF MDR FLOD SILFOc
I 1.6E-2 1.6E-6 8.6E+4 — — — — — —
II 1.0E-2 1.0E-6 3.4E+5 5.2E-2 1.0E-2 2.3E+5 2.9E-1 4.3E-8 1.0E-2
III 1.2E-2 1.2E-6 1.1E+5 5.2E-2 1.1E-2 2.1E+4 7.4E-1 2.7E-7 1.1E-2
IV 1.0E-2 1.0E-6 3.1E+5 1.6E-2 1.0E-2 1.6E+5 2.1E-1 6.5E-9 1.0E-2
V 6.0E-2 6.0E-6 6.9E+4 6.0E-2 6.0E-2 0 0 0 6.0E-2
VI 1.1E-2 1.1E-6 3.4E+5 5.3E-2 1.1E-2 2.3E+5 2.1E-1 4.7E-8 1.1E-2
VII 1.7E-2 1.7E-6 2.8E+5 5.8E-2 1.7E-2 1.7E+5 3.7E-1 7.2E-8 1.7E-2
VIII 1.7E-2 1.7E-6 2.7E+5 9.8E-2 1.6E-2 2.0E+5 4.3E-1 1.4E-7 1.6E-2
drive cycle of 1 hour [1, 5:9.4]. The times for generating the CTMC from the
DFT and computing each measure on the CTMC are given in Tab. 5.
Fig. 15 illustrates the obtained measures for a variety of concepts and architec-
tures. In Fig. 15(a) the complement of the reliability, i.e., the failure probability,
over a lifetime of 50,000 hours is given for the scenarios I-IV. Fig. 15(b) de-
picts the AFH. Fig. 15(c) compares the sensitivity of the failure probability
for scenarios III and IV. For the sensitivity analysis we change the ASIL levels
of the hardware FTs, i.e., change the coverage of the safety mechanism. In
both scenarios the straight lines are obtained by the baseline failure rates and
coverages for the hardware components as given in Sect. 6.1.2. The dashed
(dotted) lines are obtained assuming an increased (decreased) coverage according
to an increased (decreased) ASIL level, respectively. The graph in Fig. 15(d)
displays the SILFO for the safety concepts with degraded states.
Results for the approximation are given in Fig. 16. Fig. 16(a) illustrates
the lower and upper bound for the failure probability in scenario VIII w.r.t.
the computation time. Fig. 16(b) depicts the approximation error over the
computation time for the largest scenarios. Moreover, computing an approximate
reliability allowing a 1% relative error on scenario VIII requires only 206,410
states and could be computed within 12 seconds.
23
Table 5: Timings
I II III IV V VI VII VIII
CTMC generation 0.52s 0.51s 0.08s 0.05s 0.02s 0.10s 12.02s 2043.78s
reliabilityc 0.03s 0.03s 0.00s 0.00s 0.00s 0.00s 1.00s 82.84s
AFH 0.03s 0.04s 0.01s 0.00s 0.00s 0.01s 0.93s 157.94s
MTTF 0.01s 0.01s 0.00s 0.00s 0.00s 0.00s 0.18s 26.43s
FFAc — 0.02s 0.01s 0.00s 0.00s 0.00s 0.54s 64.41s
FWD — 0.02s 0.00s 0.00s 0.00s 0.00s 1.46s 61.79s
MTDF — 0.48s 0.10s 0.02s 0.02s 0.10s 10.78s 826.73s
MDR — 0.49s 0.09s 0.02s 0.03s 0.09s 11.02s 829.27s
FLOD — 1.08s 0.20s 0.05s 0.04s 0.21s 28.49s 2945.81s
SILFOc — 1.10s 0.20s 0.05s 0.04s 0.21s 29.95s 3007.60s
7. Discussion
7.1. Analysis of results
We evaluate the obtained results w.r.t. the assumed failure rates. The
evaluation is not intended as a recommendation for a specific partitioning
scheme, but they illustrate the possibility to perform design space exploration
efficiently. In the following we exemplarily evaluate the results of the scenarios
I-IV. The variety of measures computed allows some insights in the effect of
different safety concepts and the role of degradation.
The AFH and MTTF indicate that system reliability of scenarios II and
IV are superior compared to III and I. The differences between II and IV are
marginal, and III is better than I. These claims can also be deduced from
Fig. 15(a) and Fig. 15(b). The similarity between I and III stems from the fact
that in both cases the system fails if two paths fail, i.e., two out of three in
the TMR of I, or normal and safety path in III. It is interesting to see that the
encoding in ADAS+ for III only marginally improves the reliability, because
the better fault coverage of the encoding is outweighed by the higher load on
the hardware. In scenario II, all three paths—one normal and two redundant
safety paths—have to fail before the complete system fails. In scenario IV, the
fallback path has a reduced hardware load as long as the primary path is still
operational, leading to a smaller failure rate for this path.
However, from the sensitivity analysis in Fig. 15(c) we can deduce that the
lessons are only valid with respect to the assumed failure rates. In particular,
for scenario IV increasing the safety coverage only has a marginal effect on the
failure probability. Thus, the benefit of increasing fault coverage in platforms
depends on the chosen architecture.
Scenarios II and IV differ in their failure behaviour of degraded states as seen
in Fig. 15(d). When limiting the driving time in the degraded state to one hour,
scenario II offers a better reliability than IV whereas in the overall reliability
the difference is marginal. The FLOD is orders of magnitude smaller than FWD
in all scenarios. Thus, the duration of the drive cycle is insignificant for SILFO.
The results in Fig. 16 show that we can obtain tight approximation results
for the standard measures within seconds, even for the largest scenarios.
24
0 25k 50k
0
0.1
0.2
0.3
Time (h)
F
a
il
u
re
p
ro
b
a
b
il
it
y
I
II
III
IV
(a) Probability of failure
0 25k 50k
0
1e-6
2e-6
3e-6
4e-6
5e-6
Life time (h)
A
F
H
I
II
III
IV
(b) Average failure-prob. per hour
0 25k 50k
0
0.1
0.2
0.3
Time (h)
F
a
il
u
re
P
ro
b
a
b
il
it
y
III
inc.
dec.
IV
inc.
dec.
(c) Sensitivity analysis
0 25k 50k
0
0.05
0.1
Time (h)
F
a
il
u
re
p
ro
b
a
b
il
it
y
II
III
IV
(d) SILFO
Figure 15: Analysis results
7.2. General methodology
We discuss the methodology based on generating DFTs for function block
diagrams and hardware assignments.
Direct translation to CTMCs. A direct translation from the system description to
CTMCs is arguably more flexible, and allows to keep any overhead to a minimum.
However, even a naive translation is necessarily complex and error-prone, and
the resulting CTMCs are typically too large to be comprehensible. It is hard to
give the modeller feedback on the meaning of the constructed CTMC. Fault trees,
in contrast, are comparably small, and contain more structure. Additionally,
state-space generations have to be implemented with performance in mind, which
makes the direct translation likely to be error prone.
Moreover, an additional benefit of reusing the DFT formalism is due to the
presence of tool-support. The state-space generation only had to be slightly
adapted, which is significantly easier than the construction of an efficient state-
space generation from scratch.
Automation of fault tree generation. Manual creation of the system- and block
layer of the fault tree has some advantages, noteworthy are:
• The semantics of the function block diagram do not need to be formalised.
In particular, function block diagrams contain implicit assumptions, e.g.,
25
1s 10s 100s
1.5e-2
1.7e-2
2.0e-2
Time (s)
F
a
il
u
re
p
ro
b
a
b
il
it
y
Lower bound
Upper bound
(a) Failure probability for VIII
1s 10s 100s 1000s
1e-5
1e-1
Time (s)
E
rr
o
r
b
o
u
n
d
VII
VIII
(b) Approximation error
Figure 16: Approximation results
assumptions about different failure behaviour of voters, or channels with
different meanings. Manual creation can adapt for these subtle differences.
• Constructing the FT is an important step in the development-process of
safety-critical systems [1].
Due to the structure, as discussed in Sect. 4, our prototype made several default
suggestions, e.g., for block FTs, that greatly reduce the required manual effort.
The generation of the complete fault tree given the system and block layers is
fully automated. The automation is essential for the proposed methodology, as
this allows a push-button comparison of the various possible variants of hardware
assignments.
Using dynamic fault trees. Using DFTs as the underlying model has several
advantages. Fault trees are a well-known concept in reliability engineering. Their
hierarchical structure allows for a faithful model of the different layers in the
considered scenarios. Using DFTs instead of static fault trees provides more
expressiveness. For example, most of the proposed measures for degraded states
cannot be computed on static fault trees. The proposed fault trees contain several
features only present in DFTs, including the gates PAND (for switches), SPARE
(for cold redundancy), SEQ (in safety mechanisms). Functional dependencies
are heavily used, in particular to simplify the representation of feedback loops.
The claiming mechanism of SPARE gates is not used. The claiming mechanism
has traditionally led to some strong separation assumptions of the subtrees
under SPAREs, and does restrict some possibilities for simplification. DFTs
traditionally lack activation dependencies, but they could be added straightfor-
wardly. In particular, they do not seem more complex than the existing gates or
dependencies. Thus, most features of DFTs are present in the generated fault
trees and DFTs with the addition of activation dependencies seem a suitable
formalism to assess the failure behaviour.
Analysis methods. Fig. 15(b) indicates that the average failure-probability per
hour (AFH) varies for different operation life times. This observation justifies
the analysis w.r.t. different measures and time horizons. Tab. 3 indicates
26
that reduction techniques successfully alleviate the state-space problem. The
generated state space remains small even for hundreds of elements. The size of
the state space depends largely on the scenario. Naturally, latent faults increase
the state space, but then the effectiveness of the approximation increases as
shown in Fig. 16. Using CTMCs as an underlying model allows to check a
wide variety of measures out-of-the-box. Tab. 5 indicates that most of the
measures can be computed within seconds even on the largest models. The more
complex measures as MTDF and SILFO require a tailored implementation to
avoid performing model-checking queries for each degraded state. However, the
tailored implementation was able to reuse the existing building blocks of the
model checker Storm. The approximation algorithm computes tight results for
the reliability and MTTF within seconds. Thus, the approximation scales well
for large scenarios with millions of states which can be analysed quickly by only
building the most relevant fraction of the state space.
7.3. Related work
Earlier work [26] considers an automotive case study where functional blocks
are translated to static fault trees without treating the partitioning on hardware
architectures. [27] has a similar setting but focuses more on causal explanations
and less on analysis performance of large-scale models. The evaluation of various
options from the design space by a translation to fault trees, and applying
fault tree analysis has also been considered for air traffic control [28]. The
effect of different topologies of a FlexRay bus has been assessed using FTA in
[29]; and identified the need for modelling dynamic aspects. The analysis of
architecture decisions under safety aspects has been considered in e.g. [30] using
a dedicated description language and an analytical evaluation. Safety analysis
for component-based systems has been considered in [31], using state-event fault
trees. Qualitative FTA has been used in [32] for ISO 26262 compliant evaluation
of hardware. Different hardware partitionings are constructed and analysed using
an Architecture Description Language (ADL) in [33]. ADL-based dependability
analysis has been investigated for several languages, e.g., AADL [34], UML [35],
Arcade [36], and HiP-HOPS [37]. These approaches typically have a steeper
learning curve than the use of DFTs. The powerful Mo¨bius analysis tool [38]
has recently been extended with dynamic reliability blocks [39]. Model checking
for safety analysis has been proposed by, e.g., [40]; which focuses on AltaRica,
and does not cover probabilistic aspects.
DFTs are a subclass of the more expressive state/event fault trees (SEFTs)
[41], but efficient analysis techniques for SEFTs are lacking. Various variants and
analysis techniques for DFTs exist [16]. A precise comparison for the semantics
of state-based approaches for DFT analysis is given in [18]. Model checking for
DFT was first proposed by [22]. The performance of that approach suffers from
an intrinsic overhead. Static/Dynamic fault trees [42] are a subclass of DFTs,
and allow for efficient analysis, but lack the expressive power to express ordered
failure and warm redundancy. To support a richer class of failure distributions
and improve scalability, rare event simulation for Fault-Maintenance Trees [43]
has been considered [44].
27
7.4. Future Work
Future work can be partitioned into two directions:
• improved modelling, either by improved expressive power, or by conciseness.
• improved analysis, either by the support for a broader range of properties,
or by improved performance for existing properties.
Regarding modelling, we would like to investigate the modelling of involved
error propagation schemes, possible containing deterministic timing information.
A possible direction would be to consider a combination with timed failure
propagation graphs [45]. Similar combinations have been considered within the
COMPASS project [40]. It would be interesting to consider the use of Boolean
driven Markov decision processes [46] or SEFTs [41].
For an improved analysis, a more rigorous treatment of transient faults and
the combination with degraded states seems promising, for which we may draw
inspiration from [43]. Moreover, for an improved failure rate sensitivity analysis,
we would like to investigate the use of parametric Markov models [47, 48].
8. Conclusion
We presented a model-based approach using dynamic fault trees towards the
safety analysis of vehicle guidance systems. The approach (see Fig. 1) takes the
system functions and their mapping onto the hardware architecture into account.
Its main benefit is the flexibility: new partitionings and architectural changes
can easily and automatically be accommodated. The use of DFTs instead of
static FTs allows for a more faithful model, e.g., incorporating warm and cold
redundancies, and order-dependent failures. The obtained DFTs were analysed
with probabilistic model checking. Due to tailored state-space generation [7]
and reduction techniques, the analysis of these DFTs—with up to 100 basic
events—is a matter of minutes.
References
[1] ISO, ISO 26262: Road Vehicles - Functional Safety (2011).
[2] J. B. Dugan, S. J. Bavuso, M. Boyd, Fault trees and sequence dependencies,
in: Proc. of RAMS, 1990, pp. 286–293.
[3] J.-P. Katoen, The probabilistic model checking landscape, in: Proc. of LICS,
ACM, 2016, pp. 31–45.
[4] M. Z. Kwiatkowska, Model checking for probability and time: from theory
to practice, in: LICS, IEEE Computer Society, 2003, p. 351.
[5] C. Baier, Probabilistic model checking, in: Dependable Software Systems
Engineering, Vol. 45 of NATO Science for Peace and Security Series - D:
Information and Communication Security, IOS Press, 2016, pp. 1–23.
28
[6] E. Ruijters, M. Stoelinga, Fault tree analysis: A survey of the state-of-the-
art in modeling, analysis and tools, Computer Science Review 15-16 (2015)
29–62.
[7] M. Volk, S. Junges, J.-P. Katoen, Fast dynamic fault tree analysis by
model checking techniques, IEEE Trans. Industrial Informatics 14 (1) (2018)
370–379.
[8] C. Dehnert, S. Junges, J.-P. Katoen, M. Volk, A storm is coming: A modern
probabilistic model checker, in: CAV (2), Vol. 10427 of LNCS, Springer,
2017, pp. 592–600.
[9] A. Legay, B. Delahaye, S. Bensalem, Statistical model checking: An overview,
in: RV, Vol. 6418 of LNCS, Springer, 2010, pp. 122–135.
[10] G. Agha, K. Palmskog, A survey of statistical model checking, ACM Trans.
Model. Comput. Simul. 28 (1) (2018) 6:1–6:39.
[11] S. Junges, D. Guck, J.-P. Katoen, A. Rensink, M. Stoelinga, Fault trees on
a diet: automated reduction by graph rewriting, Formal Asp. of Comput.
(2017) 1–53.
[12] M. Ghadhab, S. Junges, J.-P. Katoen, M. Kuntz, M. Volk, Model-based
safety analysis for vehicle guidance systems, in: SAFECOMP, Vol. 10488 of
LNCS, Springer, 2017, pp. 3–19.
[13] A. Armoush, F. Salewski, S. Kowalewski, Design pattern representation for
safety-critical embedded systems, JSEA 2 (1) (2009) 1–12.
[14] M. Ghadhab, J. Kaienburg, M. Su¨ßkraut, C. Fetzer, Is software coded
processing an answer to the execution integrity challenge of current and
future automotive software-intensive applications?, in: Proc. of AMAA,
Springer, 2016, pp. 263–275.
[15] M. Stamatelatos, W. Vesely, J. B. Dugan, J. Fragola, J. Minarick, J. Rails-
back, Fault Tree Handbook with Aerospace Applications, NASA Headquar-
ters, 2002.
[16] S. Junges, D. Guck, J.-P. Katoen, M. Stoelinga, Uncovering dynamic fault
trees, in: Proc. of DSN, IEEE, 2016, pp. 299–310.
[17] J. B. Dugan, S. J. Bavuso, M. A. Boyd, Fault trees and sequence dependen-
cies, in: 1990 Annual Reliability and Maintainability, 1990, pp. 286–293.
[18] S. Junges, J.-P. Katoen, M. Stoelinga, M. Volk, One net fits all - A unifying
semantics of dynamic fault trees using GSPNs, in: Proc. of Petri Nets, Vol.
10877 of LNCS, Springer, 2018, pp. 272–293.
[19] C. Baier, B. R. Haverkort, H. Hermanns, J.-P. Katoen, Performance evalua-
tion and model checking join forces, Commun. ACM 53 (9) (2010) 76–85.
29
[20] ISO, ISO 17458: Road Vehicles - FlexRay communications system (2013).
[21] ISO, ISO 11898: Road Vehicles - Controller area network (CAN) (2015).
[22] H. Boudali, P. Crouzen, M. Stoelinga, A rigorous, compositional, and exten-
sible framework for dynamic fault tree analysis, IEEE Trans on Dependable
Secure Comput 7 (2) (2010) 128–143.
[23] C. Eisentraut, H. Hermanns, L. Zhang, On probabilistic automata in contin-
uous time, in: Proc. of LICS, IEEE Computer Society, 2010, pp. 342–351.
[24] C. Baier, E. M. Hahn, B. R. Haverkort, H. Hermanns, J.-P. Katoen, Model
checking for performability, Mathematical Structures in Computer Science
23 (4) (2013) 751–795.
[25] J.-P. Katoen, M. Z. Kwiatkowska, G. Norman, D. Parker, Faster and
symbolic CTMC model checking, in: PAPM-PROBMIV, Vol. 2165 of LNCS,
Springer, 2001, pp. 23–38.
[26] M. L. McKelvin, A. Sangiovanni-Vincentelli, Fault tree analysis for the
design exploration of fault tolerant automotive architectures, in: SAE
Technical Paper, SAE International, 2009, pp. 1–8.
[27] M. Ko¨lbl, S. Leue, Automated functional safety analysis of automated
driving systems, in: FMICS, Vol. 11119 of LNCS, Springer, 2018, pp. 35–51.
[28] M. Gario, A. Cimatti, C. Mattarei, S. Tonetta, K. Y. Rozier, Model checking
at scale: Automated air traffic control design space exploration, in: CAV
(2), Vol. 9780 of LNCS, Springer, 2016, pp. 3–22.
[29] K. L. Leu, J. E. Chen, C. L. Wey, Y. Y. Chen, Generic reliability analysis
for safety-critical flexray drive-by-wire systems, in: Proc. of ICCVE, 2012,
pp. 216–221.
[30] V. Rupanov, C. Buckl, L. Fiege, M. Armbruster, A. Knoll, G. Spiegelberg,
Employing early model-based safety evaluation to iteratively derive E/E
architecture design, Sci. Comput. Program. 90 (2014) 161–179.
[31] L. Grunske, B. Kaiser, Y. Papadopoulos, Model-driven safety evaluation
with state-event-based component failure annotations, in: Proc. of CBSE,
Vol. 3489 of LNCS, Springer, 2005, pp. 33–48.
[32] N. Adler, S. Otten, M. Mohrhard, K. D. Mu¨ller-Glaser, Rapid safety evalu-
ation of hardware architectural designs compliant with ISO 26262, in: Proc.
of RSP, IEEE, 2013, pp. 66–72.
[33] M. Walker, M.-O. Reiser, S. T. Piergiovanni, Y. Papadopoulos, H. Lo¨nn,
C. Mraidha, D. Parker, D.-J. Chen, D. Servat, Automatic optimisation of
system architectures using EAST-ADL, J. of Syst. Software 86 (10) (2013)
2467–2487.
30
[34] M. Bozzano, A. Cimatti, J.-P. Katoen, V. Y. Nguyen, T. Noll, M. Roveri,
Safety, dependability and performance analysis of extended AADL models,
Comput. J. 54 (5) (2011) 754–775.
[35] F. Leitner-Fischer, S. Leue, QuantUM: Quantitative safety analysis of UML
models, in: Proc. of QAPL, Vol. 57 of EPTCS, 2011, pp. 16–30.
[36] H. Boudali, P. Crouzen, B. R. Haverkort, M. Kuntz, M. Stoelinga, Architec-
tural dependability evaluation with Arcade, in: Proc. of DSN, IEEE, 2008,
pp. 512–521.
[37] D.-J. Chen, R. Johansson, H. Lo¨nn, Y. Papadopoulos, A. Sandberg,
F. To¨rner, M. To¨rngren, Modelling support for design of safety-critical
automotive embedded systems, in: Proc. of SAFECOMP, Vol. 5219 of
LNCS, Springer, 2008, pp. 72–85.
[38] T. Courtney, S. Gaonkar, K. Keefe, E. Rozier, W. H. Sanders, Mo¨bius 2.3:
An extensible tool for dependability, security, and performance evaluation
of large and complex system models, in: Proc. of DSN, IEEE, 2009, pp.
353–358.
[39] K. Keefe, W. H. Sanders, Reliability analysis with dynamic reliability block
diagrams in the Mo¨bius modeling tool, ICST Trans. Security Safety 3 (10).
[40] M. Bozzano, A. Cimatti, O. Lisagor, C. Mattarei, S. Mover, M. Roveri,
S. Tonetta, Safety assessment of AltaRica models via symbolic model
checking, Sci. Comput. Program. 98 (2015) 464–483.
[41] B. Kaiser, C. Gramlich, M. Fo¨rster, State/event fault trees - A safety
analysis model for software-controlled systems, Rel. Eng. & Sys. Safety
92 (11) (2007) 1521–1537.
[42] O. Ba¨ckstro¨m, Y. Butkova, H. Hermanns, J. Krca´l, P. Krca´l, Effective
static and dynamic fault tree analysis, in: SAFECOMP, Vol. 9922 of LNCS,
Springer, 2016, pp. 266–280.
[43] E. Ruijters, D. Guck, P. Drolenga, M. Peters, M. Stoelinga, Maintenance
analysis and optimization via statistical model checking - evaluating a train
pneumatic compressor, in: QEST, Vol. 9826 of LNCS, Springer, 2016, pp.
331–347.
[44] E. Ruijters, D. Reijsbergen, P.-T. de Boer, M. Stoelinga, Rare event simula-
tion for dynamic fault trees, in: SAFECOMP, Vol. 10488 of LNCS, Springer,
2017, pp. 20–35.
[45] B. Bittner, M. Bozzano, A. Cimatti, Automated synthesis of timed failure
propagation graphs, in: IJCAI, IJCAI/AAAI Press, 2016, pp. 972–978.
[46] M. Bouissou, J.-L. Bon, A new formalism that combines advantages of
fault-trees and Markov models: Boolean logic driven Markov processes, Rel.
Eng. & Sys. Safety 82 (2) (2003) 149–163.
31
[47] M. Ceska, P. Pilar, N. Paoletti, L. Brim, M. Z. Kwiatkowska, PRISM-PSY:
precise GPU-accelerated parameter synthesis for stochastic systems, in:
TACAS, Vol. 9636 of LNCS, Springer, 2016, pp. 367–384.
[48] T. Quatmann, C. Dehnert, N. Jansen, S. Junges, J.-P. Katoen, Parameter
synthesis for Markov models: Faster than ever, in: ATVA, Vol. 9938 of
LNCS, 2016, pp. 50–67.
32
