Abstract. This report describes a discrete-time model of the startup phase of a FlexRay network. The startup behaviour of this network is analysed in the presence of several faults. It is shown that in certain cases a faulty node can prevent the network from communicating altogether. One previously unknown scenario is uncovered.
Introduction
In the year 2000, a consortium was established with the goal to design a new, time-triggered communication protocol for use in the automotive industry that would outperform CAN and TTP in both speed and reliability. At the end of 2009, the consortium was disbanded, leaving a final version of a time-triggered protocol called FlexRay. The protocol definition is currently being transformed into an ISO standard.
Already in 2006, the first commercially available cars were equipped with FlexRay networks, enabling new algorithms for vehicle control because of its higher bandwidth.
Since FlexRay will be the basis for communication in many vehicles to come, we would like to establish that the protocol itself is inherently correct, i.e., that implementing a system according to the latest specification leads to a system that behaves predictably. We restrict ourselves to analysing the startup behaviour of FlexRay networks.
As the specification has been declared final, there is no way to avoid any unwanted behaviour that might occur (that is, any behaviour that an end user might find inconvenient). For any such behaviour we would therefore also like to know whether an external party such as a so-called central bus guardian can be used to improve such behaviour without having to alter the protocol.
In this report, we start by giving a brief overview of the FlexRay protocol. We then discuss some of the research done towards verifying parts of the FlexRay protocol, and we explain how our research compares. After this, we give a more detailed description of the startup phase of a FlexRay network, which is the part of the protocol that we wish to analyse. We show how the startup phase can be modelled using the mCRL2 language. We then analyse the startup behaviour in several scenarios with noisy channels and faulty nodes. To conclude, we discuss the results and suggest how the startup behaviour of FlexRay networks might be improved by a central bus guardian. use a series of minislots (possibly 0) to send its on-demand data, followed by a single minislot that indicates the end of the transmission. The next minislot is then assigned to the highest priority node that did not yet send anything. As the length of the dynamic segment is fixed, it may be the case that no more bandwidth is available for lower priority nodes if higher priority nodes claim many minislots.
The FlexRay protocol states that nodes may be connected to at most two channels. A channel is simply a communication medium such as a copper wire or optical fibre (depending on hardware support). The two channels may be used to create redundancy, or to connect to two different sets of nodes.
Related work
The FlexRay protocol has been studied quite extensively, from different perspectives. In this section we give a brief overview of previous studies known to us, and describe the aspects that these studies cover.
The German Verisoft project 1 is a project in the automotive realm that has the verification of the FlexRay protocol as a sub-goal. Within this project, Kühnel et al. aim to provide a framework in which distributed applications can be verified if they use a combination of an osektime compliant (real-time) operating system, ftcom (a fault-tolerant communication layer for osektime operating systems) and FlexRay communication [17] . They use the focus [5] language as a basis for their toolkit. A model is created on a specification that is "based on" the FlexRay protocol specification version 2.0. They apply the following abstractions to simplify their model [16] :
1. All clocks are synchronized, i.e. clock synchronization is not modelled. 2. Start-up behaviour is not modelled (because the clocks are synchronised). 3. The coding/decoding process is not modelled. 4 . Bus guardians are not modelled. 5 . Only the static segment of a communication cycle is modelled, not the dynamic segment. 6. Slots are assumed to have a size of one tick in focus. 7. Lossiness of the channel is not modelled. 8. Single-channel nodes are modelled.
The above lists the aspects of FlexRay that are not modelled, but we could not find a textual explanation of the part of FlexRay that is incorporated in the model. Instead, a specification in focus is given directly, and it is shownusing a theorem prover-that the used interface specification of the FlexRay component is indeed refined by the presented model [22] .
The clock synchronization mentioned in abstraction 1 is a modification of a clock synchronisation protocol described by Lundelius and Lynch [18] . Barsotti et al. have verified (amongst other protocols) the latter [1, 11] using a combination of a theorem prover and an SMT 2 solver. Zhang notes, however, that the correctness of the FlexRay clock synchronisation protocol does not trivially follow from the correctness of Lundelius and Lynch's algorithm [25] .
The start-up behaviour (not modelled because of abstraction 2) is addressed by Malinsky in [19, 20] . He uses Uppaal to create a timed-automata representation of a system consisting of two coldstart nodes and one non-coldstart node. Using a few different settings for a number of FlexRay parameters, this system is checked for deadlock, and it is checked that the system starts up normally. It should be noted that this setup is not a valid FlexRay setup, because in a network with three nodes all nodes must be coldstart nodes, according to [8] . However, the requirements document [9] does state that startup must succeed when, due to a fault, only two coldstart nodes are active.
Local bus guardians [7] are considered by Zhang, who proves three functional properties under the assumption of synchronised clocks [26] . This partially addresses abstraction 4, although central bus guardians [6] are not taken into account. Zhang does however mention research done towards systems employing a central bus guardian in another industry standard, Time-Triggered Architecture 3 .
Pop et al. have looked into properties related to the dynamic segment in a FlexRay communication cycle as meant in abstraction 5. Their analysis is however of a rather different kind, as they provide a schedulability analysis based on a correctly working protocol [21] . Such results can be extremely useful in proving correctness of a distributed application, as it may guarantee thatunder given circumstances-data will travel through the system. Botaschanjan et al. present a methodology that uses the framework developed in the Verisoft project [4] . The method aims at formally modelling tasks (a concept taken from the automotive industry) in the focus tool. The application logic can then be verified, while automatic, verified code generation should ensure that these properties still hold in the deployed system. Later this approach is refined and formalised [3] .
Steiner uses the SAL model checker to find failures in the startup protocol [23, 24] . He identifies a scenario in which the system does not start up due to a single fail-silent node.
In this report, we try to establish some correctness results on the discrete behaviour during the startup phase of the FlexRay protocol. This aim resembles that of Malinský, but we do not focus on modelling timing properties. The approach of Steiner et al. is comparable, but their model is hand-crafted and therefore does not have an obvious one-to-one correspondence with the specification. The time model chosen in their model is also coarser than the one we present. Our aim is to create a model that can be easily seen to correspond to the specification document, and of which we can describe fairly accurately the characteristics that do not match reality. 
The FlexRay startup phase
We study the behaviour of a FlexRay network in the startup phase. The startup phase starts when a node is awake (i.e., it is powered on and ready to send and receive), and ends when the node has succesfully integrated into the network or when it has done a number of unsuccessful integration attempts.
Our aim is to create a model that allows us to analyse the state behaviour of the FlexRay protocol during startup. We focus on creating a model that has a direct link with the SDL models in the FlexRay protocol specification.
We wish to assess the correctness of the FlexRay startup protocol itself. The protocol description is not very clear about what correct operation entails, so we discuss this issue separately in Section 4.2.
The model that we provide later in this document is not complete; like the models mentioned in Section 3, we must simplify the problem to make it tractable. The applied simplifications are discussed in Section 4.3. Finally, we discuss the model itself in Section 4.4.
Operation
The startup phase in a FlexRay network is initiated by so-called coldstart nodes. These nodes have the ability to start communication on a network. Regular nodes must wait for communication to start on the network, after which they may join in. Coldstart nodes send special 'startup frames' on fixed locations in the schedule. These startup frames are the only frames that are sent during startup of a network, and they are sent every cycle.
All nodes start listening to the bus as soon as they are awake. If a listening node decodes a startup frame header or a collision avoidance symbol (CAS) from the bus, then it will assume that communication has already started and it will try to integrate into the schedule.
Integrating into ongoing communication is done by first waiting for the same startup frame header in two adjacent communication cycles, so the node's clock can be adjusted to account for any relative clock drift that the sender might have. Once it has adjusted its clock speed, it will listen for a few more cycles to see whether communication is indeed still going as expected, and eventually it will join in.
If a node does not hear anything on the bus for two communication cycles, it sends a collision avoidance symbol, after which it starts sending startup frames according to its schedule. The CAS is always sent at a specific point in a communication cycle, so, effectively, every node waits for a different amount of time after sending a CAS before sending a startup frame. It is this timing difference that implements a leader election protocol: should more than one node (almost) simultaneously send a CAS, then the first of them to send a startup frame will cause the others to stop sending. These nodes will then once more listen to the bus, and integrate to the remaining node.
After a node has successfully started communicating on the bus, it observes the traffic for another two cycles and restarts the whole procedure if it does not see the other parties anymore. The network traffic of a successful startup is depicted schematically in Figure 1 .
In our report, we assume that all nodes are coldstart nodes, and hence we will sometimes talk about 'nodes', rather than 'coldstart nodes'. Furthermore, we assume that all frames are in fact startup frames, so again we do not distinguish between the two.
Correctness
Although the FlexRay specification documents do not say what correct operation means, we do find the following in the requirements document [9]:
To say that "a cluster is able to start up in the presence of a fault" has a meaning that depends on the type of fault.
-Fault class 1: The fault is associated to a channel or a star: All nodes which are intended to participate in communication and are connected to the other channel reach a state where they communicate to one another as scheduled. They reach this state within a defined maximum time. -Fault class 2: The fault is associated to a node: All fault-free nodes which are intended to participate in communication reach a state where they communicate to one another as scheduled. They reach this state within a defined maximum time. -Fault class 3: Transient fault: All nodes which are intended to participate in communication reach a state where they communicate to one another as scheduled. They reach this state within a defined maximum time (For a value see the requirements specification).
The requirement then gives some faults that FlexRay networks should be robust against. In some cases, the use of both FlexRay channels is required to ensure correct operation. The scope of this document only covers single-channel networks, and therefore all faults in class 1 are considered fatal (if the only used channel is not working, then no communication can take place).
The class 2 and 3 faults within the scope of our investigations are the following (we again quote, and numbering corresponds to the numbering in [9] 13 For a given time of less than one frame length, all present channels are forced to an arbitrary pattern. 14 A bus driver in a node cannot receive anything. 15 A bus driver in a node cannot transmit anything. 17 A coldstart node sends sporadically CASs. After occurrence, the fault does not manifest itself for at least 10 communication cycles. 18 No node is operational except of 2 fault free nodes and these two nodes are assigned to perform startup ("coldstart nodes"). (Req ID 326)
Our aim is to verify that in these cases, the correctly functioning nodes in a FlexRay network will indeed start up as usual. We saw before that, according to the requirement, startup has succeeded when all non-faulty nodes communicate to one another as scheduled. This will never be the case in for instance fault scenario 17, as the communication will be periodically interrupted by the CAS.
We reinterpret the definition of correct startup to mean that on all nodes the startup protocol has terminated successfully, and during one cycle in which no startup protocol is active anymore, every frame that is sent by a non-faulty node is received by all other non-faulty nodes, unless a faulty node or transient fault prevents reception.
Simplifications
Although our choice of modelling language would permit us to exactly describe every detail of the FlexRay protocol, we must abstract away some of the com-plexity in order to be able to automatically analyse our model. The aim is to have a model of which the behaviours are a subset of the behaviours that a real system might display. Note that this means that proving the model correct is not enough to conclude that the actual system will behave correctly. However, any unwanted behaviour that is detected in the model will indeed be observable in the actual system. Important but straightforward restrictions are that in our model, only a single channel is used, each node is assigned a single static slot, and all frames are startup frames. The dynamic segment remains unused.
Two concepts that require more complicated measures are the notions of time and data. We discuss these in more detail, and try to give an intuition about the implications of the simplifications for the accuracy of our model.
Time In [20] , Malinský and Novák present a model of the startup phase of a FlexRay network. Their approach is to make no assumption about the correct working of the clock synchronisation protocol, and they subsequently show that several parameters of their model can be adjusted in such a way that the network model does not properly start up.
Our approach is fundamentally different in a number of ways. We assume a discrete-time global clock, i.e., we assume that for all nodes in the network the clock drift is zero and that the nodes' timed behaviour can be modelled using time slots. The advantage of this approach is that the clock synchronisation process need not be modelled, which is a great advantage indeed as it is a very data-intensive process. The downside is that the model becomes less realistic, as many scenarios are not permitted by our model (namely the ones in which the clocks of the nodes are not synchronised). However, because we only discard behaviour by making this assumption, any faulty behaviour in the model should still correspond to faulty behaviour in the real system.
The resolution for time slots is chosen to be one bit length (gdBit in [8] ). When clock synchronisation works correctly, the external behaviour of a node can be modelled using this time base, as all lengths of symbols sent over the bus are defined (either directly or indirectly) as multiples of one bit length.
Another issue concerning time is caused by the semantics of SDL. Signals are sent to processes, which store them in queues. Each such queue is partially visible, the invisible part modelling signals that have not reached their destination yet, and the visible part modelling those signals that are ready to be processed by the receiver.
We assume that signals are delivered immediately, thus avoiding the need for the invisible part of the queue. The need for the visible part of the queue stems from the fact that a process can be busy performing a calculation when events arrive. We assume that calculations are also done instantly. Since the queues in the SDL semantics are FIFO, we can now simply process events when they come in.
This abstraction comes at the cost of modelling only part of the possible behaviour of the real system. By synchronising the bus communication per bit, we cannot detect any faults in the clock synchronisation protocol. This however is outside the scope of our investigations.
It is more difficult to assess what part of the state behaviour we might have excluded by this synchronisation. Obviously, we cannot model a system in which one node starts sending a symbol less than a bit time after another node; a scenario that might occur when more than one node is sending a CAS symbol. It is unclear to what extent this may influence the behaviour of the system. Data During the startup phase, the only data communicated over the bus are collision avoidance symbols and startup frames. As mentioned before, our time model assumes a global clock with a 1-bit resolution, so these symbols are encoded one bit at a time.
We do not wish to model the sending of actual bit patterns, as that would lead to an enormous amount of allowed combinations. Instead, we model the relevant symbols by using an encoding that models some relevant properties.
The symbols that are relevant during the startup procedure are the following. FRAME HEADER(id) A (startup) frame header sent by node id. According to the specification, the frame header would carry information such as the frame identifier, the length of the payload that follows, a CRC, the cycle counter and some indicator bits. FRAME BODY(id) The body of a (startup) frame sent by node id. The body in reality carries the payload of a frame and a CRC checksum.
Both the frame header and the frame body contain a CRC checksum. This check is in place to detect corrupted frames. We make the assumption that this check is flawless, and model this by sending the id of the sender along with every bit in a frame. This enables the receiver to decide whether a string of received bits forms a valid frame.
length(FRAME HEADER(id))
None None · · · None CHIRP length The following two symbols are much simpler, but pose another technical difficulty: these symbols are defined as a period during which the bus is in a certain state. The lengths of these periods are given in microseconds, so the symbols cannot in general be seen as bit patterns. However, the periods are both multiples of gdBit, which is the length of one bit. In our model, we will therefore model them as bit patterns, because our time model allows it.
CAS A collision avoidance symbol, the length of this symbol is defined by the protocol constants cdCASRxLowMin and cdCAS, which are set to 29 gdBit and 30 gdBit, respectively. CHIRP The channel idle recognition point (CHIRP) is defined as an offset relative to the last time activity was seen on the channel. The length of this symbol is defined by the protocol constant cChannelIdleDelimiter, which is defined to be 11 gdBit.
The mCRL2 encoding of these symbols is schematically depicted in Figure  2 . It is clear that this encoding does not allow us to model certain corner cases; every scenario in which frame data from two different sources is accidentally interpreted as a valid frame cannot be modelled, for instance. This is because in our model the recipient of a bit can identify the sender of that bit, which corresponds to the assumption that the CRC check can detect any form of data corruption. 
Model
We use the mCRL2 modelling language [12, 14] to model a small FlexRay network. In the following, we assume a basic knowledge of mCRL2, and explain how our model relates to the FlexRay protocol specification.
Structure
We model a network of three coldstart nodes during the startup phase. The structure of a single node, as presented in the FlexRay specification, is shown in figure 3 . We will only model the Process Operation Control component in detail, and the simplifications described in the previous section allow us to not model the components shown in grey altogether. For the remaining components, a very abstract model is created. Figure 4 shows a high level model of our system. Three nodes and one bus are each modelled as a separate process. The processes run in parallel, and communicate by synchronising actions. Each process has a notion of a clock tick, which is modelled using an action. We require that if a clock tick occurs in one process, it must simultaneously occur in all other processes. This type of synchronisation is called barrier synchronisation; the dotted line represents the barrier.
The Put and Get actions occur between clock ticks, and define what a process writes and reads in the current time slot. The system will in each time slot first perform three Put actions, one for each node, followed by three Get actions. In a fault-free scenario, the data that is read by each node will be the same; it will be the combination of the data that was written by the three nodes. A more detailed view of the Node and Bus processes is given in Figure 5 . Their interface consists of the actions involved in the barrier (again indicated by the dotted line), and the actions named put, get, put' and get'. The first two represent a node respectively writing to or reading from the bus. The last two represent the bus receiving data from a node and providing data to a node. The put and put' together form the Put action from Figure 4 when they occur simultaneously. Moreover, we do not allow put or put' to occur on their own; a node can only read data from the bus if the bus is providing that data, and likewise, a node can only write data to the bus if the bus is receiving it. This type of synchronisation is a widely used technique to model communication between processes running in parallel.
The Node process consists of three subprocesses that run in parallel: the CODEC, MAC and POC. These components correspond closely with the SDL processes described in [8] . Figure 5 also shows the flow of data between the components. The names on the arrows are the actions involved in the communication between components (which is again modelled in the same way as the communication between the nodes and the bus). For a single node, the POC (process operation control ) process is the process that drives the startup protocol. It is defined in terms of SDL macros in [8] Chapter 7.2. Figure 7 shows the parts of the POC process that are modelled.
Bus model
We wish to verify properties of a network with deaf nodes (nodes that cannot read anything from the bus) and mute nodes (nodes that cannot write anything to the bus). The easiest way to deal with this is to see such faulty behaviour as a trait of the physical bus. The physical bus is modelled as a process that reads, per time slot, a signal from all connected nodes that are not mute and delivers the combination of all those signals back to all connected nodes that are not deaf. If a node is not sending, it is modelled as a node that is sending silence. The combination of a signal and silence is that signal. The combination of two signals is defined to be noise.
When a bit has been sent and received in this manner, the bus action marks the progress of time and the process repeats itself (it is part of the clock tick barrier).
In the model in appendix C, two variants of the bus are included. One of them reads in an arbitrary order, and one reads in a fixed order. The latter was used for verification purposes (because it is the more general model), but the model as listed above was used to generate the traces in section 5. Fig. 7 . The POC startup phase, taken from [8] . Grey parts are not included in our model.
quicker. The reason the model that performs reads and writes in an arbitrary order is more time-consuming to generate is that it generates more states, and uses set operations where the ordered model needs only enumerated and integral types.
POC model
Because the FlexRay protocol specification is already quite formal (in the sense that a systematic notation is used to describe processes), we set out to create our model in a way that remains as close to the specification as possible. Because a large part of the specification is abstracted away, we cannot use an automated conversion to an mCRL2 model, but where our model attains the same degree of detail as the original specification, we attempt to make our translation as systematic as possible. We have taken the diagrams from Chapter 7.2 in [8] as the basis for our model. The semantics of these diagrams is given in [15] .
To create a model that would not inevitably blow up to unreasonable proportions, we use a slightly simplified version of these semantics. We assume that the calculations in the diagrams do not take time; moving from one state to the next can then be seen as an instantaneous change. Secondly, we assume that signals do not take time to be delivered. In [15] , event queues are used to capture the semantics of these phenomena. In a finite-state model, this would lead to a very large state space, as the state of the queues must be taken into account. Our simplifications eliminate the need for queues. The downside is that not all possible sequences of events that may occur in reality are modelled; for instance, a signal from another process never arrives when the receiving process is busy.
We note here that the FlexRay specification does not claim to use the precise SDL semantics (in fact, it claims that it may not do so). Given that the specification is at the detail level of a reference implementation, this is not surprising, as the use of event queues would be infeasible in a hardware implementation. The specification does however not give any guidelines on how to interpret the diagrams, which is why we have taken the official SDL semantics as a starting point.
Using the simplified semantics, the SDL diagrams in Chapter 7.2 of [8] were translated into mCRL2 code (using a slightly altered syntax, see appendix A).
As an example, we show the part of the mCRL2 code that models the 'coldstart collision resolution' diagram in the specification ( -SyncCalcResult. This event is emitted by the clock synchronisation process, just before a new communication cycle starts. We assume a global clock with a resolution of one bit, of which the ticks are modelled by the bit actions. Rather than waiting for four SyncCalcResult events, we instead look at the global clock to decide when four cycles have passed. -header received on A/B. This event is emitted by CODEC right after a frame header has been received. It is modelled using the decode action with a FRAME_HEADER parameter.
-symbol decoded on A/B. This event is emitted by CODEC right after a collision avoidance symbol or media test symbol is decoded. It is again modelled using the decode action, this time with a CAS parameter.
In the mCRL2 model, we need to absorb decode(FRAME(id')) events (not doing so would prevent the source from sending the event, thus potentially leading to deadlock states). In the semantics of SDL [15] , this corresponds to discarding an event from the event queue when it cannot be processed.
To allow time to progress in a state, we must additionally allow the bit action to occur. The above piece of code implements a timeout of four cycles by letting time progress for four cycles, and then moving to the 'coldstart consistency check' state. If a CAS or frame header is received before the timeout occurs, the startup is aborted.
MAC
Media access control is modelled by the mCRL2 process below. It starts in inactive mode (not shown) and can then receive macCAS, macStart and macStop commands. When a macCAS command is received, it requests the CODEC to send a collision avoidance symbol and then switches to active mode. In the specification, MAC waits for one macrotick before sending the CAS, but it seems that the waiting for the macrotick is not intended as a delay, but merely as a check that the clock synchronisation process has started. Since we assume a global clock, we do not need to model this delay.
When a macStart command is received, the MAC proceeds to active mode, and when it receives a macStop command, it goes back to inactive mode. In active mode, MAC periodically requests CODEC to encode a frame.
CODEC
The CODEC is modelled as a process that either reads from or writes to the bus. When in reading mode, it processes bits it reads from the bus, and writes silence to the bus (i.e., it does not write anything to the bus). When it is in writing mode, bits it reads from the bus are ignored, and it writes an encoding of the last requested symbol to the bus.
Remainder
It can be seen from the context diagrams in [8] that the POC process communicates with all other processes during the startup process. We covered MAC and CODEC, and we argue that we may safely omit models for the remaining processes.
By assuming a global clock, we can avoid modelling the macrotick generation, clock synchronisation startup and clock synchronisation processes. Furthermore we assume that nodes are never in coldstart inhibit mode. This eliminates the influence of the controller host interface, so we can also omit a model for that process. During startup, only the clock synchronisation depends on events generated by frame and symbol processing, so this process is also ignored.
Verification
From the viewpoint of the POC, the faults mentioned in Section 4.2 can be seen as instances of a few general problems that may occur. Either a node is not able to send anything, a node is not able to receive anything, or the bus misbehaves in such a way that symbols are not always transmitted correctly. Only the periodic resetting of a node requires the node to display slightly more complicated behaviour.
Since for the POC a noisy signal is observably equal to no signal at all (the CODEC simply does not generate events), we model a limited set of scenarios. Each of the descriptions below describes two scenarios: one in which the node with the lowest identifier is the faulty node, and one in which another node is faulty. This is necessary because the protocol relies on a leader election mechanism that is not quite symmetric: although the process descriptions for startup are the same for every node, the leader that will be elected depends on the configuration of the nodes. The candidate configured with the lowest identifier will be elected as leader. At least one failure scenario (viz. the resetting node scenario below) is known [23] that is only possible if the node with the lowest identifier is the faulty node.
In this manner, the following categories of scenarios are modelled.
Two nodes A faulty node does not switch on at all, so effectively there are only two nodes present in the network. Silent node A faulty node is not able to send anything. Although we do model this separately, we note that this scenario is equivalent to the two-node scenario if we are not interested in the behaviour of the faulty node. We include this scenario because it shows that the silent node is still able to integrate into the communication correctly, albeit in a read-only mode. Deaf node A faulty node does not receive anything. Resetting node A node resets itself periodically.
Noisy channel Signals sent by nodes are corrupted on the channel. We use a noise model that consisting of a burst length and a maximum backoff time.
The burst length determines the maximum number of sequential bits that are corrupted, the maximum backoff time determines the maximum number of sequential bits that pass through the channel unaltered. Due to practical limitations, we were only able to model this scenario in a two-node scenario.
For each of these scenarios, we check that the correctly functioning nodes start up. We do this by checking three properties. The first is absense of deadlock; our model is constructed in such a way that we do not expect a deadlock to occur (we always allow time to progress, so a deadlock would indicate an error in the model).
Absence of deadlock is checked while generating the statespace. The other two properties are formulated in the first order modal µ-calculus (see, e.g., [13] ). For brevity, we use mathematical syntax rather than concrete mCRL2 µ-calculus syntax, and extra statements to help the mCRL2 toolset (e.g., to prevent quantifiers from being expanded forever) are left out. The concrete formulae are given in Appendix B. It is important to note that these formulae only represent the intended properties correctly if the system they are checked on is deadlock free, as otherwise the [true]ϕ subformulae might trivially hold. The second property asserts that eventually all correctly functioning nodes enter normal operation exactly once, an event that is flagged by the enter operation action. It is expressed by the formula in Figure 8 , in which N is the total number of nodes and C is the set of correctly functioning nodes. The last property says that eventually all correctly functioning nodes will keep receiving each others messages. Even though our model is not intended to model the ongoing traffic after startup, we have constructed our model in such a way that this property should hold. If this property does not hold, then it is likely that the nodes did not synchronise correctly. The formula in Figure 9 states this property formally.
Verification of these properties is done by linearising the mCRL2 specification and combining it with the formulae to form parameterised Boolean equation systems. These are instantiated to Boolean equation systems, which are in turn reduced modulo stuttering equivalence on parity games. The resulting smaller equation systems are then solved. A description of this procedure can be found in [10] . We use the July 2011 release of the mCRL2 toolset.
We note that it is also possible to check eventual startup and eventual communication manually. By hiding all actions but enter operation and then reducing the statespace using branching bisimulation, the first property can be checked. The second property can be checked manually by hiding all but Decode, reducing the statespace using branching bisimulation and then manually inspecting all strongly connected components.
Results
No faulty nodes The statespace is deadlock free, and both properties hold on the system. Two nodes The statespace is deadlock free, and both properties hold on the system. Silent node The statespace is deadlock free, and both properties hold on the system. Manually inspecting the branching-bisimulation reduced statespace reveals that the failing node can in this case enter normal operation using the wrong schedule (see Figure 10 ). The clock synchronisation process will allow this scenario, and frame and symbol processing will also not detect the mistake while the startup protocol has not finished. The mistake is harmless, however, because the silent node cannot disturb ongoing communication. As soon as normal operation is entered, the clock correction process or the frame and symbol processing process of the faulty node will notice the error. A next attempt to integrate will succeed, because there is then already ongoing traffic. Notice that the above scenario is possible because a process is not able to read while writing (as can be seen from Figure 3 -18 in [8] ).
Deaf node The statespace is deadlock free, but neither of the µ-calculus properties hold.
In case of a deaf node, there is a possibility of the network not starting at all. Figure 11 shows such a scenario. The deaf node can choose to align its frames with those of another startup node, causing only the headers of the other node to be readable on the bus. The non-faulty node that is broadcasting startup frames will not detect that every sent frame is corrupted by the faulty node. Because the non-faulty node's frame headers are untouched, all other nodes will wait until it gives up after the maximum number of startup attempts. When there are more than three coldstart nodes in the system, the remaining nodes will be able to start normally after that. In Figure 11 however, there are only three, and the remaining coldstart node cannot start the system by itself. In this case, the entire network fails to start. This scenario again is due to the fact that the CODEC cannot read and write simultaneously.
Resetting node The statespace is deadlock free, but neither of the µ-calculus properties hold. Although this scenario was already known (it was described in [24] ), the emergence of the trace in Figure 12 gives us confidence that our model is correct. The trace shows that the leading node may cause startup of the network to fail by resetting itself every time it has sent a frame. In fact, it would just have to send the frame header, but the way we modelled our reset behaviour does not allow this. It should be noted that in this scenario it is required that node 1 be the faulty node, which is not necessary in the scenarios for deaf and mute nodes. Noisy channel The results depend very much on the parameters of the noise model. For an arbitrary noise pattern, it is obvious that the system will not start up. The channel could simply decide to corrupt all traffic going through it. The noise model we chose guarantees that some information will come through. Checking exactly for which values of maximum burst size and maximum backoff period the system starts correctly is too big a task, but simply trying a few settings soon gives an idea of how robust the system is. We made the following observation.
If there is noise on the channel for too long while nodes are trying to commence the startup procedure, then obviously startup may fail. The interesting scenarios are those in which some information can be communicated.
However, if the minimum backoff time is less than the time needed for faultfree startup, then one of the sync frames of the leading coldstart nodes can always be corrupted, causing either the schedule initialisation or the consistency check of the other nodes to fail. If the presence of noise is the only anomaly in the system, then the minimum backoff time being at least the time required for fault-free startup is enough to guarantee that the system will come up.
Techniques
We created a model that is very close to the original specification. As a consequence the model is rather complex, and its state space is large for realistic values of the various parameters. The results in this paper are obtained by explicitly generating the statespaces of systems in which these parameters have been set to (unrealistically) small values. The resulting statespaces are small enough to minimise modulo branching bisimulation, and the reduced statespaces are in turn small enough to check manually.
A natural choice when dealing with extremely large systems is to use symbolic model checking. In our case, however, the bottleneck does not seem to be in the memory needed to store the statespace, but in the time needed to generate it; a problem that cannot be overcome by using for instance the symbolic model checker LTSMIN [2] . Using this tool we tried to check that the startup failed action does not occur in a fault-free setting. However, after 9 days of computing, only 2.2 billion states have been checked. Attempts to speed up the state space generation by reducing the size of the process equations did not lead to better results, and different search strategies (BFS, chaining) seem to perform equally poor.
The µ-calculus properties were checked by using parity game reduction and parity game solvers, as described in [10] . Although the parity game reduction did speed up the solving process significantly, the real bottleneck was generating the game (or, equivalently, boolean equation system) itself, taking over three days for the second µ-calculus property for the scenario with a resetting leader node.
Specification
As was already mentioned in [23] , the absense of a fault hypothesis makes it difficult to assess whether the specification specifies a correctly operating system.
Modelling the specification is difficult, because the specification is given only in a high level of detail. This is ideal for implementing the system, because the mapping to an implementation is very straightforward. However, it makes it virtually impossible to abstract away implementation details: due to implementation considerations, it seems, the specification lacks a hierarchical structure; functionality like clock synchronisation is spread out over multiple components. In this sense, the specification does not so much specify a general networking protocol, but rather a reference implementation of one. A lack of explanation on a more abstract level makes it extremely difficult to understand what the exact functionality of the protocol is, and makes it hard to assess whether simplifications of a model are valid abstractions.
Conclusion
We have modelled a large part of the FlexRay startup protocol using a discrete time model. We presented a notion of correct operation of the startup protocol, and mechanically verified that in a number of fault scenarios the startup protocol operates correctly. The model revealed two scenarios in which a network does not start up; one of these scenarios was previously unknown.
A Modified mCRL2 syntax
In order to create a more structured specification, an extension of the mCRL2 syntax was used. This syntax is not meant to provide a semantic extension of the language, but merely to syntactically abbreviate some common constructs so as to remove clutter. We therefore merely give an explanation of how to read the syntax, but do not recommend using it in any other setting.
The syntax is best explained using an example. Consider the following specification: This specification is translated to the following, pure mCRL2 specification.
proc X(n: Nat) = X'A(n: Nat); proc X'A(n: Nat) = a(n) . X'B(0); proc X'B(n: Nat, m: Nat) = m < 10 -> b . X'B(n, m + 1) + m == 10 -> b . X(n + 1);
The intuition is that every nested block defines some parameters that are shared by the states defined in it. The states are mCRL2 process specifications, and can therefore again be defined in terms of nested blocks.
A state may add parameters to the list provided in its associated nested statement. Duplicate parameter names are not allowed. When a process or state name is used within a state, a name lookup is done first in the scope of the current nested block, then in the one above that etc.
Obviously, name clashes are a big problem in this construction. The models presented in this document therefore use unique variable and process names in such a way that the translation scheme does not cause any ambiguity. && Y(remove(CORRECT_NODES, s(nextsym(s, CORRECT_NODES))), nextsym(s, CORRECT_NODES)) ) ) + sum id': Sender .
decode(id, FRAME_HEADER(id')) . ColdstartListen(tStartupNoise=0) + sum id': Sender .
decode(id, FRAME(id')) . bit . init_sched(id) . InitialiseSchedule(0, id') + reset(id) . AbortStartup(attempts=0) + bit .
( (tStartup >= 2 * CYCLE_length -1 || tStartupNoise >= 4 * CYCLE_length -1) -> ( is_idle(true) .
macCAS . abort(id) . ColdstartCollisionResolution(attempts=attempts+1, timer=-length(CAS)) + is_idle(false) .
ColdstartListen(tStartup=0, tStartupNoise=tStartupNoise + 1) ) + (tStartup < 2 * CYCLE_length -1 && tStartupNoise < 4 * CYCLE_length -1) -> ( is_idle(true) .
ColdstartListen(tStartup=tStartup + 1, tStartupNoise=tStartupNoise + 1) + is_idle(false) .
ColdstartListen(tStartup=0, tStartupNoise=tStartupNoise + 1) ) ) ); % In InitialiseSchedule, the node waits for the clock synchronisation process % to finish its initialisation. In this model, we abstracted away from clock % synchronisation, and therefore this state can be left as soon as the second % frame of the sender that this node is synchronising on (syncon) is received. state InitialiseSchedule(timer: Nat, syncon: Sender) = ( decode(id, CAS) . InitialiseSchedule() + sum id': Sender .
decode(id, FRAME_HEADER(id')) . InitialiseSchedule() + sum id': Sender .
decode(id, FRAME(id')) . ( (id' == syncon) -> ((timer == CYCLE_length -1) -> bit .
IntegrationColdstartCheck( Int2Nat(FRM_START(syncon) + length(FRAME(id'))), syncon, false, false, false, false) <> AbortStartup() ) <> InitialiseSchedule() )
