. (2016) 
Introduction
Traditional point-to-point control systems are currently being replaced by networks in modern manufacturing systems. These networks reduce cost by decreasing the amount of electrical wiring and also decrease maintenance costs. These Networked Control Systems (NCSs) transmit packets that have real-time constraints. These packets are small and frequent. They originate from a Sensor node (S) that samples a physical phenomenon, such as temperature, either regularly (in clock-driven systems) or when there is a system change (in event-driven systems). A packet is then sent to a controller (K) that decapsulates it, calculates the control decision, encapsu-models. Section 6 concludes this research.
Two-Cell Architecture
In-line architectures have numerous applications in many industrial processes; an example is having two separate machines operating in tandem (in-line production) with the second machine working on functions released by the first one.
In addition, by interconnecting two machines, each with its own NCS, the system as a whole is expected to become more reliable through added fault-tolerance measures. In such a fault-tolerant architecture, the second cell can take over operation of both cells in case of the occurrence of a failure in the first one.
For each individual cell, an S2A (Sensor-to-Actuator) architecture, utilizing Switched Ethernet, is employed as in [21] thus all sensor control packets are transmitted directly to all actuators every sampling period. The system sampling period is fixed to 694 µ seconds for a sampling frequency of 1440 Hz similar to [21] . Additionally, each sensor and actuator also sends a single packet to a supervisor node for monitoring purposes.
For fault-tolerance at the level of the control function, each actuator has an integrated controller (K) in addition to an extra network interface (E) as shown in Figure 1 . During normal operation in the absence of any faults, the integrated controller (K) is responsible for taking the required control action based on the received sensor data. However, when the integrated controller (K) fails, the supervisor (Sup) node enters the cell's control loop by generating the required control action and transmitting it to the affected actuator's extra network interface (E). In Figure 1 , Sup1 represents the supervisor in cell 1, Sup2 the supervisor in cell 2 and Ki, j controller j in cell i (i = 1,2 and j = 1 to 4).
Thus, the control traffic flows can change based on the failure state; as such, all potential failure states must be investigated individually in order to guarantee that all system deadlines are met with no control packet losses. 
Fault-Free (FF) Scenario
When all system components are operational, various packets are exchanged over the network. These control packets must be successfully transmitted, with no packet loss, within the required system deadlines in order to guarantee correct system behavior.
Sensors: All the sensors in cell one, shown in Figure 1 , send packets to all four actuators in cell one. In addition, the sensors send their data to both cells' supervisors. Similarly, the sensors in cell two send packets to all four actuators in cell two and also send data to both cells' supervisors.
Actuators: The controllers within the actuators in cell one receive packets from the sensors in cell one. The controllers within the actuators in cell two receive packets from the sensors in cell two. Actuators from both cells send additional monitoring packets to both cells' supervisors.
Supervisors: The two supervisors receive data from all the sensors and all the actuators in both cells to allow for fault-tolerance. Finally, each supervisor sends a watchdog signal to the other supervisor every 347 microseconds (half the sampling period) to indicate to the other supervisor that it is functioning properly. This message is only 10 Bytes, as opposed to control messages, which are 100 Bytes. If a supervisor does not receive a watchdog signal from the other supervisor, the supervisor assumes the other one has failed and takes over its responsibilities. Thus, for correct operation of the proposed system, it is assumed that the supervisors are fail silent.
Fault Tolerant (FT) Scenarios
In the proposed fault-tolerant two-cell architecture, both cells can still continue operating normally even if certain components fail within the system. Next is a description of the different fault-tolerant scenarios. It is important to note that, since TMR is employed at the level of the sensor nodes, the failure of any one sensor can be tolerated by the proposed control system.
Supervisor in Cell One Fails
If Sup1 from Figure 1 fails, the system will remain operational with Sup2 receiving data from all the sensors and actuators in cell one, and it will also receive data from all the sensors and actuators in cell two. The failure of the supervisor will not increase the sensor to actuator delay in either cell.
Supervisor in Cell Two Fails
Similar to the previous scenario but Sup2 fails instead of Sup1.
Controllers in Cell One Fail
If any of the controllers (or all of them as a worst case scenario) within the actuators in cell one fail (K1,1; K1,2; K1,3; K1,4 from Figure 1 ), Sup1 will detect and take over the function of the failed controllers, relaying data to the actuators through the Ethernet port (E) within each actuator. The Supervisor is able to detect any controller failure since all controllers send additional monitoring packets to the supervisors every sampling period. There will be a slight increase in the sensor to actuator delay for the affected actuators in cell one because data must go from the sensors to the supervisor and then the actuators, but in cell two the delay will not be affected because data will travel directly from the sensors to the actuators.
Controllers in Cell Two Fail
Similar to the previous scenario but the controllers within the actuators in cell two (K2,1; K2,2; K2,3; K2,4) fail instead, Sup2 will take over the function of the failed controllers.
All Controllers in Both Cells Fail
If the controllers within the actuators fail in cell one and cell two (K1,1; K1,2; K1,3; K1,4; K2,1; K2,2; K2,3; K2,4 from Figure 1 ) simultaneously (as a worst case scenario), supervisor one will take over the function of the failed controllers in cell one and supervisor two will take over the function of the failed controllers in cell two. In this situation, data must go from sensor to supervisor first before going from supervisor to the actuators. As a result, there is a slight increase in the sensor to actuator delay in both cells compared to the delay in the faultfree scenario.
Supervisor and Controllers in Cell One Fail
If (K1,1; K1,2; K1,3; K1,4; Sup1 from Figure 1 ) fail, Sup2 receives packets from the sensors in cell one to relay them to the actuators in cell one through the Ethernet port in each actuator, and it will also receive packets from the sensors and controllers in cell two. There is a notable increase in the sensor to actuator delay in cell one, but no increase in delay in cell two.
Supervisor and Controllers in Cell Two Fail
Similar to the previous scenario but (K2,1; K2,2; K2,3; K2,4; Sup2) fail instead and Sup1 takes over control of the failed cell.
Supervisor in Cell One Fails, Controllers in Cell Two Fail
If (K2,1; K2,2; K2,3; K2,4; Sup1 from Figure 1 ) fail, Sup2 receives packets from the sensors and controllers in cell one, but there is no increase in the sensor to actuator delay in cell one. However, in cell two, Sup2 must also undertake the function of the failed controllers within the actuators in the second cell; as a result, there is an increased delay in cell two.
Supervisor in Cell Two Fails, Controllers in Cell One Fail
Similar to the previous scenario but (K1,1; K1,2; K1,3; K1,4; Sup2) fail instead and Sup1 takes over control of the actuators in cell one.
Supervisor in Cell One Fails, Controllers in Both Cells Fail
If (K1,1; K1,2; K1,3; K1,4; K2,1; K2,2; K2,3; K2,4; Sup1 from Figure 1 ) fail, Sup2 must undertake the function of the failed controllers in both cells. However, because cell two is much closer to supervisor two, the end-to-end delay in cell one is much larger than the delay in cell two.
Supervisor in Cell Two Fails, Controllers in Both Cells Fail
Similar to the previous scenario but (K1,1; K1,2; K1,3; K1,4; K2,1; K2,2; K2,3; K2,4; Sup2) fail and Sup1 must instead undertake the function of the failed controllers in both cells.
Delay Analysis
Through analysis of the model, the end-to-end delay for the control packets will be calculated for both Fast Ethernet and Gigabit Ethernet. A worst-case delay analysis will be carried out on this model. As previously mentioned, one of the restrictions on the proposed model is that the control action must be taken within 694 μs. Therefore, it is crucial for the worst-case delay to not exceed this limit in order to ensure correct control operation. The following analysis will focus on the last packet transmitted from the final sensor node; this represents the worst-case scenario because all the previously sent packets are queued before this particular packet, hence it will require the largest amount of time to be transmitted over the network. In addition, processing delay is not taken into account because previous work has shown it is so small compared to the rest of the delays and as such it can be considered negligible [22] . Thus, the amount of time required for the transmission of a single packet over a particular link is given by:
The total end-to-end delay for the worst-case packet flow is given by:
total packet
Total Number of Packets Transmitted Sequentially
The link transmission delay (D transmission ) is the amount of time required for all of the packet's bits to be transmitted onto the link. It depends on the packet length L (bits) and link transmission rate R (bps) [23] .
The length of the packet is fixed to 100 Bytes at the application layer; however, additional packet and frame header overhead (approximately 58 Bytes) must be taken into consideration. All the links are Gigabit Ethernet in one scenario and Fast Ethernet in the second scenario, therefore: 
The propagation delay (D propagation ) is the time taken for the packet to travel from the sender to the receiver; it depends on the link length d (m) and the propagation speed s (m/s) [23] .
The length between each node and the switch is d = 1.5 m and the transmission speed in the Ethernet links is s = 2 × 10 8 m/s.
( ) 8 propagation 1.5 2 10 7.5 ns
Hence, the total end-to-end delay is given by: 
Using the delays obtained above as constants and substituting with the appropriate values in (8) , the sensor to actuator delays for all possible system states (including fault-free and fault-tolerant scenarios) is calculated next.
Fault-Free (FF) Scenario
For the fault-free scenario, the number of packets in the worst case queue is 20 packets (16 packets from the sensors to the switch and 4 packets from the switch to the actuators) following the same analysis methodology as in [12] . Thus, the total end-to-end delays for Fast and Gigabit Ethernet can be obtained as follows: 
Fault-Tolerant (FT) Scenarios
Following the same delay calculation methodology, Table 1 summarizes the delay analysis for all aforementioned Fault-Tolerant (FT) scenarios. 
Simulation vs. Analytical Results
The proposed two-cell fault-tolerant architecture was simulated on OMNeT++ [24] in the fault-free scenario as well as under all outlined failure scenarios. For all simulated scenarios, all control packets were transmitted successfully with no packet losses. Additionally, all observed control packet delays were less than the required system deadline.
In Figure 2 , the maximum end-to-end delay for the control packets transmitted from the 16 sensors to the 4 actuators in cell 2 is shown in the absence of any failures. The x-axis represents the Simulation Time (seconds) and the y-axis shows the End-to-end Delay (in seconds). The end-to-end delays from the simulation include packet transmission, propagation, queuing, encapsulation and decapsulation delays.
The observed end-to-end delays were deterministic due to the periodic nature of the control traffic combined with the use of Switched Ethernet. In all simulated scenarios, the observed percentage error between the simulated and analytical delay results did not exceed 5% as shown in Table 2 .
Reliability Modeling
The reliability of the proposed two-cell architecture will be calculated next. Two values will be calculated: the Control Function Reliability (CFR) and the Node Reliability (NR).
Control Function Reliability (CFR)
CFR(t) focuses only on the components where the control function is executed, i.e., the supervisors (Sup1 and Sup2) and the four controllers connected to the four actuators in both cells (Ki, j, i = 1,2 and j = 1 to 4). It will be assumed that, if the system loses its observability (both supervisors fail), this will be considered as a system failure even if both cells are still operational. There is one controller in each actuator and there are four actuators in each cell; in addition, there is a supervisor in each cell, which means the control function for the two-cell system is based on 10 components.
In order to calculate CFR(t), all the different situations are analyzed and the failure states are identified. There are 210 situations to consider since there are 10 components involved in the Control Function. A small scale model is first analyzed, using Algorithm I of Figure 3 , with one controller and one supervisor per cell.
It was observed that all failure states have one feature in common: in each failure state, both supervisors are in a failure state, regardless of the state of the controllers. The rationale behind this finding is as follows: Assuming the Ethernet port of the actuators is always operational, the controllers Ki, j will not cause a system failure because, even if they fail, the supervisors can take over their function. This is clear from the 11 scenarios described in Section 3.2. However, when both supervisors fail at the same time, the system loses its observability which is considered to be a system failure as mentioned above (even if all the controllers are working). Hence, CFR(t) for the two-cell system is:
The reliability CFR sim (t) is that of the simplex system (two cells without any fault-tolerance). Note that:
In the above equation, it is assumed that both supervisors have the same failure rate. The same is assumed for all controllers connected to the actuators. The reliability of any of the 8 controllers in both cells is R k .
Node Reliability (NR)
As mentioned above, another perspective would include the Ethernet ports of both sensors and actuators in the reliability calculations (in addition to both supervisors and the 4 controllers connected to the actuators in each cell), i.e., all nodes connected to the network fabric. As a result, across both cells, the components consist of 16 sensor Ethernet ports in each cell, 1 supervisor in each cell and 4 controllers as well as 4 actuator Ethernet ports for each cell; in other words, a total of 25 components for each cell and a total of 50 components for the two-cell system in total.
Since there are 50 components to analyze, 250 situations have to be studied. As for CFR(t), a small scale model is first analyzed following Algorithm I with one controller, one sensor Ethernet port, one supervisor and one controller Ethernet port for each cell, resulting in a total of 8 components, or 28 = 256 situations.
Of the 256 situations, precisely 27 are operational and the remaining 219 situations are failure states. By observing the 27 states in which the system is operational, the following observations can be made.
Sensors: If any of the sensor Ethernet ports fail for any reason in either cell, the system fails. Controllers: If the controllers connected to the actuators fail, the system will continue to work on the condition that one of the supervisors is working and that the corresponding Ethernet ports in the actuators are working as well. Supervisors: At least one out of the two supervisors must be working at all times. If both supervisors fail at the same time, the observability of the system is lost which is assumed to cause a system failure.
Actuator Ethernet Ports: The system can continue working normally even if the Ethernet ports in the actuators are down, but only if the controllers attached to the corresponding actuators are working. For each actuator, its controller and Ethernet port form a 1-of-2 system. If a controller fails inside the actuator and the Ethernet port inside the same actuator fails at the same time, the system fails. Hence, NR(t) can be calculated as follows:
where R s is the reliability of the sensor's Ethernet port, R a is the reliability of the actuator's Ethernet port and R k is the reliability of the controller attached to the actuator. Note that NR sim (t) is the reliability of the simplex system from a nodes point of view.
( ) ( ) ( ) ( )
It is again assumed that both supervisors have the same failure rate. All sensors also have the same failure rate as do all controllers.
Two-Cell Architecture with TMR
Although connecting two cells together is expected to increase system reliability (whether CFR(t) or NR(t)), there is still the problem of single points of failure, i.e., the sensors. If any one sensor out of the thirty two sensors fails, the whole system will fail. In [13] (as mentioned in Section 2), it was shown that applying Triple Modular Redundancy (TMR) at the sensor level was feasible and that it increased system reliability as expected.
The same approach is going to be implemented next; TMR is applied to all 32 sensors (for the two cells). A sensor will only fail if at least two of its three modules fail. The application of TMR results in 48 sensors in cell one and 48 sensors in cell two, giving a total of 96 sensors in the system. And while the increase from 32 sensors to 96 sensors means an increase in costs, the system is expected to become much more reliable. Such an expensive but extremely reliable architecture is appropriate for applications in sensitive environments such as the nuclear industry or the space industry where reliability is the most important factor in system design.
On the other hand, such an increase in the number of sensors will significantly increase network traffic. Note that any control packet must not have an end-to-end delay greater than the 694 µs system control deadline. Clearly, the fault-free and fault-tolerant scenarios will be identical to those described in Sections 3.1 and 3.2. However, the delays are expected to increase significantly due to the extra traffic generated by 96 sensors instead of 32.
Fault-Free (FF) and Fault-Tolerant (FT) Scenarios
In addition to the fault-free scenario, this system must meet the required control deadline under all the fault-tolerant scenarios stated in Section 3.2. Furthermore, as long as two out of every three sensors are working, the system will tolerate the failure, even if one sensor fails during mission time.
Although the sensor to actuator delays experienced in the proposed model in Section 3 were under the 694 microsecond limit, there is an expected increase in the delays for the proposed TMR model because of the large amount of extra traffic due to TMR. In addition to meeting the required control delay deadline, the proposed TMR model must guarantee zero control packet loss.
Delay Analysis (with TMR)
Calculations for TMR applied to two cells are obtained in the same manner depicted in Section 3.3 for two cells connected to each other without TMR. The only difference is an increase in the number of packets in the system. Furthermore, calculations (and OMNeT++ simulations) were conducted only using Gigabit Ethernet links because it was shown in [13] that using fast Ethernet causes the delay to exceed the 694 µs deadline when applied to cells with TMR sensors. Below is the analytical sensor to actuator delay calculations using equation (8) 
Fault-Tolerant (FT) Scenarios
Following the same delay calculation methodology, Table 3 summarizes the delay analysis for all aforementioned Fault-Tolerant (FT) scenarios.
Simulation vs. Analytical Results (with TMR)
The proposed two cell fault-tolerant architecture, after applying TMR on the sensor nodes, was simulated on OMNeT++ [24] in the fault-free scenario as well as under all outlined failure scenarios. For all simulated scenarios, all control packets were transmitted successfully with no packet losses. Additionally, all observed control packet delays were less than the required system deadline. In all simulated scenarios, the highest observed percentage error between the simulated and analytical delay results was 6.08% as shown in Table 3 .
Reliability Modeling

Control Function Reliability (CFR)
From a control function point of view, which only focuses on the supervisors and the controllers within the actuators, the reliability equation does not change from the two cell system without TMR, even with the addition of the extra sensors because the sensors are not taken into consideration in this reliability model. Hence, the reliability equations are still the same as in (13) and (14).
Node Reliability (NR)
System reliability can also be calculated by taking into account the sensors and the Ethernet ports within the actuators, in addition to the supervisors and the controllers within the actuators. In this case, the equation will only differ in the sensor block, but the rest of the equation from (15) 
Case Study
For the two proposed fault-tolerant architectures, a case study was conducted to quantify overall system reliability compared to a corresponding simplex system. An exponential Time To Failure (TTF) is assumed with time measured in days. Table 4 summarizes the assumed case study parameters. Based on the case study parameters assumed in Table 4 , the CFR and NR for the proposed system will be compared to a corresponding system with no TMR as well as a simplex system.
Control Function Reliability (CFR)
The CFR for the two proposed architectures is shown, and compared to the reliability of a corresponding simplex system with no fault-tolerance, in Figure 4 .
It can be seen that, from a Control Function perspective, there is no difference in overall system reliability between the two proposed fault-tolerant architectures (with and without TMR). However, in both cases, a significant improvement in CFR can be seen compared to the corresponding simplex architecture. Figure 5 illustrates the NR for the three studied system architectures.
Node Reliability (NR)
It can be seen that, when Node Reliability is taken into account, the proposed fault-tolerant architecture with TMR shows significant improvement in reliability compared to that without TMR as well as the simplex architecture. This is due to the large number of sensor nodes (32) utilized across the architecture's two cells. Without TMR, each of these sensor nodes is a single point of failure for the entire system. 
Conclusions
Fault-tolerant design is essential for a robust Networked Control System (NCS) with a high reliability and a long operational lifetime. With the increasing complexity of NCSs consisting of a large number of nodes such as sensors, controllers and actuators, the probability of the occurrence of any single failure increases significantly. Without fault-tolerance, the occurrence of a single fault in any one node can lead to the failure of the entire control system resulting in lengthy downtimes and consequently significant production losses. In this paper, the architecture of a two-cell fault-tolerant NCS is developed on-top-of both Unmodified Fast and Gigabit Switched Ethernet. The proposed architecture models a production line composed of two identical machines each based on a Sensor-to-Actuator (S2A) control architecture.
Fault-tolerance is first applied at the controller level across both cells. An extra network interface is added to each actuator node in addition to the integrated controller node. In case of failure of the integrated controller, a supervisor node becomes part of the control loop and sends the required control action to the affected actuator's added network card. Under all possible failure scenarios, it was shown that the proposed fault-tolerant architecture fulfils the required control system deadline with zero dropped or over-delayed packets under both Fast and Gigabit Ethernet (both analytically and through OMNeT++ simulations).
Additionally, the fault-tolerance of the proposed architecture was expanded to the level of the sensor nodes through Triple Modular Redundancy (TMR). The application of TMR at the sensor nodes led to a significant in-crease in the number of control packets transmitted across the proposed architecture making Fast Ethernet unsuitable for meeting the required control system deadline. It was shown that the proposed fault-tolerant architecture fulfils the required control system deadline with zero dropped or over-delayed packets using Gigabit Ethernet under all possible failure scenarios.
Two reliability modeling methodologies were illustrated to quantify the achievable improvement in lifetime compared to a corresponding simplex architecture with no fault-tolerance: Control Function Reliability (CFR) and Node Reliability (NR). CFR only considers the probability of failure of the integrated controllers and supervisors while NR also takes into account the probability of failure of the sensor Ethernet ports and the added integrated network interfaces. A case study was carried out for a typical industrial system. It was shown that, for both modeling methodologies, the proposed fault-tolerant architectures significantly improve overall system reliability.
