Abstract-The IEC 61499 standard provides means to specify distributed control systems in terms of function blocks. The execution model is event driven (asynchronous), where triggering events may be associated with data (and seen as a message). In this paper we propose a low complexity implementation technique allowing to assess end-to-end response time of event chains spanning over a set of networked devices. In this paper we develop a method to provide safe end-to-end response time taking both intra-and inter-device delivery delays into account. As a use case we study the implementation onto (single-core) ARMcortex based devices communicating over a switched Ethernet network. For the analysis we define a generic switch model and an experimental setup allowing us to study the impact of network topology as well as 802.1Q quality of service in a mixed critical setting. Our results indicate that safe sub millisecond end-to-end response times can be obtained using the proposed approach.
Abstract-The IEC 61499 standard provides means to specify distributed control systems in terms of function blocks. The execution model is event driven (asynchronous), where triggering events may be associated with data (and seen as a message). In this paper we propose a low complexity implementation technique allowing to assess end-to-end response time of event chains spanning over a set of networked devices. In this paper we develop a method to provide safe end-to-end response time taking both intra-and inter-device delivery delays into account. As a use case we study the implementation onto (single-core) ARMcortex based devices communicating over a switched Ethernet network. For the analysis we define a generic switch model and an experimental setup allowing us to study the impact of network topology as well as 802.1Q quality of service in a mixed critical setting. Our results indicate that safe sub millisecond end-to-end response times can be obtained using the proposed approach.
I. INTRODUCTION
Traditional SCADA control systems typically imply the use of costly PLCs communicating over field buses, such as legacy Modbus [1] /PROFIBUS [2] and the more recent Ethernet based POWERLINK [3] , EtherCAT [4] and PROFINET [2] buses. In order to establish end-to-end timing behaviour much attention has in general to be paid to the control system design, partitioning, and deployment. In this paper we take an alternative approach based on the recent IEC 61499 standard [5] .
In contrast to the traditional scan (time triggered) based PLC behaviour (as implied by the IEC 61131 standard [6] ), the IEC 61499 undertakes an event triggered (asynchronous) execution model. In the presented work we exploit the benefits of asynchronous execution both at device and network layer. In particular asynchronous execution allows at device level preemptive execution of the function block network, and at network layer, asynchronous communication over standard Ethernet (with optional 802.1Q quality of service) frames, as implemented in Commercial Off-The-Shelf (COTS) as well as rugged (industrial) Ethernet switches. In order to establish safe end-to-end timing properties of distributed IEC 61499 models, the presented work covers the execution model, requirements for the per device scheduling, a generic Ethernet switch model, and impact of general purpose traffic. This allows us to reason on timing properties under different network topologies, link speeds, and quality of service configurations.
Our results indicate that the proposed approach is capable of safe sub millisecond end-to-end response times even using cost efficient COTS components. In particular, we show that, given the assumption switches implement predictable 802.1Q quality of service, the impact of non-critical traffic can be bound, and safe response times obtained. This allows for flexible deployment, where the same physical network is shared in between the timing critical control applications and general purpose usage. Our results are safe to the worst case, while further improvements can be obtained if the Maximum Transmission Unit (MTU) of general purpose (best effort traffic) is controlled.
We can conclude that the presented approach reduces complexity of distribution (in comparison to the traditional SCADA approach), reduces device and network complexity (does neither require complex time triggered architectures like scan based PLCs, nor any dedicated network components/protocols), and is highly flexible (allowing network sharing between real-time and best-effort devices). Moreover, the presented approach opens up for the possibility to automate the generation of interfaces (SIFBs) interconnecting FB networks, for which correct-by-construction data delivery can be ensured. Moreover automation would allow the designer to focus on properties of the application at model level, independent on networking and deployment specifics.
II. BACKGROUND

A. Task and Resource model
The task and resource model undertaken for this presentation follows that of the Stack Resource Policy (SRP) [7] . For this presentation we use task instance and task interchangeably. In short the SRP task and resource model is captured in the following:
• t i is a task with run to completion semantics,
• p(t i ) is the static priority for t i used for scheduling,
• e i is an event triggering the execution of the corresponding task t i ,
• r j defines a resource, a task may claim resources in a LIFO nested manner.
SRP provides pre-emptive, deadlock free, single-core scheduling for systems with shared resources (in this context, claiming a resource can be seen as entering a critical section). Given WCETs for the tasks and their inner claims, response time analysis is readily available [7] .
The RTFM-kernel has been developed from the outset of the SRP task and resource model [8] . The kernel API amounts to C code macros, with predicable and low overhead (typically a few machine instructions onto the ARM-Cortex family of processors). It extends the task model with the ability to emit events (hence task chains can be expressed). Under the common assumption that the inter-arrival of triggering events is larger than the associated task response time, each task can be associated with an interrupt handler, the underlying interrupt hardware can be exploited as (a single unit) event buffer 1 , and critical sections can be enforced through interrupt masking. In such a way the actual scheduling is performed by the interrupt hardware at a large.
B. IEC 61499 to RTFM-core Task and Resource model
In previous work, a mapping from device/resource level IEC 61499 models have been proposed [9] . At the edges of the IEC 61499 resources, Service Interface Blocks (SIFBs) are required for inter resource communication. The IEC 61499 standard does not cover the implementation of SIFBs. For this work we propose RTFM-core as the means of implementation. This brings the advantage that the complete system (at device level) can be implemented in a uniform manner, allowing the SIFBs to be part of the device level scheduling without introducing any additional/external run-time system.
1) System model example:
a) IEC 61499 System model: Figure 1 depicts an IEC 61499 model developed in the 4DIAC IDE [10] . The SIFB instances Ea1 and Ec1 capture the external events and trigger the execution of actions associated to a_1.i_1 and c_1.i_1 respectively. Figure 2 depicts the ECC of b1_b3, showing the associated actions taskb1 for input event i1 and taskb3 for input event i3 respectively. In RTFM-core the system is expressed in terms of event triggered tasks with named critical sections. The external event e a1 triggers the execution of a 1 , which in turn triggers the execution of b 1 , which in turn trigger b 2 . For part of the execution b 1 and b 3 enters a named critical section r in order to get exclusive access of the shared resource (in this case a data structure). This corresponds to the sequential behaviour of the ECC in b1_b3. While each link in -core is treated as message (i.e., event with associated data) there is no need for explicit data wiring, and multiplexing (F_MUX_2_1) rendering a succinct model. 
C. Monolithic single-core deployment
For deployment onto single-core platforms, the rtfm-compiler derives the task set from the input program and performs SRP analysis (i.e., derives the resource ceilings on basis of the per-task resource utilisation). The generated code compiled together with the (target specific) RTFM-kernel primitives render an executable for the deployment. The kernel primitives implement the SRP scheduling policy by exploiting the underlying interrupt hardware. (For the ARM-Cortex family of processors, the scheduling overhead is bound to a few clock cycles [8] .) Given Worst Case Execution Time (WCET) for tasks (and their critical sections) we can deploy available SRP based methods for deriving per task response times [7] .
1) Monolithic response times:
The end-to-end response times for a task chain can be safely estimated by accumulating the response times along the chain. E.g., for the example system the response time of the task chain e a1 → a 1 
D. Ethernet technology
Ethernet based buses are emerging in industrial applications as a cost and performance efficient alternative to proprietary field buses.
In the following we briefly review a set common topologies. (Messages in the figures relate to the distribution example later introduced, Section III.) Figure 4 depicts point to point topologies. In legacy systems, a single coaxial cable was used (left in Figure 4 ), using a collision detection and resend mechanism, for which bound transmission times are hard/impossible to guarantee. Alternatively, collision free communication is possible by using multiple interfaces and dedicated (point to point) cabling, however such solutions are costly and bulky (right in Figure 4 ).
A switched (full duplex) star (single layer) network mapping of the system is given in Figure 9 . Figure 11 , depicts the scenario of a layered network, with local switches S 1 and S 2 and a backbone switch S 3 . In this case the network is shared between real-time traffic and best effort traffic. 
1) Ethernet framing:
An Ethernet frame ( Figure 5 , top) has a payload of 46 to 1500 bytes (octets). A 802.1Q (VLAN) frame ( Figure 5 , bottom) has a payload of 42 to 1500. In both cases this yields a minimum packet size of 64 bytes (including the header and CRC/FCS). The large minimal size packages is a legacy from non-switched (shared media) networks, for which the collision detection (CSMA/CD) mechanism is dependent on the propagation delay of the signal.
Including Preamble (8 bytes) and Inter Frame Gap (12 bytes), the total size of a frame size is 84 bytes. The transmission time for a complete minimal frame over a link is hence 84 * 8/ls (where ls is the link speed), amounting to 67.2/6.72/0.672μs over 10Mbit/100Mbit/1Gbit/s links respectively. This gives the minimum period for transmissions over a link (and hence the maximum blocking time inferred by tramission of a minimal sized packet). Considering only packet delivery we can exclude the Inter Frame Gap (12 bytes), 57.6/5.76/0.576μs over 10Mbit/100Mbit/1Gbit/s links respectively.
E. Ethernet based field buses
The term Industrial Ethernet (IE) is commonly used, referring to rugged components relying on Ethernet technology.
A gamut of Ethernet based field buses (such as POWER-LINK [3] and EtherCAT [4] ) have been introduced, for an overview see e.g., [11] . In common they have a focus on predictable timing, hence they are collectively referred to as Real-Time Ethernet.
III. DISTRIBUTED SYSTEMS IN RTFM-CORE
In the following we discuss distribution aspects of RTFMcore models from a generic perspective, allowing -core models to be derived from corresponding IEC 61499 models, written directly in the RTFM-core language [12] , or any mix thereof.
A. Distributed single-core deployment
In a distributed setting, the task set (and associated) resources are partitioned onto a set of devices. For this presentation we make the assumption that resources cannot be shared among tasks executing on different devices. 
B. End-to-End Response Time, a holistic approach
A safe end-to-end response time estimate of a task chain spanning multiple devices can be computed by the accumulated sum of intra-device response time and the inter-device delivery times.
As a use case, we study Switched Ethernet, the impact of network topologies, 802.1Q quality of service configurations and the sharing of physical links between real-time and best effort domains. For the intra-device response time we rely on the previous discussion (Section II-C).
C. Holistic device level response times
In our setting we consider device drivers as a part of the application at device level, and thus allow us to assess intra device response times in a holistic manner. The ARM-Cortex M3 based LPC1769 micro-controller provides an on chip Ethernet MAC (EMAC) with DMA functionality triggering the event e a RX when a buffer has been received. The LPC1769 bus matrix allows the EMAC DMA to read and write package data using a dedicated SRAM allowing interference to be isolated to reading out/setting up package payload. This amounts for task a RX and function fa T X to a few shared memory accesses, for which worst case latency can be deduced from the data sheets. Under the assumption that messages will be consumed before reoccurring, a single (non-protected) buffer for each unique message is sufficient, figure) . However, since the function fa T X can be executed (preemptively) on behalf of both tasks a 1 and a 2 it's operations on the EMAC must be protected in a critical section (marked blue).
The EMAC transmit and receive functionality operates on circular description tables using producer and consumer indexes.
a) Transmit:
The operation of fa T X amounts to writing the message payload into the dedicated send buffer (in the shared SRAM), allocating a descriptor (round robin 
Notice that tasks a 1 and a 2 includes the execution for the queuing function fa T X . For device B we have the following:
For the scheduling analysis, we have the task instances b 
D. Link Layer Model
In order to assess the impact of networking we first study a simple (directly connected) point-to-point topology, that allow us to isolate the link-layer, Figure 4 (right). For the discussion we consider only nodes A and B (hence for each device a single network interface suffices).
1) Output queueing:
The actual scheduling of the output link is performed by the function fa T X . For our implementation we deploy a simple FIFO mechanism, where packets are queued in order of arrival to the task fa T X . I.e. in the worst case, a message will be blocked/interfered by all other outgoing messages. Assume that we have n outgoing messages, and tr(m j ) is the transmission time for a message m j , then the blocking time is given as:
The total time for delivery includes the transmission time of m i , giving us a total delivery time (or network delay) over a link:
In the case all messages fit the minimum Ethernet payload (46/42 bytes for Ethernet II/802.1Q respectively), we have a common d(m) = tr(m) * n, which as seen in Section II-D amounts to n * 67.2/6.72/0.672μs over a 10 Mbit/100 Mbit/1Gbit/s link respectively.
One could think of improvements to the above queuing mechanism, either by allowing non standard framing (reducing the transmission time), or by priority queuing.
2) Input queueing:
On the receiver side, packets arrive in the order defined by the output queueing discussed above. The propagation delay over the link l be neglected for most practical cases.
E. Response time in a point-to-point network
We can now derive a safe response time for task chains spanning multiple devices over a point-to-point Ethernet network.
Let T be a global set of tasks, and |T | the size of the set, M be a global set of messages, and |M | its size. Let T C(e) be a set of tasks triggered by the event e (i.e., tasks along the task chain headed by the event e). Let MC(e) be a set of messages along the task chain headed by the event e.
For a triggering event e passing the devices D and the links L we have:
D rp (e) being the accumulated response times over all devices, L rp (e) the accumulated link delays and E rp (e) the total end-to-end response time (delay).
Taken the running example, we have the chain:
Given the assumption that the inter-arrival time of e is larger than E rp (e), we can exclude the blocking and interference by tasks t j along the same chain (t j ∈ T C(e)) for the computation of rp(t i ) (at device level). The same reasoning goes for the link delays, where messages belonging to the same chain cannot block each other.
However, for the given example, the result would be the same, but for cases when the task chain traverses the same nodes (and links) multiple times a tighter, yet safe, estimate can be obtained.
F. Switch Model
In order to derive the response time over switched network, we first have to define a model of the switch at hand. We take the outset that switches are non-blocking at link speed, and able to store and forward all packages as long as the output bandwidth (for each port) is sufficient. Figure 10 gives an architecture outline; incoming frames are buffered and inserted (by reference) into the corresponding output queue (or queues in case of multi-cast). The frame (if any) referenced at the head of each output queue is transmitted from the input queue, through the switching fabric, to the output port.
The output queuing mechanism may also vary: FIFO queuing, i.e., merging streams in order of arrival; 802.1Q quality of service, i.e., merging streams according of QoS/PCP priority. (Actual implementations may feature extended QoS, based on VLAN priority groups, and/or stream management using double tagging inside service provider networks).
G. Response time in a switched star network
Taking the outset of our example, Figure 6 , and the topology Figure 9 , we take a closer look at the end-to-end response time for the task chains headed by e c1 .
For chain (a), we find that the messages m 2 and m 4 passes two links (indexes above show the order of delivery). Each link delay can be derived analogously to Section III-D, taking into account the queueing characteristics and scheduling overhead of the switch as discussed in previous section. The end-to-end response time can then be derived analogously to Section III-E. Figure 11 depicts a tree network topology with edge switches S 1 and S 2 and a backbone switch S 3 . The ports connecting to the backbone switch are so called trunks, merging traffic from different streams. In this case the links to/from S 3 convey mixed traffic from the real-time and and best effort domains, exposing the messages m 
H. Switched tree model
I. Response time in a switched tree network
Taking the outset of our running example, we take a closer look at the end-to-end response time for the task chains headed by e c1 . The communication for (a) is depicted in Figure 11 , indexes above show the order of delivery:
The response time can be derived analogously to Section III-G, however we will need to take into account blocking and queuing overhead implied by best-effort traffic for the backbone (trunk) links.
IV. EXPERIMENTS
A. Experimental Setup 1) Devices: Experiments have been conducted on the Embedded Artist (EA) LPCXpresso1769 development board running at 96MHz. The board features a LAN8720 PHY, connected to a 10/100 Base-Transformer (RJ45 jack) featured on the EA base board. (100Mbit per default for the experiments.)
2) Switches: The experiments were conducted using a set of modern Cisco Catalyst 2960 S switches (running OS version 12.2) [13] . The switch provides 4 output (egress) queues, where queue 1 has priority (expedited unless empty), other queues are expedited to share or shape remaining bandwidth under configurable Weighted Tail Drop (WTD) parameters. Figure 12 depicts the generic experimental setup, allowing us to observe device/link layer and switching overhead, as well as implications of network topology and interference by best effort traffic. An initial message (tagged with an index of 1) is produced by a 1 . Upon receiving a correct package (w.r.t destination and index), device A generates a new message to device B with incremented index, while device B simply forward any incoming package (as received) to device A. Task a 2 toggles an digital output on each invocation (allowing external monitoring of the roundtrip time). The minimal Ethernet frame size (60+CRC/FCS) was deployed, out of which only 16 bytes (including MAC, EtherType/size, and index) were actually used.
3) Basic Setup:
4) 802.1Q Setup:
For the experiments involving 802.1Q, frames from the real-time domain were tagged with the EtherType 0x8100, PCP/CoS 7 (network), DEI 0 (non-drop), and VLAN 2, giving a frame size of 64 2 .
The best effort traffic was generated (and received) on ordinary PCs (devices C and D) using the tools Ostinato 0.6 (for package generation) and Wireshark (for package analysis). 2 According to Cisco interpretation of minimal frame size. 
B. Experiment 1: Point-to-point network
For this experiment, devices A and B were directly connected. This shows the overhead of the link layer transmission time, EMAC hardware and device driver overhead. The roundtrip, A (receiving a frame, extracting payload, verifying index, sending a frame) and B (receiving a frame, extracting payload, sending a frame with same payload), time was measured to 33μs (with a jitter of < 1μs). This experiment confirms the low link layer software and hardware overhead, indicating end-toend response times in the range of 15μs.
C. Switched star network
For this experiment, devices A and B were connected over a (single) switch. This experiment focuses the RealTime Domain (and isolates the effect of the switch on real time traffic). No additional devices were connected. Again the round-trip time was measured with the logic analyser, and measured to 55μs (with a measured jitter of < 1μs).
D. Switched single-layer network with best effort traffic
For this experiment, real-time devices (A and B) and best effort devices (D and E) were connected over a (single) switch. Real-time and best-effort traffic were separated by port (i.e., all real-time traffic is transmitted between devices A and B, while all best effort traffic is transmitted between devices D and E. For the experiment an average of 5000 (MTU 1518 bytes) packages/s was sent in between devices D and E (emulating a generic and heavy IP traffic load). To stress buffering effects, packages were sent in burst of 10 at the rate of 500 bursts/s). The measured round-trip time was still 55μs (with a measured jitter of < 1μs).
E. Switched tree network
For these experiments we tunnelled all traffic over a common backbone (trunk) with link speed set to 100Mbit (Ex4, . . . , Ex6) and 1Gbit for Ex7. Two edge switches were used, connecting (A, D) and (B, E) respectively. In the first experiment (Ex4) only real-time traffic was emitted. A roundtrip delay of 103μs with a jitter of < 1μs was observed.
The second experiment (Ex5) contains both real-time and best-effort streams. The observed best case was 103μs, while the worst case 340μs was less than the theoretical bound 103+ 2 * tr(MT U) = 103 + 2 * 122 = 347μs.
In the third experiment (Ex6) we reduce the best effort MTU to 64 bytes. Results show that the overhead of best effort traffic can be efficiently mitigated by controlling the MTU.
In the final experiment (Ex7) we observe the effect of a Gigabit backbone. This indicates that even without controlling the MTU of best effort traffic, Gigabit backbone efficiently mitigates the blocking overhead.
Ex1
Ex2 
V. EVALUATION AND DISCUSSION
Guaranteed real-time services over standard Ethernet has been studied from a modelling and simulation perspective (e.g., in [14] ). Response time of IEC 61499 has been studied at device level e.g. [15] , and over switched Ethernet e.g. [16] . Our models extends on [14] , [15] , [16] and our experiments indicate their validity. The problem of shared output queuing to QoS has been identified and discussed in [17] . In our case we need only two priority classes supported by the switch in order to provide real-time guarantees.
Our results are safe to the worst case, assessing end-toend response in a holistic manner including the overhead of receive/transmit hardware and software. Our experiments show that the adversative effects of best effort traffic can be mitigated by controlling the MTU and/or faster backbone.
In cases when absolute performance requirements cannot be met by the presented approach, specialised solutions and technologies (e.g., EtherCAT) may be combined with the presented work, allowing to further reduce end-to-end response times, with (potentially) tighter bounds to the analysis.
Set target of future work is to further automate the design process for distribution, allowing networking components to be generated directly from 61499/RTFM models, and to integrate the proposed methods for analysis in the RTFM-4-FUN tool suit. In order to ensure correctness of the presented analysis and future automation features, rigorous formalisation and extraction of certified code is possible through e.g., Coq [18] . Another area of interest is deriving a generic test bed, allowing fast and accurate characterisation of switch configurations, facilitating the deployment into real-life environments.
VI. CONCLUSIONS
In this paper we have discussed end-to-end deadlines for distributed IEC 61499 applications communicating over switched Ethernet networks. The end-to-end analysis takes device level response time, network link layer transmission times, network topology as well as switch specifics (up to layer 2) into consideration. For the analysis and implementation, the system is translated into concurrent tasks and resources (critical sections) in RTFM-core language. The static nature of the RTFM-core model allows device level response times to be assessed using available schedulability analysis. For each link the number of (outstanding) messages in the system is precisely bound, thus allowing the blocking and queuing time to be safely estimated. To assess the impact of switches in the network a generic model is presented. Our experiments conducted on commercially available micro-controllers and Ethernet switches indicate that safe end-to-end response times well below 1ms are feasible, even over tree network topologies and in the presence of best-effort traffic.
