Abstract-Latest
I. INTRODUCTION

N
EXT generation of Fusion devices envisages plasma scenarios to the scale of a future power plant. The mission is to demonstrate not only the feasibility of Fusion power but also to ensure that there will not be a negative impact on equipment, people, environment or investment. Diagnostics are an essential part of a Nuclear Fusion reactor's operation, providing an extensive set of measurements of plasma behaviour. Diagnostics for machine protection and plasma control are mission-critical and will consequently face stringent reliability and availability demands, extensive to the associated control and data acquisition (C&DA) systems [1] . Availability studies for the next generation devices have been performed since the first years of their conception. A first approximation was to target Fusion reactors to the same inherent availability level of a Fission reactor (70% to 80%) [2] . A more recent RAMI study for ITER indicates a figure of 60% [3] . However, the breakdown of this value in ITER's multiple functionalities, tasks and subsystems reveals a figure of 99% for C&DA systems, setting new challenges to the involved instrumentation architecture towards high availability (HA). The present work describes the development of a modular C&DA system, built on the Advanced Telecommunication Computing Architecture (ATCA) standard [4] , using PCI Express (PCIe) over the ATCA backplane Fabric Channel Interface, and connecting with an external PCIe host computer. The system architecture takes advantage of both ATCA and PCIe resources for HA, namely ATCA Hot Swap and the PCIe Hot Plug mechanisms, which allied to the use of redundancy contribute to improve system reliability and availability [5] .
II. ATCA FEATURES FOR HIGH AVAILABILITY
The ATCA form-factor offers generous layout area, power per slot and data throughput, including sub-specifications for PCIe, Serial Rapid IO or Ethernet transmission protocols. Most importantly, in the current subject, it provides resources to achieve high availability (HA), featuring redundant power modules, fans, shelf manager (ShM) units and backplane topologies. The ShM is responsible for handling these resources in order to implement fault-tolerant operation and achieve HA, monitoring and controlling the ATCA infrastructure health, cooling, power, and interfaces with the overall system manager controller.
The concept of Field Replaceable Unit (FRU) is essential to the ATCA architecture, meaning that each hardware component of the shelf can be replaced, adding modularity to the system. Furthermore, FRUs may be replaced (inserted or extracted) without a powering off the shelf, maintaining the desired level of service and increasing system availability. This protection mechanism is named Hot Swap. Fig. 1 represents a typical 14-slot ATCA shelf. Backplane connections may establish several network topologies (Star, Dual-star, and Full-mesh), depending on the shelf model, allowing to implement redundant paths for data and timing signals. In this case, the configuration is a dualstar, where logical slots 1 and 2 are hub-slots. Once installed, hub blades will provide point to point links (stars S1 and S2) 0018-9499 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 
III. SYSTEM ARCHITECTURE
The ATCA C&DA system is composed by two types of hardware modules. In the node slots of the shelf are digitizing units (ATCA-IOP) [6] . These modules may be configured either as ADC or DAC units, in a total of 48 input/output (IO) channels per board, with galvanic isolation up to 700 V. Digitized data are processed in a Virtex-6 Field Programmable Gate Array (FPGA) [7] and made available to the backplane PCIe network by a dual PCIe endpoint, also implemented in the FPGA. The hub slots are filled with ATCA-PTSW-AMC4 PCIe switch blades, which handle data from the node-endpoints (ATCA-IOP) and interface with an external host computer [8] . Both types of modules contain a CoreIPM OPMA2368 IPMC for local hardware management, including Hot Swap support [9] . This system was primarily developed to be used in dual-star topology, hence the dual PCIe endpoint per acquisition channel on the digitizing blades. A variety of redundancy schemes for the ATCA blades may be implemented within the dual-star setup by way of the redundancy of blades, assisted by the ATCA Hot-Swap mechanism. 
A. 2N Redundancy of all Blades
This configuration, depicted in Fig. 2 , uses redundancy for every node and hub blade (and corresponding host computer). The shelf is split in two identical halves (two stars), typically operating in active-standby mode. In the case of failure of any active hardware module, the system manager controller toggles the active and standby roles of each half, allowing the replacement of the malfunctioning board without loss of service. 2N redundancy does not require any signal cabling changes for failover operation since each analogue signal cable is routed to each redundant pair of digitizer channels. This option has a higher cost per channel since every channel is duplicated. However, it ensures the highest level of availability.
B. N + M Redundancy of Node Blades
This configuration is shown in Fig. 3 . It uses the same dualstar with dual hubs and host computers but there are now only M spare (standby) node blades for N active node blades (M < N). There are more active than spare node blades so if one node has a failure, the fail-over process may require changing the cabling from one node blade to another. This may lower the availability since the operations of re-routing the analogue cabling have to be done manually. For the case of a hub blade failure there is no difference since there are still two (redundant) hubs. N = M yields a 2N redundancy configuration, as in the preceding section, while M = 0 means no redundancy of nodes, which is the lowest cost per active channel. The N + M scheme (M < N) is a compromise solution for cost, at the expense of availability of nodes.
IV. ATCA HOT SWAP
The insertion and extraction of blades is supported by the Hot Swap features of the ATCA specification. It consists in a protection mechanism which turns on/off the payload power network of a given slot, allowing its FRU to be safely inserted or extracted, using the board's pair of Handles. The lower Handle activates a Handle Switch (HS), which is the Hot Swap sensor used to indicate to the IPMC if the Handle is open or closed. The IPMC negotiates with the ShMC permission for the requested FRU activation/deactivation processes, according to a state machine operation triggered by Hot Swap events, using the IPMI protocol. During this procedure, the IPMC informs the operator, through the FRU front panel Blue Led about the Hot Swap FRU state, with specified blinking pattern behaviours -especially important to signal if the board is ready be safely extracted. This is illustrated in Fig. 4 .
V. PCI EXPRESS HOT SWAP/HOT PLUG
ATCA Hot Swap provides insertion/extraction of blades at the hardware level. However, at the host computer level, the operating system (OS) needs to be aware of these events, to correctly insert/remove the corresponding PCIe devices and successfully open or close related software applications. Nor the ATCA base specification or its extension "PCI Express Advanced Switching for ATCA" describes such a direct relationship between ATCA Hot Swap and PCIe [10] . The PCIe base specification indicates to support "Hot Plug and Hot Swap solutions" [11] , yet further reading reveals more information regarding Hot Plug as "hot-add and hot-removal of adapters", while the term "Hot Swap" is no longer mentioned. The distinction between PCIe "Hot Plug" and "Hot Swap" is found on another author, stating that the PCIe Hot Plug is a standard "derived from revision 1.0 of the Standard Hot Plug Controller specification" for a "graceful" or "no-unexpected" methodology whereas PCIe Hot Swap does not have any standard and devices may be added or removed "without special consideration" [12] . Because ATCA Hot Swap is itself a "graceful" method (there is a warning notification to prepare the hardware for insertion/extraction of FRUs) it seems adequate to establish a relationship with the standard PCIe Hot Plug.
A. PCI Express Hot Plug
As mentioned previously, PCIe Hot Plug supports the hotadd/removal of adapters, meaning the standard PCIe formfactors, defining a set of Hot-Plug elements and respective behaviours. Other form-factors must define which and how these elements are to be implemented. The ATCA form-factor specification, as yet, does not define such implementation. The solution found was to establish a customized relationship between the ATCA native Hot Swap resources, the standard PCIe Hot Plug elements and the local hardware PCIe Hot Plug support. For the current architecture, the hardware for PCIe switching is based on the PLX PEX 8696 device [13] , existing in the ATCA-PTSW-AMC4 hubs, which features PCIe Hot Plug support. Mechanical and sensor elements are provided by the ATCA Hot Swap mechanism (Handles, Handle Switch and IPMC). Table I shows which PCIe Hot Plug elements could be fulfilled by the current architecture. Attention Button and Electromechanical Interlock do not have a direct equivalent on the ATCA form-factor.
Hot Plug elements generate Hot Plug events, supported by the associated Downstream Port. For the current architecture, these are the Downstream Ports of the PEX 8696 of the ATCA-PTSW-AMC4 hubs, which link to each PCIe endpoint of the ATCA-IOP nodes. Hot Plug events, listed on Table II , are registered on the PEX 8696 Slot Control Register.
Once the Downstream Port receives a Hot Plug event, it must notify the software by a corresponding Hot Plug interrupt. Upon the reception of a specified sequence of Hot-Plug interrupts from a Downstream Port, the software layer proceeds to the insertion/removal of the corresponding device and/or taking the adequate actions to software applications using the device. 
B. ATCA Form-factor Implementation of PCIe Hot Plug
This section describes an implementation for PCIe Hot Plug of the ATCA-IOP node blades linked to the Downstream Ports of the ATCA-PTSW-AMC4 hub, at the PEX 8696 switch. Consulting the Hot Plug process for this hardware component, two issues were found. First, the Hot-Plug event signaling interface is not physically implementable since the PCIe endpoints of the ATCA-IOP nodes do not implement these signals and there are not any sideband signals. The only PCIe signaling established between nodes and hubs on the ATCA Fabric Channel interface is in-band data. However, PEX 8696 allows Hot Plug events to be triggered by software, by I 2 C access to its Slot Control register. The solution found was to use the ATCA-IOP IPMC not only to negotiate ATCA Hot Swap with the ShMC, but also to message the ATCA-PTSW-AMC4 hub IPMC the PCIe Hot Plug events, since the hub IPMC has I 2 C access to the PEX 8696. With this methodology, the Hot Plug interrupts may then be generated by the PEX 8696 to the host. The second issue is that the specified Hot Plug process for the PEX 8696 uses the Attention Button event and the ATCA form-factor does not have a corresponding physical element. Therefore Attention Button event and interrupt were removed from the original process. The remaining required process events were maintained and implemented according to Tables I and II .
C. Example of Node Blade Extraction
An example of the PCIe Hot Plug implemented for IPFN C&DA system, using the process presented in the previous section. Fig. 5 shows the setup for the extraction of an ATCA-IOP node blade, with a numbered sequence, described in Table III . The reverse process (node insertion) is analogous. In general, the node blade IPMC messages events to the hub IPMC, which generate interrupts to notify the necessary actions to be performed by the PCIe host -open/terminate applications and/or inserting/removing the corresponding PCIe devices. The node blade may then be physically inserted or removed, through ATCA Hot Swap, and the remaining nodes (not represented in the figure) keep working. In the case of a failure of the node board, a spare node takes over the failing one, using any of the redundancy schemes presented in III, preserving the service availability.
VI. RESULTS AND CONCLUSIONS
A C&DA system was conceived to exhibit HA properties, aiming to fulfil the demands of Nuclear Fusion diagnostics. The current hardware is part of the "ITER Catalog of I&C productsFast Controllers" [14] . Preliminary tests for node insertion and extraction of node blades were successfully performed at IPFN, according to the developed procedures, described in the previous chapter. Blades were extracted in inserted from the shelf with correct activation/deactivation of respective software devices and applications, using Linux OS and in-house developed data acquisition software [15] . A demonstration of the system was presented at the ANIMMA 2015 conference, using a 100 meter fiber optics link between the ATCA shelf and the host PC, allowing the computer to be placed farther away from the shelf and the effects of single event upsets [16] . A test plan is currently underway to test the availability, setting appropriate availability goals and create fault scenarios to assess system behaviour, starting from the redundancy scenarios herein described.
The ATCA form-factor not only contains intrinsic HA features but also provides redundancy resources that can be used to increase system availability, especially at the data transmission level, on the ATCA backplane. The current solution for PCIe device Hot Plug and ATCA Hot Swap relationship is customized for the particular application and hardware components but aims to be compatible with other types of hardware and OS. The ATCA specification could benefit from developing further definitions for the implementation of PCIe Hot Plug in future revisions, as well as further elaboration on the differences between PCIe Hot Plug and Hot Swap, as a means to standardize hardware and software procedures, increasing compatibility and facilitating the development of instrumentation systems and components.
