Abstract-Achieving higher levels of plasma performance control in present fusion experiments requires that diagnostics be upgraded to deliver processed physical parameters in real-time (RT). A key element in a diagnostic RT upgrade is the data acquisition system (DAS), that should be capable of delivering the acquired data to the data processing resources with very low latencies and in the shortest possible time. Adequate standard commercial solutions with these characteristics are not easily found in the market, what leads most of the times to the development of complex custom high-performance designs from ground-up. A mixed solution, partially based on commercial offthe-shelf (COTS) components, is under development to upgrade the existing ASDEX Upgrade (AUG) broadband reflectometry diagnostic so that a full demonstration of plasma position control using RT reflectometry density profile measurements can be performed. The 8-channel (12-bit/100 MSPS) DAS being designed features a PCI Express (PCIe) x8 interface to enable direct memory access (DMA) data transfers with throughputs in excess of 1 GB/s. The use of COTS components resulted in a faster hardware design cycle without compromising system performance and flexibility. The architecture of the system and its main design constraints as well as the system integration in the AUG RT diagnostic network are herein discussed. Preliminary benchmark results for data throughput and overall measurement latency are also presented.
I. INTRODUCTION
A growing trend in Fusion experiments like ASDEX Upgrade (AUG) is upgrading diagnostics for real-time (RT) operation. The access to processed physical parameters in RT opens the way to higher levels of plasma performance control.
A key element in such diagnostic upgrades is the data acquisition system (DAS), which must be capable of delivering the acquired data to the data processing resources with very low latencies and in the shortest possible time. When large volumes of acquired data or a high number of acquisition channels are involved, and/or the processed measurement cycle is very demanding, high-performance custom built designs are in general required. However, many diagnostics do not have such demanding requirements and therefore can live with solutions partially based on commercial off-the-shelf (COTS) components. In this paper we present the DAS developed for the demontration of the reflectometry based plasma position control technique, an ITER 1 relevant demonstration presently underway in AUG [1] . In this context, the goal of the RT reflectometry diagnostic is producing density profile measurements and position estimates for plasma position feedback control in the AUG fastest position control cycle, i.e. ≈ 1 ms [2] . For this purpose, a DMA capable PCIe (x8) DAS was designed to remove all possible hardware latencies from the complete RT measurement cycle, using mainly COTS components. The 8-channel (12-bit/100 MSPS) DAS uses a Xilinx Virtex-5 SX series FPGA to guarantee a data throughput in excess of 1 GB/s between the ADC's local buffer memory and the data processing server's RAM. This capability relieves the requirement of having a commercial hard-RT operating system (OS) to manage both the acquisition hardware and the RT data processing tasks. The continuous integration of hard RT capabilities into standard open-source OS like Linux makes them ideal candidates for these applications. The gained system development flexibility resulted in an easier integration of the designed system in the AUG RT software framework. In the following sections we will briefly describe the reflectometry RT time measurements and resulting system requirements/constraints, the proposed DAS architecture and preliminary benchmark results and the system integration in the AUG RT network.
II. DIAGNOSTIC CHARACTERISTICS AND REAL-TIME OPERATIONAL REQUIREMENTS
The AUG RT reflectometry diagnostic will produce two types of on-line results: high-field side (HFS) and low-field side (LFS) density profiles and estimates for the inner and outer separatrix position. These measurements are obtained using two broadband O-mode reflectometers [3] probing the plasma HFS and LFS at the equatorial plane. Interference signals, resulting from the swept operation of the K, Ka, Q and V band microwave sources of both reflectometers, are sampled to produce density profiles covering the 0.3−6.0×10 19 m −3 density range. In total, eight signals are synchronously acquired in frames of N samples, as sketched in the Fig. 1 (a) . The microwave sources are swept in 25 µs and require a settling time of 10 µs. If acquired at 40 MSPS each sweep/frame has N=1K samples. Doubling the number of samples per sweep can be performed increasing the sample rate to 80 MSPS.
A burst of four sequential broadband sweeps is required to produce a single RT density profile. As the targeted 1 The future International Thermonuclear Experimental Reactor. RT reflectometry measurement rate is 1 kHz, bursts of e.g. 4 f rames × 1024 samples × 8 channels samples have to be acquired every 1 ms. This measurement rate is compatible with the fastest AUG plasma position control cycle and in more demanding ELMy H-mode regimes, where measurements might be affected by ELM activity, ensures that enough valid measurements are available to produce position estimates for slower control cycles (typically 10 ms for ITER). Fig. 1 (b) shows a simplified schematic diagram of the complete position control cycle. This is a two-phase process that starts with the acquisition and RT data processing of the physical relevant parameters to be delivered to the DCS. These tasks, running locally in the RT diagnostic, are followed by the position control actuation processing, running in the DCS.
The control cycle can be broken down in the following subtasks: (i) data acquisition (temporary local storage of data in the acquisition system), (ii) data uploading (to the host system device driver RAM buffer -DMA transfer), (iii) data management (including system IRQ response and buffered data adaptation and replication to the user space RT processing task), (iv) RT data processing (to calculate density profiles and separatrix position estimates), (v) communication of processed data to the discharge control system (DCS), and (vi) position control actuation processing. The duration of these tasks is either fixed, e.g. (i), (v) and (vi), dependent of the host system performance, e.g. (iv) or on the choices made for the OS, (iii)/IRQ response, or for the acquisition hardware design, e.g.
(ii).
A neural network based technique was developed [4] to calculate reflectometry density profiles and produce separatrix position estimates in RT. An optimized multithreaded implementation of the calculation/estimation codes was obtained using OpenMP. Benchmarking this code has shown that at least half of the targeted 1 ms measurement period was available for all the non-data processing activities involved in the complete control measurement cycle ( Fig. 1 (b) ). Considering that the data acquisition accounts just for ≈ 130 µs of the available > 500 µs, great flexibility can be gained, in what respects the choice of RT OS and system hardware architecture. This was the main reason to chose the standard RT Linux kernel (mainline kernel with Real-time-Preempt patches applied) [5] over more latency optimized RT OS implementations like RTAI [6] (Real-Time Application Interface), Xenomai [7] or even commercial solutions like VxWorks [8] . As writing software for a RT Linux kernel based system requires no special API, no or only minor adaptation to the RT processing codes or hardware device drivers is required. Since, in this application, an IRQ response latency of a few tens of micro-seconds could be afforded, a trade-off between code simplicity and optimal IRQ latency was made, resulting in large benefits in terms of software development and debugging times. In what concerns the complete RT position control cyle, achieving a ≈ 1−1.3 ms mark will depend essentially on the final implementation of the RT data processing code. Because processing diagnostic inputs and producing the required actuations takes ≈ 400 µs, to attain the desired control cycle will only be possible if less than 500 µs are used to calculate the reflectometry position data. So far, an average calculation time of ≈ 170 µs was measured when all 8 cores of the protoype system host were used. In the final implementation, some of these cores will be reserved permanently for the OS and/or hardware related or communication tasks. However, if required, further optimizations such as increasing the number of CPUs (the original OpenMP code is scalable) or converting some calculations to fixed point are still possible. Additionally, some parts of the data processing can always be implemented in the FPGA to further improve the cycle time.
III. COTS BASED ACQUISITION SYSTEM ARCHITECTURE
The main characteristics of the acquisition system are: 8 channels, 12-bit resolution, 80-100 MSPS, 128KB burst FIFO memory, PCIe bus interface. These satisfy the operational requirements mentioned in the previous section. The justification for the most design constraining characteristic, i.e. the PCIe bus interface, is twofold. It stems directly from the need to maximize the data uploading throughput and from the fact that the fastest data bus available on powerful multiprocessor server motherboards is in fact a multiple lane PCIe interface (in general materialized in a x8 or x16 lane slot). Using the standard PCIe interface format, rather than the more industrially adopted compact PCIe format (cPCIe) minimizes 978-1-4244-7110-2/10/$26.00 ©2010 IEEE the number of data bus switches between the acquisition board PCIe endpoint and the microprocessor and RAM buses. By using the larger PCIe slots, usually directly connected to the motherboard North Bridge, DMA data transfer latencies are minimized and data throughput maximized. Besides optimizing system performance, bringing the complete acquisition system into the server's rack mount case allows for an "allin-a-box", compact and self contained RT diagnostic data acquisition and processing system.
The two hardware main building blocks were easily found on the COTS market: an ADC evaluation board featuring a serial LVDS interface and a quad-channel 12-bit, 105 MSPS ADC and a PCIe FPGA development board with an 8-lane 1.0 PCIe bus. Among the many available options, the later was chosen to feature a Xilinx Virtex-5 SX FPGA (XC5VSX50T/95T) since this device integrates a built-in hardware x8 PCIe endpoint. Choosing a host motherboard that supports a PCIe TLP (Transaction Layer Packet) payload size of at least 128 bytes is enough to guarantee ≈ 70% of the maximum theoretical data throughput (2.0 GB/s) of an 8-lane PCIe bus. The chosen FPGA family has enough internal memory resources to implement the large FIFO required to temporally store the burst data, and DSP specialized units to allow for future in-FPGA data processing. Such development boards are also filled with extra functionality such has SFP connectors, Gigabit Ethernet PHYs, USB and RS232 ports, on-board and slotted DDR2 SDRAM memory and multiple programmable clock sources. Above all, they feature special expansion connectors, for customized user application daughter cards, whose pins are directly routed to the FPGA single ended or differential IO pins and IO clock resources.
In practice, the only hardware that had to be developed to build the described acquisition system was one such piggy back daughter card, used essentially to interface the FPGA development board to the ADC evaluation boards. The remaining components are required to build and integrate the diagnostic in the AUG RT diagnostic network: the centrally synchronized timing device and low latency RT network interface board. Fig.  2 simplified block diagram shows the referred components and how the system interfaces with the outside world, i.e., with the reflectometry microwave circuitry and with the discharge control system (DCS) via either the high or low latency RT networks.
The correct timing synchronization between all data acquisition systems participating in the AUG RT diagnostic network and the DCS is an important prerequisite and is achieved through the uTDC hardware [11] , an IPP (MaxPlanck-Institut für Plasmaphysik) in-house developed timing device. To guarantee that all uTDCs in this distributed system always operate at exactly the same (64-bit) time value they all are connected to one central timer via an unidirectional fiber network in a star topology. Every millisecond the central timer distributes the actual system time and synchronization information whereby the phase-locked loop circuit of each uTDC can lock-in. The accuracy of the uTDC timing device, available as standard PCI or compact PCI plugable boards, is presently 20 ns. These devices can produce complex timing signals to control ADC-boards using two onboard independent programmable pulse generators (PPGs).
In a first step the RT-Reflectometry diagnostic will be interconnected with the ASDEX Upgrade control network by standard Gigabit Ethernet. This solution provides latencies in the range of several hundreds of microseconds which will be sufficient for the testing phase. Later on it will be replaced by a VMIC reflective memory interconnection which guarantees true hard real-time operation with bounded latencies of only a few tens of microseconds.
IV. HARDWARE AND FIRMWARE DEVELOPMENT
The significant advantage of using COTS components is to limit the complexity and the amount of hardware to develop. As previously mentioned, this system only needed a daughter card to provide an interface between the PCIe FPGA board and the ADC evaluation boards. The main "hardware" development effort, however, was put in programming the FPGA so that the acquired data buffering and very high speed transfer requirements were properly satisfied. Again, a careful planning and an adequate architecture design helps to limit the complexity of the operations performed inside the FPGA. The sheer performance of recent multi-core processors (the system motherboard supports two quad-core 3.0 GHz Xeon processors with 12 MB of L2 cache) allow the migration of most of the low-level data management functionality, such as sample grouping and reordering or even data filtering, from FPGA to the RT data processing tasks without severe penalties in terms of the overall performance. On the other hand, precious development and debugging time is gained since programming applications in C or C++ is a much easier task than programming FPGA in behavioral languages such as VHDL or Verilog. This is particularly true when placing and routing complex designs, with very large data buses working at several hundreds of MHz, is involved. In the next subsections a more detailed description of the custom interface board and FPGA firmware functionality will be made.
A. Custom Interface Board Design
The custom interface board has four main functions as can be seen in the block diagram of Fig. 3 . First of all it is used to route the 8 LVDS DDR data streams (two per channel) and frame and data clocks from each ADC evaluation board to the FPGA development board. This board allows the connection of up to 4 ADC boards, i.e. 16 12-bit/100 MSPS acquisition channels, via 4 high speed socket strips. The interconnection between the board and the ADC boards is made through 50Ω high speed cable assemblies that also carry the single ended signals used for the ADC serial programming interface. A 5V powered DC-DC conversion sub-module generates locally digital and filtered analog 3.3V supply voltages to feed simultaneously all 4 ADC boards and the in-board circuitry. Three single-ended IO connectors are usable for feeding in/out trigger signals. The voltage level of these these signals can be locally converted between 2.5-3.3 V voltage levels. Finally, a fully programmable PLL was implemented to produce a high quality low jitter sample clock to drive the ADCs in phase with the 10 MHz synchronization clock generated by the uTDC board. The chosen PLL has LVDS and LVPECL output stages. The LVPECL outputs are used to drive the ADCs and two LVDS replicas of the acquisition clock are furthered to an output connector and back to the FPGA. The jitter characteristics of the PLL/internal VCO were evaluated using the PLL supplier own simulator/loop filter calculator. Although the ADC boards could be equipped with the 14-bit versions of the ADC (also natively supported by the interface board), by themselves, the jitter characteristics of the output LVPECL acquisition clocks maximized the obtainable effective number of bits (ENOB) to ≈11.4 in the 40-100 MSPS sampling frequency range. Therefore, no advantage would be gained in using higher resolution ADCs. The PLL has two reference clock sources to which it can lock: the external uTDC 10 MHz synchronization clock and an internal low jitter 10 MHz oscillator. In case of failure of the selected reference, usually the external uTDC sync clock, the PLL commutes automatically to the fall back PLL reference.
B. FPGA Embedded Functionality
The choice of the FPGA family was critical to insure that the level of firmware development was maintained as low as possible. As stated, the Virtex 5 family integrates an hardware implementation of an x8 PCIe endpoint. By using a thirdparty DMA IP core, the full functionality of a DMA capable x8 PCIe interface was unlocked. Fig. 4 shows a simplified block diagram of the logic programmed into the FPGA that can be grouped in two main macro blocks charged of the: i) acquisition data flow and ii) generic logic control.
The acquisition data flow macro block contains the blocks required to receive, store and format the data blocks to be uploaded to the host by the DMA management module through the DMA and PCIe EP cores. The LVDS frontend receives the ADC differential DDR data stream pairs, the bit (dclk) and frame (fclk) clocks, and uses the FPGA builtin delay and deserializer resources to reconstruct each 12-bit sample. Since the system synchronously acquires data from 8 microwave channels, the samples are grouped in a 96-bit bus running at a maximum of 100 MHz (when the maximum sampling rate is used). As the ADC clocks run continuously, the frames of samples are formated and synchronized with the acquisition triggers received from the uTDC in the Acquisition data buffering control logic block. This block can be programmed to accept different frame sizes and number of frames per burst. It also provides the required signals to the write interface of the data buffering block. This memory block implements a 128 KB asymmetric FIFO with a 128-bit write interface, operating at a maximum of 100 MHz, and a 64-bit read interface, operating at a fixed 250 MHz rate. The read side is connected to the DMA management module, responsible for the data uploading DMA transfer.
The generic control logic macro block handles all the remaining configuration functions such has synchronizing the frame id and timer counters with the frame acquisition, handling the acquisition dependent programmable internal trigger subsystem (used among other things to generate IRQ request timing) and to deliver PLL and ADC configuration data to the serial program interface block. To perform these functions the Main Control Logic block uses a set of IO registers handled by the DMA core Slave management module. These registers are used to bring configuration data and logic triggering signals to the board and to reflect the status of the various system components. One of these registers is used to access the 128-bit contents of the frame timer and id counter. This long word carries the 32-bit frame id (frame count), start time (48-bit) of the first frame in the burst and time for a programmable acquisition dependent condition, corresponding to the data present in the FIFO. The 48-bit time counter in this block works with a 200 MHz clock (also used in the delay lines present in the LVDS deserializer interface) generated by an internal PLL. This PLL also generates, from an external 100 MHz reference clock, a 20 MHz clock used in the serial programming interface of the acquisition PLL and ADCs and a phase synchronized replica of the 100 MHz clock. Fig. 5 shows pictures of the developed interface board in "piggy-back" with the PCIe FPGA board and of the ADC module sub-assembly and respective high speed cabling.
V. PRELIMINARY BENCHMARKING
At the time of writing the complete functionality described in the block diagram of Fig. 4 was not yet completed (no working asymmetric data FIFO). However, its partial implementation allowed the preliminary benchmarking of two critical aspects of the design: system response to acquisition hardware generated IRQs and DMA data throughput. To test the system response times the uTDC board was programmed to generate trigger events corresponding to the acquisition of a predefined number of bursts of 4 sweeps. The programmable internal logic in the acquisition board FPGA was configured to generate an IRQ after the occurrence of the first trigger of a burst. The corresponding burst number and timestamp are automatically registered using the frame id counter and 48-bit timer (∆t = 5 ns), respectively. The logic was also programmed to register a timestamp when a register programmed flag is set to 1. Using this mechanism we could measure the latency required to serve the IRQ (acquisition system device driver) and activate the high priority user RT data processing task (RTDPT). Fig. 6 shows the latency distributions when the frame timer is read: a) right at the beginning of the RTDPT (IRQ response + RTDPT activation latencies), b) after the acquired data has been de-interleaved and copied to the local memory buffer, c) after the de-interleaved locally buffered data is stored in a large memory buffer (RAM). These measurements were performed with an unloaded host, frames of 1K sample, and simulating acquisition sessions of one minute (typical AUG discharge duration is 10 s max.), i.e. sets of 60000 IRQs per distribution. In this setup, RTDPT activation time remained below 20 µs and 50 µs after after receiving an IRQ the RTDPT 978-1-4244-7110-2/10/$26.00 ©2010 IEEE can start processing the data. Adding the time required to store each burst of data in a large buffer in memory (RAM) only adds < 40 µs. This last step is required so that, at the end of the discharge, all acquired raw data can be stored in disk in a Level-0 shotfile. All in all, < 100 µs are required for the complete data management step (green bar in Fig. 1  b) ). Doubling the frame size to 2K samples retains the same RTDPT activation timings but roughly doubles the data deinterleaving and storage timings.
To benchmark the DMA transfer, we used the test reference design provided by the DMA IP core supplier. Fig. 7 shows a plot of the instantaneous data throughput (half duplex mode) for reading from (red line) and writing to (blue line) a FIFO memory in the FPGA of the PCIe development board. As can be seen, sustained DMA transfers of > 1.4 GB/s were achievable with the DMA IP core reference design. Such a fast data throughput is possible because the used motherboard chipset allows a maximum PCIe TLP (Transaction Layer Packet) payload size of 128 bytes, enough to guarantee ≈ 70% of the maximum theoretical data throughput (2.0 GB/s) of an 8-lane PCIe 1.0. Fig. 1 (b) was sketched using a more conservative 1.0 GB/s transfer rate (orange bar), and a data management time of 100 µs. For the 2K sample per frame diagram, the data uploading time and the data de-interleaving and storage times were simply doubled. In both cases, the available time for processing the acquired data in RT is greater than 500 µs. A further optimization can still be made by initiating the data transfer before the end of the 130 µs acquisition time. The gained time, marked by the arrow and the dotted lines, can be directly added to the available computation time.
These preliminary tests show that no severe penalties were imposed by the chosen RT OS implementation and that the targeted 1.0 GB/s transfer rate can be easily achieved.
VI. INTEGRATION IN THE ASDEX UPGRADE RT DIAGNOSTIC NETWORK
An important application of real-time capable data acquisition systems is the integration with a feedback control system. In the case of the ASDEX Upgrade this means connecting the diagnostic to the Discharge Control System (DCS) via a RT-network [10] . The DCS is responsible for the controlled execution of a plasma discharge and the coordination of all participants which have an impact on the process. Typical sections which formerly resided only in the control domain of the system are now being migrated into the data ac- quisition domain with plasma diagnostics having hard realtime properties. This migration process is enforced through the continuous integration of hard real-time capabilities into standard operating systems like Solaris or Linux under which the data acquisition software is running.
For a most efficient integration, the implementation of RT-reflectometry must be as compliant as possible with the architecture of the ASDEX-Upgrade standard RT-diagnostic [12] , shown in Fig. 8 . The "Level-0" process is responsible for raw data acquisition and storage. This part is controlled by a "shotfile header" which describes the parameters of the data acquisition process, like sample rate, data type and the mappings of raw signals to physical signal names. After the discharge, a "Level-0 shotfile" embodying acquired raw data and all relevant parameters and information used to describe it, is generated for later data analysis.
The second component is the "Level1" process which takes up raw data from the Level-0 process to produce, by means of a particular rtDiagAnalysis code, physically meaningful results which are immediately published as real-time signals on the real-time communication network. These signals are read by the Discharge Control System (DCS) for further processing and can be subscribed by other diagnostics to enhance their own Level-1 data acquisition processes. This component is controlled by a Level-1 shotfile header defining the raw data used as well as the resulting signals, signal groups and parameters which are stored in a "Level-1 shotfile" after a plasma shot. An additional XML file describes all incoming and outgoing signals, which are exchanged over the realtime communication network. All communication with the DCS is handled by the library modules rtDiagLib/Control. Selected parameters can be configured on the fly passing them in the command line of the Level-1 main() function. Both processes, Level-0 and Level-1 communicate over the library module rrlib, based on Posix shared memory inter process communication functions.
To ease the development and implementation of standard real-time diagnostics a rtLevel-1 framework was developed [9] . It provides a generic template for a Level-1 process that can be adapted to any diagnostic individual requirements. Initially the rtLevel-1 framework was only available for the Solaris platform and strongly biased to the in-house developed SIO data acquisition hardware [12] . For the RT-Reflectometry diagnostic integration the rtLevel-1 framework was ported to Linux (64-bit) and extended to support custom non-SIO hardware. Other necessary libraries required to write Level-0 and Level-1 shotfiles and to interface with the DCS framework over the RT-network were already available for the Linux platform.
VII. CONCLUSION
It was shown that, by using available COTS components, compact low latency and high throughput data acquisition systems can be built with limited hardware development. This option implies that the main development effort is displaced to the programming of the embedded FPGA devices. However, the high performance of cheap and widely available multicore multiprocessor host servers directly contributes to limit the complexity programmed into these FPGAs. In fact, if this complexity is moved to RT user-space tasks running on the acquisition system host, the use of parallel programming paradigms such as OpenMP and optimized parallel digital signal processing function libraries allows for a quicker development and prototyping cycle of the required data processing and management algorithms. In the end, a solution capable of satisfying demanding RT measurement cycle rates, without the need for complex and time consuming development of FPGA based data processing codes, is achievable.
For RT diagnostics compatible with overall system response times greater than ≈50-100 µs, the standard RT Linux kernel (mainstream kernel with RT-preempt patches) implementation has the required characteristics to guarantee an adequate deterministic behavior. By using it, overall system implementation simplicity and access to a much broader hardware and software support is gained. These advantages alone justify, in these cases, its choice over a standard hard-real time OS. Especially because its somewhat longer response times can be partly compensated by the reduction of the data transfer time obtained by using very high data throughput and low latency acquisition systems like the one herein described.
The prototype of this system, integrated in the ASDEX Upgrade RT network, is expected to be operational during the 2010 experimental campaign. Experiments of plasma position feedback control using reflectometry measurements are planed for H-mode discharges and also for ELM free regimes.
ACKNOWLEDGMENT
This work, supported by the European Communities and the Instituto Superior Técnico, has been carried out within the Contract of Association between EURATOM and IST. Financial support was also received from the Fundação para a Ciência e Tecnologia in the frame of the Contract of Associated Laboratory. The views and opinions expressed herein do not necessarily reflect those of the European Commission, IPP, IST and FCT.
