Abstract-The neutron detection system for characterization of emissivity in ITER Tokamak during DD and DT experiments poses serious challenges to the performance of the diagnostic control and data acquisition system (CDAcq). The ongoing design of the ITER radial neutron camera (RNC) diagnostic is composed by 26 lines of sight (LOS) for complete plasma inspection. The CDAcq system aims at meeting the ITER requirements of delivering the measurement of the real-time neutron emissivity profile with time resolution and control cycle time of 10 ms at peak event rate of 2 MEvents/s per LOS. This measurement demands the generation of the neutron spectra for each LOS with neutron/gamma discrimination and pile-up rejection. The neutron spectra can be totally processed in the host CPU or it can use the processed data coming from the system fieldprogrammable gate array (FPGA). The number of neutron counts extracted from the spectra is then used to calculate the neutron emissivity profile using an inversion algorithm. Moreover, it is required that the event-based raw data acquired are made available to the ITER data network without local data storage for postprocessing. The data production for the 2 MEvents/s rate can go up to a maximum data throughput of 0.5 GB/s per channel, fostering the evaluation of real-time data compression techniques in RNC. To meet the demands of the project, a CDAcq prototype has been used to design and test a high-performance distributed software architecture taking advantage of multicore CPU technology capable of coping with the requirements. This submission depicts the design of the real-time architecture, the spectra algorithms (pulse height analysis, neutron/gamma discrimination, and pile-up correction), and the inversion algorithm to calculate the emissivity profile. Preliminary tests to evaluate the systems performance with the synthetic data are presented.
I. INTRODUCTION
T HE radial neutron camera (RNC) is an ITER diagnostic that aims to deliver in real time with a 10ms control cycle the profile of the plasma neutron emission for machine control purposes. Each line of sight (LOS) provides the integrated measurement of the neutron flux by means of spectrometers and neutron flux monitors. These measurements provide the inputs for the neutron emissivity calculation using an inversion algorithm that uses the magnetic surfaces and the integrated neutron flux [3] - [5] . Fig. 1 depicts the system architecture with 16 external and 6 internal port LOSs and the plasma magnetic surfaces used to calculate the neutron emissivity in the points of intersection between the LOS and the magnetic surfaces.
The design of the high-performance distributed software architecture using multicore CPU technology capable of coping with the system highly demanding requirements was validated using a control and data acquisition (CDAcq) prototype. This contribution describes the real-time architecture, with the spectrum neutron and gamma energy spectrum construction algorithms (including pulse height analysis, neutron/gamma discrimination, and pile-up correction) and the inversion algorithm to calculate the neutron emissivity profile. The description of the preliminary tests made to validate and evaluate the systems performance with synthetic data is presented.
II. OVERALL SYSTEM ARCHITECTURE
To obtain the necessary measurements that comply with the system requirements, the ITER RNC diagnostic runs a set of functions implemented in a variety of algorithms performing the following tasks.
• Acquisition of the streamed data from radiation detectors.
• Detection of pulses (events) to build a sequence of relevant data for the measurement, while discarding the stream where no pulses were detected.
• Pulse processing in order to provide count rates and pulse height spectra.
• Calculation of the neutron emissivity profile reconstruction (by processing the neutron count rate data together with the magnetic flux information). While algorithms that compute the neutron emissivity profile in real time must run within a 10-ms cycle in order to provide the measurement for advanced control purposes, 0018-9499 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. the stream of raw detector pulses data must be delivered for storage aiming at offline physics studies. For this purpose, data compression algorithms were implemented aiming at reducing the size of detector pulses data before they are sent to the ITER archiving system.
III. SYSTEM PROTOTYPE
This section describes the CDAcq prototype designed to test the multicore CPU technology distributed software architecture, the implemented algorithms, and the performance of the solutions presented to meet the demanding RNC requirements.
The most recent RNC project design review has defined the use of 16 ex-port LOS and 6 in-port LOS. Nevertheless, the acquisition prototype presented in this section uses up to four inputs in the host to validate the implementation of the data distribution, data compression, and pulse processing algorithms. The final system design will replicate the number of these prototypes up to the number of necessary channels, minimum one per LOS.
Although ATCA [6] or PXIe [7] architectures may be used for the final system design, the decision is still pending, based on the performance test results, cubicle space optimization, and reduction of unnecessary costs. Thus, the prototype architecture based on peripheral component interconnect express (PCIe) was decided based on a cost-effective solution capable of deploying the necessary performance tests, software architecture tools, and critical algorithms [8] .
The test and development prototype hardware and software specifications feature the following.
• Real-time OS: Scientific Linux 7.0 with RT kernel.
• In-house developed device drivers.
• Host PC running on Intel(R) Core(TM) i7-5930K CPU at 3.50 GHz with 6 cores and 12 independent software threads, 64-GB memory, and 256-GB SSD.
• Xilinx KC705 evaluation board featuring a Kintex 7 fieldprogrammable gate array (FPGA).
• In-house developed firmware featuring analog-to-digital converter (ADC) readout, PCIe direct memory access (DMA) data streaming management, event detection, and pulse analysis [1] .
• In-house developed analog input with 2 ADC 12-bit at up to 1.6-GHz FPGA Mezzanine Card (FMC-AD2-1600) [8] . Fig . 2 depicts the hardware setup of the prototype, which includes one FMC digitizer that connects to the FPGA using the Xilinx high pin count (HPC) bus. The system can also host another Xilinx KC705 simultaneously increasing the number of analog channels per host. Nevertheless, the inclusion of more channels demands that the host has enough performance and CPU cores available to run all the algorithms.
IV. SOFTWARE ARCHITECTURE
This section describes the design of the real-time architecture that acquires the pulse information from the detectors, and builds the particles spectra using pulse integration, neutron/gamma separation, and pile-up correction algorithms. The inversion algorithm that computes the emissivity profile will run in a different dedicated host, which is connected to the main one through a dedicated Ethernet link. Fig. 3 shows the software architecture of the present RNC prototype. Fig. 3 also depicts the data links and the hierarchy between the main software modules.
• Linux device driver.
• Data receiver and distribution.
• Pulse processing for particle energy spectrum construction.
• Data compressing and archiving.
• Raw data archiver (for testing and validation). The use of a shared memory permits the real-time distribution of the pulse data among several clients so that the most recent pulse data are available for different purposes. The pulse processing application calculates the particle energy spectrum for real-time measurement, while the data compression [2] is used to store or deliver to the archiving system the raw pulse data for offline analysis and physics studies. The raw pulse data archiver may be used when no data compression is desired or for testing and validation activities.
To optimize the performance and the real-time deterministic completion of the tasks, each software module runs in dedicated isolated cores. The load of each CPU core was measured during preliminary tests to help dimensioning the distribution of tasks per CPU core and the number of CPU cores per task. For the present prototype tests, the following needs were identified for optimal performance at 2 Mevents/s.
• Two logical cores for the use of the operating system.
• One logical core per KC705 for the use of the device driver.
• One logical core per KC705 for the use of the data receiver and distribution application.
• Three logical cores per channel for the use of the pulse processing algorithm.
• Five logical cores per KC705 for data compression algorithm [2] . Fig. 4 depicts the processing tasks breakdown structure identified to obtain the real-time measurement of the neutron emissivity profile [9] , [10] are as follows.
• Task 1: Acquire and process neutron detector pulses, building the necessary energy spectrum for neutrons and gammas.
• Task 2: Retrieve and calculate the necessary inputs from the plasma equilibrium data.
• Task 3: Calculate the neutron emissivity profile using the inputs from the previous tasks and running an inversion algorithm based on the Tikhonov regularization method [3] .
The identified tasks were then distributed in the real-time software architecture design that implements the parallelization of the algorithms by: 1) running the neutron count detection continuously within the control cycle time slots; 2) calculating the emissivity profile using the acquired data in the previous cycle and sending the data to the realtime control network within the same control cycle slot; and 3) performing the real-time measurement control cycle in 10 ms. • Task 1: Runs immediately after the acquisition starts, before any other task, for the complete period of a control cycle.
• Task 2a: Retrieves the plasma equilibrium data necessary to reconstruct the magnetic flux surfaces.
• Task 2b: Uses data retrieved in the previous control cycle and computes the needed inputs from the plasma equilibrium data.
• Task 3: Calculates the neutron emissivity profile using the neutron and magnetic flux data available from the previous cycle as well as calculations made in the present control cycle.
• The tasks run in parallel using different CPUs for improved performance and independent run-time processing.
A. Data Distribution
The host computer receives the pulse data using DMA channels programmed by the host and managed in the FPGA. The DMA transfer engine may generate hardware interrupts to signal the host that the data are available. The developed Linux PCIe driver makes the received data from the hardware available to the data distribution application by storing it in an internal memory. On top of the driver, a data distribution application prepares the data in a shared memory using a round buffer. The data can be accessed by client applications using the data pointers managed by the distribution applications, which logs any data lost between the client application and the distribution application.
Although the use of the polling mechanisms in the driver is not the usual approach for variable event incoming data rate, the tests using the interrupt mechanism reflected a poor stability due to the errors in the interrupt packets transmission when the maximum event rate is used. An architecture using a different device driver approach with polling mechanism and an internal data transmission correction algorithm was implemented, aiming at improving the performance and reliability. This enables the device driver to automatically check and recover missing data blocks, recovering the data losses in a transparent way for the client applications. For the polling mechanism, the synchronization is maintained using a status field transferred via DMA. The device driver implementation using both mechanisms is detailed in [11] .
B. Data Compression
Besides the real-time measurements for control purposes, the RNC diagnostic data acquisition system is also required to send to the ITER data archiving the pulse data with a sustained 2 Mevent/s peak event rate, producing up to 0.5 GB/s of data per channel.
To minimize the data size to transfer to the data archiving system, a lossless compression algorithm was implemented, providing high compression speeds to comply with the system requirements. During the RNC diagnostic prototype phase, the LZ4 compression algorithm was tested using data across multiple processor cores with results up to 1.5 MB/s and 37% space saving [2] , which are important for dimensioning the system in the final design phase. 
C. Pulse Processing
The pulse processing algorithm aims at delivering the neutron and gamma energy spectra by processing the pulse data acquired from the detectors for each LOS. Fig. 6 depicts the real-time pulse processing algorithm flowchart, including the main functions.
• Baseline evaluation.
• Saturation detection.
• Pile-up detection.
• Signal integration (energy calculation).
• Signal and particle separation.
• Construction of energy spectrum per particle. For real-time energy calibration, it is foreseen the use of a known energy LED. The application of LED correction at the end of the control cycle to calibrate the energy spectra in real time is under evaluation. In former offline pulse processing implementations, the LED correction was applied to each pulse individually; however, to improve the performance of the pulse processing algorithm, the LED correction may be applied only to the final spectra. This has implications in the resolution and may introduce a deviation of the real spectrum that must be validated. Another solution envisaged is to apply the correction to every pulse based on the LED energy detected on the previous control cycle. This will minimize the implications presented with a small increase in the processing time. Moreover, the stability of the offset variation is enough to permit that the correction is made using the LED correction of the previous 10-ms cycle.
The number of pile-up and saturated signals detected is used as a correction factor for statistical spectrum count correction. The investigation of pile-up separation algorithms in FPGA and host computer is foreseen for the future, but present results with the statistical correction were validated.
D. Neutron Emissivity Profile Algorithm
The ITER neutron emissivity profile is calculated using an inversion algorithm applying the Tikhonov regularization method [3] . Fig. 1 shows the RNC geometry composed by: 1) lower and higher plasma inspection LOSs using in-port detectors and 2) center plasma inspection LOS using ex-port detectors. The neutron emissivity profile is calculated in the intersection points of different LOSs and the magnetic flux surfaces. Fig. 4 describes the algorithm used for the complete calculation. The RNC measurements (b_signal) and the L matrix (L − Matrix) are obtained in tasks 1 and 2, whereas the regularization matrix (D) is precalculated based on the system geometry. In task 3, the inversion algorithm to obtain 2-D neutron emissivity profile is performed [3] .
V. VALIDATION AND PERFORMANCE TESTS
This section details the results of the algorithm validation and performance tests for more critical algorithms and global prototype architecture.
A. Pulse Processing
To validate the pulse processing algorithm, the two-channel generator CAEN DT5800D was used. Each channel was programmed with exponential decay signal with different decay times aiming at emulating neutron and gamma particle detector signal. A different spectrum was programmed in each channel output. Fig. 7 depicts the signal shapes and the corresponding spectrum for neutron and gamma particles. Both signal generator output channels were mixed with Minicircuits ZX10R-14-S+, allowing signal bandwidth from dc to 10 GHz. This signal is then fed into the system FMC analog input to be acquired and processed by both algorithms in the FPGA firmware and in the host PC.
The algorithm validation was made by comparing the known input spectra signals with the output spectra from the FPGA algorithm and host algorithm. Fig. 8 shows the spectra outputs for both algorithms that are in agreement with each other and moreover with the known inputs from the signal generator.
The signal generator was also programmed to feed the prototype system with different pulse event rates aiming at measuring the performance of the host algorithm. The processing time for three different event rates (2.0 Mevents/s, 1.5 Mevents/s, and 1.0 Mevents/s) using one logical core with 10-ms control cycle running the complete data analysis was measured during several cycles. Fig. 9 depicts the performance results by plotting the processing times of the data acquired during 10 ms over several control cycles for 2.0, 1.5, and 1.0 Mevents/s. It is possible to see that for 1.0 and 1.5 Mevents/s processing times were always under 10 ms, which is in the limit to comply with the specification for the system processing time. For 2.0 Mevent/s, however, all cycles needed more than 10 ms to process the data, which is out of the system specification. It is necessary to upgrade the implementation to achieve 2.0 Mevents/s processing time under 10 ms using the host algorithm by allocating two logical cores for parallel pulse processing.
B. Neutron Emissivity Profile Algorithm
The validation of the neutron emissivity reconstruction algorithm was performed to understand if the code runs and converge with the correspondent simulated emissivity for relevant ITER scenarios. Moreover, for the purpose of testing the performance and correction of the algorithms, the input data for the computation were obtained from the simulated data obtained using the corresponding ITER scenarios.
The algorithms depicted in Fig. 4 (2) for the computation of the L-matrix using the equilibrium data; (3) for the computation of the inversion algorithm, using Tikhonov regularization and calculation of the emissivity profile, were implemented using the Multithreaded Application Real-Time executor (MARTe) in two generic application modules (GAMs) [12] . Fig. 10 shows the processing times for the neutron emissivity profile for the measurement data with different levels of random noise. It is possible to confirm that the complete task runs always under 1.5 ms which is compliant with the system specification.
A cross check of the reconstructed emissivity, for different levels of random noise (1%, 3%, and 12%) using 20 magnetic surfaces, with the correspondent simulated emissivity was done for the relevant ITER scenarios. Fig. 11 depicts the results of this validation. The only noticeable difference that was verified is for 12% input error in the DD-LOW scenario.
VI. CONCLUSION
This section presents the prototyping activity most relevant achievements and important conclusions that can be taken for the improvement of the final system design.
A system architecture has been presented to help the final RNC system design and specifications. The system prototype has been used to:
• implement the most critical algorithms;
• validate the results of the algorithms;
• measure the performance of the system. The performance tests were relevant to help retrieve valuable information to size the optimal system configuration in terms of processing needs and number of acquisition channels per CPU.
The inversion algorithm to calculate the neutron emissivity profile can run in less than 2 ms which is fully compliant with the system specifications.
On the contrary, the pulse processing algorithm runs in 12 ms using 1 CPU core for peak event rate of 2.0 Mevents/s, which is not compliant with the system specifications. Nevertheless, the use of more CPU cores to parallelize the host algorithm is one solution that may be implemented in the current prototype to fulfill the system requirements. However, the FPGA pulse processing algorithm already implemented and running in compliance with the timing system specification should be used to improve the performance of the system [1] .
Currently, tests with the prototype system using two FPGA Xilinx modules in the same host, providing the host with four acquisition channels to acquire and process in real time, are ongoing.
ACKNOWLEDGMENT This paper is in the memory of Prof. C. Correia, who is no longer among us.
