To achieve the aim of the international thermonuclear experimental reactor (ITER) radial neutron camera diagnostic, the data acquisition prototype must be compliant with a sustained 2-MHz peak event per each channel. The data are acquired and processed using an IPFN FPGA Mezzanine Card (FMC-AD2-1600) with two digitizer channels of 12-bit resolution and a sampling rate up to 1.6 GSamples/s mounted in an peripheral component interconnect express (PCIe) evaluation board from Xilinx (KC705) installed in the host PC. The acquired data in the event-based data path are streamed to the host through the PCIe ×8 direct memory access with a maximum data throughput per channel ≈0.5 GB/s of raw data (event base), ≈1 GB/s per digitizer, and up to 1.6 GB/s in continuous mode. The prototype architecture comprises a host PC with two KC705 modules and four channels, producing up to 2 GB/s in event mode and up to 3.2 GB/s in continuous mode.
I. INTRODUCTION
T HE radial neutron camera (RNC) is a key international thermonuclear experimental reactor (ITER) diagnostic aiming at the real-time measurement of the neutron emissivity to characterize the neutron emission that will be produced by the ITER tokamak [1] - [6] . To achieve the aim of the RNC diagnostic, the data acquisition prototype must be compliant with a sustained 2-MHz peak event per each channel. The data are acquired and processed using IPFN FPGA Mezzanine Cards (FMC-AD2-1600) with two digitizer channels of 12-bit resolution sampling up to 1.6 GSamples/s. These inhouse developed cards are mounted in peripheral component interconnect express (PCIe) evaluation boards from Xilinx (KC705) [5] .
The prototype architecture comprises one host PC with two installed KC705 modules and four channels, producing the expected 2 GB/s of data in event mode (0.5 GB/s per channel). Moreover, in case of continuous acquisition due to strong pileup effect, a maximum of 3.2 GB/s of data throughput (0.8 GB/s per channel) can be achieved [5] .
The LZ4 is a lossless compression algorithm, providing compression speed at 400 MB/s per core, scalable with multicore CPUs. This algorithm appears as the fastest compression algorithm with a relevant compression ratio comparing to other dictionary encoding and entropy encoding algorithms [7] , [8] .
During the RNC diagnostic prototype phase, the LZ4 was chosen to evaluate the feasibility of the real-time data compression implementation in the host PC to reduce the produced data throughput to ITER archiving system [6] .
LZ4 is also suitable for implementation in FPGAs [9] , [10] or in the graphics processor units (GPUs) such as other lossless algorithms [11] - [14] , which can be a valuable feature for future developments. This paper presents the implemented solution and the achieved results, which contribute to the RNC diagnostic specification. A brief overview of the system and software architecture is provided in Section II. The preliminary results that contribute to the design of the implemented architecture are presented in Section III. Section IV presents the tests and results with the developed solution and selected compression algorithm. This paper ends with Section V devoted to the conclusions and future work remarks. [15] , highlighting the context compression software path. The system was designed to support two IPFN FPGA Mezzanine Cards installed in the PCIe evaluation boards from Xilinx (KC705) and connected to the host through the PCIe ×8 slots. The data production using a downsampled configuration to 400 MSamples/s is up to 1 GB/s per board in event mode and can be 0018-9499 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
II. SYSTEM ARCHITECTURE
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. increased to 1.6 GB/s per board in continuous mode [16] , providing higher data throughput to stress tests. The host computer hardware specification includes in the following. 1) Motherboard: ASUS Rampage V Extreme with 4xPCIe 3.0/2.0 ×16 slots. 2) CPU: Intel Core i7-5930K at 3.50 GHz supporting Intel Hyperthreading Technology (6 cores, 12 threads). 3) 64 GB of RAM and 256-GB solid state drive. The Scientific Linux 7 is running as an operating system with kernel 3.10-rt and LZ4 version 1.7.5, providing a similar test environment to the ITER control, data access and communication (CODAC) system. Furthermore, this distribution enables the test of future Red Hat Linux kernel versions.
Interfacing between the hardware and high-level applications is installed the Linux device driver, supporting data transfers up to 1.6 GB/s per board. The control and data acquisition software includes a shared memory layer to distribute the acquired real-time data across the client applications when several clients need to use the data at the same time. Future improvements may take advantage of the shared memory mechanisms for interprocess communication provided in the ITER CODAC data archiving system [17] - [19] .
At the application level, software modules were developed for:
1) real-time data compression to reduce the data size; 2) data pulse processing for energy and particle discrimination; and 3) real-time raw archiving for test purposes with low data rate acquisitions. In the presented tests, the compression application was directly connected to the device driver for evaluating its performance limit.
The introduction of the shared memory layer should not affect performance since the approach to read the data from the shared memory must be, in the worst case, as fast as the direct reads from the device driver.
A. Software Architecture Fig. 2 shows the compression application software architecture and its interface to the device driver. The compression application is based on the task farm algorithm approach with a master thread that launches a poll of threads with a configured number of worker threads.
The device driver implements a kernel thread with an internal circular buffer to store in real-time, the data transferred from the hardware until it is read from the consumer applications. There are two implemented pointers for the circular buffer (read pointer and write pointer) to control the read operations from the consumer applications and check data loss between the device driver and the consumer applications. In addition, the device driver implements algorithms to check the data loss between the hardware and device driver.
The master thread reads the available data from the device driver in real time, packs it into data blocks with configurable data size, and distributes it across the configured number of worker threads. Each data block is tagged with an id to be used by the worker threads to store the compressed data in the correct position of a shared buffer between worker threads. The work balance algorithm distributes the next block to be compressed to the first available core in the pool but other options can be used.
The worker threads implement the compression algorithm and are responsible for the parallel compression in real time. In the present tests, the worker threads store the compressed data in a memory buffer but can deliver it through the network directly to the data archiver.
The device driver, master thread, and worker threads run in isolated cores (detached from the kernel scheduling, preventing its usage by the operating system), taking advantage of the CPU affinity feature, which is the ability to direct a specific task, or process, to use a specified core.
III. PRELIMINARY TESTS
To identify the achieved compression speed, compression ratio and space saving with different configurations of the LZ4 algorithm several tests are done using the LZ4 default and LZ4 high compression (HC). In the LZ4 default, an acceleration option can be configured to get a better compression speed compromising the compression ratio. On the LZ4 HC, an HC derivative of LZ4, the compression level can be configured to improve the compression ratio compromising the compression speed.
The input data for these tests were collected with real radiation sources in Frascati neutron generator (FNG) during the tests in January 2018 and from a waveform generator simulating a gamma ray type signal as input to compare Table I presents the LZ4 default compression tests comparing the different sources and acceleration factors. The results suggest that the accelerations with even number (2, 4, 6, 8, 10, and 12) have a better relationship between compress speed and compress ratio. However, the acceleration with factor 1 has a better compression ratio. Table II presents the LZ4 HC tests comparing the different sources and compression levels. Using the 256 MB/s of data throughput with up to eight CPU cores running in parallel, only the first three compression levels can be used without missing data. The results present a better space saving than in the LZ4 default but compromising significantly the compression speed. Table III presents the theoretical number of needed cores to compress 1 GB/s of data throughput in real-time with different LZ4 default acceleration factors and different LZ4 HC compression levels. Using the LZ4 default algorithm, there is no difference in the needed cores between the acceleration levels 2 and 5, but the space saving reduces ≈5%. Using the compression level 1, the space saving is increased ≈2% but one more core is needed.
The LZ4 HC variant usage is not possible because it needs at minimum 13 available cores (10 more than LZ4 default) to improve the space saving in ≈6%.
The tests also confirm compressing speed and ratio similarities between the acquired signals in the real environment and signals from the waveform generator.
IV. TESTS AND RESULTS
The tests were based on a pulse-type signal from a waveform generator simulating a gamma-ray distribution. The acquisition tests have different pulsewidth configurations to produce distinct acquisition data rates up to 1.5 GB/s. Each acquisition test has 60 min, in agreement with the ITER long pulse acquisitions. All tests have 10 MB of data block size to compress, except the tests to verify the impact of the usage of other block sizes.
To compress data with maximum compression ratio available, the LZ4 default algorithm with acceleration factor 1 was selected.
A. Number of Cores
Table IV presents the relationship between the data loss and the number of cores for different data acquisition rates.
Based on the results, the minimum number of needed cores per data acquisition rate with no data loss is as follows.
1) 1 core to compress until 256 MB/s. 2) 2 cores for 512 MB/s. 3) 3 cores for 768 MB/s. 4) 4 cores for 1 GB/s. 5) 6 cores for 1.5 GB/s. These results are inside the range of the preliminary results for acceleration level 1, which have a compression speed of ≈300 MB/s per core.
B. Core Usage
The used CPU supports Intel hyperthreading technology, providing 12 logical cores for the operating system, based on their six physical cores. This architecture can result in slight   TABLE V   AVERAGE CORE USAGE   TABLE VI   COMPRESSION STATISTICS   TABLE VII BLOCK SIZES COMPARISON differences to tests with 12 dedicated cores that can produce small improvements.
Table V presents average core usage of the dedicated cores to the master and worker threads for each acquisition data rate.
C. Compression Statistics
Table VI summarizes the compression results of the tests with different pulsewidths. Table VII presents the test of different data block sizes with same input signal and configurations (1 channel with pulsewidth 128 and acquisition data rate of 512 MB/s).
D. Block Sizes
The results suggest that data block size did not improve the space saving, however, the standard deviation of the compression speed during the pulse is reduced. Fig. 3 shows the CPU and memory usage during three acquisitions from one board during 10 s with 1024 MB/s of acquisition data rate. There are two dedicated cores for the operating system, one isolated core for the device driver thread, one isolated core for the compression master thread, and four isolated cores for the compression worker threads. The allocated memory is around 6 GB in run-time to the data buffers. Fig. 4 shows the CPU and memory usage during three acquisitions from two boards simultaneously during 10 s with 512 MB/s of data acquisition rate per board. There are two dedicated cores for the operating system and each board uses five isolated cores (one for the device driver thread, one for the compression master thread, and three for the worker threads). The allocated memory is around 10 GB in run-time to the data buffers. Fig. 5 depicts the relationship between space saving and pulsewidth for signals using one and two analog-to-digital converters (ADCs).
E. CPU and Memory Usage

F. Relationship Between Space Saving and Pulsewidth
Independently of the ADCs number, a relation between space saving and pulsewidth can be identified. Using a greater pulsewidth, the relative space saving increases, which can be related with the type of acquired data.
V. CONCLUSION
This contribution evaluates the feasibility of data compression implementation in the host PC and contributes to the RNC diagnostic specification.
The presented architecture is scalable and adjustable. The number of worker threads can be configured to comply with different algorithms and data throughput.
The stress tests show a stable solution during 60-min acquisitions with data acquisition rates up to 1.5 GB/s, using a maximum of six worker threads in parallel.
The system was also tested in Fedora Linux 27, kernel 4.16, the community version of Red Hat Linux that supports ITER CODAC system. The results were similar, which validates the developed architecture for future kernel version of Red Hat Linux.
Based on the presented tests, to compress 1 GB/s from one board in real time, a minimum of five cores are needed (one master and four worker threads). Using two boards simultaneously to acquire 1 GB/s in each, the system will need 14 cores (two cores for the operating system, two cores for the device driver, two cores for the master thread, and eight cores for the worker threads). There are several commercial CPUs that support this architecture.
The preliminary tests with two boards simultaneously showed a possible performance decreasing. Acquiring 512 MB/s, the system needs three cores, instead of two with a single board. This can be related with the usage of Intel hyperthreading technology instead of dedicated cores but intensive tests with two hardware modules acquiring simultaneously are scheduled to a future task.
With the tested signals, the maximum achieved space saving with the LZ4 algorithm was between 25% and 40%. Changes on signal configuration can influence the compress ratio.
In the future, the data compression can be implemented in the GPU or FPGA to compare the results with the host PC. There are some possible advantages to be tested but for the FPGA implementation, a new data path to the host is needed, once the processing algorithms need decompressed data in real time. This increases the data transfer throughput between FPGA and host PC, which can be a demanding task for the Linux device driver.
ACKNOWLEDGMENT This paper is in memory of Prof. C. Correia who is no longer among us.
This publication reflects the views only of the author, and Fusion for Energy cannot be held responsible for any use which may be made of the information contained therein.
