Exploiting graphic processing units parallelism to improve intelligent data acquisition system performance in JET's correlation reflectometer by Nieto, J. et al.
Exploiting Graphic Processing Units Parallelism to 
Improve Intelligent Data Acquisition System 
Performance in JET's Correlation Reflectometer 
J. Nieto, G. de Arcas, J. Vega, M. Ruiz, J.M López, E. Barrera, A. Muran, A. Fonseca and JET EFDA contributors 
Abstrcwt-The performance of intelligent data acquisition 
systems relies heavily on their processing capabilities and local 
bus bandwidth, especially in applications with high sample rates 
or high number of channels. This is the case of the self adaptive 
sampling rate data acquisition system installed as a pilot 
experiment in KG8B correlation reflectometer at JET. The 
system, which is based on the ITMS platform, continuously 
adapts the sample rate during the acquisition depending on the 
signal bandwidth. In order to do so it must transfer acquired 
data to a memory buffer in the host processor and run heavy 
computational algorithms for each data block. The processing 
capabilities of the host CPU and the bandwidth of the PXI bus 
limit the máximum sample rate that can be achieved, therefore 
limiting the máximum bandwidth of the phenomena that can be 
studied. 
Graphic processing units (GPU) are becoming an alternative 
for speeding up compute intensive kernels of scientific, imaging 
and simulation applications. However, integrating this 
technology into data acquisition systems is not a straight forward 
step, not to mention exploiting their parallelism efflciently. 
This paper discusses the use of GPUs with new high speed data 
bus interfaces to improve the performance of the self adaptive 
sampling rate data acquisition system installed on JET. 
Integration issues are discussed and performance evaluations are 
presented. 
I. INTRODUCTION 
T he pursuit of more performing plasma scenarios in reactor class devices, with high energy content and long 
pulses, has motivated the measurements of more quantities, 
with higher spatial and time resolution, leading to an 
exponential increase in the amount of data to be analyzed, 
which cannot be handled with traditional methods [1]. The 
traditional paradigm of analyzing the data after the shot must 
be abandoned, and approaches such as the one proposed 
through the Intelligent Test and Measurement System (ITMS) 
platform [2] must be adopted to provide more information 
during the shot. The pilot data acquisition system developed 
for JET's KG8B correlation reflectometry [3] is an example of 
such a system that is being used to analyze potential issues and 
candidate solutions of this new approach. The system 
implements a self adaptive sampling data acquisition 
mechanism, which consists of acquiring data with a variable 
sampling rate which is continuously adapted depending on the 
bandwidth of the input signáis, therefore optimizing the 
volume of data generated without loosing any information. In 
order to achieve this goal the system must process 
considerable amount of data in real time to compute the 
bandwidth of the input signáis. This is one of the typical 
problems of these systems which leads to the need of multi-
parallel scalable processing capabilities in the data acquisition 
architectures. ITMS provides a framework to develop such 
applications with high level tools, enabling the programmer to 
use several processing units (PCPUs) or even FPGA-based 
cards from high level programming languages such as 
LabVTEW [4]. In this paper, the possible use of GPUs in this 
architecture is analyzed to determine implementation issues 
and its advantages. 
II. SYSTEM DESCRIPTION 
The system has been developed with a commercial 
Workstation Hewlett-Packard model Z600, that hosts two 
Xeon X5550 QuadCore processors at 2,66 Ghz with 4 Gbytes 
DDR3 RAM, and a NVIDIA TESLA C1080 processor board 
with 4 Gbytes GDDR3 RAM with 240 streams processors 
running at 1,3 Ghz. The system setup was chosen to maximize 
the data transfer rate between both processors by using a PCI-
Express Gen 2 Bus with 16 data lines, which allows a 
máximum data transfer rate of 4Gb/s in each direction. The 
software running in the host processor has been developed 
with National Instrument's LabVTEW 8.6.1 under Microsoft 
Windows XP-32 bit operating system. And the softwaew 
running in the GPU (TESLA board), has been developed using 
the drivers and the runtime system of NVTDIA's Computed 
Unified Device Architecture (CUDA) library v2.3, and 
Microsoft Visual Studio 2008. 
III. DEVELOPMENT METHODOLOGY 
The main goal of integrating GPUs into the ITMS platform is 
to evalúate their performance for processing data in real time, 
and at the same time, to analyze the development process in 
order to determine the possibility of using them from high 
level applications. Taking into account the internal 
architecture of GPUs, the algorithms to be used during the 
processing must have a high degree of paralelization in order 
to achieve the first goal. This parallelization degree can be 
inherent to the algorithm, due to its nature (eg. Fast Fourier 
Transform); or it can be forced, when several techniques, such 
as data parallelization, can be applied. Therefore, the 
NOCOFE algorithm used for bandwidth estimation in KG8B’s 
adaptive sampling rate DAQ system was chosen as a good 
example of a general signal processing algorithm as it contains 
both natively (FFT) and non-native (filters, searches) parallel 
routines. 
Regarding the development methodology, LabVIEW was 
chosen as the development environment, as it the base for the 
ITMS architecture. The process relies on the use of the 
LabVIEW GPU Computing Toolkit, which permits to use the 
CUDA library from LabVIEW. This toolkit consists of a set of 
functions as shown in Fig.1, named VIs, which permit to make 
low level calls to the CUDA runtime to use the hardware 
The LabVIEW code implementing the abovementioned 
process is shown in the Fig. 2. 
resources. 
Fig. 1. LabVIEW GPU Computing Toolkit Palette. 
It is important to note that calls to the code running in the 
GPU are called asynchronously by threads running on the host 
under the LabVIEW environment. Therefore it is possible to 
parallelize the code running in the CPU, and that running in 
the host, but synchronization mechanisms must be taken into 
account in order to develop a consistent application 
The work flow for the execution of an application in the GPU 
from LabVIEW is as follows: 
• Create a context in a CUDA-enabled device. 
• Determine which memory resources will need the 
GPU and reserve them from the host. 
• Transfer the data from the host memory to the GPU 
memory. 
• Launch the CUDA function to run in the GPU. This 
must be embedded in a DLL containing calls to the 
kernel. 
• Transfer the results from the GPU memory to the 
host memory. 
• Free memory resources from the GPU and close the 
context. 
Fig. 2. LabVIEW G-code. 
All functions calls are performed synchronously, except that to 
the processing routine in the GPU. The problem is solved by 
the function moving the data back to the host memory, which 
waits until the operation in the GPU is completed. In order to 
obtain the expected performance data must be transferred 
between the host and the GPU memory at high speeds. In the 
system tested this is guaranteed by the PCI-Express bus, 
providing sustained data transfer capabilities at 4GBytes/s 
The difficult part of the process comes when porting the 
algorithm to GPU code as it is not possible to write this code 
using high level programming languages, such as LabVIEW. 
In order to implement the NOCOFE algorithm the tasks shown 
in Fig. 3 must be implemented. The code must be written in C 
language and encapsulated in a DLL that will be called from 
LabVIEW. When writing the code, it is important to look for 
the highest degree of parallelization, so each operation must 
be analyzed in detailed. In the following, a description on how 
each operation of the NOCOFE algorithm has been 
implemented to exploit the processing capabilities of the GPU 
is provided. 
Fig. 3. Taks into DLL. 
A. Powerspectrum 
The first step to compute the power spectrum is the Fast 
Fourier Transform. Since the FFT has a high inherent degree 
of parallelization, its implementation has been straight forward 
as NVIDIA provides an optimized implementation for its 
CUDA enabled devices though a library named cuFFT. The 
function implements a parallel algorithm and it also allows for 
data parallelization by passing the input data in a appropiate 
way, so the first step is quite straightforward. Then, the single 
sided power spectrum must be computed according to the 
definition: 
Sxx 
F{x}* F"{x} 
where F{x} is the FFT of the signal, F*{x} its conjugate, and n 
number of elements in x. This operation must be repeated for 
each output value of the FFT as the cuFFT library only returns 
the spectrum components between 0 and π. 
z 
n 
To complete this calculation a kernel function that maximizes 
the use of existing TESLA processor board has been 
developed. Parallelization has been achieved by developing a 
kernel function which implements the operation for one 
sample, and launching this function in parallel as many times 
as input samples. So there is a kernel function which computes 
the addition of the squared real part and imaginary parts of 
each spectral component, divided by the square of the number 
of existing components. This function is called from the DLL 
library in the host processor by launching as many threads as 
number of FFT output values. 
B. Normalize-DB Scale 
This task consist of normalizing each sample of the power 
spectrum calculated above and converting it to decibels (dB). 
As there is no optimized function available from NVIDIA for 
this operation, a kernel function has been developed using the 
same strategy as in the previous task. Fig. 3 shows the kernel 
function (GPU) code and how this function is called in parallel 
from the DLL in the host (CPU), as many times as the number 
of input values to analyze. The expression 
<<<Blocks,ThreadsPerBlocks>>> launches as many Blocks 
times as ThreadsPerBlocks threads, each of them running the 
kernel function called. The parameter ThreadsPerBlocks is 
chosen depending on the maximum number of threads that a 
GPU processor can run (e.g. 512 in this case), and the 
parameter Blocks is chosen depending on the number of input 
samples. An expression must be used to synchronize all 
threads before proceeding to the next step. Regarding the 
implementation of the kernel function, a second degree of 
optimization is achieved by using the primitive functions 
included in the NVIDIA library. 
norma lizespectrum_H< «Blocks,TlireadsPerBlocks>»(Results,r 
cudaTh readSynchronizef); 
Desviation_H<<<Blocks_d,ThreadsPer Blocks_d>>>(Data_Out, Resulte); 
cudaThreadSynchronizej); 
Fig.43. C code runs into the GPU and the CPU. 
C. Filter 
The type of filter chosen has a direct impact on the 
performance of the algorithm, as some implementations lead 
to poor parallelization, whereas others are just the opposite. It 
is important to test several alternatives to find the best 
compromise between functionality and performance. In this 
case the filter is used to smooth the output of the Power 
Spectrum function so the search process that comes afterwards 
produces more consistent results when analyzing similar 
power spectrums. Therefore a F I R filter with 30 taps has been 
chosen. F I R filters have a high degree of parallelization since 
their output values only depend on the values of the input and 
the coefficients. This is not the case for I IR filters, where it is 
necessary to accumulate several previous input and output 
values in order to compute each output value. Again there is 
no optimized function in the N V I D I A libraries to perform this 
operation so it has been implemented in C following the same 
strategy as before. Parallelization has been achieved by 
developing a kernel function which implements the filter 
equation, and launching one thread per output sample to call 
that function in parallel as many times as needed. 
D. Search 
In order to compute the bandwidth of the signal, the point 
where the filtered power spectrum meets the noise level, 
corner frequency must be computed. This is done in two steps. 
First, the noise level is obtained as the mean plus the standard 
deviation of the last 20% of the power spectrum. Both 
operations have been implemented as kernel functions, and 
parallelization as been achieved by dividing the input data 
block in several sub-blocks in order to run them in parallel. 
Then, the corner frequency is obtained by searching in the 
filtered power spectrum array starting at its maximum value. 
Searching is a linear operation, so it is not possible to 
parallelize it directly. Therefore, the only possibility is data 
parallelization (e.g. running one search per acquired input 
channel). 
E. Decimate 
The last step is to decimate the input signals depending on the 
maximum bandwidth value of the acquired channels. This 
function has been fully parallelized by following the same 
approach as in the previous examples. 
IV . EXPERIMENTAL RESULTS 
In order to evaluate the performance of the G P U , execution 
times must be measured. There are two possibilities: to use the 
signal clock of the C P U , which has a resolution of 
milliseconds; or to use the timers in the G P U , which have a 
resolution of half a microsecond. In the first case the time 
measurements are taken from LabVIEW, in the second they 
are taken from the D L L by using calls to primitive functions 
from the C U D A library. The later has been chosen as it has a 
much better resolution. 
First, the algorithm execution time has been measured for 
different input block sizes. The execution time of each step of 
the algorithm has been determinated to locate the weakest 
points of the process. Table I shows how this results, in 
milliseconds, for each step of the algorithm depending on the 
MaximumValue_phase¿_H«<Blocks_b, I hreadsPerolocks_b»>(maximum,índex); 
;udaThreadSynchronize(); 
ñrñlter_H<<<Blocks_f,ThreadsPerol ocks_p>>>( Resulte, uata_uut); 
;udaThreadSynchronize(); 
Mean_H«<Blocks_m, I hreadsPerolocks_m>»(uata_uut, means); 
;udaThreadSynchronize(); 
data block size used to acquire each channel. The Search 
process is computationally more expensive because it has the 
lowest parallelization level. 
TABLE I. EXECUTION TIME IN MS FOR EACH TASK 
Data block Power Normalize Filter Search Decimate 
size/ch Spectrum dB scale 
16Ks 0.936 0.615 0.110 1.988 0.13 
32Ks 0.945 0.677 0.176 3.438 0.24 
64Ks 0.958 0.776 0.320 6.227 1.20 
128Ks 1.196 0.974 0.620 11.715 2.39 
It is also clear from the results that when the user must 
parallelized an algorithm it leads to poorer performance than 
when there is an optimized library available, as the 
performance of all tasks that have been parallelize have a 
higher dependency on the data block size than that of the FFT. 
This is shown for clarity in Table II. Anyhow, one must also 
consider that the FFT is the task with the highest native 
parallelization degree, so it not fare to do a straight 
comparison with the rest of the functions. Further efforts could 
be made to improve the parallelization of the other tasks, such 
as playing with the memory allocation of the data in the GPU 
among the different types of memories defined in the CUDA 
Memory Model, but this was not the objective of this work. 
TABLE II . INCREASED EXECUTION TIME DEPENDING ON THE DATA BLOCK SIZE 
Data block Power Normalize Filter Search Decimate 
TABLE III. COMPARED PERFORMANCE OF THE G P U AND THE HOST 
size/ch Spectrum dB scale 
16Ks 
32Ks 
64Ks 
128Ks 
- -
1% 
2% 
28% 
10% 
26% 
58% 
-
60% 
191% 
464% 
-
73% 
213% 
489% 
-
85% 
823% 
1738% 
In order to analyze the performance improvement obtained by 
using GPUs a version of the same algorithm was also 
implemented in LabVIEW and run in the host processor. 
Table III shows the system improvement obtained depending 
on the data block size of each acquired channel. It must be 
considered that the execution time of the GPU includes the 
data transfer between the host and the CUDA-enabled device 
memory as data acquisition control is still reserved to the host 
processor. Therefore the possible benefits of the GPU rely in a 
sense on the availability of high speed data buses. 
The processing time of a data block limits the maximum 
sample rate that can be used to acquired that data in real time. 
Table IV shows the improvement obtained by using the GPUs 
in terms of the maximum sample rate that can be achieved. 
This a crucial parameter of the system, as it limits the spectral 
bandwidth of the system, and therefore the sources of 
information that can be analyzed. 
Data block 
size/ch 
16Ks 
32Ks 
64Ks 
128Ks 
LabVIEW 
exec. time(ms) 
6 
14 
26 
51 
GPU Exec. 
time (ms) 
4 
6 
10 
18 
Performance 
improvement 
33.3% 
57.1% 
61.5% 
64.7% 
It must be reminded that the data block size, together with the 
sample rate, determine the available processing time, and the 
time resolution of the system. Therefore, small block sizes are 
interesting from the point of view of the time resolution; while 
bigger ones might provide better performance as it the case in 
the GPU. 
TABLE IV. MAXIMUM SAMPLE RATES (MS/S) 
Data block LabVIEW Data transfer Improvement 
size/ch. + processing 
16Ks 
32Ks 
64Ks 
128Ks 
2.73 
2.34 
2.52 
2.57 
4.09 
5.46 
6.55 
7.28 
50.0 % 
133.3% 
160.0 % 
183.3 % 
It has been demonstrated that developing efficient code for 
GPUs requires a high degree of knowledge of the processor 
architecture and some degree of expertise in algorithm 
parallelization. Software development tools, e.g. compilers 
and high level tools, are evolving much slower than 
microelectronics, creating a gap that complicates the 
widespread use of these technologies. Anyhow, it is also true 
that the massive processing capabilities of GPUs, counteract 
against this problem by compensating the lack of optimization 
with more processing power. Therefore, it has been proved 
that it is possible to integrate GPUs in D A Q applications from 
high level programming languages, although it is not a 
straightforward process. Another question to solve is the 
availability of G P U based solutions for industrial 
environments, e.g. P X I based boards, or the interconnection 
between existing 1 U systems and these systems. 
REFERENCES 
[1] J.Vega, A . et al. “New Developments at J E T in Diagnostics, Real Time 
Control, Data Acquisition and Information Retrieval on the route to 
ITER”. Fus.Eng. and Des., vol 84, issue 12. , p.p. 2136-2144. December 
2009. 
[2] Barrera E . et al. 
[3] G . de Arcas, et al “Self-adaptive sampling rate data acquisition in JET’s 
correlation reflectometer”.Rev. Sci. Instrum, vol. 79, issue 10, p.p. 
10F336-10F336-4. 2008 
[4] J . González et al. “Configuration and supervision of advanced 
distributed data acquisition and processing systems for long pulse 
experiments using J INI technology”. Fus.Eng. and Des., vol 84, issue 2-
6. , p.p. 832-836. June 2009 
