: Future experiments in high-energy physics will pose stringent requirements to computing, in particular to real-time data processing. As an example, the CBM experiments intends to perform online data selection exclusively in software, without using any hardware trigger, at extreme interaction rates of up to 10 MHz. In this article, we describe how heterogeneous computing platforms, in particular Graphical Processing Units (GPUs), can be used to solve the associated computing problems, giving the example of an event selection process sensitive to J/ψ decays using muon detectors. We investigate and compare different parallel computing paradigms (OpenCL, OpenMP, MPI, CUDA) on both CPU and GPU architectures and demonstrate that the problem under consideration can be accommodated with a moderate deployment of hardware resources, provided their compute power is made optimal use of.
Introduction
The computing demands for modern experiments in high-energy particle and nuclear physics are increasingly challenging. This holds in particular for experiments relying heavily on real-time data processing. An example is the Compressed Baryonic Matter experiment (CBM) planned at the future FAIR facility in Darmstadt, Germany. CBM will study strongly interacting matter by investigating collisions of heavy nuclei at very high interaction rates of up to 10 7 /s, enabling access to extremely rare physics probes. A key feature of the experiment is a data taking concept without any hardware trigger. All data from the various detector systems will be forwarded to a computing farm, where the collision data will be inspected for potentially interesting physics signatures and consequently either accepted or rejected. The data selection performed in software requires significant processing power, and thus the efficient use of the available computing architectures, making use of parallel processing features, is indispensable.
In modern computing architectures, both multi-core CPU systems and auxiliary accelerators like Graphical Processing Units (GPU) are playing a pivotal role. GPUs consist of a set of multiprocessors designed to obtain the best performance with graphics computing. Nevertheless, Figure 1 . Left: the CBM experiment setup with its detector systems; right: the CBM muon detection system with a pictorial view of a J/ψ decay into µ + and µ − their computational power can be also used, within certain limits, for general-purpose computing, an application type which is becoming more and more popular. Extracting good performance from GPU processors, however, is not trivial and requires architecture-specific optimizations.
In this article, we present a study on the usage of GPUs for a specific task within the CBM online data processing, namely the selection of J/ψ candidate events. After a short introduction to the experiment and its computing challenge in section 2, section 3 describes the event selection algorithm in detail. Section 4 briefly summarises the computing platforms and testing conditions for this study. The implementation of the algorithm and its performance on various computing platforms and for different parallel computing paradigms are reported in sections 5 and 6. The results will be summarised in section 7 followed by an outlook and acknowledgements.
The CBM experiment and its muon detection system
The Compressed Baryonic Matter (CBM) experiment [1] is a dedicated relativistic heavy-ion collision experiment at the upcoming FAIR accelerator centre at Darmstadt, Germany [2] . Figure 1 (left) shows the setup of the CBM experiment in the muon configuration. For CBM, energetic ions (p beam = 3.5A − 35A GeV for Au ions) will be collided with a fixed target, producing a large number of particles which will be registered in several detector systems serving for momentum determination and particle identification. One of these detector systems is the Muon Chamber System (MUCH) [3] shown in the right panel of Figure 1 . It is placed after the Silicon Tracking System (STS) consisting of eight silicon-strip tracking stations inside the dipole magnet [4] . The MUCH system allows the separation of muons from other particles by several massive absorber slices interlayed with position-sensitive detector stations. In its full configuration, the system comprises of one graphite and five iron slabs of thicknesses varying from 20 cm to 100 cm, with detector triplets inserted in the gap between two consecutive slabs. The last station consisting of a detector triplet is called trigger station and is positioned after a one meter thick iron absorber. One objective of the experiment is the measurement of J/ψ meson production via their decay into a muon pair (J/ψ → µ + µ − ). At CBM energies, however, the J/ψ multiplicity is expected to be extremely small -of the order of 10 −7 per collision [3] . The experiment will thus be operated at extreme interaction rates in order to obtain sufficient statistics. To reduce the total data rate to a recordable value, J/ψ candidate events must be selected in real-time during data taking, rejecting the vast majority of the collisions. The trigger signature is obvious from the right panel of Fig. 1 : two muons traversing the absorber system and being registered simultaneously in the trigger station. The trajectories of the muon candidates have to point back to the collision vertex (target), since the J/ψ decays promptly at the vertex.
The trigger signature will be evaluated exclusively in software on the so-called FLES (FirstLevel Event Selection) computing cluster [5] hosted in the Green Cube building at GSI [6] . Efficient trigger algorithms are a prerequisite for the affordability of the FLES cluster.
3 The real-time event selection process
Trigger signature
The signature for J/ψ → µ + µ − candidate events is a rather simple one and is visualized in the right panel of Fig. 1 . The two daughter muons, having a high momentum due to the large q value of the decay, traverse all absorber layers and reach the trigger station, while hadrons, electrons, and low-momentum muons will be absorbed. Since the J/ψ decays promptly (cτ = 7.1 · 10 −21 s), the decay products practically originate from the primary (collision) vertex, i.e., from the target. Owing again to the high momentum of the muons, their trajectories can be approximated by straight lines even in the bending plane of the dipole magnetic field. The trigger station provides three position measurements, allowing to check the back-pointing to the primary vertex. The signature of a candidate event is thus the simultaneous registration of two particles in the trigger station which can be extrapolated backward to the target. To study the feasibility of this selection procedure, Fig. 2 (left) shows the number of hits per event for the trigger station (combined for all three layers) for a sample of background events (not containing a J/ψ decay) simulated in the CBM setup. The average number is about 45 for the trigger station and 15 for each of its layers. The dominant source of this background are secondary muons originating from weak decays of Λ and K 0 S between the target and the trigger station. The right panel of Fig. 2 shows the spatial distribution in the transverse plane x − y of these background hits. The void area in the centre is not instrumented, since it hosts the vacuum pipe for the non-interacting beam.
Algorithm
The algorithm corresponding to the signature described in the previous section operates on all data from the MUCH detector within one event (collision). The steps are:
1. Create all possible triplets of hits in the trigger station, with one hit from each layer. For the application to simulated events used in the following, data are read from a file produced by the detector simulation (GEANT3 or GEANT4 [7] plus detailed detector response implementation). When deployed for data taking, the algorithm will receive the input (hit) data from the online data stream. Hit data are grouped into events; they provide three-dimensional coordinate information (x, y, z).
Evaluation of the trigger algorithm
In order to assess the performance of the trigger algorithm, the following sets of simulated data were produced and studied: D1. Events containing only one decay J/ψ → µ + µ − . The phase-space distribution of the J/ψ were generated using the PLUTO generator [8].
D2. Background events (central Au+Au at p beam = 35A GeV) generated with the UrQMD model [9] .
D3. Background events with one embedded decay J/ψ → µ + µ − in every event.
The following performance figures are used: a) efficiency: the fraction of embedded signal events selected by the algorithm; b) efficiency under acceptance (EUA): the fraction of selected embedded signal events in which both decay muons have hits in all three layers of the trigger station; c) background suppression factor (BSF): the ratio of all background events to background events selected by the trigger algorithm.
The efficiency is mainly determined by the geometrical coverage of the muon detection system. Typical values are about 39 %. The reason for the EUA to be smaller than unity is one or both signal tracks not passing the χ 2 cut. With the cut values named above, the EUA is 85.4 %. The BSF of 71.4 shows that the primary aim of the algorithm -suppression of a large fraction of the input data rate -is reached; the probability to find a chance pair of triplets passing the χ 2 cut is still not negligible. The choice of architecture and programming paradigm may well depend on the specific computing problem to be solved. For the algorithm described in the previous section, we have tested implementations on the following two heterogeneous platforms: S2. An AMD-based HP Server with four AMD Opteron 2.6 GHz processors comprising 16 cores with 4 GB/core RAM (total 64 cores).
Heterogeneous computing
The algorithm was implemented on these setups with different programming paradigms, namely: a) Using both Intel processors (12 cores in total) of the S1 setup with OpenMP, MPI and OpenCL; b) Using all four AMD processors (64 cores in total) of the S2 setup with OpenMP, MPI and OpenCL; c) Using the Tesla C2075 GPU of the S1 setup (448 cores) with CUDA and OpenCL; d) Using the Quadro 4000 GPU of the S1 setup (256 cores) with CUDA and OpenCL.
5 Implementation on NVIDIA GPUs
NVIDIA GPU C2075 and thread architecture
The Tesla C2075 GPU used in this work consists of an array of 14 so-called Streaming Multiprocessors (SM). Each SM contains 32 Scalar Processor (SP) cores. Therefore, in total 14 · 32 = 448 cores are available in a single GPU card. All SPs within an SM share resources such as registers and memory. The instruction issue unit issues the same instruction across each SP inside a SM. Thus, they execute the same instruction at any time, a concept known as SIMT -Single Instruction Multiple Threads. Figure 4 shows a diagrammatic representation of the GPU architecture. The thread hierarchy is of two levels. At the topmost level there exists a grid of thread blocks. At the second level, the thread blocks are organized as an array of threads. A function to be executed on the GPUs is known as kernel. Kernel execution takes place in the form of a batch of threads organized as a grid of thread blocks. The thread blocks are scheduled across SMs. Each block comprises of many warps of 32 threads. Threads belonging to the same warp execute the same instruction over different data. The efficiency of computation is best when the threads follow the same execution path for the majority of the computation. Execution divergence, when threads of a warp follow different execution paths, is handled automatically inside the hardware with slight penalty on execution time. The size of thread blocks (number of threads per block) and number of blocks can be managed by the programmer.
The CUDA platform and memory hierarchy for the Tesla C2075 GPU
The Compute Unified Device Architecture (CUDA) has been developed by NVIDIA for performing general-purpose computing on NVIDIA GPU using parallel computation features. The CUDA memory organization is hierarchical. Each thread has its own private local memory other than registers, which are 32,768 per SM. The Tesla C2075 GPU includes a configurable (as instruction or data) L1 cache per SM block and a unified L2 cache for all processor cores. All threads inside a block share different memory spaces -per block shared memory. The size of a block local shared memory is 48 KB. Its lifetime equals that of the block and is characterized by low memory access times. Shared memory comprises of a sequence of 32-bit words called banks. There also exists a global memory shared by all threads across all thread blocks, having the lifetime of the application. The size of the global memory is 6 GB. The global memory access time is larger compared to other memories. Global memory comprises of 128 byte segment sequence, and at any time, memory requests for 16 threads (a half warp) are serviced together. Each segment corresponds to a memory transaction. If the threads in a half warp access data spread across different memory segments (uncoalesced memory request), the corresponding multiple memory transactions would lower the performance.
Implementation with CUDA
The event selection algorithm described in section 3 was implemented in the C language using the CUDA API and then compiled with the NVIDIA compiler (nvcc). To optimize the event selection code for the GPU, we analysed the program and performed memory arrangement according to the GPU architecture as discussed above. We also worked to optimize the CPU to GPU data transfer time, which is significant as the data volume is large.
In our first approach, the entire event hit data of the setup, consisting of the STS and MUCH detectors, was transferred to the GPU. The following steps were taken:
1. Data are read in from a file; in the actual experiment, the data will be deployed in shared memory by the data acquisition software.
2. Memory on the GPU device is allocated for a chosen number of events n ev .
3. The number of blocks b and the number of threads per block t are selected, optimising the GPU computation time.
4. The event selection algorithm is executed in GPU threads for n ev events in parallel, with n ev ≤ b · t.
The list of selected events is transferred back to the CPU.
The GPU schedules and balances the selected number of blocks and threads on the available SMs (14 for C2075) and SPs (32 where x, y, z are the hit coordinates in configuration space; the first index denotes the event, the second one the consecutive number of the hit in the event.
Our investigation showed that this data arrangement suffers from uncoalesced memory access for all the x, y, z coordinates. By its SIMT architecture, CUDA executes 32 threads of a block simultaneously; therefore all 32 threads should read from the global memory in a single or double read instruction. To cope with this, we rearranged the data such that coalesced memory access is possible. First, we introduced separate data containers for each coordinate axis, and second, hits of different events are arranged together: In the course of further optimizing the process, we found that the majority of time is taken by the global read of data by each thread, thereby requiring a reduction in global read time as data reside in the global memory and not in the shared or private memory of the GPU. By construction of the algorithm, the number of global reads for each thread is proportional to the number of events n ev . Each event contains about 5000 hits, and every global read takes around 300-400 clock cycles. For the computation, however, only a small fraction of these data are used, namely hits in the last (trigger) station, which are about 15 per layer per event (see Fig. 2 ). Thus, we introduced a filtering of the data on the CPU host side, such that only hit data in the trigger station are transferred to GPU [17] .
The importance of the optimization steps is illustrated in Fig. 5 (left) , showing the per-event GPU execution time for the various implementations on the Tesla GPU and respectively the perevent CPU to GPU data transfer time. This study was performed for up to 4,000 events because of memory limitations of the GPU. The processing time is reduced by a factor of two from the first implementation (i1) to the one properly using coalesced memory (i2). The data transfer time is the same for both implementations since the same data are transferred. Filtering of input data at the host side (i3) gives a reduction by about two orders of magnitude for the per event-execution compared to (i2) and one order of magnitude for the per-event data transfer time.
Figure 5 (right) and Table 1 compare the per-event GPU execution time and data transfer time for implementation i3 to the singe-thread execution time on CPU. The data filtering on the host side relaxes the restrictions imposed by the limited GPU memory, such that a larger number of events (tested up to 80,000) can be processed at a time. The data transfer time is lower by one order of magnitude compared to the execution time; moreover, it can be hidden by performing computation and transfer in parallel. The measurements demonstrate the importance to load the GPU with sufficient data in order to make optimal use of its capacity. Compared to the single-thread execution on the CPU, we obtain a speed-up of 35 for a data set of 40k events by using the Tesla GPU. Table 1 shows that for more than 40k events, the speed-up with respect to the single-thread CPU is slightly reduced, indicating that the optimal data load on the GPU is reached with this event number. Our results show that about 3 · 10 5 events per second can be processed on the mentioned single Tesla GPU. 
Implementation with OpenCL and comparison with CUDA
The previous section has demonstrated that making optimal use of a GPU with the CUDA API is far from trivial and requires sophisticated optimization of the data arrangement. The OpenCL programming paradigm [11] offers an architecture-independent alternative. However, for a beginner OpenCL seems difficult as far as its syntax and programming procedure are concerned. Writing a small "Hello World" program in OpenCL needs creating platform, device, context, and command queue, then memory allocation via create & writing buffer, program object creation via creating source, building program, program execution via create kernel, enqueueing kernel, reading back buffer, etc. At first sight, this seems cumbersome compared to CUDA which provides an easy terminology for writing programs [13] . On the other hand, OpenCL programs can be compiled via available C or C++ compilers, unlike CUDA which requires a vendor-specific compiler. Once OpenCL programs are written and compiled, then can be executed on any device, whether GPU, CPU, or APU [10], whereas CUDA can be executed only on NVIDIA GPUs. Both CUDA and OpenCL treat the CPU as host, but for CUDA only the NVIDIA GPU is a device, whereas OpenCL treats any hardware as computing device by creating an instruction queue that can be executed on all available computing resources. We thus investigated OpenCL as an open-source solution for heterogeneous programming. Figure 6 shows the per event execution time of the event selection algorithm for different numbers of events on the Tesla and Quadro GPUs using the implementations in CUDA and in OpenCL. We find the OpenCL code execution time to be slightly larger than that of the CUDA code on both Tesla and Quadro, showing that CUDA is better optimized to the NVIDIA GPU architectures. However, the difference is modest and seems a reasonable price for the flexibility offered by an architecture-independent code. Comparing Tesla and Quadro, we find the Tesla GPU to be more powerful, which becomes visible at large-enough input data (number of events). The hardware differences between these two GPU cards are manifold -processor speed, global memory size, number of computing cores etc. We conclude that the Tesla GPU seems more appropriate for our problem than the Quadro GPU.
6 Implementation on multi-core CPU An alternative to using GPU accelerator co-processors is to make use of the multi-core CPU architecture present in contemporary computers. Concurrency on CPU cores can be established using OpenMP, MPI, and OpenCL, all of which are open-source programming tools. We tested implementations of the event selection algorithm for all three of these programming paradigms on the two platforms S1 and S2 the GPUs of the S1 setup were idle or in open condition). Figure 7 (left) compares the throughput (number of events executed per second) on the Intel Xeon processors (2 x 6 cores) of setup S1 in dependence of the number of cores (threads) used in parallel for a sample of 20000 events. We find for both implementations a linear scaling with the number of threads up to 12 threads, from when on the throughput decreases again. This signals that from this point onwards, the context switching time starts to dominate the total process time. The same test was performed on the setup S2 (4 x AMD Opteron 16 cores) as shown in the right panel of Figure 7 , obtaining similar results for 64 threads. The throughput for the MPI implementation is higher than that of OpenMP, the difference being larger on AMD than on Intel. We attribute this to the fact that the process does not use shared memory or inter-thread communication. The comparison of the performances found with OpenCL, OpenMP and MPI on the two hardware architectures is shown in Fig. 8 for different numbers of events (up to 80,000) processed at a time. The number of threads is 12 for Intel and 64 for AMD, as shown to be optimal by Fig. 7 . On both platforms, we find OpenCL to perform significantly better than the other two implementations. The reason in our opinion is the better ability of OpenCL to produce vectorized code in an automatized way. The performance of OpenMP and MPI can possibly be improved by manual vectorization. A comparison of the Intel and AMD processors is not straightforward because of the different number of computing units. However, AMD appears less performant than Intel for all implementations since the number of parallel threads is 5 times larger than for the Intel CPUs, but the per event execution time is only 2.5 times smaller.
Conclusions
We have described the development of an event selection algorithm for the CBM-MUCH detector data and a systematic study for the implementation of the event selection process using different parallel computing paradigms like OpenMP, MPI, and OpenCL for multi-core CPU architectures, and CUDA and OpenCL for many-core architectures like NVIDIA GPUs. For both platforms, the event selection procedure suppresses the archival data rate by almost two orders of magnitude without significantly reducing the signal efficiency, thus satisfying the CBM requirements for high-rate data taking.
On GPUs, we have found a speed-up of 35 with respect to the single-thread execution on CPU. This result, however, is only obtained after careful optimization of the implementation in CUDA.
OpenCL on NVIDIA GPUs is found to perform only slightly worse than CUDA. Our results show that about 3 · 10 5 events per second can be processed on a single GPU card. Present hardware supports up to four GPUs on a single motherboard. This suggests that the targeted CBM interaction rate of 10 7 events per second can be accommodated by a small number of servers properly equipped with GPUs.
In a multi-core CPU environment, we have compared OpenCL, OpenMP and MPI as concurrency paradigms. We find OpenCL to perform best in such architectures. A linear scaling of the data throughput with the number of parallel threads is observed up to the number of available physical cores. In the powerful S2 setup with total 64 AMD cores, we find that about 3 · 10 6 events can be processed per second, which is already close to the targeted event rate of 10 7 /s. This demonstrates that the computing demands of the CBM experiment for the real-time selection of J/ψ candidate events can be achieved by properly making use of the parallel capacities of heterogeneous computing architectures.
Comparing the different programming paradigms, we find the cross-platform OpenCL to be a proper choice for heterogeneous computing environments. Once a program is written in OpenCL, it can be executed on a variety of platforms (CPU / GPU) without changing the code. Modern computer architectures come with a combination of multi-CPU cores with accelerator cards like GPUs. For this kind of systems, OpenCL provides a suitable solution to simultaneously exploit all available compute units for a given application. It also provides the flexibility to future improvements in computing architectures, which is of particular importance for CBM as an experiment in the construction stage.
Outlook
The data selection procedure developed and investigated in this article relies on data aggregated into events, corresponding to a single nucleus-nucleus interaction. The data acquisition of the CBM experiment, however, will deliver free-streaming data not associated to single events by a hardware trigger. To properly account for this situation, not only the spatial coordinates, but also the time measurement of each hit must be considered. This will increase the complexity of the current, rather simple algorithm. We are working together with the CBM collaboration towards extending the algorithm to event building and selection from the real online data stream and also will investigate the throughput on multi-core, many-core platforms in parallel.
