Abstract-Power density constraints are limiting the performance improvements of modern CPUs.
I. INTRODUCTION
T HE Large Hadron Collider (LHC) at CERN is the highest energy collider ever constructed. It consists of two counter-circulating proton beams made to interact in four locations around a 17 mile ring straddling the border between Switzerland and France. It is by some measures the largest man-made scientific device on the planet. The goal of the LHC is to probe the basic building blocks of matter and their interactions. In 2012, the Higgs boson was discovered by the CMS and ATLAS collaborations. Experimentally, we collide proton beams at the center of our detectors and, by measuring the energy and momentum of the escaping particles, infer the existence of massive particles that were created and decayed in the pp collision and measure those massive particles' properties. In all cases, track reconstruction, i.e., the determination of the trajectories of charged particles, plays a key role in identifying particles and measuring their charge and momentum. The track reconstruction as a whole is the most computationally complex and time consuming, sensitive to increased activity in the detector, and traditionally, least amenable to parallelization. The speed of reconstruction has a direct impact on how much data can be stored from the 40 MHz collisions rate. The most CPU-intensive part of the event selection in the online trigger process is track reconstruction, to the point that it can only be applied to a few % of the input event rate. At the same time, the tracker is the most precise instrument in CMS. Tracking is thus the essential tool in making surgical decisions in the online selection to optimally use the limited output bandwidth for interesting physics events. The speed of track reconstruction software thus limits both the total output rate of the experiments via the first cap, and the surgical precision with which interesting events can be selected via the second cap. Our research aims to lift those caps by vastly speeding up the online tracking reconstruction. The large time spent in track reconstruction will become even more important in the HL-LHC era of the Large Hadron Collider, where the increase in event rate will lead to an increase in detector occupancy ("pile-up", PU), leading to an exponential gain in time taken to do track reconstruction, as can be seen in Fig. 1 . In the Figure, A correlated issue is the change in computing architectures in the last decade. Around 2005, the computing processor market reached a turning point: power density limitations in chips ended the long-time trend of ever-increasing clock speeds, and our applications no longer immediately run exponentially faster on subsequent generations of processors. This is true even though the underlying transistor count continues to increase per Moore's law. Increases in processing speed such as those required by the time increases in Fig. 1 will no longer come 'for free' from faster processors. New processors instead are aggregates of 'small cores' that in toto still show large performance gains from generation to generation, but their usage requires a re-work of our software to exploit. The processors in question include ARM, GPGPU and the Intel Xeon Phi; in this work we target the Xeon Phi architecture.
II. KALMAN FILTER TRACKING
To realize the new performance gains, a change is required to move from the sequential applications of today to vectorized, parallelized applications of tomorrow. The algorithm we are targeting for parallelization is a Kalman Filter (KF) based algorithm [1] . KF-based tracking algorithms are widely used since they naturally include estimates of scattering along the trajectory of the particle due to multiple scattering off massive detectors. Other algorithms, more naturally suited to parallelization and coming from the image processing community, have been explored by others. These include Hough Transforms and Cellular Automata, among others (see, for instance, [2] .) However, these are not the main algorithms in use at the LHC today, whereas there is extensive understanding how KF algorithms perform. KF algorithms have proven to be robust and perform well in the difficult experimental environment of the LHC. By porting these algorithms to parallel architectures, we aim to port this robust tool to new architectures. Past work by our group has shown progress in sub-stages of the KF algorithm in simplified detectors (see, e.g. our presentations at ACAT2014 [3] and CHEP2015 [4] ). Fig. 2 shows a result from the track building part of the problem, where we decide which hits to group together as coming from the passage of a single charged particle, as a function of the increased usage of the processors' vector registers. The black line shows "ideal behavior" -perfect scaling, and the blue line shows our measured results. We observe a significant speedup compared to the baseline, but still room for improvement with respect to the ideal behavior. All results are for the Xeon Phi. We have now implemented an end-to-end algorithm with a semirealistic detector model. This talk represents a status report.
A. Optimized Matrix Library MATRIPLEX
The computational problem of Kalman Filter-based tracking consists of a sequence of matrix operations on matrices of sizes from N × N = 3 × 3 up to N × N = 6 × 6. To allow maximum flexibility for exploring SIMD operations on small-dimensional matrices, and to decouple the specific computations from the high level algorithm, we have developed a new matrix library, MATRIPLEX. The MATRIPLEX memory Fig. 2 . Track building part of the problem, where we decide which hits to group together as coming from the passage of a single charged particle, as a function of the increased usage of the processors' vector registers.
layout is optimized for the loading of vector registers for SIMD operations on a set of matrices. MATRIPLEX includes a code generator for generation of optimized matrix operations supporting symmetric matrices and on-the-fly matrix transposition. Patterns of elements which are known by construction to be zero or one can be specified, and the resulting generated code will be optimized accordingly to reduce unnecessary register loads and arithmetic operations. The generated code can be either standard C++ or simple intrinsic macros that can be easily mapped to architecture-specific intrinsic functions.
III. CURRENT STATUS
We have performed extensive studies of the performance of our software. Both the physics performance and the code performance were examined. For the latter, we used the INTEL VTUNE [5] suite of tools to identify bottlenecks and understand the impacts of our optimization attempts. In particular, as can be expected, we determined that memory management is of critical importance. To this effect we describe below several studies to optimize memory performance, and discuss the results of these studies.
A. Cloning Engine Studies
In Tab. I we present results from our "cloning engine" (CE) studies. In profiling our application it was observed that a significant amount of time was spent in operations associated with memory management. In our tracking algorithm, new data structures have to be created when new track candidates are examined. Significant time is also spent loading data into our local caches. In the CE studies, we examine offloading the memory management work to a worker process. The first strategy, labeled "cloning engine" in the Table, gangs all memory operations together; speedup is observed by removing redundant operations. In the second strategy, labeled "threaded cloning engine", the memory management task is delegated to a second thread. We observe increased performance and vector unit efficiency with CE studies compared to the original model and increased performance for both Xeon and Xeon Phi using the threaded CE compared to the non-threaded CE.
B. Reduced Data Structures
The size of the data structures used in our algorithm have a crucial impact on the timing performance. In particular, the data structures representing the energy deposits in the detector (the "hits") and the reconstructed particle trajectories (the "tracks") must be optimized in size to fit into the fastest cache memory. Object-oriented data structures require more memory than arrays. Therefore, we replaced the Kalman covariance matrices with basic C-style arrays rather than C++-style classes. These changes allowed us to reduce the size in memory of the track objects by 20% and the hit object by 40%. Tab. II shows the result of these studies. The table only shows Xeon numbers; Xeon Phi performance is similar.
In addition to moving to C-style data structures, we optimized the contents of the data structures. Our original implementation of these objects were designed for algorithmic ease of use and contained data members not strictly needed for the track reconstruction. For instance, Monte Carlo truth information, i.e., information about how the trajectories were generated, were stored in the hit objects. The track objects contained (duplicate) copies of the hits themselves, rather than smaller references to the hits in their original locations. We rewrote the data structures to optimize performance and maintain ease of use by carefully considering what was needed for physics output only and keeping auxiliary information in associated data structures.
We find significant speed-ups from the reduced data formated (labeled "r.d.f" in the table). We also find that with these changes, the improvement of the CE studies is reduced; presumably since the amount of memory churn overall, which the CE improves, has now been globally reduced.
C. Performance Scaling -parallelization Figure 3 shows the parallelization performance scaling of our code on the Xeon architecture. The plot on the right show the time to process 20,000 tracks as a function of the number of threads. The curve in green represent ideal scaling from the first data point. The blue and red curves show two different ways of distributing the data across threads. On the right the same data is plotted but now as a function of 1/n threads ; this figure can be used to extract the serial fraction of the code. According to Amdahl's law, the time spent on n threads consists of a serial fraction B and a parallel fraction (1 − B)/n threads multiplied by the time per thread, T (1).
The B parameter can be extracted from the data by fitting to this function, and results are shown for different versions of our code. In the Figure, a big improvement was found by optimizing how data structures are re-initialized on each event, leading a reduction of the serial fraction from 26% to 9%. There is a significant residual contribution to non-ideal scaling due to variation of occupancy within threads: some threads simply take longer than others. In our group there is work ongoing to define strategies for an efficient 'next in line' approach or a dynamic reallocation of thread resources to even out timing across threads.
IV. CONCLUSION We have made significant progress in parallelized and vectorized Kalman-filter-based tracking R&D on Xeon and Xeon Phi architectures. We have developed a good understanding of bottlenecks and limitations of our implementation. New versions of the code are faster and exhibit scaling closer to ideal performance. We are continuing to pursue new ideas to further improve performance Though it was not discussed in the talk, we have also developed tools to process fully realistic data, with encouraging preliminary results.
The project is solid and promising; however, much work remains.
ACKNOWLEDGMENT
This work was supported by the U.S. National Science Foundation. 4 Scaling Plots Excluding Eta Bin Setup On the right the same data is plotted but now as a function of 1/n threads ; this figure can be used to extract the serial fraction of the code.
