Abstract-An important step in the ATLAS upgrade program is the installation of a tracking processor, the Fast Tracker (FTK), with the goal to identify tracks generated from charged particles originated by the LHC 14 TeV proton-proton collisions. The collisions will generate thousands of hits in each layer of the silicon tracker detector, making the track identification a very challenging computational problem. At the core of the FTK there is the associative memory (AM) system, made with hundreds of AM ASICs chips, specifically designed to allow pattern identification in high density environments at very high speed. This component is able to organize the computational steps of the track identification providing a huge computing power for a specific application. The AM system will in fact be able to reconstruct tracks in 10s of microseconds. Within the FTK team there has also been a constant effort to maintain a detailed emulation of the system, to predict the impact of single component features in the final performance and in the ATLAS data acquisition system. The FTK emulation is however a demanding software, we describe the efforts to have the best performance using commercial computing devices and some ideas for the future evolution.
number of available patterns, speed and consumption of the chip. The integration in the trigger and data-acquisition (TDAQ) infrastructure of the ATLAS experiment [5] is also quite challenging because the silicon tracker was not designed for the specific task and a more general approach has been requested.
A complete simulation package of the FTK processor and the AM chip has been developed in the past years. This tool has been extremely important to evaluate the impact of hardware design decisions on the tracking performance in term of resolution, efficiency or fake tracks. For the hardware development the software has also been important to balance and optimize the system. The optimization of this software is challenging, because the algorithms used in the system and the overall size require an intensive use of computing resources. This document describes the AM system and the main goals of the emulation software and the challenge to allow flexible emulation software with an optimal use of standard computing resources offered by ATLAS.
II. THE ATLAS FAST TRACKER PROCESSOR
The FTK processor is an electronic subsystem installed in the ATLAS TDAQ infrastructure. The processor is expected to reconstruct the trajectories of charged particles originated in p-p collisions at the energy 14 TeV produced by the LHC. The FTK is expected to enter the TDAQ from the fall of 2015. The detector coverage will be initially limited to the ATLAS barrel 
Event Loop
region, and then it will be expanded to the whole geometrical range covered by the ATLAS tracking system. The tracks are reconstructed by the FTK processor receiving data from both pixel and strip silicon sensors installed in the most internal part of the ATLAS inner detector (ID), at a distance between 5 and 50 cm from the collision point. The ID is composed by 12 concentric layers: 4 layers of pixel sensors in the inner part, followed by 4 layers of strip sensor pairs. The incoming data are preprocessed at each Level-1 trigger accept, at the maximum rate of 100 KHz, and the track reconstruction is expected to be performed within 100 μs. This enables the High-level trigger (HLT) online computing farm to use these tracks as input to extract interesting features from the events and decide which events should be conserved by the trigger system for data analysis.
The FTK pipeline begins with an input stage, implemented in ATCA boards, the Data Formatter (DF), receiving the raw data from the ID performing cluster finding on the incoming data, and distributes the data for the next components (see Fig.  1 ). The number of pipelines in which the DF sends the data is 64, organized as 4 sequential barrels, each with 16 angular slices. Each pipeline is also referred as a tower and covers about 1.2 units in pseudo-rapidity ( ) and 22.5 degree in the transverse plane. For each tower there are two boards with 64 AM chips each, performing the pattern matching, and a rear module, the FTK auxiliary (AUX) card, devoted to convert the incoming data for the pattern matching and perform a first fitting stage to remove fake tracks. The AM system recognizes tracks using only 8 layers out of the 12, allowing 1 missing layer. The 8-layer track candidates from the AUX card are sent to the 2 nd stage fit board (SSB), where the track candidates from the AUX are refined adding up to 4 hits, with an improvement in resolution and a better rejection of fake tracks. The final board is the FTK to Level 2 Interface (FLIC), were the data after the SSB are collected and prepared for the ATLAS HLT processing farm.
The system is designed to allow track reconstruction with a minimum transverse momentum of 1 GeV, with an azimuthal angle from the beam direction of about 8 degrees, representing the limit of the ID reconstruction. The absolute tracking reconstruction efficiency is about 93% for muon tracks, the average fraction of fake tracks at the highest expected pileup scenario of 80 overlapping collisions would be about 5% [6] . Those parameters depend on optimization studies and can change during the final optimization.
III. THE ATLAS FTK SIMULATION SOFTWARE
A simulation of FTK was initially designed in 2005 and evolved during the design studies of the system, with a major update in 2010. The goals of the simulation are to describe the logic of the main parts inside the processor, allowing to verify how different setups can influence the performance of the system both in term of load for the electronic components and quality of the output tracks. The current code is composed by more than 20 thousands lines of C++ library, able to describe the logic of the main components of the FTK pipeline, as showed in Fig 2. The code is highly modular, ensuring ability to evolve and test new features, and can be used as standalone software or as a part of the ATLAS main simulation and software framework, ATHENA.
As anticipated all the functionalities are described, mostly grouped in 2 main blocks: road finding (RF) and track fitting (TF). The RF part controls the input; here the clustering functionality of the FTK clustering mezzanine is performed as well as the distribution of the hits in the tower as done by the DF. The input module is controlled by the main RF simulation object, and then distributed to C++ classes representing the AM bank of one FTK tower, composed by 2 VME boards and rear modules with 64 and several FPGAs each. The hits are sent to each AM emulated module according the geometrical acceptance, the AM emulation details will be explained with much more details in the next section. The RF output is composed by a list of output roads and the associated hits, stored in different container to save space because many roads can share the same SS and indeed the same hits. Statistical information on the RF workload is also saved with roads and associated hits, useful to provide information on the DF load in the events, like a summary of the number of input hits before and after clustering. According the simulation setup the RF output can be saved on disk or made available to the next step, the track fitting (TF), just temporary storing in memory the RF output content. The TF emulation loops over the list of roads, retrieving the hits associated for each road and composing all the track candidates with one hit in each SSlayer and performs a track fit using up to 8 hits. Tracks with hits in all the expected layers but high 2 are fitted again, verifying if removing only one of the hits, in each layer, are fitted again as incomplete tracks, to recover the case when a full road is obtained combining a track found without all the possible hits plus noise, important at the higher pileup scenarios. The fit stage just described is the first fitting stage, using up to 8 layers and 11 independent measurements; this fit stage is followed by a second one, already described in the brief FTK workflow description, where up to 11 layers are used, with 16 independent measurements to use in the fit. The
computing framework and generally common in installation for HEP computing. The data persistency uses ROOT, ensuring the flexibility to store the complex objects used to describe roads and tracks, while allowing evolution of the data format and performance scalability.
The simulation is characterized by the use of large set of constants representing the patterns used for the RF and values used by the fit (see Table I ). The FTK final system will have about 1 billion patterns, a sequence of 8 numbers of 18 bits, which in the simulation are rounded to 9 integers of 32 bits, to store a reference to the fit constants, totaling a minimal request of about 36 GB of memory. The fits require about 20 thousands set of constants per tower. Each set is composed by 11 × 12 values for the first stage fit, 16 × 17 values for the final fit and additional information used during the extension between the two fitting stages, with about 20 integers. The fit constants are also 32 bit values and indeed require in total more than 2 GB of RAM. The memory requirement is extremely high and has indeed required to develop precise strategies to perform simulation of the whole FTK system. In particular we were forced to divide the simulation task in independent jobs, each controlling only a fraction of the system, which can be smaller than a single tower. This segmentation of the simulation task allows to fit in standard machine, where standard is a machine with about 2 GB/core of memory and up to 4 GB of memory per process.
The current segmentation is 256 jobs, 4 for each hardware tower (see Fig. 3 ). The jobs of each tower are executed in sequence, with a partial merge at the end of the last job. This proliferation complicates the emulation workflow but allows running with a minimal amount of memory, of about 1.5 GB for the large banks. This limitation can be removed using special high-memory machines, but they are a rare resource in the ATLAS computing model, mostly used for tests and small MC events production, with the consequence of a reduction of the effective usable CPU resources.
The emulation performance has been evaluated as function of the pattern bank size and pileup condition of the physics event. Fig 4 shows how the emulation of the road finder stage dominates the overall execution time, where only at the highest pileup levels the track fitter starts to be important.
IV. THE AM BOARD AND AM CHIP COMPUTING POWER
The AM chip in FTK performs an extremely intensive task: the hits' positions are smeared by the AUX in super-strip (SS). These positions are sent through the P3 connector from the AUX card to the AM board (AMBSLP). Within the AM board the chips are distributed in 4 Large Area Mezzanine Boards (LAMBs), with each board stuffed with 16 chips, with a total of 64 chips for AM board.
The version of the AM chip for the FTK will be the AMchip06, able to store 128k patterns per chip, working at a clock 100 MHz, with a consumption of about 2.5 W per chip. Each pattern is composed by 8 words of 18 bits, send through independent busses. The chip will be able to make a comparison for all the location at every clock. The comparison of a single word can use "don't-care" feature, this allows tuning the precision of the match in each location. These numbers and characteristics will allow having up to 1 billion of patterns available for the final FTK system.
All the communications within the AM boards are controlled by high speed serial link, with a massive amount of links and an extremely large bandwidth. To bring the hits to the AM chips and to collect the results each AMBSLP has about 750 serial links, with a total bandwidth of about 200 GB/s. The whole FTK processor has 128 boards, meaning about 25 TB/s used for the pattern matching functionality only.
To simplify the comparison of the AM system performance with standard CPU in term of computing power we can round the width of busses to 4 words at 32 bit for each pattern; this means that at each clock 128 thousands comparisons of 4 words are made, so roughly 500 thousands comparison per clock, indeed 50x10 M comparison instructions per second (MIP/s). Scaling this to the whole AM system, composed by 128 boars, with 64 chips each, this is equal to 400 G MIP/s. A so huge number of MIP is not possible in CPU, excluding super computers. V. THE AM SIMULATION SOFTWARE As described in section III the FTK group has developed software able to emulate the logic of the system. The AM emulation module is the most important and consuming part of the whole simulation. With respect to the hardware pipeline in the Fig 1 the AM modules, for technical reason, integrates or controls functionalities that in hardware belong to the AUX or even the DF board (see Fig. 5 ).
The AM emulation module is connected to the FTK simulation infrastructure that controls the RF stage. The road finder algorithm can in general control many AM modules. In practice it loads only a fraction of the pattern bank.
In order to emulate the AM working principle the code needs some internal structure to store the data in the chip, the status of the match during an event, and the data to be sent to the next emulation step. The first data structure is the pattern's table: this stores the description of the patterns. This table has a number of columns equal to the number of layers used by the chip plus one integer as reference for the fit constants, for the current FTK this means 9 columns. The number of rows is equal to the number of patterns loaded in the module, which is limited to several millions. This table is loaded at the initialization of the emulation. A second data structure related to the previous one is a table that for each event records the status of the match between the incoming hits and the patterns. This structure has the same size but for each pattern a single bit is enough, to achieve that groups of 8 patterns are grouped.
In order to match an incoming hit with the patterns, the simplest approach is a linear scan. This has been proved to be extremely time-consuming because of the size of the pattern bank. In order to avoid this drawback an additional data structure is instead used to index the pattern by SS. In this way an incoming hit is compared to the pattern index for its layer and the list of patterns is quickly retrieved, allowing an update of the match status avoiding the linear scan. As further improvement this is done only when a SS is found for the first time in an event, the following hits are only accumulated because they will be used for the track fitting in the proper emulation module.
A final structure related to the AM work is the list of "fired patterns". The real chip, when the last hit of the event is received, starts scanning the patterns, sending the index of the ones above a selected threshold to the next step of the pipeline. In the emulation this results in an unnecessary scan, in fact the code has a list of patterns that matched in enough layers, this list is updated while the hits are received. This allows avoiding the linear scan of the patterns, because the list of the roads is always available.
Despite the optimization effort, as showed in Fig. 4 , the AM emulation remains the top consumer of the FTK emulation. This part of the code consumes 75% of the execution at the highest pileup level, or more at lower pileup level conditions. This justifies the ongoing because the ongoing optimization efforts focus mostly on this component. 
VI. CONCLUSIONS
The AM system used by the FTK system is extremely powerful and tailor-made for specific algorithms, as parallel pattern matching in dense environment. The memory access bandwidth and number of comparisons per second has no equal in commercial resources. This powerful system allows track reconstruction within very limited budget latency.
The emulation of a large hardware system as the FTK processor is extremely complex. It requires describing the functionalities and the interaction of many different boards. Such simulation is also resource demanding, with extremely large requirements in memory and CPU for the execution of entire events. The emulation of a single event requires almost 6 minutes at the highest pileup level foreseen for ATLAS in the Run II, with most of the time spent during the emulation of the AM system.
The AM system is in fact the most challenging one: the previous section has clarified the complexity of implementing the AM working principle in software. In particular it is evident that the attempt to mimic a workflow that is extremely natural and fast in this special chip is very slow and convoluted in software. Commercial CPU systems need to match a workflow optimized for a system capable of bandwidth memory and capable to make a number of comparisons per second that is many orders of magnitude better than any regular combination of CPU and RAM systems. Despite the continuous effort in optimization the implementation of the emulation will be slower than its emulation.
Reaching good time performance for this emulation is important for two main reasons. First, the emulation will be used to monitor the FTK hardware components in order to check for errors and malfunctions of the system. Then ATLAS needs to emulate the system at the best to parameterize the effect of FTK on quantities that are interesting in physics analysis, this means efficiency and quality of the FTK tracks and the objects built using these tracks.
