Abstract-The ALICE High Level Trigger combines and processes the full information from all major detectors in a large computer cluster. Data rate reduction is achieved by reducing the event rate by selecting interesting events (software trigger) and by reducing the event size by selecting sub-events and by advanced data compression. Reconstruction chains for the barrel detectors and the forward muon spectrometer have been benchmarked. The HLT receives a replica of the raw data via the standard ALICE DDL link into a custom PCI receiver card (HLT-RORC). These boards also provide a FPGA co-processor for data-intensive tasks of pattern recognition. Some of the pattern recognition algorithms (cluster finder, Hough transformation) have been re-designed in VHDL to be executed in the Virtex-4 FPGA on the HLT-RORC. HLT prototypes were operated during the beam tests of the TPC and TRD detectors. The input and output interfaces to DAQ and the data flow inside of HLT were successfully tested. A full-scale prototype of the dimuon-HLT achieved the expected data flow performance. This system was finally embedded in a GRID-like system of several distributed clusters demonstrating the scalability and fault-tolerance of the HLT.
I. INTRODUCTION
The ALICE experiment at the LHC will investigate Pb-Pb collisions at a center of mass energy of about 5.5 TeV per nucleon pair and p-p collisions at 14 TeV. The detectors are optimized for charged particle multiplicities of up to of 8000 in the central rapidity region.
The main central tracking detector, the Time Projection Chamber (TPC), is read out by about 600 000 channels, producing a data size of up to 75 MB per event for central Pb-Pb (most extreme scenario). The overall event rate is limited by the foreseen bandwidth to permanent storage of 1.25 GB/s. With no further reduction, the ALICE TPC can only accumulate central Pb-Pb events up to 20 Hz. Higher event rates are possible by either online event selection and/or data compression. Both applications require a real-time analysis of the detector information. To accomplish the pattern recognition tasks at an incoming data rate of 10-20 GB/s, a massive parallel computing system, the High Level Trigger (HLT) system, is under construction [2] .
II. DATA FLOW AND ARCHITECTURE
The High Level Trigger combines and processes the full information from all major detectors in a large computer cluster. A farm of clustered SMP-nodes (about 400 nodes), based on off-the-shelf PCs and connected with a high-bandwidth, low overhead network, provides the necessary computing power for event reconstruction. The HLT farm is designed to be completely fault-tolerant avoiding all single points of failure. Based on the publisher subscriber principle, a generic communication framework has been developed, which allows the construction of any hierarchy of communication processing elements. Figure 1 shows a sketch of the architecture of the system adapted to the anticipated data flow from the ALICE detectors. The TPC consists of 36 sectors, each sector being divided into 6 sub-sectors. Data from each sub-sector is transferred via optical fibers from the detector front-end into the ReadOut Receiver Cards of the DAQ system (D-RORC), from where a copy is sent to the HLT-RORC. These are interfaced to the receiving nodes through their internal PCI-bus. The HLT-RORC provides a FPGA co-processor for the data intensive local tasks of the pattern recognition and enough external memory to store several dozen event fractions.
The overall architecture of the system is driven by the inherent readout granularity and the requirement for a complete event reconstruction and trigger decision. The internal topology will have a tree-like structure, where the result from the processing on one layer (e.g. track segments on sector level) will be merged at a higher layer (sector merging and track fitting). Finally all local results will be collected from the sub-detectors and combined on a global level where the complete event can be reconstructed and trigger decisions can be issued. 
III. ONLINE PATTERN RECOGNITION
The main processing task is to reconstruct the tracks in the TPC, and in a final stage combine the tracking information from all detectors. Given the uncertainties of the anticipated particle multiplicities, different approaches are being considered for the TPC track reconstruction.
The conventional approach of TPC track reconstruction consists of a Cluster Finder and a subsequent Track Follower. In a first step the Cluster Finder reconstructs the cluster centroids from the generated two-dimensional charge distributions in the TPC pad-row planes. Together with the position of the padrow-planes the centroids are interpreted as three-dimensional space points along the particle trajectories, and serve as an input for the Track Follower which connects the space points into track segments. A final helix-fit of the track segments provides the track parameters and thus the kinematic properties of the particles.
Such an approach has been implemented and evaluated on simulated ALICE TPC data [3] . The algorithms were originally developed for the STAR L3 trigger [4] and consist of a straight-forward center-of-gravity calculation of cluster centroids, and a Track Follower which applies conformal mapping on the space points. The latter enables the circular tracks to be fitted by a linear parametrization, thereby significantly reducing the computational requirements. The overall measured performance of the reconstruction chain represented by the tracking efficiency as a function of the transverse momentum is shown in Figure 2 . The tracking efficiency for 4000 is similar to that achieved by the standard offline reconstruction chain. The algorithm is relatively fast, and is therefore well suited for the lower multiplicity regime. For higher multiplicities the observed tracking performance deteriorates. This is due to the increasing detector occupancy which gives rise to a significant amount of overlapping clusters. In such a scenario the Cluster Finder fails to reconstruct the cluster centroids due to its incapability to deconvolute overlapping charge distributions. Information about the tracks is needed prior to reconstruct the cluster centroids in order to fit the individual distributions to a known shape. This can be done since the cluster shape depends mainly on the track parameters, and together with the knowledge of the number of tracks contributing to a given cluster, the deconvolution can be done based on a twodimensional Gauss-fit. Such an approach has been evaluated by applying an implementation of the Hough Transform on the raw ADC-data, and subsequently fitting the clusters to a two-dimensional Gauss-function based on the found track candidates. However, too many candidates are produced by this gray-scale Hough Transform which result in too many fake tracks. A better approach is a counting Hough Transform [5] . The fact that the TPC is a continuous tracking device is taken into account and therefore all padrows contribute to a good track. Large gaps indicate fake candidates and parameter space bins containing gaps are removed from the filling procedure. In addition, the paramater space is linearized using a conformal mapping. Both methods speed up the transformation and result in a simple peak structure in the parameter space. The obtained tracking efficiency as a function of track transverse momentum is shown in Figure 3 . The efficiency is better than 95% for and does not depend on the event multiplicity. The abundance of fake track candidates is less than 5%.
The overall computing time needed for the TPC tracking for different multiplicities is shown in Table I . The reference platform was an Intel Pentium 4 (2.8 resp. 3 GHz) which corresponds to a performance rating of approx. 1k SPECint. The CF + TF approach produces track parameters as well as space points for refitting and analysis, while the fast Hough transform just results in track parameters. Assuming a multiplicity of , as predicted by many models based on RHIC results, a farm of about 1000 CPUs would suffice to solve the pattern recognition task within the time budget of about 5 msec. 
IV. ITS TRACKING AND TRIGGER
The tracks found in the TPC are followed into the Inner Tracking System (ITS). The offline code was used for the processing of the ITS data and the tracking [6] . The efficiency of the combined tracking ( Figure 3 , gray curve) is slightly lower than for the TPC only (black curve). The impact parameter resolution is dominated by the resolution of the innermost layer of the silicon pixel detector. A transverse resolution of 60 microns has been achieved -comparable to offline results. Based on this impact parameter resolution, track candidates stemming from a secondary vertex can be selected. The finder used here is the offline code processing HLTtracks. The invariant mass resolution ( Figure 5 ) is -about 2-3 times larger than the offline result. The rate of background events can be reduced by a factor of 20. The computing time needed by the ITS processing and tracking and the finder for different multiplicities is shown in Table II . The reference platform was a 1.3k SPECint machine. Only the silicon pixel and silicon strip detectors were included in the HLT processing. The processing is fast, both for the ITS part and the open charm trigger.
V. I/O INTERFACE TO DAQ
The HLT system interfaces to the DAQ via the DDL. The detector data is split on the D-RORC and a copy is sent to the DIU on the H-RORC. The HLT system ships the trigger decision, modified and compressed as well as additional data (ESD) back to DAQ via an DDL. The data flow into and out of the HLT has been successfully tested in the TPC test beam setup at the PS (see Figure 6 ). 
VI. IMPLEMENTATION
The components of the HLT system are a farm of clustered SMP-nodes, based on off-the-shelf PCs and connected with a high-bandwidth and low overhead network, a custom PCI receiver card (H-RORCs) which receives a replica of the raw data via the standard ALICE DDL link and also provide a FPGA co-processor for data-intensive tasks of the pattern recognition and a generic communication framework based on the publisher subscriber principle, which allows the construction of any hierarchy of communication processing elements and guarantees fault-tolerance.
A. FPGA co-processor
The final design of the H-RORC is shown in Figure 7 . Some of the pattern recognition algorithms (cluster finder) have been re-designed in VHDL to be executed in the Virtex-4 FPGA, simulated, synthesized and then benchmarked in hardware. Currently the fast Hough transform is being implemented in VHDL.
B. Data transport framework
The design of the framework used to construct the data flow inside the HLT cluster is based on the publisher-subscriber paradigm in which subscribers inform a publisher of their interest in the data offered [7] . From this point on the publisher will broadcast new events that become available to its registered subscribers. In the design of this interface, particular emphasis is placed on efficiency, flexibility and fault tolerance. Efficiency is required for the framework as the need for CPU power for the analysis of the event data will be very significant. CPU resources should therefore only be used as much as necessary for the transport of data, to keep as much CPU time as possible for processing. This is achieved in the framework by not transporting actual data between the framework's components. Instead, data is placed into a shared memory segment by its publishing object and descriptors of that data are transmitted to the subscribers via named pipes. When all subscribers have informed the publisher that they have finished processing an event, it is released and the shared memory can be re-used. The primary mechanism for providing flexibility is the separation of the framework into components (dataflow, data processing and data sink components) which can be connected in different configurations and any processing hierarchy can then be constructed. As the publishersubscriber supports dynamic connections and disconnections at runtime, the system configuration can be adapted while it is active. This dynamic reconfiguration is also one of the major features supporting fault-tolerance of a system built with this framework. It allows for the replacement of failed components during runtime and also for the addition and/or removal of components as required for the reaction to events occuring in a system. A second major building block for this important point is related to the bridge components connecting different nodes. These components also have the ability to establish connections dynamically at runtime, not only for reestablishing existing connections but also for new connections between nodes. Through this mechanism it becomes possible to isolate faulty nodes in the system and replace them with other, previously unused nodes.
One of the important challenges in the ALICE HLT will be the management of the large number of framework component processes distributed in the cluster. It has to be ensured that all processes are started and connected in the correct order. For this purpose a system, the TaskManager [9] , has been developed to control and supervise the framework components.
VII. GRID-LIKE HLT CONFIGURATION
In the ALICE High Level Trigger a GRID approach would in principle be feasible as it does not have any fixed latency requirements for its trigger decision. A globally distributed test of the HLT system was intended as a proof-of-principle demonstration and feasibility study of online grid-like systems. In order to create a full global north-south axis as well as some east-west expansion two further sites in Tromsoe/Norway and Dubna/Russia have been included in the setup, in addition to the listed HLT collaboration institutes.
For the test a configuration was chosen that mimics a part of the HLT processing, incorporating input from TPC and Dimuon detectors [8] . At three of the sites, Bergen, Tromsoe and Dubna, the components were set up to correspond to cluster finding on data from the TPC detector. Output data produced at these three sites was then sent to Heidelberg. Here it was merged together for TPC tracking. The output produced by these four components was then sent to Cape Town. In Cape Town the mock-up TPC data was merged with mockup Dimuon data generated by another processing chain. This chain simulated the processing of Dimuon detector data from cluster finding up to tracking. As the last step in the processing chain the tracked mock-up data was then merged with the received TPC data. In a real setup this component would be the location where the trigger decision would be made and/or where the completely reconstructed event data could be written to permanent storage. The full setup is shown in Figure 8 . This test ran unattended for more than 15 hours. During this time more than 500,000 events were passed through the mock-up processing chain. The event rate was of course limited by the network to about 10 Hz. 
VIII. CONCLUSION
The current TPC tracking performance shows that a sufficient event reconstruction within the central Pb-Pb event rate of 200 Hz will be achievable for multiplicity densities of 4000. For higher densities cluster deconvolution based on track parameters becomes necessary. In this scenario the fast linearized Hough Transform has proven to be efficient and fast up to 8000 for transverse momenta larger than about 0.5 GeV/c. The ITS can be included in the HLT processing scheme with sufficient efficiency and moderate CPU requirements. A trigger is feasible. The final design of the custom H-RORC is under way as well as a VHDL implementation of the fast Hough transform. The I/O interface to DAQ has been successfully tested in the TPC test beam. It has been demonstrated that distributed grid-like online systems are feasible in principle, provided that the necessary conditions are met.
