The STAR level-3 trigger is a MYRINET interconnected ALPHA processor farm, performing online tracking of N 58000 particles (N 445 per track) with a design input rate of R"100 Hz. A large-scale prototype system was tested in 12/99 with laser and cosmic particle events.
Introduction
The RHIC accelerator at Brookhaven National Laboratory, USA, will investigate Au#Au collisions with (s4200 A GeV and p#p collisions with (s4500 GeV. The STAR experiment [1] is a large scale, cylindrical, symmetric 4 -detector. Physics data taking will start in 2000 with a full size Time Projection Chamber (TPC), R "0.6 m, R "2 m) with 24 sectors, 6912 pads each. TPCs are speci"cally suitable for detecting high-density charged particle #uxes in high-multiplicity nucleus}nucleus events.
Architecture

The STAR trigger architecture
The STAR trigger system is subdivided into 4 hierarchic levels. The level-0 output rate is 10 Hz, levels-1 and -2 as well as coincidence with the TPC gating reduce the rate by one order of magnitude each. Level-3 trigger is supposed to reduce an input rate of 10 Hz to the "nal DAQ rate of R "1 Hz at an expected TPC event size of K15 MB. Task examples for the level-0/-1/-2 trigger stages are (a) selection of central and peripheral Au#Au events based upon multiplicity (function of impact parameter) and (b) rejection of beam-gas events with a vertex far from the interaction point. The tasks of level-3 trigger are event selections Fig. 1 . STAR level-3 trigger system arhitecture as used in the system test 12/99(cf. Section 8) . Event building of level-3 speci"c events has been performed locally on the Global-L3 CPU, integration into the STAR event building is foreseen.
based upon the online reconstructed track parameters of each particle. Several applications have been proposed for Au#Au collisions, being on the one hand high p 2 trigger applications as enrichment of heavy (anti-)fragments (e.g. He), STAR EMC (Electromagnetic Calorimeter) calibration using tagged high p 2 \, QCD hard parton scattering (leading high p 2 hadrons in jets), and c and b quark decays (e.g. high p 2 leptons). On the other hand, online invariant mass reconstruction of J/ , BPe>e\ is proposed, as suppression of cc and bb production is commonly regarded as one the most promising signatures of the quark-gluon plasma. Additionally, for p#p collisions, a level-3 trigger algorithm for "ltering of 4700 pile-up events in the TPC (at highest Luminosity L"2;10 cm\ s\) per one level-0 trigger is being developed.
The STAR DAQ architecture
Level-3 trigger architecture is closely related to the STAR DAQ architecture, in which one VME crate is mapped onto each physical TPC sector (two TPC sectors in the "rst stage), containing a Sector Broker, i.e. Motorola MVME-2306 VME board, carrying a PowerPC 604 (300 MHz, VxWorks), as the TPC sector master controller. The Sector Broker carries a MYRINET interface (cf. Section 5) for (a) raw data transfer to the main STAR event builder and (b) connection to the level-3 track "nder CPUs. Moreover, each DAQ crate also contains six VME receiver boards, each carrying three mezzanine cards with E one Intel i960 CPUs (33 MHz, VxWorks) for (a) data formatting and (b) running the level-3 cluster "nder, E 4 MB of dual-ported VRAM for bu!ering and pipelining of raw data of 12 events.
Further details about the hardware are described elsewhere [2, 3] .
The STAR level-3 trigger architecture
Level-3 trigger scheme consists of two main components:
E The sector level-3 part (`Sector-L3a) is mapped onto one physical TPC sector. It contains (a) level-3 cluster "nder (cf. Section 3) and (b) the level-3 track "nder (cf. Section 4). Data transfer of cluster data and track data is performed by MYRINET (cf. Section 5). Typical data sizes per event are K85 MB for raw data, K15 MB after zero suppression (DAQ taping event size), K3 MB after cluster "nding and K0.5 MB after track "nding. E The global level-3 part (`Global-L3a) consists of 1, 2 , n (n"1 in the "rst stage) master CPUs for the whole STAR TPC, collecting all track data via MYRINET and issueing the level-3 decision. Fig. 1 shows the schematic level-3 trigger architecture, as installed for the system test in 12/99 (cf. Section 8). The project development of level-3 trigger can be subdivided into two main stages. E In the "rst stage (installed 12/99), for both DAQ and level-3 trigger system two physical TPC sectors are mapped onto one logical level-3 sector, which only contains one track "nder CPU. Thus, all trigger rate design values are to be multiplied by a factor of . Level-3 trigger will employ TPC data only, and the input trigger rate is R425 Hz. Monte-Carlo simulations were performed using the event generator HIJING 1.31, which is based on a QCD-inspired model for jet production [5] . E In the second stage, envisaged for 2001, one TPC sector maps one level-3 sector, which then contains up to 4 parallelized track "nder CPUs. Level-3 trigger will employ additional track information of SVT (Silicon Vertex Tracker) [4] , and the "nal design value for the input trigger rate is R"100 Hz (limited by TPC frontend readout rate).
Cluster 5nder
For a TPC readout, one ADC channel is indexed by a pad number (r -direction, e.g. 88 pads for the most inner padrow at R "0.6 m) and a drift timebin number (z-direction, 512 timebins per pad). Clusters are continuous (r , z) regions with an ADC value above threshold. In a "rst step, for each TPC cluster the center-of-gravity (weighted mean according to ADC values) is calculated to obtain particle hit xyz-coordinates.
The cluster "nder algorithm runs on the Intel i960 CPUs, implemented on the DAQ receiver boards (cf. Section 2.2). The number of i960s is 18 per TPC sector, 432 for the whole TPC. Input to the cluster "nder are zero-suppressed TPC raw data, stored in the VRAM bu!er. The output cluster data, i.e. (a) cluster center-of-gravity and (b) cluster total charge (ADC sum), are sent via VME to the Sector Broker, which itself ships the data via MYRINET to the level-3 track "nder CPU (expected data transfer rate of K3 MB/s per TPC sector).
The cluster "nder time constraint is 4 10 ms (input rate R"100 Hz). Benchmarks on the i960 were performed for 600 clusters (realistic Au#Au scenario) on the TPC's most inner padrow. The position resolution of (r)K37 m and zK13 m could be obtained with an algorithm within "7.5 ms (absolute cluster "nding ef-"ciency "93%). The clusters and reconstructed centers-of-gravity are shown in Fig. 2 (left) . If two clusters are merged, an additional deconvolution subroutine must be started, consuming 6.0% more CPU time than in case of two separated clusters. Fig. 2 (right) shows the reconstructed clusters for a STAR beam-gas event, recorded in 07/99. Di!erent TPC clock timing lead to a maximum drift timebin t "348 in that case (x-axis).
Track 5nder
Monte-Carlo simulations predict for a central Au#Au collision (impact parameter b42.0 fm)
In the STAR solenoid magnetic "eld of B"0.5 T charged particle tracks can be parametrised as helices, being visible as circles in an xy-projection. that the track "nder algorithm must be able to "t at least NK400 tracks per event per TPC sector, each consisting of N 445 points (given by the number of padrows). This shall be referred as`Au#Au bench-mark eventa hereafter.
The fast track "nder algorithm has speci"cally been developed for level-3 trigger project [6] . It employs conformal mapping, i.e. a transformation of a circle into a straight line, followed by a "t with a follow-your-nose method. A given space point (x, y) is transformed into a conformal space point (x, y) according to the equations x"(x!x )/r and y"(y !y)/r, using r"(x!x ) #(y!y ). The transformation requires the knowledge of one point (x , y ) on the track trajectory, either (a) the interaction point (vertex constraint for primary tracks) or (b) the "rst point associated with the track (no vertex constraint for secondary tracks). Fig. 3 shows an example of 1000 simulated level-3 trigger >/ \ particle tracks (p 2 41.0 GeV/c). Fig. 4 shows the track "nder e$-ciency as a function of p 2 (top) and pseudorapidity (bottom) for 50,000 Monte-Carlo generated tracks (independent of particle type). The e$ciency of 590% for " "41.2 and p 2 5400 MeV/c is well suited for high p 2 trigger applications (cf. Section 2.1), the p 2 resolution being e.g. N2 "14.9 MeV/c (RMS) for p 2 "500 MeV/c. The track "nder time constraint is given by 4 ! , while 410 ms is given by the time being necessary for cluster "nding (cf. Section 3). In the "rst project state is given by the bu!er time of 12 pipelined events ( "12;10 ms), in the "nal stage by one event only ( "10 ms). Timing benchmarks for different CPU platforms have been described in detail in Ref. [2] . The fastest available CPU platform is A truncated mean is calculated by (a) ADC value into dE/dx transformation (e.g. gain calibration, pedestal subtraction) (b) list sorting and (c) calculation of the mean of the lowest 70% (tale truncation). Fig. 4 . Level-3 track "nder e$ciency as a function of p 2 (top) and pseudorapidity (bottom) for 50,000 Monte-Carlo generated tracks (assumed hit resolutions of (r)"500 m and z"2 mm).
Background DMA (Direct Memory Access) data transfers do not consume any CPU time (except for bus arbitration and interrupt handling). Thus, the processor is free for tasks as track "nding or data formatting. the ALPHA 21264 (64 bit). For the Au#Au benchmark event, an ALPHA XP1000 (500 MHz) prototype machine gave a result of "88 ms, to be compared with "135 ms for e.g. a Pentium II (450 MHz). Fine tuning of initialization parameters (e.g. number of /-slices for followyour-nose search range) lead to a signi"cant improvement of the processing speed ( "39 ms). If additionally for each track, a dE/dx truncated mean value is calculated, an additional time of "8 ms is needed. The current system consists of 12 ALPHA DS-10 (466 MHz) machines. The ALPHA 21264 chip provides two additional signi"cant advantages: Firstly, it is the "rst ALPHA chip with a hardware sqrt( ) function implementation (&30 sqrt( ) calls per track). Secondly, it has 128 bit wide memory access to 2 MB of level-2 cache on chip, important for fast`d ata digginga (only 64 bit wide access to external cache for Intel Pentium). Further ALPHA 21264 hardware details are described elsewhere [2, 7] . Linux was chosen as operating system, with the kernel Linux 2.2.12 running stable on ALPHA. Each DS-10 machine is booting Linux diskless, console messages being routed via serial port and ethernet to any arbitrary terminal, and thus enabling a single user to remotely control the whole processor farm.
Network
Level-3 trigger system requires for each TPC sector a high bandwidth network connection between a PMC adapter (on the Sector Broker VME board) and a PCI adapter (in the Sector-L3 AL-PHA). The estimated whole system throughput is R"52 Mbytes/s (level-3 track data, Au#Au benchmark event), to be added to RK15 Mbytes/s DAQ data throughput. The two di!erent point-topoint, full-duplex networks SCI and MYRINET were tested for application in the level-3 trigger. Due to missing DMA capabilities, Gigabit Ethernet was not considered. Scalable Coherent Interface (SCI) [8] Latency¸"t }t is de"ned by the time t for issuing an interrupt (e.g. end-of-package) on the receiver and the sender acknowledge time t .
In a ring topology, any failure (e.g. a single faulty cable) a!ects the complete system, in a switch topology only a single point-to-point connection.
The basic architectures di!er signi"cantly, i.e. SCI utilizes a ring topology, MYRINET a switch topology. Thus, e.g. SCI requires the Sector Broker to carry two PMC adapters (di!erent rings), as while MYRINET requires the existence of (16-port) switches between the PMC and the PCI side. Fig. 5 shows the bandwidth and latency as a function of bu!er size for both MYRINET and SCI (PCI-PCI point-to-point DMA benchmark). Typical bu!er sizes for the level-3 trigger are b"128 byte for messages and b520 kbyte for data transfers. In both cases, the bandwidth is limited by PCI bus to RK60}70 Mbytes/s. In case of SCI, the maximum bandwidth is already achieved for small bu!er sizes b564 bytes (corresponding to SCI payload), but saturates at a lower level (R"62 Mbyte/s). However, former D310 versions also achieved R"72 Mbyte/s (cf. Fig. 6 in Ref. [2] ), only a recent hardware revision lead to a bandwidth reduction of R/RK14%. For both cases, the CPU usage is as low as 412% due to DMA. In case of SCI the (one-way) latency of "2}3 s is smaller than in case of MYRINET (¸K30}40 s) due to an extra MYRINET software layer. The latency limit for the STAR DAQ design is¸4100 s. The "nal decision for the usage of MYRINET was driven by (a) long-term test stability issues, (b) hardware revision status and availability and (c) free availability of MYR-INET driver software as`open sourcea for numerous platforms (e.g. Linux/Intel, Linux/ALPHA, VxWorks).
Global Level-3 trigger
The Global-L3 CPU performs (a) track data collection from all Sector-L3 machines, (b) a level-3 decision algorithm based on event characteristics (e.g. invariant mass reconstruction, further examples in Section 2.1) and (c) issues the level-3 yes/no decision to the event builder (MYRINET message). Merging of low p 2 tracks split by TPC sector boundaries is foreseen for the future. Multiple Global-L3 processes and/or CPUs will cover di!erent physics decision tasks simultaneously, the yes/no decision being issued as logical OR. Both Pentium III (600 MHz) and ALPHA XP1000 (500 MHz) Fig. 6 . 3D event display (cf. Section 7) for a level-3 processed cosmic particle event (top) and a laser event (bottom), recorded in system test (cf. Section 8) in 12/99 (level-3 cluster centers-ofgravity, zy sideview).
The step from one up to 12 Sector-L3 CPUs also marks at the same time the step from synchronous to asynchronous mode (arriving data sequence by random).
For laser events the track "nder was not operational, because straight tracks result in unreasonable p 2 PR values.
have been tested as Global-L3, showing no di!erence in MYRINET performance. As an example for a global decision algorithm, invariant mass reconstruction for 100 particle pairs (J/ Pe>e\, high p 2 , cut p 2 51.5 GeV/c pre-applied) requires a CPU time of t"0.4 ms on ALPHA.
3D event display
In order to visualize and browse a large number of events (level-3 accept and reject cases) quickly, a fast 3D event display has been developed. The C# # program is based upon the high-performance graphics language OpenGL (Mesa 3.0 library). The graphical user interface has been designed using the Qt 2.0 Library. The hardware, a Pentium III (600 MHz) with (OpenGL supporting) nVidia RIVA TNT2 graphics adapter, allows the display of 250,000 clusters (N "8000) as well as mouse controlled operations like rotating and zooming in quasi-realtime without signi"cant delay.
System test
In 12/99, a detailed system test of the level-3 trigger system (cf. Fig. 5 ) has been performed in two stages.
(1) While the STAR TPC was switched o!, empty events (even size identical to Au#Au collision event) have been used for tests of high-rate and long-term stability of one single sector subpath (one VME crate, one Sector-L3, one Global-L3). For messages only (72 messages per event), a trigger input rate of RK600 Hz could be processed stably for 1.2 million events. For messages and data transfer, a bandwidth of R"48 Mbytes/s was achieved (bu!er size b"196 kbyte), being lower than the MYR-INET bandwidth (Fig. 5, left) due to the message protocol. (2) While the STAR TPC and the STAR magnet (B"0.5 T) were switched on, a test of the complete level-3 trigger prototype system (432 Intel i960, 12 Sector-L3, one Global-L3) was performed with laser and cosmic particle events. Event displays for one event of each type (using the 3D event display, cf. Section 7) are shown in Fig. 5 .
For single cosmic particle tracks, a very high track "nding e$ciency 595% was achieved, although loose cuts (in order to be able to "nd tracks with z even outside the TPC) lead to a number of reconstructed tracks N 51 in &30% of all cases. In total, 15,000 level-3 speci"c events (cluster and track data) were recorded successfully. Further cosmic data taking is planned for 01-03/2000. The start of the RHIC physics program is envisaged for 04/2000.
Summary
The STAR level-3 trigger system will perform online track "nding for high multiplicity Au# Au collider events (N 58000, N 445 per track), utilizing high-performance ALPHA 21264 CPUs and high-bandwidth MYRINET network interfaces (expected data transfer rate K52 Mbytes/s). A global level-3 CPU will perform tasks like e.g. online invariant mass reconstruction and issue an accept/reject decision with a design input rate R"100 Hz (R425 Hz for a prototype system in 12/99).
