Solenoid experiment at the Large Hadron Collider reads out event fragments of an average size of 2 kB from around 650 detector frontends at a rate of up to 100 kHz. The first stage of event-building is performed by the Super-Fragment Builder employing custom-built electronics and a Myrinet optical network. It reduces the number of fragments by one order of magnitude, thereby greatly decreasing the requirements for the subsequent event-assembly stage. Back-pressure from the down-stream event-processing or variations in the size and rate of events may give rise to buffer overflows in the subdetector's front-end electronics, which would result in data corruption and would require a time-consuming re-sync procedure to recover. The Trigger-Throttling System protects against these buffer overflows. It provides fast feedback from any of the subdetector front-ends to the trigger so that the trigger can be throttled before buffers overflow. This paper reports on new performance measurements and on the recent successful integration of a scaled-down setup of the described system with the trigger and with front-ends of all major subdetectors. The on-going commissioning of the full-scale system is discussed.
I. INTRODUCTION
T HE Compact Muon Solenoid (CMS) experiment [1] at CERN's Large Hadron Collider (LHC) will search for new physics at the TeV scale such as the Higgs mechanism or Super-Symmetry. At its design luminosity of 10 cm s the LHC will provide proton-proton collisions at a center-of-mass energy of 14 TeV with a bunch crossing frequency of 40 MHz. Each bunch crossing will give rise to about 20 inelastic collisions in which new particles will be created. These particles or their decay products will be recorded by the subdetector systems of CMS comprising approximately 5.5 10 read-out channels. After zero-suppression the total event size per bunch crossing is on average approximately 1 MB.
A highly selective online-selection process accepts of the order of 10 events per second to be stored for offline analysis. In CMS, this process consists of only two levels. The Level-1 Trigger [2] , a dedicated system of custom-built pipelined electronics, first reconstructs trigger objects (muons, electrons/photons, jets etc.) from coarsely segmented data of the muon and calorimeter subdetectors. Based on concurrent trigger algorithms which include cuts on transverse momentum, energy and event topology, it accepts events at an average rate of 100 kHz.
All further steps of online event processing including the read-out, data transport to the surface and event-building at an aggregate data rate of 1 Tb/s, high level trigger processing and data storage are handled by the CMS Data Acquisition (DAQ) System [3] (Fig. 1) . The read-out channels are grouped into approximately 650 data sources by the Front-End Driver (FED) electronics, which in general deliver event fragments of approximately 2 kB. Full event data are buffered during the 3 s latency of the Level-1 Trigger and pushed into the DAQ System upon a Level-1 accept. To optimize the utilization of available bandwidth, some data sources with smaller fragment size are merged by Front-end Read-out Links (FRLs) resulting in a total number of around 500 balanced data sources, from which events have to be built. The fully assembled events are passed to the filter farm which executes the high-level trigger decision based on reconstruction algorithms similar to those of the full offline reconstruction. An innovative two-stage event building architecture [3] has been developed for the CMS experiment. A Super-Fragment Builder first builds larger fragments from fragments. These super-fragments are delivered to hosts (Read-out Units, RUs) in different sets (DAQ slices) such that all super-fragments of an event are delivered to the same DAQ slice. Full events are then built from the super-fragments inside each slice. Requirements for the second and final event-building stage are considerably reduced with respect to a one-stage event builder. The number of inputs is reduced by a factor of while the aggregate throughput requirements per DAQ slice are reduced by a factor of , since each of the DAQ slices processes events independently. The design allows for a staged deployment of the DAQ System as DAQ slices may be added as needed, e.g., as a function of the luminosity delivered by the LHC. Unlike in a one-stage system, where the final performance can only be demonstrated with the full system available, the performance of the two-stage system scales by construction with the number of DAQ slices. The larger fragment size also results in better utilization of network bandwidth in the final event building stage: measurements show a throughput increase by a factor of approximately 2.5 when the fragment size is increased from 2 kB to 16 kB [4] .
The CMS event building architecture includes back-pressure all the way from the filter farm through the event builder and Super-Fragment Builder to the Front-End Drivers. Through this mechanism, FEDs are prevented from sending data to the DAQ system in case of down-stream congestion. The amount of data received by the FEDs on the other hand is determined by the trigger rate and by the detector occupancy. In order to prevent buffer overflow and data corruption in the FEDs or front-end electronics, a trigger throttling system provides fast feedback to the Level-1 Trigger and throttles the rate or disables the trigger when buffers are close to full.
The present paper focuses on the Super-Fragment Builder and on the Trigger Throttling System, the two components of the CMS DAQ system which are at the interface with the subdetector systems.
II. THE SUPER-FRAGMENT BUILDER
The Super-Fragment Builder Input Stage receives event fragments from 650 FEDs and merges data from FEDs with smaller fragment size producing around 500 fragments of balanced size. Multiple networks then build a configurable number of super-fragments from sources and distribute them to DAQ slices. In the default configuration, super-fragments are built from sources and distributed to DAQ slices with 72 Read-out Units (PCs running Linux) per slice. A configuration with 16 DAQ slices of 36 hosts is also possible with the same hardware.
Gigabit Ethernet (the fastest variant of Ethernet at the time of system design) and Myrinet [5] technologies have been evaluated for the Super-Fragment Builder networks. Myrinet is a high-performance packet-communication and switching technology for clusters, composed of network interface cards (NICs) and cross-bar switches interconnected with bidirectional fiber-optic point-to-point links. The switches employ wormhole routing 1 and provide link-level flow control with guaranteed packet delivery.
Myrinet was chosen over Gigabit Ethernet because of its superior link speed of 2 Gb/s, lower latency and link-level flow control which guarantees loss-less data transmission without the need for a CPU-intensive high-level protocol stack such as TCP/IP. Myrinet NICs contain user-programmable RISC processors which facilitate integration with the custom-developed FRL electronics. A further advantage is the cost-effective fiberoptic data transmission which makes it possible to cover the distance of 200 m from the underground counting room to the surface computing farm.
A. Super-Fragment Builder Input Stage
FEDs are interfaced to the Myrinet NICs via custom-built Front-End Read-out Links (FRLs). FRLs are compact-PCI boards with an internal PCI bus hosting a commercial Myrinet NIC. Up to 16 FRLs are controlled and monitored by a PC. FRLs receive data from one or two FEDs using LVDS links according to the SLINK64 specification [6] . At the design clock speed of 50 MHz, the data transfer rate per link is 400 MB/s. The link provides feedback lines in order to signal back-pressure and to initiate an automatic self test. Data are sent in packets with a 64-bit header word containing information such as the event number and 64-bit trailer word containing a CRC calculated over the header and packet data.
All FEDs use a common type of SLINK Sender mezzanine card, which provides a small buffer of 1.6 kB. The card checks the CRC of incoming data packets in order to detect transmission errors between the FED and the SLINK Sender card. In case of an error it replaces the CRC with the correct one and sets a flag in the trailer. In the FRL, the CRC is checked again in order to detect transmission errors over the SLINK. When the FRL receives data from two FEDs, the data packets are merged. Data are buffered in memories of 64 kB size and pushed into the Myrinet NIC in fixed size packets via the internal 64 bit/66 MHz PCI bus. The FRL provides monitoring capabilities such as histograms of fragment size distribution or the possibility to spy on events.
B. Super-Fragment Builder Network
Each Myrinet NIC contains two bi-directional optical data ports with 2 Gb/s data rate, which are connected to two independent Myrinet switch fabrics (rails). This two-rail configuration doubles the bandwidth to 4 Gb/s per FRL and provides redundancy.
The software running on the Myrinet NICs has been custom developed in C. The Myrinet NICs on the FRLs are programmed to send packets to a destination address assigned on the basis of the event number and a look-up-table. The algorithm provides load balancing over the two rails as well as re-transmission in the rare case of transmission errors over the fibers or hardware failure. In case of failure of one rail, all traffic is automatically re-directed to the other rail. The Myrinet NICs hosted by the Read-out Units are programmed to concatenate fragments with the same event number in order to build the super-fragment.
Instead of using an individual switch for each super-fragment, a single large switch fabric is used per rail. This has the advantage that the composition of super-fragments can easily be reconfigured in order to balance super-fragment size or to route avoiding faulty hardware. The switch fabric consists of four layers of 16 16 cross-bar switches (Myrinet Xbar32 components) arranged in a rearrangeably non-blocking Clos [7] , [8] topology as illustrated in Fig. 2 . The first two layers of cross-bar switches are located in three Clos-256 enclosures in the underground counting room while the last two layers are located in another three Clos-256 enclosures on the surface. The first and fourth layer are composed of 36 cross-bar switches each thus allowing up to 576 FRLs and 576 RU PCs to be connected. The inner two layers are composed of 48 cross-bar switches each.
Within the constraints of the connectivity between the inner two layers as shown in Fig. 2 , it is possible to define completely independent paths for the packets through the first three layers of cross-bars. (This is trivial if all data sources of a super-fragment are connected to the same cross-bar in the first layer.) The task of building super-fragments is then confined to a single 16 16 cross-bar in the fourth layer. Fig. 3 shows performance measurements for 8 8 super-fragment building as a function of the fragment size using two rails. In agreement with simulations, at the average fragment size of 2 kB, the throughput per input node (and since also per output node) is about 60% of the link speed of 4 Gb/s due to head-of-line blocking [9] . The increase in throughput with respect to previous measurements [10] is due to the newer generation of Myrinet switches employed and to optimizations in the custom-developed system software executed on the Myrinet NICs. At the measured per-node throughput of 300 MB/s for variable size fragments, the aggregate throughput of the Super-Fragment Builder for the full eight-slice system with 576 RU nodes is approximately 1.4 Tb/s. For 16 16 super-fragment building, a per-node throughput of 250 MB/s has been measured for variable size fragments with an average size of 2 kB. For both tested configurations, the measured throughput exceeds the required throughput of 200 MB/s needed for the transport of 2 kB fragments at the nominal trigger rate of 100 kHz.
For constant size fragments (plot with square markers) the Super-Fragment Builder enters by itself into a barrel-shifter mode. Bandwidth is optimally used when the size of the fragments including all headers is equal to a multiple of the maximum underlying Myrinet packet size of 4096 byte. 2 By implementing barrel-shifter traffic shaping in the software running on the Myrinet NICs, a maximum throughput of approximately 2 Tb/s can be achieved also for variable size fragments, 3 if needed.
III. THE TRIGGER THROTTLING SYSTEM
The CMS DAQ system is designed to handle event data with 2 kB fragment size at a sustained rate of 100 kHz. Level-1 Trigger thresholds and pre-scales will be optimized in order to fully utilize the available bandwidth. While the average trigger rate will be set to 100 kHz, there may be fluctuations in the instantaneous rate which are however limited by the Level-1 Trigger System through the trigger rules given in Table I . The event size varies from event to event as it depends on detector occupancy. These statistic variations may cause buffers in the front-end electronics or in the Front-End Drivers to fill up, even in the absence of back-pressure from the DAQ system.
The DAQ System provides buffering at various levels in order to cope with moderate variations in instantaneous rate and event size and with variations in throughput of the filter farm or the bandwidth to disk storage. If buffers do fill up in the DAQ System, backpressure will be propagated all the way to the Front-End Drivers, thus constituting a further reason for buffers in the FEDs to fill up.
Buffer overflows, which would result in data corruption and would require a lengthy re-sync operation to recover, are avoided by the synchronous Trigger Throttling System (TTS) illustrated in Fig. 1 . It provides a hardwired feed-back path with a latency of less than 1 s from each of the 650 FEDs to the Level-1 Trigger. Through this path, FEDs may request the Level-1 Trigger to enter into a reduced rate state (which is implemented by changing to a more restrictive set of trigger rules) or into a state where triggers are disabled completely. These states should be activated only for a small fraction of the time as they introduce dead-time. The Level-1 Trigger counts the exact times (numbers of bunch crossings) spent in reduced rate state and disabled state for each luminosity section (a period of about 1 minute of data taking during which pre-scales are fixed) and records these times in the conditions database for later analysis.
TTS signals issued by the FEDs are 4-bit signals indicating one of the states Ready, Warning, Busy, Out-of-Sync, Error or Disconnected (listed in order of increasing priority). Signals are transmitted over 4 LVDS pairs. The states Ready, Warning and Busy indicate the amount of buffering left and whether the trigger needs to be switched to a reduced rate state or disabled. Given the trigger rules in Table I and a TTS latency of 1 s, a FED has to be able to accept two more triggers after it changes into the Busy state. The state Out-Of-Sync indicates that synchronization was lost, possibly due to an extremely large event. The state Error indicates that an error occurred in the front-end electronics. The Level-1-Trigger attempts to recover automatically from these states by issuing a Level-1-Resync or Level-1 Reset command via the Timing, Trigger and Control (TTC) system.
In case of the tracker subdetector, buffer space in the front-end chips is very limited. Since these chips are mounted directly on the detector, feedback signals would have to travel an additional distance to the underground counting room. In order to shorten the feedback loop and save buffer space in the front-ends, dedicated emulator modules have been designed. These modules, which are located in close proximity to the Level-1 Trigger, receive the triggers, emulate the buffer levels of the front-end-chips and send fast feedback signals directly to the trigger with a control loop latency of only 0.25 s [11] .
For maximum flexibility, FEDs are grouped into 32 TTC partitions which may be operated independently of each other. The Level-1 Trigger Control System separately distributes triggers to these 32 TTC partitions and separately receives trigger throttling signals for each TTC partition. TTS signals from all FEDs in a TTC partition thus need to be merged with low latency. Dedicated Fast Merging Modules (FMMs) have been designed for this task. These modules can merge and monitor up to 20 inputs and have quad outputs. Optionally, FMMs can be configured to merge two independent groups of ten inputs with two independent twin outputs. A separate asynchronous trigger Throttling System defines a two-way communication path between the Level-1 Trigger Control System and the DAQ System using the same type of signals as the synchronous TTS. For each of up to eight groups of TTC partitions defined in the Trigger Control System, the trigger control system sends the state of the group to the DAQ system. The DAQ system sends a TTS signal per group in order to throttle or disable triggers or to resynchronize or reset the group in case of synchronization problems detected by the filter farm. Asynchronous TTS signals are sent and received by the DAQ system using a modified FMM card with 12 inputs and 12 outputs.
IV. THE MAGNET TEST AND COSMIC CHALLENGE
The Magnet Test and Cosmic Challenge (MTCC) in the second half of 2006 was a major milestone towards the completion of CMS. The entire detector was successfully closed for the first time with the solenoid magnet and parts of most subdetectors installed. Subdetectors were connected to the front-end electronics and FEDs and interfaced to a prototype of the DAQ system.
Despite the fact that the number of FEDs and the trigger rate were two to three orders of magnitude smaller than in the final system, the MTCC was an important test for the CMS DAQ system. All types of hardware components and prototypes of all software components of the final DAQ System were tested and integrated with FEDs of all but three of the CMS subdetectors.
A total number of 18 FEDs of seven subsystems were connected to the DAQ system as detailed in Table II . One super-fragment was built per subsystem and processed by a single DAQ slice with 7 RU hosts. The Super-Fragment-Builder System consisted of 18 input elements (SLINK sender card SLINK cable FRL Myrinet NIC), two Clos-256 Myrinet enclosures and seven RU PCs equipped with Myrinet NICs. The Myrinet enclosures were equipped with a switch fabric that also supported larger scale performance tests with up to 64 FRLs and up to 64 RU PCs: the first and fourth layer of the switch fabric consisted of eight cross-bars switches each supporting two rails with 64 inputs and 64 outputs. The second and third layer consisted of 16 cross-bars each with half of the outputs of each second layer cross-bar connected to half of the inputs of a third layer cross-bar. The Trigger Throttling System consisted of six Fast Merging Modules, one for each subsystem (except for the trigger) and an emulator module for the tracker subsystem. The outputs of the six FMMs were merged by a merger FMM in order to be able to use a Local Trigger Controller instead of the final Level-1 Trigger, which became available only during the final phase of the test.
In the first phase of the test, each subdetector was individually integrated with the DAQ system, verifying correct reset of the event counter at the start of a run, correct data transmission without CRC errors at reduced rate, correct handling of back-pressure applied through the SLINK and correct generation of TTS signals. All problems found during these first integration tests were resolved by modifications to front-end and FED firmware. For some subdetectors, support for operation at the full Level-1 Trigger rate of 100 kHz is still being finalized. The trigger rate from cosmic muons was in general between 100 Hz and 300 Hz depending on trigger settings. This mode of operation was supported by all participating subdetectors thus allowing data to be read out and events to be built.
In the final phase of the test, the magnet was successfully ramped to a field of 4 T. The whole detector was operated as a single experiment for the first time allowing curved tracks of cosmic muons to be acquired and stored to disk. Some of these tracks were simultaneously detected by the muon systems, calorimeters and tracker (Fig. 4) proving that the entire system had been synchronized.
The Trigger Throttling System worked smoothly throughout the MTCC. The FMM monitoring tools were used to monitor all TTS inputs and in one case allowed a flaky TTS connector to be quickly identified. The Local Trigger Controller, which was used during most of the MTCC, stopped triggers immediately when FEDs changed into the Warning or Busy states. The global Level-1 Trigger, which was tested towards the end of the MTCC, also provides a reduced-rate state.
In order to test the Trigger Throttling System integrated with the Global Level-1 Trigger, the Global Trigger was programmed to generate random triggers at different rates which were distributed to a Front-End Driver. At trigger rates greater than 2.8 kHz the FED received back-pressure from the DAQ System and requested to throttle (TTS state Warning) or stop (TTS state Busy) the trigger. Depending on the generated input trigger rate (not shown), the FED's TTS state stayed at Ready, changed between Ready and Warning or changed between Warning and Busy as illustrated in Fig. 5(b) . At input trigger rates close to 2.8 kHz the TTS state changed between all the three states. The effective average trigger rate after throttling was limited to approximately 3 kHz by the available bandwidth of data storage to disk as shown in Fig. 5(a) . The frequency of transitions between TTS states, which may in the future serve as an indicator of problems, varied as a function of the input trigger rate as shown in Fig. 5(c) .
This test demonstrates that all components of the trigger throttling control loop from FEDs through the Fast Merging Modules to the Level-1 Trigger and back to the FEDs are operational and that monitoring features are in place. Parameters such as the buffer levels at which FEDs switch between TTS states and the trigger rules to be used for reduced rate state yet need to be fine-tuned in order to achieve an optimal utilization of the available bandwidth to DAQ. shielded TTS cables and around 60 FMMs housed in eight Compact-PCI crates have been installed in the underground counting room.
SLINK and TTS cables in the underground counting room have been tested with a mobile FED emulator board. A few thousand cycles of test patterns have been sent over each link. Besides identifying a small number bad contacts or soldering problems, this test also served to verify the correct labeling and routing of the cables.
In a second stage, FEDs were installed and SLINK and TTS cables were connected. For the SLINKs a self-test was initiated by the FRLs requiring only that the SLINK sender cards on the FEDs were powered. For the Trigger Throttling System an automatic in situ test procedure has been devised. This procedure is executed under the control of the CMS Run Control System and may be repeated at any time in the future on a regular basis or if problems are suspected. One after the other, FEDs are instructed to generate certain TTS signals or sequences of TTS signals. These are then captured by the FMM transition history memories and verified. The FEDs installation and tests of connectivity to the DAQ system are ongoing at the time of writing.
VI. SUMMARY
The CMS Super-Fragment Builder based on Myrinet and the Trigger Throttling System have been presented. Performance measurements show a throughput of 300 MB/s per node for 8 8 super-fragment building, exceeding the required throughput of 200 MB/s per node. Installation of both systems is now close to completion. All types of hardware and software components and their interfaces to most of the subdetectors have successfully been tested in a scaled-down setup with reduced trigger rate and event size during the Magnet Test and Cosmic Challenge at the end of 2006. Commissioning of the full-scale system is underway and will continue over the next few months in order to be ready for the first LHC collisions in 2008.
