Abstract-The Data Acquisition System of the Compact Muon Solenoid experiment at the Large Hadron Collider reads out event fragments of an average size of 2 kilobytes from around 650 detector front-ends at a rate of up to 100 kHz. The first stage of event-building is performed by the Super-Fragment Builder employing custom-built electronics and a Myrinet optical network. It reduces the number of fragments by one order of magnitude, thereby greatly decreasing the requirements for the subsequent event-assembly stage. By providing fast feedback from any of the front-ends to the trigger, the Trigger Throttling System prevents buffer overflows in the front-end electronics due to variations in the size and rate of events or due to backpressure from the down-stream event-building and processing. This paper reports on the recent successful integration of a scaled-down setup of the described system with the trigger and with front-ends of all major sub-detectors and discusses the ongoing commissioning of the full-scale system.
I. INTRODUCTION
T HE Compact Muon Solenoid (CMS) experiment [1] at CERN's Large Hadron Collider (LHC) will search for new physics at the TeV scale such as the Higgs mechanism or Super-Symmetry. At its design luminosity of 1034 cmM2s-' the LHC will provide proton-proton collisions at a center-of-mass energy of 14 TeV with a bunch crossing frequency of 40 MHz. Each bunch crossing will give rise to about 20 inelastic collisions in which new particles may be created. Decay products of these particles are recorded by the subdetector systems of CMS comprising approximately 5 .5X107 readout channels. After zero-suppression the total event size per bunch crossing is on average 1 MB. A highly selective online-selection process accepts of the order of 102 events per second to be stored for offline analysis. In CMS, this process consists of only two levels. The Level-i Trigger [2] , a dedicated system of custom-built pipelined electronics, first reconstructs trigger objects (muons, electrons/photons, jets ...) from coarsely segmented data of the muon and calorimeter sub-detectors. Based on concurrent trigger algorithms which include cuts on transverse momentum, energy and event topology, it accepts interesting events at an average rate of 100 kHz.
All further steps of on-line event processing including the read-out, data transport to the surface and event-building at an aggregate data rate of 1 Terabit/s, high level trigger processing and data storage are handled by the CMS Data Acquisition (DAQ) System [3] . The readout channels are grouped into approximately 650 data sources by the FrontEnd Driver (FED) electronics, which in general deliver event fragments of approximately 2 kB. Full event data are buffered during the 3 pts latency of the Level-I Trigger and pushed into the DAQ System upon a Level-I accept. To optimize utilization of available bandwidth, some data sources with smaller fragment size are merged by the DAQ System resulting in a total number of around 500 balanced data sources, from which events have to be built. The fully assembled events are passed to the filter farm which executes the high-level trigger decision based on reconstruction algorithms similar to the full off-line reconstruction.
An innovative two-stage event building architecture [3] has been developed for the CMS experiment. A Super-Fragment Builder first builds larger fragments from N fragments with 1 K N < 31. These super-fragments are delivered to hosts (Readout Units, RUs) in M different sets (DAQ slices) in a round-robin scheme so that all super-fragments of an event are delivered to the same DAQ slice. The SuperFragment Builder considerably reduces the requirements for the second and final event-building stage with respect to a one-stage event builder. The number of inputs is reduced by a factor of N while the aggregate throughput requirements per DAQ slice are reduced by a factor of M, since each of the DAQ slices processes events independently. The larger fragment size results in better utilization of network bandwidth in the final event building stage [4] . Furthermore, the design allows for a staged deployment of the DAQ System as DAQ slices may be added as needed, e.g. as a function of the luminosity delivered by the LHC.
The CMS event building architecture includes backpressure all the way from the filter farm through the event builder and super-fragment builder to the Front-End Drivers. Through this mechanism, FEDs are prevented from sending data to the DAQ System in case of down-stream congestion. The amount of data received by the FEDs on the other hand is determined by the trigger rate and by the detector occupancy. In order to prevent buffer overflow and data corruption in the FEDs or front-end electronics, a Trigger Throttling System provides fast feedback to the trigger and throttles the rate or [6] . At the design clock speed of 50 MHz, the data transfer rate per link is 400 MB/s. The link provides feedback lines in order to signal backpressure and to initiate an automatic self test. Data are sent in packets with a header word containing information such as the event number and trailer word containing a CRC calculated over the header and packet data.
All FEDs use a common SLINK Sender mezzanine card, which provides a small buffer of 1.6 kB. The card checks the CRC of incoming data packets in order to detect transmission errors between the FED and SLINK Sender card. In case of an error it replaces the CRC with the correct one and sets a flag in the trailer.
The FRL receives data from one or two FEDs and merges data packets in the latter case. The CRC is checked again in mm} order to detect transmission errors over the SLINK. Data are buffered in memories of 64 kB size and pushed into the Myrinet NIC in fixed size packets via the internal 64bit/66MHz PCI bus.
The FRL provides extensive monitoring capabilities such as histograms of fragment size distribution or the possibility to spy on events.
B. Super-Fragment Builder Network Each Myrinet NIC contains two bi-directional optical data ports with 2 Gb/s link speed (250 MB/s bandwidth), which are connected to two independent Myrinet switch fabrics. This two-rail configuration doubles the bandwidth to 4 Gb/s per FRL and provides redundancy.
The Myrinet NICs on the FRLs are programmed to send packets to a destination address assigned on the basis of the event number and a look-up-table. The algorithm provides load balancing over the two rails as well as re-transmission in the rare case of packet loss or corruption due to hardware failure. The Myrinet NICs hosted by the Readout Units concatenate fragments with the same event number in order to build the super-fragment.
Instead of using an individual N x M switch for each superfragment, a single large switch fabric is used per rail. This has the advantage that the composition of super-fragments can easily be reconfigured in order to balance super-fragment size or to route avoiding faulty hardware.
The switch fabric consists of four layers of 16X 16 crossbar switches (Myrinet Xbar32 components) arranged in a rearrangeably non-blocking Clos [7] topology as illustrated in Fig. 2 . The first two layers of cross-bar switches are located in three Clos-256 enclosures in the underground counting room while the second two layers are located in another three Clos-256 enclosures on the surface. The first and fourth layer are composed of 36 cross-bar switches each thus allowing up to 576 FRLs and 576 RU PCs to be connected. The inner two layers are composed of 48 cross-bar switches each.
Within the constraints of the connectivity between the inner two layers as shown in Fig. 2 , it is possible to define completely independent paths for the packets through the first three layers of cross-bars. (This is trivial if all data sources of a super-fragment are connected to the same cross-bar in the first layer.) The task of building super-fragments is then confined to a single 16X 16 cross-bar in the fourth layer. Fig. 3 shows performance measurements for 8X8 superfragment building as a function of the packet size using two rails. In agreement with simulations, at the average fragment size of 2 kB, per-node throughput is about 60 00 of the link speed of 4 Gb/s due to head-of-line blocking. At 
III. THE TRIGGER THROTTLING SYSTEM
The CMS DAQ System is designed to handle event data with 2 kB fragment size at a sustained rate of 100 kHz. Level-I trigger thresholds will be optimized in order to fully utilize the available bandwidth. While the average trigger rate will be set to 100 kHz, there may be fluctuations in the instantaneous rate which are however limited by trigger rules similar to the ones given in Table 1 . The event size varies from event to event as it depends on the total multiplicity of tracks. These statistic variations may cause buffers in the front-end electronics or in the Front-End Drivers to fill up, even in the absence of back-pressure from the DAQ System. The DAQ System provides buffering at various levels in order to cope with moderate variations in instantaneous rate and event size and with variations in throughput of the filter farm or the bandwidth to disk storage. If buffers do fill up in the DAQ System, backpressure will be propagated all the way to the Front-End Drivers, thus constituting a further reason for buffers in the FEDs to fill up.
Buffer overflows, which would result in data corruption and would require a lengthy re-sync operation to recover, are avoided by the synchronous Trigger Throttling System (Fig. 1 ). Table 1 and a TTS latency of 1 pts, a FED has to be able to accept 3 more triggers after it changes into state Busy. In case of the tracker sub-detector, buffer space in the frontend chips is very limited. Since these chips are mounted directly on the detector, feedback signals would have to travel an additional distance to the underground counting room. In order to shorten the feedback loop and save buffer space in the front-ends, dedicated emulator modules have been designed. These modules, which are located in close proximity to the Level-I Trigger, receive the triggers, emulate the buffer levels of the front-end-chips and send fast feedback signals directly to the trigger with a control loop latency of only 0.25 pts [8] .
The 
IV. THE MAGNET TEST AND COSMIC CHALLENGE
The Magnet Test and Cosmic Challenge (MTCC) in the second half of 2006 was a major milestone towards the completion of CMS. The entire detector was successfully closed for the first time with the solenoid magnet and parts of most sub-detectors installed. Sub-detectors were connected to the front-end electronics and FEDs and interfaced to a prototype of the DAQ System.
Despite the fact that the number of FEDs and the trigger rate were two to three orders of magnitude smaller than in the final system, the MTCC was an important test for the CMS DAQ System. All types of hardware components and prototypes of all software components of the final DAQ System were tested and integrated with FEDs of all but three of the CMS sub-detectors. A total number of 18 FEDs of 7 subsystems were connected to the DAQ system as detailed in Table 2 . One superfragment was built per subsystem and processed by a single DAQ Slice with 7 RU hosts. The Super-Fragment-Builder System consisted of 18 input elements (SLINK sender card + SLINK cable + FRL + Myrinet NIC), two Clos-256 Myrinet enclosures and 7 RU PCs equipped with Myrinet NICs. The Myrinet enclosures were equipped with a switch fabric that also supported larger scale performance tests with up to 64 FRLs and up to 64 RU PCs: the first and fourth layer of the switch fabric consisted of 8 cross-bars switches each supporting 2 rails with 64 inputs and 64 outputs. The second and third layer consisted of 16 cross-bars each with half of the outputs of each second layer cross-bar connected to half of the inputs of a third layer cross-bar.
The Trigger Throttling System consisted of 6 Fast Merging Modules, one for each sub-system (except for the trigger) and an emulator module for the tracker sub-system. The outputs of the 6 FMMs were merged by a merger FMM in order to be able to use a Local Trigger Controller instead of the final Level-I Trigger, which became available only during the final phase of the test.
In the first phase of the test, each sub-detector was individually integrated with the DAQ system, verifying correct reset of the event counter at the start of a run, correct data transmission without CRC errors at reduced rate, correct handling of back-pressure applied through the SLINK and correct generation of TTS signals. All problems found during these first integration tests were resolved by modifications to front-end and FED firmware. For some sub-detectors, support for operation at the full Level-I trigger rate of 100 kHz is still being finalized. Trigger rate during the MTCC was limited since no high-level event filtering was available and therefore all events were stored to disk. The prototype storage manager had a maximum throughput of 40 MB/s limiting the trigger rate to approximately 200 Hz. This mode of operation was supported by all participating sub-detectors thus allowing data to be read out and events to be built. In the final phase of the test, the magnet was successfully ramped to a field of 4 T. The whole detector was operated as a single experiment for the first time allowing curved tracks of cosmic muons to be acquired and stored to disk. Some of these tracks were simultaneously detected by the muon systems, calorimeters and tracker (Fig. 4) proving that the entire system had been synchronized.
The Trigger Throttling System worked smoothly throughout the MTCC. The FMM monitoring tools were used to monitor all TTS inputs and in one case allowed a flaky TTS connector to be quickly identified. The Local Trigger Controller, which was used during most of the MTCC, stopped triggers immediately when FEDs changed into the Warning or Busy states. The Global Level-I Trigger, which was tested towards the end of the MTCC, also provides a reduced-rate state. Fig. 5b Commissioning of the full-scale system is underway and will continue over the next few months in order to be ready for the first LHC collisions at the end of the 2007.
