Abstract-The CMS data acquisition system is designed to build and filter events originating from 476 detector data sources at a maximum trigger rate of 100 kHz. Different architectures and switch technologies have been evaluated to accomplish this purpose. Events will be built in two stages: the first stage will be a set of event builders called front-end driver (FED) builders. These will be based on Myrinet technology and will pre-assemble groups of about eight data sources. The second stage will be a set of event builders called readout builders. These will perform the building of full events. A single readout builder will build events from about 60 sources of 16 kB fragments at a rate of 12.5 kHz. In this paper, we present the design of a readout builder based on TCP/IP over Gigabit Ethernet and the refinement that was required to achieve the design throughput. This refinement includes architecture of the readout builder, the setup of TCP/IP, and hardware selection.
to reduce this rate in two steps: a hardware Level-1 trigger which has a maximum accept rate of 100 kHz and a high level software trigger (HLT) with an additional rejection of a factor of . All events that pass Level-1 are sent to a computer farm (filter farm) which performs reconstruction and selection of events using the full data.
The DAQ system is designed to be modular in order to facilitate expansion as the luminosity increases and to retain the flexibility to change the implementation of parts of the system when technologies are available or new requirements are identified. The design of the CMS data acquisition system and of the high level trigger is described in detail in the DAQ Technical Design Report (TDR) [3] .
An overview of the data flow within the CMS DAQ is shown in Fig. 1 proceeding from top to bottom. At the top are the front-end drivers (FEDs) which are the subdetector specific data sources and which feed the 476 front-end readout links (FRLs) which merge the data of up to two FEDs into one stream. The 2 kB outputs of the FRLs are then assembled into larger event fragments of 16 kB (super fragments) by the FED builders and are distributed to up to eight independent readout builders. The readout builders consist of three types of software components running on Linux PCs. The readout units (RUs) receive the event fragments from the FED builders and distribute them to the builder units (BUs) which assemble full events and pass them to the filter units (FUs). The FUs then have access to the full detector information for selecting events to send to the mass storage system.
The BUs and FUs run in a single PC and receive the data from the RUs via TCP/IP on Gigabit Ethernet (GbE). The configuration is referred to as the Trapezoidal configuration, because there are more BUs than RUs. A combined BU and FU PC is referred to as a BU-FU. A more detailed diagram of an individual readout builder is shown in Fig. 2 .
Each readout builder has an event manager (EVM) that manages the data flow in the RU builder and keeps track of the memory occupancy of the RUs. The event manager is able to request a reduced trigger rate if one of the RUs is running out of memory for buffering incoming super fragments. The event manager deals with three logically separated networks. It uses two distinct control networks to send control messages to RUs to adjust the distribution of the super fragments and to receive the requests of sending the data from the BUs. The readout builder network is instead used to send the super fragments from the RUs to the BUs.
With the event sizes of approximately 1 MB and each readout builder building events at 12.5 kHz, the data throughput of a RU 0018-9499/$25.00 © 2008 IEEE Fig. 1 . Simplified sketch of the CMS DAQ system. From the subdetector specific sources (FED) data is assembled into larger fragments by the FED builders and distributed to up to eight readout builders each supervised by an event manager (EVM). This picture also shows some components not belonging to the DAQ system like the Global trigger processor (GTP) and the trigget throttle system (TTS) responsible for the trigger of the detector. The number of reported inputs (FEDs, FRLs,…) includes the additional sources used by the TOTEM experiment [4] . PC will be approximately 200 MB/s (in and out). This paper describes measurements of the Gigabit Ethernet network and PC throughput which demonstrate that the design requirements can be achieved.
Running the BU and FU components on the same PCs is a deviation from the original design described in the DAQ TDR [3] , which had them on separate PCs connected by a second network. This modification, described in [5] , removes the need for one network and one layer of PCs, but requires more TCP/IP sockets and larger switches because each RU has to communicate with more BUs.
II. IMPLEMENTATION OVERVIEW
Myrinet technology 1 is used in the FED builder and for the data transfer from the detector to the surface. The technology has a data link speed of 2 Gb/s, low latency, implements hardware flow control, and has an on-board CPU in the interface card. Furthermore, Myrinet provides a fiber optic solution for the 200 meter distance to surface at a relatively modest cost. Two parallel data paths are used to achieve 300 MB/s after taking into account efficiency in the Myrinet switches for input fragment sizes of 2 kB. Further details on the FED builder are described in [6] .
The readout builder network, which connects the RUs to the BUs, is implemented with up to four parallel Gigabit Ethernet links, referred to as rails, to achieve an aggregate bandwidth of up to 490 MB/s, more than twice the design requirement.
The DAQ software is built on the XDAQ framework [7] running on commodity Linux machines.
III. TCP/IP SETUP
A transport software package called asynchronous TCP/IP (ATCP) 2 was developed to decouple the DAQ applications from the networking software. It also avoids blocking when more than one host is trying to send data to the same network interface of another host. ATCP puts all messages to be sent in different queues according to the destination and asynchronously processes them in another thread. It writes (reads) into (from) a given socket until it blocks and as soon as it blocks it passes to another socket and continues.
The choice of TCP/IP over Gigabit Ethernet has the advantage of using completely standard hardware and software. TCP/IP is a reliable protocol for which we do not need to worry about packet loss at the application level, which typically occurs when operating close to wire speed. The main drawback is the considerable usage of machine resources for its operation. In order to achieve good performance, the following design choices were made.
• We use Ethernet Jumbo frames. By increasing the maximum transmission unit (MTU) from the standard 1500 B to 7000 B, we observe an increase in the performance of approximately 50%.
• We implement multirail operation. This is done by using independent physical networks depending on the source and destination hosts. Fig. 3 shows a possible two rail configuration of the RU-BU communication.
(detailed description of rail).
• Readout builder control messages and data have different requirements in terms of latency and throughput. While the first must be delivered with low latency and have low throughput the latter require high throughput. In the configuration of TCP/IP in Linux, the Nagle algorithm [8] can be turned on or off. The Nagle algorithm is a means of improving the efficiency of TCP/IP networks by reducing the number of packets that need to be sent over the network. When the Nagle algorithm is on and there are not many messages in the pipeline, messages may be delivered with a very large latency. On the other hand, when the Nagle algorithm is off, we observe a significant decrease of the throughput with time. The optimal readout builder performance is obtained by using different sockets for data and 2 See http://xdaqwiki.cern.ch/index.php/Ptatcp. control messages and setting the Nagle algorithm on for data and off for control messages.
IV. TESTS OF GBE EVENT BUILDER ARCHITECTURE
The performance of the system as a whole and of individual components was measured using PCs with two single-core 2.6 GHz Xeon processors interconnected with a Force 10 E1200 switch and two Gigabit Ethernet rails for each PC.
With the prototype system of 16 RUs 60 BU-FUs, corresponding to almost one quarter of a full-scale readout builder, we measure the throughput as a function of fragment size. Fig. 4 shows that for a fragment size of 8 kB network utilization is almost 100%. The fragment size of 16 kB produced by the FED builders is sufficiently above this threshold, so even with fragments size fluctuations a high network utilization is maintained. The full line speed to the switch (readout builder network) is achieved because the trapezoidal configuration has more reading than writing ports (more BU-FUs than RUS) so the output buffers have low occupancy. For these tests, the FU software is a dummy and does not include the CPU and memory loads associated with the actual trigger algorithms.
To approach the full scale test in terms of the number of connections from RU to BU-FUs, we run multiple BU-FU applications (up to 4 per node). By doing this we obtain similar results up to a 16 RUs 240 virtual BU-FUs configuration, where virtual refers to the fact that BU-FU PC ran multiple applications at the same time. This result shows that there is little or no decrease in performance when increasing the number of output sockets from RUs to BU-FUs, although the throughput is less than there will be in the full system, because there are only 16 RUs as inputs instead of 72.
For the design filter unit input of 50 MB/s, only 5%-10% of the CPU is used to build events leaving the bulk of the processing power for the event selection algorithms.
V. MEASUREMENTS USING PCS WITH TWO DUAL-CORE PROCESSORS
Several different PC architectures have been tested for the RU component. Based on our design requirements and an order tendering process, we have selected PCs equipped with two dual-core processors. In the following section, we present measurements using the selected two dual-core processor PC model. The PC is a Dell PE 2950 equipped with two Woodcrest Xeon 2 GHz processors (E5130), a 1.3 GHz front side bus, 2 PCI-X, and 1 PCI-Express slots for full height expansion cards.
A. Memory Throughput
In the RU configuration with Myrinet DMA input and Gigabit Ethernet TCP/IP output, performance was limited by memory throughput for earlier generation PCs. This is because the data are copied four times for each fragment: the Myrinet PCI-X DMA to memory, from user space memory to CPU then back to kernel space memory and finally data are sent by GbE PCI-E DMA transfer from memory. A more complex building algorithm in the RU might require an additional super-fragment copy in the PC thus moving the data six times.
In tests using a Dell PE 2850 with two single-cores with a 800 MHz front-side bus and non-fully buffered DIMMs, we find a memory bandwidth of 1900 MB/s and measure a throughput of 450-470 MB/s. This is consistent with being limited by the memory bandwidth which would predict 1900/4 475 MB/s. Adding an additional copy of the data we expect 320 MB/s and we measure 330 MB/s.
The newer Dell PE 2950 PCs described above, with fully buffered DIMMs, have a memory bandwidth above 4000 MB/s, so memory throughput is no longer a limitation.
B. RU and BU Performance
Tests of the PCs as RUs have been carried out with one and two Myrinet inputs and six GbE rails. Measurements with two Myrinet inputs were made in order to evaluate the possibility of running two RU processes in a PC and in order to measure the fully saturated throughput of the PC. Fig. 5 shows the measured total throughput of the PC as a function of the fragment size from 2 kB to 32 kB in both configurations. The PC reaches the maximum throughput achievable with six rails and two Myrinet input cards of 640 MB/s for a fragment size of about 8 kB, not quite saturating six links. We find the PCs can fully saturate four Gigabit Ethernet links achieving a throughput of 490 MB/s.
C. Single Quad-Core Processor PC
The two dual-core processor PC was compared with a single quad-core processor PC. The purpose of this test was to explore the possibility of using a single quad-core processor and leaving a socket on the mother board free for future upgrades.
Comparing the memory throughput, we saw that the two dualcore processor PC was much faster than the single quad-core which has only 2700 MB/s throughput. This memory bandwidth does not leave a sufficient margin for achieving the design throughput and reduces the flexibility for other potential modifications.
VI. EVENT BUILDER AND FILTER FARM CONFIGURATION

A. PCs
We have now purchased 640 2u (two rack units high) Dell PE 2950 PCs equipped with two dual-core processors that will be used in the long term as readout units. We are installing on each of them a Myrinet card and a 4 port Gigabit Ethernet card. The Myrinet card is a M3F2-PCIXE-2 with two fiber links and the Gigabit Ethernet card is a Silicom PEG4i with four links and based on the Intel 82571EB controller.
As it was shown in Section V, we are in principle able to achieve a throughput of 490 MB/s per PC, although the throughput may be limited by components up and downstream of the readout builder.
B. Switches
The final choice for switches was to use Force10 E1200 switches equipped with 90 port line cards. Fully loaded, this switch accommodates 1260 Gigabit Ethernet ports. The connector density is achieved using mini-RJ21 connectors (one every six ports) which are then connected to the PCs using patch panels to standard RJ45 connectors located in the racks. The current implementation of these line cards is oversubscribed. Each group of sequential 90 ports can have a maximum aggregate throughput of 11 Gb/s in plus 11 Gb/s out. At the beginning of 2008 full throughput 90 port line cards are expected to be available. The chassis are compatible with both types of line cards.
VII. COMMISSIONING CONFIGURATION
All the scaling related tests described above were performed using an earlier generation of PCs and smaller configurations than the full-system. During the commissioning phase part of the 640 PCs will be used as BU-FUs. The baseline trapezoidal configuration with two slices of 72 RU and 248 BU-FU PCs should achieve a rate of at least 25 kHz. The other configurations that have been considered during the development will also be reevaluated including the original DAQ TDR design. These configurations share common cabling and the difference is only a matter of software configuration. The choice of configuration for the first run may also depend on the high level trigger output requirements.
VIII. FUTURE DEPLOYMENT
The event builder and filter farm will be scaled to 50 kHz in time for the first LHC Physics run in 2008. This upgrade will involve the purchase and installation of additional filter nodes and line cards. At that time, full throughput line cards will be available. After the commissioning run, we will finalize the details of the readout builder architecture and decide whether to replace the oversubscribed line cards with full throughput line cards. We will then be able to use 4 slices of readout units and have enough filter units to match the needs of the high level trigger for a L1 trigger rate of 50 kHz.
Full deployment to eight slices and 100 kHz will be scheduled according to the requirements of the experiment. At that time, all the PCs we are currently installing will be used as readout units and at least four Force 10 E1200 chassis will be needed to connect to of the order of 2000 filter nodes.
IX. SUMMARY
We have presented the design and prototyping of the second stage of the CMS event builder. This stage will be implemented using TCP/IP on Gigabit Ethernet. In order to achieve the design throughput of 12.5 GB/s in eight parallel slices for an aggregate bandwidth of 100 GB/s, the TCP/IP usage has been optimized and up to four parallel Gigabit Ethernet links are used per PC. The network will be interconnected using several high port density Force 10 switches.
In order to validate and optimize the design several measurements have been made. On individual PCs, we demonstrate a throughput of 490 MB/s. Using a quarter scale readout builder and lower performance PCs with only two links, we demonstrate a 240 MB/s throughput suggesting that the design scales well. Based on these results, we expect to be able to achieve the design throughput with the full system. This system is currently being installed with 640 PCs equipped with Myrinet and quad Gigabit Ethernet interfaces.
