Abstract-The CMS Data Acquisition System is designed to build and filter events originating from 476 detector data sources at a maximum trigger rate of 100 KHz. Different architectures and switch technologies have been evaluated to accomplish this purpose. Events will be built in two stages: the first stage will be a set of event builders called FED Builders. These will be based on Myrinet technology and will pre-assemble groups of about 8 data sources. The second stage will be a set of event builders called Readout Builders. These will perform the building of full events. A single Readout Builder will build events from 72 sources of 16 KB fragments at a rate of 12.5 KHz. In this paper we present the design of a Readout Builder based on TCP/IP over Gigabit Ethernet and the optimization that was required to achieve the design throughput. This optimization includes architecture of the Readout Builder, the setup of TCP/IP, and hardware selection.
I. INTRODUCTION The Compact Muon Solenoid experiment (CMS) [1] is a general purpose detector which will operate at the Large Hadron Collider (LHC) [2] , situated at the CERN laboratories in Geneva. The beam crossing rate at LHC will be 40 MHz and the events sizes will be approximately 1 MB, so it will be impossible to store all the interactions. This input rate must be reduced to of order 100 Hz, the maximum rate feasible for data storage and off-line processing. CMS has chosen to reduce this rate in two steps: a hardware Level-I trigger which has a maximum accept rate of 100 KHz and a High Level software Trigger (HLT) with an additional rejection of a factor of 103.
All events that pass Level-I are sent to a computer farm (Filter Farm) which performs reconstruction and selection of events using the full data.
The DAQ system is designed to be modular in order to facilitate expansion as the luminosity increases and to retain the flexibility to change the implementation of parts of the system when technologies are available or new requirements are identified. The design of the CMS Data Acquisition System and of the High Level Trigger is described in detail in the DAQ Technical Design Report (TDR) [3] .
An overview of the data flow within the CMS DAQ is shown in Fig. 1 proceeding from top to bottom. At the top are the Front-End Drivers (FEDs) which are the sub-detector specific data sources and which feed the 476 Front-End Readout Links (FRLs) which merge the data of up to two FEDs into one stream. The Fig. 2 .
With the event sizes of approximately 1 MByte and each Readout Builder building events at 12.5 KHz, the data throughput of a RU PC will be approximately 200 MB/s (in and out). This paper describes measurements of the Gigabit Ethernet network and PC throughput which demonstrate that the design requirements can be achieved.
Running the BU and FU components on the same PCs is a deviation from the original design described in the DAQ TDR [3] , which had them on separate PCs connected by a second network. This modification, described in [4] * We implement multi-rail operation. This is done by using independent physical networks depending on the source and destination hosts. Fig. 3 shows a possible two rail configuration of the RU-BU communication. * Control messages and data sent to the Readout Builder have different requirements in terms of latency and throughput. While the first must be delivered with low latency and has low throughput the latter requires high throughput. In the configuration TCP/IP in Linux the Nagle algorithm [8] can be turned on or off. The Nagle algorithm is a means of improving the efficiency of TCP/IP networks by reducing the number of packets that need to be sent over the network. When the Nagle algorithm is on and there are not many messages in the pipeline, messages may be delivered with a very large latency. On the other hand, when the Nagle algorithm is off, we observe an significant decrease of the throughput with time. The optimal Readout Builder performance is obtained by using different sockets for data and control messages and setting the Nagle algorithm on for data and off for control messages.
IV. TESTS OF GBE EVENT BUILDER ARCHITECTURE
The performance of the system as a whole and of individual components was measured using PCs with two single-core 2.6 GHz Xeon processors interconnected with a Force 10 E1200 switch and two Gigabit Ethernet rails for each PC.
With the prototype system of 16 RUs x 60 BU-FUs, corresponding to almost one quarter of a full-scale Readout Builder, we measure the throughput as a function of fragment size. Fig. 4 shows that for a fragment size of 8 KB network utilization is almost 100%. The fragment size of 16 KB produced by the FED Builders is sufficiently above this threshold, so even with fragments size fluctuations a high network utilization is maintained. The full line speed to the switch is achieved beacause the trapezoidal configuration has more reading than writing ports so the output buffers have low occupancy. For these tests, the FU software is a dummy and does not include the CPU and memory loads associated with the actual trigger algorithms. To approach the full scale test in terms of the number of connections from RU to BU-FUs, we ran multiple BU-FU applications (up to 4 per node). By doing this we obtained 116 RU X 608ULF0 I.
-F --- Fig. 5 shows the measured total throughput of the PC as a function of the fragment size from 2 KB to 32 KB in both configurations. The PC reaches the maximum throughput achievable with six rails and two Myrinet inputs of 640 MByte/s for a fragment size of about 8 kByte, not quite saturating 6 links. We find the PCs can fully saturate four Gigabit Ethernet links achieving a throughput of 490 MB/s. C. Single Quad-Core Processor PC
I~~~~~--------------------------
The two dual-core processor PC was compared with a single quad-core processor PC. The purpose of this test was to explore the possibility of using a single quad-core processor and leaving a socket on the mother board free for future upgrades.
Comparing the memory throughput, we saw that the two dual-core processor PC was much faster than the single quadcore which has only 2400 MB/s throughput. This memory bandwidth does not leave a sufficient margin for achieving the design throughput and reduces the flexibility for other potential modifications.
VI. EVENT BUILDER AND FILTER FARM CONFIGURATION A. PCs
We have now purchased 640 2u Dell PE 2950 PCs equipped with two dual-core processors that will be used in the long term as Readout Units. We are installing on each of them a Myrinet card and a 4 port Gigabit Ethernet card. The Myrinet card is a M3F2-PCIXE-2 with two fiber links and the Gigabit Ethernet card is a Silicom PEG4i with four links and based on the Intel 82571EB controller.
As it was shown in Section V, we are in principle able to achieve a throughput of 
VII. COMMISSIONING CONFIGURATION
All the scaling related tests described above were performed using an earlier generation of PCs and smaller configurations than the full-system. During the commissioning phase some of the 640 PCs will be used as BU-FUs. The baseline trapezoidal configuration with two slices of 72 RU and 248 BU-FU PCs should achieve a rate of at least 25 KHz. The other configurations that have been considered during the development will also be reevaluated including the original DAQ TDR design. These configurations share common cabling and the difference is only a matter of software configuration. The choice of configuration for the first run may also depend on the High Level Trigger output requirements.
[8] Nagle J., "Congestion control in IP/TCP internetworks", RFC 896, 1984.
VIII. FUTURE DEPLOYMENT
The Event Builder and Filter Farm will be scaled to 50 KHz for the first LHC Physics run in 2008 at the nominal initial luminosity of 2 x 1033cm-2s-1. This upgrade will involve the purchase and installation of additional filter nodes and line cards. At that time, full throughput line cards will be available. After the commissioning run, we will finalize the details of the Readout Builder architecture and decided whether to replace the oversubscribed line cards with full throughput line cards if necessary. We will then be able to use 4 slices of Readout units and have enough Filter Units to match the needs of the High Level Trigger for a LI trigger rate of 50 KHz.
Full deployment to 8 slices and 100 KHz will be scheduled according to the requirements of the experiment. At that time, all the PCs we are currently installing will be used as Readout Units and at least 4 ForcelO E1200 chassis will be needed to connect to of the order of 2000 filter nodes.
IX. SUMMARY
We have presented the design and prototyping of the second stage of the CMS Event Builder. This stage will be implemented using TCP/IP on Gigabit Ethernet. In order to achieve the design throughput of 12.5 GB/s in eight parallel slices for an aggregate bandwidth of 100 GB/s, the TCP/IP usage has been optimized and up to four parallel Gigabit Ethernet links are used per PC. The network will be interconnected using several high port density Force 10 switches.
In order to validate and optimize the design several measurements have been made. On individual PCs, we demonstrate a throughput of 490 MB/s. Using a quarter scale Readout Builder and lower performance PCs with only two links, we demonstrate a 240 MB/s throughput suggesting that the design scales well. Based on these results, we expect to be able to achieve the design throughput with the full system. This system is currently being installed with 640 PCs equipped with Myrinet and quad Gigabit Ethernet interfaces.
