The Large Hadron Collider beauty (LHCb) experiment is designed to study the differences between particles and antiparticles as well as very rare decays in the charm and beauty sector at the (LHC). The detector will be upgraded in 2019, and a new trigger-less readout system will be implemented in order to significantly increase its efficiency and fully take advantage of the provided machine luminosity at the LHCb collision point. In the upgraded system, both event building and event filtering will be performed in software for all the data produced in every bunch-crossing of the LHC. In order to transport the full data rate of 32 Tb/s, we will use custom field-programmable gate array (FPGA) readout boards (PCIe40) and the state-of-the-art off-the-shelf network technologies. The full-event-building system will require around 500 servers interconnected together. From a networking point of view, event building traffic has an all-toall pattern, requiring careful design of the network architecture to avoid congestion at the data rates foreseen. In order to maximize link utilization, different techniques can be adopted in various areas like traffic shaping, network topology, and routing optimization. The size of the system makes it very difficult to test at production scale, before the actual procurement. We resort, therefore, to network simulations as a powerful tool for finding the optimal configuration. We will present an accurate low-level description of an InfiniBand-based network with event building like traffic. We will show a comparison between simulated and reduced scale systems and how changes in the input parameters affect the performance.
I. INTRODUCTION
T HE Large Hadron Collider beauty (LHCb) experiment [1] will receive a substantial upgrade [2] during the Long Shutdown 2 (LS2) of the Large Hadron Collider (LHC), scheduled to start in December 2018 and expected to end by February 2021. One of the major changes during this upgrade process will be the installation of a completely new data acquisition (DAQ) system without any low-level hardware trigger, Manuscript allowing all event selection being performed by sophisticated algorithms implemented in software using the latest calibration and detector alignment data available, and thus providing high yields with almost none or only little contamination of unwanted events being selected for offline data analysis. To implement a trigger-less readout, the full data rate of ∼32 Tb/s produced by the detector must be forwarded by the eventbuilding network. To achieve this total throughput, we are targeting a system composed of ∼500 hosts interconnected together using 100-Gb/s networking technology, as shown in Fig. 1 . In order to design and build a system with the abovementioned complexity, we need extensive planning and testing. For this reason, we developed DAQ protocol-independent performance evaluator (DAQPIPE). This event-building benchmark application generates real event building traffic and can be configured in multiple ways in order to experiment with different network configurations and technologies. The only drawback of using DAQPIPE is that we need a computing cluster with a suitable network, and therefore, in order to test the scalability of the system, we need to access to high performance computing (HPC) clusters equipped with 100-Gb/s capable interconnection networks. Because of the relative small number of suitable systems available in the world, the waiting list for a slot in an external HPC facility can be very long and the network configuration may be suboptimal for event building tests.
In this paper, we present a low-level simulation model that can be used in parallel with tests on real systems, to speed up the process of designing the event building network for a trigger-less readout system. 0018-9499 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. II. LHCb EVENT BUILDER ARCHITECTURE In this section, we briefly describe the DAQ's architecture of the LHCb experiment for the Run-3 of the LHC. We will focus on the network side of the system. See the technical design report (TDR) [2] for a complete description.
A. Event Building Architecture
The LHCb event builder is composed of three main logical units as follows.
• Builder Unit: It receives and aggregates the fragments into full events. • Readout Unit: It collects the fragments from the peripheral component interconnect express (PCIe)-based DAQ board and sends them to the builder units (BUs). • Event Manager: It assigns which event is built on which BU. During the event building process, every event is assigned, by the event manager (EM), to a singular BUs. After the event is assigned to it, the BU collects all the fragments of the event from every readout unit (RU) in a one-to-one way. As depicted in Fig. 2 , a BU and a RU are aggregated into one single node generating a "folded" event builder. Because the data traffic is always flowing from the RUs to the BUs, this architecture is used to fully exploit the full-duplex nature of the network and to reduce by a factor two the number of physical machines needed in the final system compared to a one-directional event builder.
In the collective communication schema, the traffic pattern of a folded event builder can be compared to an all-toall personalized exchange with different data size for every fragment.
In order to reduce network congestion generated by an allto-all personalized exchange, we use the linear shifting traffic scheduling technique, which can be explained as follows.
• We divide the all-to-all exchange into N phases, where N is the total number of nodes. • In every phase, every node sends the data to exactly one destination and receives data from one source only. • During phase n node i sends to node (n + i )%N. 1 • When a previously agreed condition is met, all the nodes synchronously switch from phase n to phase n + 1; usually, the switching is triggered either by the number of sent events or by a fixed-time window. If the aforementioned conditions are respected for all the phases, then we have an ideal linear shifting scheduling. Because fulfilling all the conditions requires strong time 1 The % symbol indicates the modulo operation. synchronization on real systems, it is possible to tolerate a moderate phase desynchronization.
B. Event-Building Network
From the networking point of view, the event-building traffic tends to create congestion and high link utilization among all the nodes, and therefore, the selected network topology has to be nonblocking and provide full bisection bandwidth.
For the implementation of the LHCb event-building network, we decided to use a folded Clos network [3] as the one depicted in Fig. 3 ; often referred to as fat-tree. 2 We selected this particular topology because: 1) it fulfills the aforementioned requirements and 2) it is widely adopted and it is supported by switch vendors. In particular, the OpenSM subnet manager used in InfiniBand-based networks provides an optimized routing algorithm for fat-tree topologies [4] . This algorithm uses a constant one-to-one correspondence between the spine switch selected and the leaf-switch port the destination node is connected to, providing a conflict-free path for all the packets that are following a perfect linear shifter.
C. Event-Building Benchmark: DAQPIPE
DAQPIPE [5]- [7] is a small benchmark application to test network fabrics for the future LHCb upgrade. It emulates an event builder based on a local area network, and it supports multiple network technologies through different communication libraries like MPI [8] , LIBFABRIC [9] , VERBS [10] , and PSM2 [11] .
DAQPIPE can be used either in a PUSH or PULL schema and it supports different traffic shaping strategies to reduce network congestion. The software has a modular architecture and can use different backends to evaluate multiple network libraries.
The software provides an implementation of all the logical blocks required by the LHCb event building and emulates the reading data from a real DAQ board connected to the detector. All the fragments of the same emulated event are then sent through the network using the desired communication library and protocol, and are then aggregated into the BU selected by the EM.
In order to reduce traffic congestion, DAQPIPE provides a linear shiftlike traffic shaping, without enforcing strong synchronization among the nodes. 3 The lack of a perfect scheduling creates temporary saturation on specific links possibly causing performance degradation. This effect is mitigated by sending multiple fragments to different destinations at the same time from every RU, the number of fragments in flight per RU and the number of events processed in parallel can be configured via two parameters as follows.
• Credits: Number of events processed in parallel by the BU. • Parallel Sends: Number of fragments of the same event in flight to the same BU from the different RUs. The credits are processed in a fully asynchronous way, and they mitigate the impact of temporary link congestion and latency toward the EM. The parallel sends define a sliding window in the linear shift scheduling and they absorb the latency between BUs and RUs.
III. SIMULATION MODEL
The simulation model we developed is implemented using the Objective Modular Network Testbed in C++ (OMNeT++) framework [12] ; this discrete event simulator primarily targets the network simulations and offers multiple tools that can be used to accomplish different tasks: from describing the network topology to gathering advanced statistics from the simulated design. In order to simulate the LHCb DAQ system, we mainly need two components: an accurate description of the network and precise modeling of the DAQ traffic.
Mellanox technologies has already contributed to an OMNeT++-based InfiniBand flow control unit (FLIT)-level simulation model. This model supports link level flow-control, static lookup table-based routing, arbitration between multiple virtual lanes (VLs), 4 packet generation and fragmentation, and packet arbitration; however, it is no longer maintained by the original author and does not support the 100-Gb/s flavor of InfiniBand [i.e., enhanced data rate (EDR)]. Therefore, we decided to expand the library capabilities to fulfill our requirements and to make it as accurate as possible. In order to obtain a realistic model behavior, we performed a finetuning of the parameters using information collected from real hardware available in our test laboratory. In particular, we focused on: buffer sizes, network latency, link flow control, packet arbitration, latency and jitter of our entire software stack including PCIe communication overheads. The source code of the full project can be accessed via git [14] .
A. Modules Description
OMNeT++ uses modules as fundamental building blocks; hereinafter, we provide a brief description of the main ones implemented.
• IBOutBuf: A buffer for outgoing FLITs.
• IBInBuf: A set of buffers for incoming FLITs, every VL has a dedicated buffer. Every WR is first fragmented into packets and then all the FLITs are generated for every packet. • IBSink: It receives the packets and notifies the IBApp module upon completion. This feature is critical for simulating real-world applications because it makes the IBApp aware of the inbound traffic and able to reply to incoming messages. Fig. 4 depicts how modules can be interconnected together to generate more complex units.
B. Topologies
In order to implement network topologies, OMNeT++ provides the network description (NED) language which can be used to generate hierarchical and parametric networks. By using this powerful and flexible tool, we implemented a parametric description of a fat-tree network. In view of analyzing and comparing against measured data collected on HPC clusters, we also implemented a Python script that generates NED code by parsing the subnet manager information of the real cluster topology. In this way, we can study the ideal topologies and compare them against real-world systems with small imperfections such as missing nodes, swapped cables, and suboptimal routing.
C. Traffic Injectors
Accurate traffic modeling is a key component for obtaining precise and realistic network simulation; in particular, in this project, we replicate both synthetic and real application traffic. Our main target is to simulate the event-building system of the LHCb experiment; therefore, a particular effort was put in an accurate replication of the DAQPIPE traffic. Moreover, we implemented two linear shifters with a different phase definition. A list and a brief description of the traffic injector implemented as follows.
• Fixed-Size Linear Shifter: It shifts the destination after injecting a fixed number of Bytes. • Time-window linear shifter: It shifts destination after a fixed time interval. This injector uses a fixed grace period to absorb jitter, during this period the nodes are not allowed to send data, resulting in increased stability at the expense of lower theoretical throughput. • Simulated DAQPIPE: An injector that replicates the real DAQPIPE traffic. This traffic generator allows the user to change all the relevant parameters as in the real software.
IV. PARAMETER ESTIMATION
The simulation model has several different parameters that need to be tuned and optimized to replicate the behavior of real InfiniBand systems. For our event-building studies, we are interested in 100-Gb/s networking solutions; therefore, we tuned the model to replicate InfiniBand EDR hardware. In particular, we used a Mellanox SB7700 [15] EDR switch and Mellanox MT27700 [16] ConnectX-4 host channel adapters (HCAs).
Most of the basic parameters can be extracted from the InfiniBand architecture specification [17] , e.g., bandwidth, header overhead, encoding overhead, link flow control behavior, etc. Advanced and hardware specific parameters can be estimated by performing real measurements and reverse engineering on the actual hardware.
Crucial values for our simulations are: switch buffer size, link layer latency, and PCIe latency; the buffer size of the switch is critical for a realistic description of the link-level flow control of InfiniBand, and an accurate modeling of the latency is needed to reproduce the same link congestion as in a real system. Fig. 5 . Setup used to generate congestion and estimate the switch buffer size. Host0 sends at full speed data to Host2, at the same time host1 sends packets of different sizes to create controlled congestion.
A. Switch Buffer Estimation
In order to measure the switch buffer size, we can use two different techniques [18] : analyzing the link-level flow control packets or generating congestion and monitoring the congestion indicator 5 on the various ports.
Decoding the information from the flow control packets produces a more accurate measure, but it requires a lowlevel InfiniBand protocol analyzer. Because there are no EDR capable protocol analyzers available on the market, we decided to use the second strategy and estimate the amount of buffering available in every switch port by measuring the performance counters.
The setup used is depicted in Fig. 5 and the procedure used to create congestion is as follows.
• Host0 sends continuously to Host2 at full speed.
• Host1 sends to Host2 packets of increasing size at regular intervals, to create congestion. • By reading the PortXmitWait counter and knowing the packet size, we can estimate the buffer size of the switch. Following this procedure, we estimated a buffer size of 64 KiB per port per VL with 4 VLs enabled.
B. Link Layer Latency Estimation
In order to measure the link layer latency, without using external protocol analyzers, we decided to use the hardware time-stamping feature of the IEEE 1588-2008 standardi.e., Precision Time Protocol (PTP)-implementation in the Mellanox HCAs.
The path latency measure using PTP produced an estimation of 170-ns full delay using a 3-m-long direct attached copper cable, between two directly connected hosts.
C. PCIe Latency Modeling
The final piece in our model tuning is a realistic model of the combined latency introduced by the PCIe bus and the InfiniBand software stack; because of the nonreal-time nature of modern computing systems and software, we decided to perform real-world latency measures and replicate this behavior in our simulation model.
The latency has been measured using the ib_write_lat [19] benchmark and subtracting the link layer latency, and therefore, this measurement will include all the time needed from the hardware and software chain to make a packet available to the link layer. Fig. 6 shows the histogram of the latency measurements, the simulation model draws random number generated from this distribution to replicate latency and jitter of the real system.
V. RESULTS In this section, we present some results obtained by simulating DAQPIPE with the aforementioned simulation model. In particular, we provide a comparison between the simulation and measured data and a comparison of two different network topologies. Fig. 7 shows a comparison between the simulated and the real DAQPIPE for different values of the credits and parallel sends parameters. The measured data are collected on an HPC cluster of 64 nodes interconnected via a fat-treelike network with missing nodes, swapped cables, and nonideal routing. The simulation uses a replica of the same topology and the same routing of the real system. From this plot, we can confirm that the simulation can replicate the trend and the absolute value of the measurements performed on the real system within 30% error. In particular, for the most interesting configuration to us, i.e., high throughput configurations, the simulation is well within 20% from the measured data. Given that the focus is on the scalability of those configurations, and the complexity of simulating such a complicated system, we are satisfied with this level of accuracy.
In Fig. 8 , we present a performance comparison of the simulated DAQPIPE on two different topologies: a clean fattree of 72 nodes and an HPC cluster of 64 nodes. The purpose of this comparison is to show the performance degradation introduced by a topology not optimized for a linear-shifting traffic. The network topology and the routing algorithm on the HPC cluster are suboptimal for our specific use case and do not allow linear shifting without conflicts; therefore, we expect a lower bandwidth.
As we can see from the plot, the performance loss is highly dependent on the parameters and can be as high as 50%; nevertheless, the bandwidth drop for the fastest configuration is 6%.
We can conclude that a nonideal topology affects the performance of DAQPIPE and makes it more unstable, the performance drop can vary significantly and it is highly influenced by the configuration parameters and the topology itself. Configurations with many parallel sends are affected in a more severe way because they increase the number of nodes communicating at every phase, especially if in conjunction with a high credit count, increasing the probability of local link congestion.
VI. CONCLUSION
We have implemented an accurate low-level model of our event-building traffic based on the InfiniBand EDR fabric. We have measured different parameters of the simulation model to achieve realistic results.
We have validated our simulation and traffic model against the measured data obtained on an HPC cluster, and our simulation model is capable of replicating the data within 20% for the configurations we are interested in. Considering that a low-level simulation of a complex network system is very challenging and that the purpose of this paper is to evaluate different optimization strategies, we are satisfied with the level of precision achieved by this simulation model. In particular, the amount of machine time needed on physical infrastructures can be significantly reduced by performing preliminary studies and parameter tuning on simulated systems.
