Abstract-The LHCb data acquisition after 2019 will need to perform event-building at an aggregated band-width of 32 Tbit/s. Apart from the technological challenges described in various papers also at this conference, the key challenge is to come up with an architecture which minimises the cost, while providing a system which can be maintained by a small team for a long time and which scales well. In this paper we present the analyses we have been doing to minimise the cost, the R&D topics we derived from that and how we combined all this into a coherent proposal which allows us to come up with a system which not only today fits the budgetary constraints of LHCb, but also will allow profiting from any main-stream technological development. We achieve this by aligning our system needs as much as possible to data-centre mass-market commercial of the shelf (COTS) products; by minimising the number of optical interconnects and by optimising the physical layout of the system. This system requires only one piece of custom-made hardware, and even this could, for a smaller setup be replaced by a commercially available item. We believe that the reasoning behind this design can be beneficial to any large, high-rate data acquisition system.
I. INTRODUCTION
VER the past decade the I/O capability of modern computers has reached enormous speeds. Applications that used to be in the realm of FPGAs and ASICs are nowadays within the grasp of common CPUs. A good example here is the mass storage market. While 10 years ago, the state of the art high performance storage systems where all based on FPGAs. Today these architectures have all but died out. Most systems use common servers with x86 compatible CPUs and Unix based operating systems under the hood.
With the advent of PCIe Gen3 and the Intel Sandy Bridge architecture, a single, dual core server can reach a total I/O bandwidth of 500 Gbit/s. That is 500 Gbit/s in-and 500 Gbit/s output for a total bidirectional bandwidth of 1 Tbit/s. As a comparison, the current DAQ system of the LHCb experiment has a total bandwidth of approximately 400 Gbit/s and could be streamed through a single server, with bandwidth to spare! Another cornerstone of large scale DAQ systems, the readout network has recently also seen a substantial boost in performance. 100 Gbit/s capable network cards are available today on InfiniBand (IB), and 100 Gbit/s Ethernet is just around the corner.
This explosion of throughput capability has made it possible, in fact necessary, to abandon the classical, expensive, crate based read-out board solutions and bring the data into a computer one step earlier in the read-out chain.
The authors are with CERN, Switzerland.
II. THE HIDDEN COST OF A CRATE BASED READ-OUT
The cost of a crate based read-out is not only the price of the hardware and infrastructure. A crate based read-out attached to a local area network either uses (expensive) single-board computers or an FPGA-based board, which transforms the detector data into an industry standard network format. To reduce overall complexity simple event-building protocols are preferred. Data are putn onto the network and sent toward dedicated machines, which assemble the individual detector snapshot fragments into complete events.
This simple push mode usually necessitates large, very expensive buffers in the read-out net-work. These costs can be mitigated by adding buffer memory to the FPGA boards, however this blows up the complexity of the firmware with more complicated protocols and memory interface code. Another aspect is the network interface itself. This can either be solved by instantiating a network core on the FPGA itself or by adding an interface card to the board. Today this means Ethernet only, which is not necessarily the most cost efficient solution anymore at very high band-widths. Also one might want to use precious logic cells for something more important than network code.
Yet another disadvantage is actually the high density of these solutions. While small, optical links are coming down in prices, copper cable-assemblies are still cheaper if power consumption is not a concern and distances can be kept short.
These cable assemblies require significant front-panel space and sometimes additional ASICs. In custom boards this drives up the cost of the solution, since one cannot easily exploit the economy of scale, which drives the COTS market.
III. A PCIE BASED READ-OUT
To overcome these disadvantages, we are currently developing an FPGA based PCIe card which connects directly to the detector. By using a common protocol for read-out, controls and fast control, we can use the same board for these three major tasks. The card uses high density optical connectors on the front plate to connect the detector. It uses a built-in PCIe hard core, which comes with most modern FPGAs for free and with very low foot print in terms of logical cell usage. The card will be Gen3 and 16 lanes wide. It will be capable of sustained 100 Gbit/s throughput for DAQ and controls purposes.
Since this card plugs into a CPU, the expensive buffer task can be moved from the network to the server, where memory is extremely cheap and plentiful. The CPU also allows the O 978-1-4799-3659-5/14/$31.00 ©2014 IEEE implemen-tation of more complex event-building algorithms, which can make the read-out more robust and are easier to develop and maintain 1 . Once the data is inside the server, the choice of network technology is only limited by what is available. It allows the choice to be made at a much later time, which means more bandwidth per price unit and a safer, more future proof choice.
IV. CURRENT R&D PROJECTS
Our current R&D is focusing on two major subjects. The first major topic is the development of the card and the drivers for it. Writing a high speed I/O driver is not a simple task and surprisingly little knowledge and resources are currently available within the HEP and the OSS community. We are currently using Altera Stratix V based PCIe development cards for developing the firmware and drivers. We have completed a first version of an 8 lane interface which works stable and can sustain throughput at more than 50 Gbit/s. Out next goal is to enhance this interface to a 16 lane version. This version uses two 8 lane interfaces as a base and a PLX PCIe bridge to merge them into a single 16 lane stream. A schematic of the planned interface card can be seen in figure 1. The second point is the change of the classic read-out network topology and event-building protocols. Since the CPU the card plugs into has plenty of computing capability, we are trying to move the usually distributed event-building task into the servers that house the read-out board. This compactification allows us to do the event-building within a very small, dedicated, high bandwidth network, illustrated in figure 2. This in turn allows the use of short cables which can be copper instead of optical.
We have already shown the feasibility of doing a 100 Gbit/s event building within a single machine. The studies were performed on an InfiniBand network using two servers. One server was running the event building software under test, while the other server provided simulated data traffic from all the other nodes in the event building network. A GPU was 1 C/C++ Code instead of VHDL used to simulate the data coming from a local read-out board on server one. Our current focus of R&D in this area is a test of a larger network. We are currently scaling up the software used in the two machine test to run on bigger clusters. Another side effect of this change in network architecture is the separation of the event-building and the event filter network. Since the filtering of events will be CPU bound, high speed I/O will not be necessary in the filter network. This opens the door for a cheaper/slower solution, where only a limited number of high speed links are necessary as uplink to the event-building network. An additional chance for cost reduction is the fan-out stage to the farm. To have some breathing room for fluctuations and safety factors for yet undetermined detector occupancies, most detector links will actually run at far less than the maximum bandwidth of 100 Gbit/s. This means we can save on output links from the event building servers to the farm by doing the building only on a subset of the servers. This subset will run close to 100 Gbit/s while the other building servers will only relay their data to them via the building network. We can then connect only these active servers to the farm and roll out high speed uplinks according to the actual bandwidth needs of the detector.
V. CONCLUSION
We have analysed major cost factors in the current LHCb DAQ system and what they mean for the future. By replacing classical, crate based read-out solutions with PCIe we keep our options open for future network technologies and prevent an early lock-in to essentially Ethernet.
Furthermore, the utilization of PCIe allows us to move data into computers much earlier in the DAQ process. We can now use the virtually limitless memory and the CPU power of a modern day server to implement more sophisticated event building protocols that allow the usage of cheap, non-buffering switches. The computer based read-out also allows us to chose the physical transport (copper or optical) closer to the deployment stage of the system. We have shown that a 100 Gbit/s event building and read-out are already possible today and are well on our way of having a larger scale demonstration system soon.
