Abstract-During the third long shutdown of the CERN Large Hadron Collider, the CMS Detector will undergo a major upgrade to prepare for Phase-2 of the CMS physics program, starting around 2026. The upgraded CMS detector will be read out at an unprecedented data rate of up to 50 Tbit/s with an event rate of 750 kHz, selected by the level-1 hardware trigger, and an average event size of 7.4 MB. Complete events will be analyzed by the High-Level Trigger (HLT) using software algorithms running on standard processing nodes, potentially augmented with hardware accelerators. Selected events will be stored permanently at a rate of up to 7.5 kHz for offline processing and analysis. This paper presents the baseline design of the DAQ and HLT systems for Phase-2, taking into account the projected evolution of high speed network fabrics for event building and distribution, and the anticipated performance of general purpose CPU. In addition, some opportunities offered by reading out and processing parts of the detector data at the full LHC bunch crossing rate (40 MHz) are discussed.
I. INTRODUCTION
T HE upgraded High-Luminosity LHC, after the third Long Shutdown (LS3) will provide an instantaneous luminosity of 7.5 × 10 34 cm −2 s −1 (levelled), with a pileup of up to 200 interactions per bunch crossing.
Along with the increased statistics and thereby extended physics reach, the increased HL-LHC luminosity carries several challenges. The higher particle fluxes and total radiation doses require more radiation-tolerant detectors and front-end electronics technologies. While the increased instantaneous luminosity boosts the probability of observing rare interactions, it at the same time complicates triggering, reconstruction, and analysis of the event data, due to the higher number of protonproton collisions within the same bunch crossing (the so-called 'pileup').
The design goal for the CMS Phase-2 upgrade is: 'to maintain the current excellent performance in terms of efficiency, resolution, and background rejection for all physics objects' [1] . In order to achieve this, the main focus points are:
• Improved tracker and muon detector coverage towards the forward regions, and increased granularity, aimed at preserving the performance of the CMS particle flow reconstruction algorithms [2] at higher pileup and occupancy.
• Inclusion of tracking information in the first level trigger [3] . Double-sided detector modules will correlate hit pairs into 'stubs' with a coarse transverse momentum estimate. Only the data corresponding to tracks with p T ≥ 2 GeV will be transmitted to the tracker backend. This self-seeding approach results in an immediate tenfold rate reduction. The baseline design uses Hough transforms implemented in FPGAs to fit up to 2500 tracks per bunch crossing, within the 4 µs latency allotment (out of a total trigger latency of 12.5 µs). The result is a significant improvement in the first-level trigger turn-on curves.
• Addition of timing information to the front-end data of several subdetectors, in order to help with pileup suppression by adding coincidence information to spatial hits.
• The introduction of a dedicated MIP 1 timing detector [4] , closely integrated with the outer silicon tracker. Maintaining vertexing resolution at high pileup by spatial tracking improvements alone is extremely challenging. The combination of timing information with the tracker hits allows clustering, tracking and vertexing to take place in four dimensions instead of in just spatial 3D. Pre- Apart from the increased throughput and rate requirements of the upgraded detector, the Phase-2 upgrade offers a unique moment of reflection. Unlike the typical end-of-life replacements, or even the Phase-1 upgrade 2 , the CMS-wide Phase-2 upgrade (due to its replacement of both front-end and backend electronics) allows breaking backward compatibility. As such, it is an excellent moment to reassess current approaches and implementations.
The original CMS trigger-DAQ design [7] combines a hardware trigger (Level-1), implemented in custom electronics, with a High-Level Trigger implemented in software and running on commodity compute nodes. This design has proven to be very flexible and scalable [8] , [9] , and the Phase-2 baseline DAQ design builds on this same architecture.
The upgrade of the Level-1 trigger falls outside the scope of this paper, but is detailed in [10] . An interim design report has been prepared for the Phase-2 CMS DAQ upgrade [11] , and will be followed up by a Technical Design Report for the DAQ and High-Level Trigger upgrade by mid-2021. The remainder of this paper discusses the overall design and scale of the Phase-2 CMS DAQ system, with a focus on open questions and ongoing R&D. Figure 1 shows an overview of the baseline Phase-2 DAQ design. The subdetector front-ends and back-ends are not in the scope of the DAQ/HLT project, but in the context of the Phase-2 upgrade they present an interesting transition. All Phase-2 CMS front-end-to-back-end communication uses the CERN GBT [12] and Versatile Link [13] or lpGBT [14] and Versatile Link+ [15] links, combining data transport in one direction with clock and fast-control in the other direction. This means that there no longer are any timing-specific ASICs present in the front-ends, and that the central timing and fastcontrol system only interfaces with the subdetector back-end electronics (and no longer with any on-detector end-points).
Subdetector back-ends will transmit data on custom pointto-point links at 16 or 25 Gbit/s (an evolution of the CERN S-link [16] protocol) to a DAQ and Timing Hub in the same crate, which concentrates and balances these data for transmission to the surface counting room.
The data-to-surface (D2S) network will be based on commercially available hardware, and use a standard protocol. A networked event builder will assemble all back-end event fragments into events, and transfer them to the High-Level Trigger filter farm. Events accepted by the HLT are buffered locally in anticipation of transfer to the CERN computing center for permanent storage.
At a pileup of 200 proton-proton collisions, the upgraded CMS detector will produce an estimated event size of ≈ 7.4 MB. At the design Level-1 accept rate of 750 kHz this corresponds to an event-builder throughput of 44 Tbit/s. The HLT is supposed to reduce this to an output rate of 7.5 kHz to storage.
III. DAQ AND TIMING HUB
One of the more fundamental changes of the Phase-2 DAQ upgrade is a tighter integration among trigger control, fast control, data flow monitoring, and the subdetector backend electronics. The form factor of choice for the Phase-2 CMS back-end and trigger-DAQ electronics is ATCA [17], [18] . Each back-end crate will be equipped with a DAQ and Timing Hub (DTH). (See Fig. 2 .) This hub board has two central functions: it interfaces the boards in the crate to the central Timing and Trigger Control and Distribution System (TCDS), and it bridges the custom back-end electronics to the standard data-to-surface network. On the TCDS side the DTH also provides stand-alone functionality for single-crate commissioning and testing. The DAQ side will receive backend data on custom point-to-point links (either 24 × 16 Gbit/s or 16 × 25 Gbit/s) via the front panel and concentrate and balance these data for transmission on the D2S network (likely again via the front panel). The latter functionality also requires large buffers to decouple the back-ends from any possible network performance fluctuations.
A single DTH is designed to provide a DAQ throughput of 400 Gbit/s per crate (and is therefore often called a 'DTH400'), which is sufficient for most subdetectors. For those subdetectors that require a higher throughput, a companion board will be designed, the DAQ800, providing an additional throughput of 800 Gbit/s but no timing functionality.
The DTH is the main custom hardware deliverable for the Phase-2 DAQ system, and plays a crucial role in the distribution of clock and timing information from the central system to the back-end electronics. A detailed prototyping program is under way, starting with an 'evaluation' prototype instrumented to prove timing distribution performance in collaboration with back-end developers and timing experts. The next steps include an updated prototype, produced in moderate quantities, targeting test stands and firmware and/or software development setups. The schedule leaves sufficient room for an additional revision if necessary, either to fix oversights or to catch up with advances in technology, before the production round of the final hardware to be installed at the experiment.
IV. DATA-TO-SURFACE AND EVENT BUILDING NETWORKS
As was the case in previous generations of the CMS DAQ system, the data-to-surface network will be based on commercial hardware and standard protocols. The choice of the exact protocol and network technology is still under consideration, and should become clear by the time of the DAQ Technical Design Report in 2021. One important constraint arises from the fact that the source of the D2S network is the DTH Fig. 2 . A DAQ and Timing Hub (DTH) in a back-end ATCA crate. Frontend to back-end connectivity is point-to-point. Level-1 trigger information is diverted by the back-end boards, and the data are sent to the DTH on pointto-point optical links. The DTH balances and concentrates these data, and sends them on to the commercial data-to-surface network.
custom hardware. The DTH data reception and data handling will be implemented in an FPGA, suggesting a preference for a D2S protocol that can be implemented in an FPGA as well. The CMS DAQ group has experience with TCP/IP implementations in various FPGA families, and at multiple speeds [19] , [20] . The baseline D2S technology at the time of writing is 4×25 Gbit/s Ethernet → 100 Gbit/s Ethernet using 100GBASE-CWDM4 interfaces to handle the approximately 250 m cable length. Table I shows the estimated throughput requirements on the D2S network for the different CMS subdetectors. Allowing for headroom and non-optimal link use due to imperfect balancing based on the separation into subdetectors, the estimated size of the D2S network is O(600) 100 Gbit/s links. Implementing this in a flexible and cost-effective fashion is one of the challenges of the Phase-2 DAQ upgrade.
The D2S network brings the data to the surface counting room. There, the baseline design foresees a set of commodity I/O servers handling the assembly of event fragments into events. These I/O servers will be interconnected by a dedicated, high-performance, event-building network. Studies are ongoing to investigate the relative merits of a traditional 'linear' network layout, in which data flows unidirectionally through the network, vs. a 'folded' network architecture. This 'folded' architecture combines the data read-out from the D2S network in the same hosts as the event building. In this configuration each of these hosts serves both as source and as destination for the event-building traffic, which flows through the same network link. This makes the 'folded' approach intrinsically more efficient in terms of network usage. The 'linear' approach, on the other hand, benefits from performance tuning exploiting the unidirectional data flow. The main (cost) difference, however, is expected to lie in the number of required event-builder switch ports, which can be significantly lower in a 'folded' architecture.
V. HIGH-LEVEL TRIGGER FARM Even with the addition of tracking information to the Level-1 trigger, the Phase-2 operating conditions will require an increase in Level-1 output rate. Simulations show that, in order not to cut into the physics signals, an increase from 100 kHz to 750 kHz is required [5] . This in turn requires a significant increase in HLT compute capacity. Figure 3 shows both the projected increase in per-event compute time at a pileup of 200 and the evolution of compute power and heat load per HLT node. Combined with the 7.5-fold rate increase, it is clear that a more advanced solution than naive scaling of the data center is necessary.
One possibility is the off-loading of compute-intensive tasks to dedicated accelerator hardware, based, for example, on GPUs or FPGAs. A preliminary study [21] applying GPUs to the pixel tracking shows a four-times gain in speed combined with a 30% reduction in power compared to the current CPUonly implementation.
The current CMS R&D program spans three possible strategies to add accelerators to the HLT farm:
• Equipping all HLT compute nodes with co-processors. Of all approaches considered this is the most homogeneous solution, which at the same time makes it the hardest to optimize. For example: different tasks could benefit from different types of co-processors/accelerators. Homogeneous compute nodes would require multiplying the accelerator hardware throughout the whole HLT farm. In addition, this solution requires a careful balancing between the CPU and accelerator workloads in order to be profitable.
• Creating a co-processor offload service on the eventbuilder network. This approach neatly sections off the special hardware in a 'corner' of the event-builder network, and also trivially supports any admixture of different co-processors and/or accelerators. The downside, however, is that it requires additional high-bandwidth, low-latency connectivity between the HLT compute nodes and the co-processors. This latter point may affect the cost-benefit of this approach.
• Integrating the co-processors directly in the I/O nodes receiving the data from the D2S network. While this is the most heterogeneous approach of all three, it has the benefit that it can be tailored easily by specializing the co-processors to the subdetector data being received. The challenge now lies in an effective separation of the accelerated pre-processing stage from the following HLT filtering stage. The most important downside of this approach is that the acceleration cannot be exploited for 'on-demand' tasks, as it is not part of a consecutive filter chain like is the case in the HLT. An important caveat applies from the point of view of cost: the low-cost consumer-grade GPUs do not lend themselves for integration in rack-based server infrastructure, forcing the choice of expensive GPUs designed for data-center use.
While it is clear that naive scaling of the HLT farm will not be sustainable for Phase-2, no clear candidate solution has been pinpointed yet. One important open question is how much of the HLT work load can be effectively ported to benefit from accelerators. The road ahead depends not only on the outcome of the CMS studies, but also on the evolution of technologies and prices in industry. By the time of the Phase-2 DAQ TDR the ongoing R&D is expected to enable the formulation of a roadmap towards a final HLT design in time for installation around 2026.
VI. NON-BASELINE STUDIES
The baseline Phase-2 DAQ design presented in the interim design report [11] was carefully selected based on its technical feasibility. In the absence of any external constraints the system could be implemented with currently available technologies. From the previous sections it will be clear that this baseline design can (and has to) be improved in several respects.
In addition to investigations into improvements of the baseline design, several studies are ongoing, that aim at providing additional functionality and/or physics reach.
One interesting possible extension could be the addition of a parallel '40 MHz scouting' read-out of all Level-1 trigger primitives, augmented with the raw detector information where bandwidth allows (e.g., the muon systems). Such a system would present an 'opportunistic experiment' in parallel to the CMS data-taking, providing access to data from multiple successive bunch crossings. Given the enormous data volumes involved, these data cannot be stored permanently. Instead, a rolling window of data would be kept, and analyzed using a combination of on-the-fly feature extraction and querybased analyses. This approach could give access to physics signatures that are not easily accessible with the traditional hardware trigger approach. One example is the search for rare, long-lived particles. Such studies would require simultaneous access to detector information from multiple bunch crossings, which is difficult to achieve and inefficient in a traditional hardware trigger. Other physics cases include searches for very rare signatures that are hidden by very high-rate backgrounds, such that it is impossible to design a hardware trigger that is sufficiently selective to reduce the trigger rate without affecting the signal. The analysis of the scouting data could serve as 'anomaly hunting', and lead to a targeted trigger menu should any interesting signs of new physics be found.
In addition to the physics potential, such a '40 MHz scouting' system would provide an invaluable contribution in the form of monitoring and calibration information with almost infinite statistics. Table II shows a numerical comparison between the current, Run-2, CMS DAQ system and the expected Phase-2 DAQ system. From those numbers, and from the description in this paper, it can be concluded that technologically the Phase-2 DAQ system could already be realized at the time of writing. The real challenges lie in the design of a cost-effective architecture that also satisfies the financial, powering, and cooling constraints. 
VII. SUMMARY AND OUTLOOK

Evolution of HLT farm -nodes
• Close to linear increase (rather than exponential) The prototyping program for the DAQ and Timing Hub should start delivering answers within the next year, and the overall design of the CMS Phase-2 DAQ and High-Level Trigger should become clearer by the time of its Technical Design Report by mid-2021.
