6 research outputs found

    Opportunistic usage of the CMS online cluster using a cloud overlay

    No full text
    After two years of maintenance and upgrade, the Large Hadron Collider (LHC), the largest and most powerful particle accelerator in the world, has started its second three year run. Around 1500 computers make up the CMS (Compact Muon Solenoid) Online cluster. This cluster is used for Data Acquisition of the CMS experiment at CERN, selecting and sending to storage around 20 TBytes of data per day that are then analysed by the Worldwide LHC Computing Grid (WLCG) infrastructure that links hundreds of data centres worldwide. 3000 CMS physicists can access and process data, and are always seeking more computing power and data. The backbone of the CMS Online cluster is composed of 16000 cores which provide as much computing power as all CMS WLCG Tier1 sites (352K HEP-SPEC-06 score in the CMS cluster versus 300K across CMS Tier1 sites). The computing power available in the CMS cluster can significantly speed up the processing of data, so an effort has been made to allocate the resources of the CMS Online cluster to the grid when it isn’t used to its full capacity for data acquisition. This occurs during the maintenance periods when the LHC is non-operational, which corresponded to 117 days in 2015. During 2016, the aim is to increase the availability of the CMS Online cluster for data processing by making the cluster accessible during the time between two physics collisions while the LHC and beams are being prepared. This is usually the case for a few hours every day, which would vastly increase the computing power available for data processing. Work has already been undertaken to provide this functionality, as an OpenStack cloud layer has been deployed as a minimal overlay that leaves the primary role of the cluster untouched. This overlay also abstracts the different hardware and networks that the cluster is composed of. The operation of the cloud (starting and stopping the virtual machines) is another challenge that has been overcome as the cluster has only a few hours spare during the aforementioned beam preparation. By improving the virtual image deployment and integrating the OpenStack services with the core services of the Data Acquisition on the CMS Online cluster it is now possible to start a thousand virtual machines within 10 minutes and to turn them off within seconds. This document will explain the architectural choices that were made to reach a fully redundant and scalable cloud, with a minimal impact on the running cluster configuration while giving a maximal segregation between the services. It will also present how to cold start 1000 virtual machines 25 times faster, using tools commonly utilised in all data centres

    Performance of the new DAQ system of the CMS experiment for run-2

    No full text
    The data acquisition system (DAQ) of the CMS experiment at the CERN Large Hadron Collider (LHC) assembles events at a rate of 100 kHz, transporting event data at an aggregate throughput of more than 100GB/s to the Highlevel Trigger (HLT) farm. The HLT farm selects and classifies interesting events for storage and offline analysis at an output rate of around 1 kHz. The DAQ system has been redesigned during the accelerator shutdown in 2013-2014. The motivation for this upgrade was twofold. Firstly, the compute nodes, networking and storage infrastructure were reaching the end of their lifetimes. Secondly, in order to maintain physics performance with higher LHC luminosities and increasing event pileup, a number of sub-detectors are being upgraded, increasing the number of readout channels as well as the required throughput, and replacing the off-detector readout electronics with a MicroTCA-based DAQ interface. The new DAQ architecture takes advantage of the latest developments in the computing industry. For data concentration 10/40 Gbit/s Ethernet technologies are used, and a 56Gbit/s Infiniband FDR CLOS network (total throughput ≈ 4Tbit/s) has been chosen for the event builder. The upgraded DAQ - HLT interface is entirely file-based, essentially decoupling the DAQ and HLT systems. The fully-built events are transported to the HLT over 10/40 Gbit/s Ethernet via a network file system. The collection of events accepted by the HLT and the corresponding metadata are buffered on a global file system before being transferred off-site. The monitoring of the HLT farm and the data-taking performance is based on the Elasticsearch analytics tool. This paper presents the requirements, implementation, and performance of the system. Experience is reported on the first year of operation with LHC proton-proton runs as well as with the heavy ion lead-lead runs in 2015

    Opportunistic usage of the CMS online cluster using a cloud overlay

    No full text
    After two years of maintenance and upgrade, the Large Hadron Collider (LHC), the largest and most powerful particle accelerator in the world, has started its second three year run. Around 1500 computers make up the CMS (Compact Muon Solenoid) Online cluster. This cluster is used for Data Acquisition of the CMS experiment at CERN, selecting and sending to storage around 20 TBytes of data per day that are then analysed by the Worldwide LHC Computing Grid (WLCG) infrastructure that links hundreds of data centres worldwide. 3000 CMS physicists can access and process data, and are always seeking more computing power and data. The backbone of the CMS Online cluster is composed of 16000 cores which provide as much computing power as all CMS WLCG Tier1 sites (352K HEP-SPEC-06 score in the CMS cluster versus 300K across CMS Tier1 sites). The computing power available in the CMS cluster can significantly speed up the processing of data, so an effort has been made to allocate the resources of the CMS Online cluster to the grid when it isn't used to its full capacity for data acquisition. This occurs during the maintenance periods when the LHC is non-operational, which corresponded to 117 days in 2015. During 2016, the aim is to increase the availability of the CMS Online cluster for data processing by making the cluster accessible during the time between two physics collisions while the LHC and beams are being prepared. This is usually the case for a few hours every day, which would vastly increase the computing power available for data processing. Work has already been undertaken to provide this functionality, as an OpenStack cloud layer has been deployed as a minimal overlay that leaves the primary role of the cluster untouched. This overlay also abstracts the different hardware and networks that the cluster is composed of. The operation of the cloud (starting and stopping the virtual machines) is another challenge that has been overcome as the cluster has only a few hours spare during the aforementioned beam preparation. By improving the virtual image deployment and integrating the OpenStack services with the core services of the Data Acquisition on the CMS Online cluster it is now possible to start a thousand virtual machines within 10 minutes and to turn them off within seconds. This document will explain the architectural choices that were made to reach a fully redundant and scalable cloud, with a minimal impact on the running cluster configuration while giving a maximal segregation between the services. It will also present how to cold start 1000 virtual machines 25 times faster, using tools commonly utilised in all data centres

    New operator assistance features in the CMS Run Control System

    No full text
    The Run Control System of the Compact Muon Solenoid (CMS) experiment at CERN is a distributed Java web application running on Apache Tomcat servers. During Run-1 of the LHC, many operational procedures have been automated. When detector high voltages are ramped up or down or upon certain beam mode changes of the LHC, the DAQ system is automatically partially reconfigured with new parameters. Certain types of errors such as errors caused by single-event upsets may trigger an automatic recovery procedure. Furthermore, the top-level control node continuously performs cross-checks to detect sub-system actions becoming necessary because of changes in configuration keys, changes in the set of included front-end drivers or because of potential clock instabilities. The operator is guided to perform the necessary actions through graphical indicators displayed next to the relevant command buttons in the user interface. Through these indicators, consistent configuration of CMS is ensured. However, manually following the indicators can still be inefficient at times. A new assistant to the operator has therefore been developed that can automatically perform all the necessary actions in a streamlined order. If additional problems arise, the new assistant tries to automatically recover from these. With the new assistant, a run can be started from any state of the subsystems with a single click. An ongoing run may be recovered with a single click, once the appropriate recovery action has been selected. We review the automation features of the CMS run control system and discuss the new assistant in detail including first operational experience.During Run-1 of the LHC, many operational procedures have been automated in the run control system of the Compact Muon Solenoid (CMS) experiment. When detector high voltages are ramped up or down or upon certain beam mode changes of the LHC, the DAQ system is automatically partially reconfigured with new parameters. Certain types of errors such as errors caused by single-event upsets may trigger an automatic recovery procedure. Furthermore, the top-level control node continuously performs cross-checks to detect sub-system actions becoming necessary because of changes in configuration keys, changes in the set of included front-end drivers or because of potential clock instabilities. The operator is guided to perform the necessary actions through graphical indicators displayed next to the relevant command buttons in the user interface. Through these indicators, consistent configuration of CMS is ensured. However, manually following the indicators can still be inefficient at times. A new assistant to the operator has therefore been developed that can automatically perform all the necessary actions in a streamlined order. If additional problems arise, the new assistant tries to automatically recover from these. With the new assistant, a run can be started from any state of the sub-systems with a single click. An ongoing run may be recovered with a single click, once the appropriate recovery action has been selected. We review the automation features of CMS Run Control and discuss the new assistant in detail including first operational experience

    The CMS Data Acquisition - Architectures for the Phase-2 Upgrade

    No full text
    The upgraded High Luminosity LHC, after the third Long Shutdown (LS3), will provide an instantaneous luminosity of 7.5×10347.5\times10^{34} cm2s1^{-2} s^{-1} (levelled), at the price of extreme pileup of up to 200 interactions per crossing. In LS3, the CMS Detector will also undergo a major upgrade to prepare for the phase-2 of the LHC physics program, starting around 2025. The upgraded detector will be read out at an unprecedented data rate of up to 50 Tb/s and an event rate of 750 kHz. Complete events will be analysed by software algorithms running on standard processing nodes, and selected events will be stored permanently at a rate of up to 10 kHz for offline processing and analysis. In this paper we discuss the baseline design of the DAQ and HLT systems for the phase-2, taking into account the projected evolution of high speed network fabrics for event building and distribution, and the anticipated performance of general purpose CPU. Implications on hardware and infrastructure requirements for the DAQ "data center" are analysed. Emerging technologies for data reduction are considered. Novel possible approaches to event building and online processing, inspired by trending developments in other areas of computing dealing with large masses of data, are also examined. We conclude by discussing the opportunities offered by reading out and processing parts of the detector, wherever the front-end electronics allows, at the machine clock rate (40 MHz)

    Performance of the CMS Event Builder

    No full text
    The data acquisition system (DAQ) of the CMS experiment at the CERN Large Hadron Collider (LHC) assembles events at a rate of 100 kHz. It transports event data at an aggregate throughput of ~100 GB/s to the high-level trigger (HLT) farm. The CMS DAQ system has been completely rebuilt during the first long shutdown of the LHC in 2013/14. The new DAQ architecture is based on state-of-the-art network technologies for the event building. For the data concentration, 10/40 Gb/s Ethernet technologies are used together with a reduced TCP/IP protocol implemented in FPGA for a reliable transport between custom electronics and commercial computing hardware. A 56 Gb/s Infiniband FDR CLOS network has been chosen for the event builder. We report on the performance of the event builder system and the steps taken to exploit the full potential of the network technologies.The data acquisition system (DAQ) of the CMS experiment at the CERN Large Hadron Collider assembles events at a rate of 100 kHz, transporting event data at an aggregate throughput of to the high-level trigger farm. The DAQ architecture is based on state-of-the-art network technologies for the event building. For the data concentration, 10/40 Gbit/s Ethernet technologies are used together with a reduced TCP/IP protocol implemented in FPGA for a reliable transport between custom electronics and commercial computing hardware. A 56 Gbit/s Infiniband FDR Clos network has been chosen for the event builder. This paper presents the implementation and performance of the event-building system
    corecore