22 research outputs found

    The CMS Timing Control and Distribution System

    No full text
    The Compact Muon Solenoid (CMS) experiment operating at the CERN (European Laboratory for Nuclear Physics) Large Hadron Collider (LHC) is in the process of upgrading several of its detector systems. Adding more individual detector components brings the need to test and commission those components separately from existing ones so as not to compromise physics data-taking. The CMS Trigger, Timing and Control (TTC) system had reached its limits in terms of the number of separate elements (partitions) that could be supported. A new Timing and Control Distribution System (TCDS) has been designed, built and commissioned in order to overcome this limit. It also brings additional functionality to facilitate parallel commissioning of new detector elements. We describe the new TCDS system and its components and show results from the first operational experience with the TCDS system in CMS

    Boosting Event Building Performance using Infiniband FDR for CMS Upgrade

    No full text
    As part of the CMS upgrade during CERN long shutdown period (LS1), the CMS data acquisition system is incorporating Infiniband FDR technology to boost event building performance for operation from 2015 onwards. Infiniband promises to provide substantial increase in data transmission speeds compared to the older 1GE network used during the 2009-2013 LHC run. Several options exist to end user developers when choosing a foundation for software upgrades, including the uDAPL (DAT Collaborative) and Infiniband verbs libraries (OFED). Due to advances in technology, the CMS data acquisition system will be able to achieve the required throughput of 100 kHz with increased event sizes while downsizing the number of nodes by using a combination of 10GE, 40GE and 56 GB Infiniband FDR. This paper presents the analysis and results of a comparison between GE and Infiniband solutions as well as a look at how they integrate into an event building architecture, while preserving the scalability, efficiency and deterministic latency expected in a high end data acquisition network

    Recent experience and future evolution of the CMS High Level Trigger System

    No full text
    The CMS experiment at the LHC uses a two-stage trigger system, with events flowing from the first level trigger at a rate of 100 kHz. These events are read out by the Data Acquisition system (DAQ), assembled in memory in a farm of computers, and finally fed into the high-level trigger (HLT) software running on the farm. The HLT software selects interesting events for offline storage and analysis at a rate of a few hundred Hz. The HLT algorithms consist of sequences of offline-style reconstruction and filtering modules, executed on a farm of 0(10000) CPU cores built from commodity hardware. Experience from the 2010-2011 collider run is detailed, as well as the current architecture of the CMS HLT, and its integration with the CMS reconstruction framework and CMS DAQ. The short- and medium-term evolution of the HLT software infrastructure is discussed, with future improvements aimed at supporting extensions of the HLT computing power, and addressing remaining performance and maintenance issues

    Distributed error and alarm processing in the CMS data acquisition system

    No full text
    The error and alarm system for the data acquisition of the Compact Muon Solenoid (CMS) at CERN was successfully used for the physics runs at Large Hadron Collider (LHC) during first three years of activities. Error and alarm processing entails the notification, collection, storing and visualization of all exceptional conditions occurring in the highly distributed CMS online system using a uniform scheme. Alerts and reports are shown on\-line by web application facilities that map them to graphical models of the system as defined by the user. A persistency service keeps a history of all exceptions occurred, allowing subsequent retrieval of user defined time windows of events for later playback or analysis. This paper describes the architecture and the technologies used and deals with operational aspects during the first years of LHC operation. In particular we focus on performance, stability, and integration with the CMS sub\-detectors

    Achieving High Performance with TCP over 40GbE on NUMA architectures for CMS Data Acquisition

    No full text
    TCP and the socket abstraction have barely changed over the last two decades, but at the network layer there has been a giant leap from a few megabits to 100 gigabits in bandwidth. At the same time, CPU architectures have evolved into the multicore era and applications are expected to make full use of all available resources. Applications in the data acquisition domain based on the standard socket library running in a Non-Uniform Memory Access (NUMA) architecture are unable to reach full efficiency and scalability without the software being adequately aware about the IRQ (Interrupt Request), CPU and memory affinities. During the first long shutdown of LHC, the CMS DAQ system is going to be upgraded for operation from 2015 onwards and a new software component has been designed and developed in the CMS online framework for transferring data with sockets. This software attempts to wrap the low-level socket library to ease higher-level programming with an API based on an asynchronous event driven model similar to the DAT uDAPL API. It is an event-based application with NUMA optimizations, that allows for a high throughput of data across a large distributed system. This paper describes the architecture, the technologies involved and the performance measurements of the software in the context of the CMS distributed event building

    Upgrade of the CMS Event Builder

    No full text
    The Data Acquisition system of the Compact Muon Solenoid experiment at CERN assembles events at a rate of 100 kHz, transporting event data at an aggregate throughput of 100 GB/s. By the time the LHC restarts after the 2013/14 shut-down, the current computing and networking infrastructure will have reached the end of their lifetime. This paper presents design studies for an upgrade of the CMS event builder based on advanced networking technologies such as 10/40 Gb/s Ethernet and Infiniband. The results of performance measurements with small-scale test setups are shown

    DAQExpert - An expert system to increase CMS data-taking efficiency

    No full text
    The efficiency of the Data Acquisition (DAQ) of the Compact Muon Solenoid (CMS) experiment for LHC Run 2 is constantly being improved. A~significant factor affecting the data taking efficiency is the experience of the DAQ operator. One of the main responsibilities of the DAQ operator is to carry out the proper recovery procedure in case of failure of data-taking. At the start of Run 2, understanding the problem and finding the right remedy could take a considerable amount of time (up to many minutes). Operators heavily relied on the support of on-call experts, also outside working hours. Wrong decisions due to time pressure sometimes lead to an additional overhead in recovery time. To increase the efficiency of CMS data-taking we developed a new expert system, the DAQExpert, which provides shifters with optimal recovery suggestions instantly when a failure occurs. DAQExpert is a~web application analyzing frequently updating monitoring data from all DAQ components and identifying problems based on expert knowledge expressed in small, independent logic-modules written in Java. Its results are presented in real-time in the control room via a web-based GUI and a sound-system in a form of short description of the current failure, and steps to recover

    The CMS Data Acquisition - Architectures for the Phase-2 Upgrade

    No full text
    The upgraded High Luminosity LHC, after the third Long Shutdown (LS3), will provide an instantaneous luminosity of 7.5×10347.5\times10^{34} cm2s1^{-2} s^{-1} (levelled), at the price of extreme pileup of up to 200 interactions per crossing. In LS3, the CMS Detector will also undergo a major upgrade to prepare for the phase-2 of the LHC physics program, starting around 2025. The upgraded detector will be read out at an unprecedented data rate of up to 50 Tb/s and an event rate of 750 kHz. Complete events will be analysed by software algorithms running on standard processing nodes, and selected events will be stored permanently at a rate of up to 10 kHz for offline processing and analysis. In this paper we discuss the baseline design of the DAQ and HLT systems for the phase-2, taking into account the projected evolution of high speed network fabrics for event building and distribution, and the anticipated performance of general purpose CPU. Implications on hardware and infrastructure requirements for the DAQ "data center" are analysed. Emerging technologies for data reduction are considered. Novel possible approaches to event building and online processing, inspired by trending developments in other areas of computing dealing with large masses of data, are also examined. We conclude by discussing the opportunities offered by reading out and processing parts of the detector, wherever the front-end electronics allows, at the machine clock rate (40 MHz)

    Performance of the CMS Event Builder

    No full text
    The data acquisition system (DAQ) of the CMS experiment at the CERN Large Hadron Collider (LHC) assembles events at a rate of 100 kHz. It transports event data at an aggregate throughput of ~100 GB/s to the high-level trigger (HLT) farm. The CMS DAQ system has been completely rebuilt during the first long shutdown of the LHC in 2013/14. The new DAQ architecture is based on state-of-the-art network technologies for the event building. For the data concentration, 10/40 Gb/s Ethernet technologies are used together with a reduced TCP/IP protocol implemented in FPGA for a reliable transport between custom electronics and commercial computing hardware. A 56 Gb/s Infiniband FDR CLOS network has been chosen for the event builder. We report on the performance of the event builder system and the steps taken to exploit the full potential of the network technologies.The data acquisition system (DAQ) of the CMS experiment at the CERN Large Hadron Collider assembles events at a rate of 100 kHz, transporting event data at an aggregate throughput of to the high-level trigger farm. The DAQ architecture is based on state-of-the-art network technologies for the event building. For the data concentration, 10/40 Gbit/s Ethernet technologies are used together with a reduced TCP/IP protocol implemented in FPGA for a reliable transport between custom electronics and commercial computing hardware. A 56 Gbit/s Infiniband FDR Clos network has been chosen for the event builder. This paper presents the implementation and performance of the event-building system

    Health And Performance Monitoring Of The Online Computer Cluster Of CMS

    No full text
    The CMS experiment's online cluster consists of 2300 computers and 170 switches or routers operating on a 24-hour basis. This huge infrastructure must be monitored in a way that the administrators are pro-actively warned of any failures or degradation in the system, in order to avoid or minimize downtime of the system which can lead to loss of data taking. The number of metrics monitored per host varies from 20 to 40 and covers basic host checks (disk, network, load) to application specific checks (service running) in addition to hardware monitoring. The sheer number of hosts and checks per host in the system stretches the limits of many monitoring tools and requires careful usage of various configuration optimizations to work reliably. The initial monitoring system used in the CMS online cluster was based on Nagios, but suffered from various drawbacks and did not work reliably in the expanded cluster. The CMS cluster administrators investigated the different open source tools available and chose to use a fork of Nagios called Icinga, with several plugin modules to enhance its scalability. The Gearman module provides a queuing system for all checks and their results allowing easy load balancing across worker nodes. Supported modules allow the grouping of checks in one single request thereby significantly reducing the network overhead for doing a set of checks on a group of nodes. The PNP4nagios module provides the graphing capability to Icinga, which uses files as round robin databases (RRD). Additional software (rrdcached) optimizes access to the RRD files and is vital in order to support the required number of operations. Furthermore, to make best use of the monitoring information to notify the appropriate communities of any issues with their systems, much work was put into the grouping of the checks according to, for example, the function of the machine, the services running, the sub-detectors to which they belong, and the criticality of the computer. An automated system to generate the configuration of the monitoring system has been produced to facilitate its evolution and maintenance. The use of these performance enhancing modules and the work on grouping the checks has yielded impressive performance improvements over the previous Nagios infrastructure, allowing for the monitoring of many more metrics per second compared to the previous system. Furthermore the design allows the easy growth of the infrastructure without the need to rethink the monitoring system as a whole
    corecore