research

Characterizing, managing and monitoring the networks for the ATLAS data acquisition system

Abstract

Particle physics studies the constituents of matter and the interactions between them. Many of the elementary particles do not exist under normal circumstances in nature. However, they can be created and detected during energetic collisions of other particles, as is done in particle accelerators. The Large Hadron Collider (LHC) being built at CERN will be the world's largest circular particle accelerator, colliding protons at energies of 14 TeV. Only a very small fraction of the interactions will give raise to interesting phenomena. The collisions produced inside the accelerator are studied using particle detectors. ATLAS is one of the detectors built around the LHC accelerator ring. During its operation, it will generate a data stream of 64 Terabytes/s. A Trigger and Data Acquisition System (TDAQ) is connected to ATLAS -- its function is to acquire digitized data from the detector and apply trigger algorithms to identify the interesting events. Achieving this requires the power of over 2000 computers plus an interconnecting network capable of sustaining a throughput of over 150 Gbit/s with minimal loss and delay. The implementation of this network required a detailed study of the available switching technologies to a high degree of precision in order to choose the appropriate components. We developed an FPGA-based platform (the GETB) for testing network devices. The GETB system proved to be flexible enough to be used as the ba sis of three different network-related projects. An analysis of the traffic pattern that is generated by the ATLAS data-taking applications was also possible thanks to the GETB. Then, while the network was being assembled, parts of the ATLAS detector started commissioning -- this task relied on a functional network. Thus it was imperative to be able to continuously identify existing and usable infrastructure and manage its operations. In addition, monitoring was required to detect any overload conditions with an indication where the excess demand was being generated. We developed tools to ease the maintenance of the network and to automatically produce inventory reports. We created a system that discovers the network topology and this permitted us to verify the installation and to track its progress. A real-time traffic visualization system has been built, allowing us to see at a glance which network segments are heavily utilized. Later, as the network achieves production status, it will be necessary to extend the monitoring to identify individual applications' use of the available bandwidth. We studied a traffic monitoring technology that will allow us to have a better understanding on how the network is used. This technology, based on packet sampling, gives the possibility of having a complete view of the network: not only its total capacity utilization, but also how this capacity is divided among users and software applicati ons. This thesis describes the establishment of a set of tools designed to characterize, monitor and manage complex, large-scale, high-performance networks. We describe in detail how these tools were designed, calibrated, deployed and exploited. The work that led to the development of this thesis spans over more than four years and closely follows the development phases of the ATLAS network: its design, its installation and finally, its current and future operation

    Similar works