16 research outputs found

    PanDA for ATLAS Distributed Computing in the Next Decade

    No full text
    The Production and Distributed Analysis (PanDA) system has been developed to meet ATLAS production and analysis requirements for a data-driven workload management system capable of operating at the Large Hadron Collider (LHC) data processing scale. Heterogeneous resources used by the ATLAS experiment are distributed worldwide at hundreds of sites, thousands of physicists analyse the data remotely, the volume of processed data is beyond the exabyte scale, dozens of scientific applications are supported, while data processing requires more than a few billion hours of computing usage per year. PanDA performed very well over the last decade including the LHC Run 1 data taking period. However, it was decided to upgrade the whole system concurrently with the LHC’s first long shutdown in order to cope with rapidly changing computing infrastructure. After two years of reengineering efforts, PanDA has embedded capabilities for fully dynamic and flexible workload management. The static batch job paradigm was discarded in favor of a more automated and scalable model. Workloads are dynamically tailored for optimal usage of resources, with the brokerage taking network traffic and forecasts into account. Computing resources are partitioned based on dynamic knowledge of their status and characteristics. The pilot has been re-factored around a plugin structure for easier development and deployment. Bookkeeping is handled with both coarse and fine granularities for efficient utilization of pledged or opportunistic resources. Leveraging direct remote data access and federated storage relaxes the geographical coupling between processing and data. An in-house security mechanism authenticates the pilot and data management services in off-grid environments such as volunteer computing and private local clusters. The PanDA monitor has been extensively optimized for performance and extended with analytics to provide aggregated summaries of the system as well as drill-down to operational details. There are as well many other challenges planned or recently implemented, and adoption by non-LHC experiments such as bioinformatics groups successfully running Paleomix (microbial genome and metagenomes) payload on supercomputers. In this talk we will focus on the new and planned features that are most important to the next decade of distributed computing workload management

    ATLAS Cloud Computing R&D project

    No full text
    The computing model of the ATLAS experiment was designed around the concept of grid computing and, since the start of data taking, this model has proven very successful. However, new cloud computing technologies bring attractive features to improve the operations and elasticity of scientific distributed computing. ATLAS sees grid and cloud computing as complementary technologies that will coexist at different levels of resource abstraction, and two years ago created an R&D working group to investigate the different integration scenarios. The ATLAS Cloud Computing R&D has been able to demonstrate the feasibility of offloading work from grid to cloud sites and, as of today, is able to integrate transparently various cloud resources into the PanDA workload management system. The ATLAS Cloud Computing R&D is operating various PanDA queues on private and public resources and has provided several hundred thousand CPU days to the experiment. As a result, the ATLAS Cloud Computing R&D group has gained a significant insight into the cloud computing landscape and has identified points that still need to be addressed in order to fully utilize this technology.\nThis contribution will explain the cloud integration models that are being evaluated and will discuss ATLAS’ learning during the collaboration with leading commercial and academic cloud providers

    ATLAS Cloud R&D

    No full text
    The computing model of the ATLAS experiment was designed around the concept of grid computing and, since the start of data taking, this model has proven very successful. However, new cloud computing technologies bring attractive features to improve the operations and elasticity of scientific distributed computing. ATLAS sees grid and cloud computing as complementary technologies that will coexist at different levels of resource abstraction, and two years ago created an R&D working group to investigate the different integration scenarios. The ATLAS Cloud Computing R&D has been able to demonstrate the feasibility of offloading work from grid to cloud sites and, as of today, is able to integrate transparently various cloud resources into the PanDA workload management system. The ATLAS Cloud Computing R&D is operating various PanDA queues on private and public resources and has provided several hundred thousand CPU days to the experiment. As a result, the ATLAS Cloud Computing R&D group has gained a significant insight into the cloud computing landscape and has identified points that still need to be addressed in order to fully utilize this technology. This contribution will explain the cloud integration models that are being evaluated and will discuss ATLAS’ learning during the collaboration with leading commercial and academic cloud providers

    ATLAS Distributed Computing Automation

    No full text
    The ATLAS Experiment benefits from computing resources distributed worldwide at more than 100 WLCG sites. The ATLAS Grid sites provide over 100k CPU job slots, over 100 PB of storage space on disk or tape. Monitoring of status of such a complex infrastructure is essential. The ATLAS Grid infrastructure is monitored 24/7 by two teams of shifters distributed world-wide, by the ATLAS Distributed Computing experts, and by site administrators. In this paper we summarize automation efforts performed within the ATLAS Distributed Computing team in order to reduce manpower costs and improve the reliability of the system. Different aspects of the automation process are described: from the ATLAS Grid site topology provided by the ATLAS Grid Information System, via automatic site testing by the HammerCloud, to automatic exclusion from production or analysis activities
    corecore