173 research outputs found
ATLAS Distributed Computing Evolution: Developments and Demonstrators Towards HL–LHC
The computing challenges at the HL–LHC require fundamental changes to the distributed computing models that have served experiments well throughout LHC. ATLAS planning for HL–LHC computing started back in 2020 with a Conceptual Design Report outlining various challenges to explore. This was followed in 2022 by a roadmap defining concrete milestones and associated effort required. Today, ATLAS is proceeding further with a set of "demonstrators" with focused R&D in specific topics described in the roadmap. The demonstrators cover areas such as optimised tape writing and access, data recreation on–demand and the use of commercial clouds
WLCG Authorisation from X.509 to Tokens
The WLCG Authorisation Working Group was formed in July 2017 with the
objective to understand and meet the needs of a future-looking Authentication
and Authorisation Infrastructure (AAI) for WLCG experiments. Much has changed
since the early 2000s when X.509 certificates presented the most suitable
choice for authorisation within the grid; progress in token based authorisation
and identity federation has provided an interesting alternative with notable
advantages in usability and compatibility with external (commercial) partners.
The need for interoperability in this new model is paramount as infrastructures
and research communities become increasingly interdependent. Over the past two
years, the working group has made significant steps towards identifying a
system to meet the technical needs highlighted by the community during staged
requirements gathering activities. Enhancement work has been possible thanks to
externally funded projects, allowing existing AAI solutions to be adapted to
our needs. A cornerstone of the infrastructure is the reliance on a common
token schema in line with evolving standards and best practices, allowing for
maximum compatibility and easy cooperation with peer infrastructures and
services. We present the work of the group and an analysis of the anticipated
changes in authorisation model by moving from X.509 to token based
authorisation. A concrete example of token integration in Rucio is presented.Comment: 8 pages, 3 figures, to appear in the proceedings of CHEP 201
Updates to the ATLAS Data Carousel Project
The High Luminosity upgrade to the LHC (HL-LHC) is expected to deliver scientific data at the multi-exabyte scale. In order to address this unprecedented data storage challenge, the ATLAS experiment launched the Data Carousel project in 2018. Data Carousel is a tape-driven workflow whereby bulk production campaigns with input data resident on tape are executed by staging and promptly processing a sliding window to disk buffer such that only a small fraction of inputs are pinned on disk at any one time. Data Carousel is now in production for ATLAS in Run3. In this paper, we provide updates on recent Data Carousel R&D projects, including data-on-demand and tape smart writing. Data-on-demand removes from disk data that has not been accessed for a predefined period, when users request them, they will be either staged from tape or recreated by following the original production steps. Tape smart writing employs intelligent algorithms for file placement on tape in order to retrieve data back more efficiently, which is our long term strategy to achieve optimal tape usage in Data Carousel
Methods of Data Popularity Evaluation in the ATLAS Experiment at the LHC
International audienceThe ATLAS Experiment at the LHC generates petabytes of data that is distributed among 160 computing sites all over the world and is processed continuously by various central production and user analysis tasks. The popularity of data is typically measured as the number of accesses and plays an important role in resolving data management issues: deleting, replicating, moving between tapes, disks and caches. These data management procedures were still carried out in a semi-manual mode and now we have focused our efforts on automating it, making use of the historical knowledge about existing data management strategies. In this study we describe sources of information about data popularity and demonstrate their consistency. Based on the calculated popularity measurements, various distributions were obtained. Auxiliary information about replication and task processing allowed us to evaluate the correspondence between the number of tasks with popular data executed per site and the number of replicas per site. We also examine the popularity of user analysis data that is much less predictable than in the central production and requires more indicators than just the number of accesses
Extending Rucio with modern cloud storage support
Rucio is a software framework designed to facilitate scientific collaborations in efficiently organising, managing, and accessing extensive volumes of data through customizable policies. The framework enables data distribution across globally distributed locations and heterogeneous data centres, integrating various storage and network technologies into a unified federated entity. Rucio offers advanced features like distributed data recovery and adaptive replication, and it exhibits high scalability, modularity, and extensibility.
Originally developed to meet the requirements of the high-energy physics experiment ATLAS, Rucio has been continuously expanded to support LHC experiments and diverse scientific communities. Recent R&D projects within these communities have evaluated the integration of both private and commercially-provided cloud storage systems, leading to the development of additional functionalities for seamless integration within Rucio. Furthermore, the underlying systems, FTS and GFAL/Davix, have been extended to cater to specific use cases.
This contribution focuses on the technical aspects of this work, particularly the challenges encountered in building a generic interface for self-hosted cloud storage, such as MinIO or CEPH S3 Gateway, and established providers like Google Cloud Storage and Amazon Simple Storage Service. Additionally, the integration of decentralised clouds like SEAL is explored. Key aspects, including authentication and authorisation, direct and remote access, throughput and cost estimation, are highlighted, along with shared experiences in daily operations
The ATLAS experiment software on ARM
With an increased dataset obtained during the Run 3 of the LHC at CERN and the even larger expected increase of the dataset by more than one order of magnitude for the HL-LHC, the ATLAS experiment is reaching the limits of the current data processing model in terms of traditional CPU resources based on x86_64 architectures and an extensive program for software upgrades towards the HL-LHC has been set up. The ARM architecture is becoming a competitive and energy efficient alternative. Some surveys indicate its increased presence in HPCs and commercial clouds, and some WLCG sites have expressed their interest. Chip makers are also developing their next generation solutions on ARM architectures, sometimes combining ARM and GPU processors in the same chip. Consequently it is important that the ATLAS software embraces the change and is able to successfully exploit this architecture. We report on the successful porting to ARM of the Athena software framework, which is used by ATLAS for both online and offline computing operations. Furthermore we report on the successful validation of simulation workflows running on ARM resources. For this we have set up an ATLAS Grid site using ARM compatible middleware and containers on Amazon Web Services (AWS) ARM resources. The ARM version of Athena is fully integrated in the regular software build system and distributed in the same way as other software releases. In addition, the workflows have been integrated into the HEPscore benchmark suite which is the planned WLCG wide replacement of the HepSpec06 benchmark used for Grid site pledges. In the overall porting process we have used resources on AWS, Google Cloud Platform (GCP) and CERN. A performance comparison of different architectures and resources will be discussed
Evolution of the open-source data management system Rucio for LHC Run-3 and beyond ATLAS
Rucio, the distributed data management system of the ATLAS experiment already manages more than 400 Petabytes of physics data on the grid. Rucio was incrementally improved throughout LHC Run-2 and is currently being prepared for the HL-LHC era of the experiment. Next to these improvements the system is currently evolving into a full-scale generic data management system for application beyond ATLAS, or even beyond high-energy physics. This contribution focuses on the development roadmap of Rucio for LHC Run-3, such as event level data management, generic meta-data support and increased usage of networks and tapes. At the same time Rucio is evolving beyond the original ATLAS requirements. This includes additional authentication mechanisms, generic database compatibility, deployment and packaging of the software stack in containers, and a project paradigm shift to a full-scale open source project.Facultad de Informátic
Rucio - Scientific data management
Rucio is an open-source software framework that provides scientific
collaborations with the functionality to organize, manage, and access their
data at scale. The data can be distributed across heterogeneous data centers at
widely distributed locations. Rucio was originally developed to meet the
requirements of the high-energy physics experiment ATLAS, and now is
continuously extended to support the LHC experiments and other diverse
scientific communities. In this article, we detail the fundamental concepts of
Rucio, describe the architecture along with implementation details, and give
operational experience from production usage.Comment: 21 pages, 11 figure
Accelerating science: The usage of commercial clouds in ATLAS Distributed Computing
The ATLAS experiment at CERN is one of the largest scientific machines built to date and will have ever growing computing needs as the Large Hadron Collider collects an increasingly larger volume of data over the next 20 years. ATLAS is conducting R&D projects on Amazon Web Services and Google Cloud as complementary resources for distributed computing, focusing on some of the key features of commercial clouds: lightweight operation, elasticity and availability of multiple chip architectures.
The proof of concept phases have concluded with the cloud-native, vendoragnostic integration with the experiment’s data and workload management frameworks. Google Cloud has been used to evaluate elastic batch computing, ramping up ephemeral clusters of up to O(100k) cores to process tasks requiring quick turnaround. Amazon Web Services has been exploited for the successful physics validation of the Athena simulation software on ARM processors.
We have also set up an interactive facility for physics analysis allowing endusers to spin up private, on-demand clusters for parallel computing with up to 4 000 cores, or run GPU enabled notebooks and jobs for machine learning applications.
The success of the proof of concept phases has led to the extension of the Google Cloud project, where ATLAS will study the total cost of ownership of a production cloud site during 15 months with 10k cores on average, fully integrated with distributed grid computing resources and continue the R&D projects
- …