59 research outputs found
RNTuple performance: Status and Outlook
Upcoming HEP experiments, e.g. at the HL-LHC, are expected to increase the
volume of generated data by at least one order of magnitude. In order to retain
the ability to analyze the influx of data, full exploitation of modern storage
hardware and systems, such as low-latency high-bandwidth NVMe devices and
distributed object stores, becomes critical. To this end, the ROOT RNTuple I/O
subsystem has been designed to address performance bottlenecks and shortcomings
of ROOT's current state of the art TTree I/O subsystem. RNTuple provides a
backwards-incompatible redesign of the TTree binary format and access API that
evolves the ROOT event data I/O for the challenges of the upcoming decades. It
focuses on a compact data format, on performance engineering for modern storage
hardware, for instance through making parallel and asynchronous I/O calls by
default, and on robust interfaces that are easy to use correctly. In this
contribution, we evaluate the RNTuple performance for typical HEP analysis
tasks. We compare the throughput delivered by RNTuple to popular I/O libraries
outside HEP, such as HDF5 and Apache Parquet. We demonstrate the advantages of
RNTuple for HEP analysis workflows and provide an outlook on the road to its
use in production.Comment: 5 pages, 5 figures; submitted to proceedings of 20th International
Workshop on Advanced Computing and Analysis Techniques in Physics Research
(ACAT 2021
The Need for a Versioned Data Analysis Software Environment
Scientific results in high-energy physics and in many other fields often rely
on complex software stacks. In order to support reproducibility and scrutiny of
the results, it is good practice to use open source software and to cite
software packages and versions. With ever-growing complexity of scientific
software on one side and with IT life-cycles of only a few years on the other
side, however, it turns out that despite source code availability the setup and
the validation of a minimal usable analysis environment can easily become
prohibitively expensive. We argue that there is a substantial gap between
merely having access to versioned source code and the ability to create a data
analysis runtime environment. In order to preserve all the different variants
of the data analysis runtime environment, we developed a snapshotting file
system optimized for software distribution. We report on our experience in
preserving the analysis environment for high-energy physics such as the
software landscape used to discover the Higgs boson at the Large Hadron
Collider
The GrGen.NET User Manual. Refers to GrGen.NET User Manual, www.grgen.net
GrGen.NET is a graph rewrite tool enabling elegant and
convenient development of graph rewriting applications with
comparable performance to conventionally developed ones.
GrGen.NET uses attributed, typed, and directed multigraphs with
multiple inheritance on node and edge types. Extensive graphical
debugging integrated into an interactive shell complements the
feature highlights of GrGen.NET. This user manual contains both,
normative statements in the sense of a reference manual as well
as an informal guide to the features and usage of GrGen.NET
A caching mechanism to exploit object store speed in High Energy Physics analysis
[EN] Data analysis workflows in High Energy Physics (HEP) read data written in the ROOT columnar format. Such data has traditionally been stored in files that are often read via the network from remote storage facilities, which represents a performance penalty especially for data processing workflows that are I/O bound. To address that issue, this paper presents a new caching mechanism, implemented in the I/O subsystem of ROOT, which is independent of the storage backend used to write the dataset. Notably, it can be used to leverage the speed of high-bandwidth, low-latency object stores. The performance of this caching approach is evaluated by running a real physics analysis on an Intel DAOS cluster, both on a single node and distributed on multiple nodes.This work benefited from the support of the CERN Strategic R&D Programme on Technologies for Future Experiments [1] and from grant PID2020-113656RB-C22 funded by Ministerio de Ciencia e Innovacion MCIN/AEI/10.13039/501100011033. The hardware used to perform the experimental evaluation involving DAOS (HPE Delphi cluster described in Sect. 5.2) was made available thanks to a collaboration agreement with Hewlett-Packard Enterprise (HPE) and Intel. User access to the Virgo cluster at the GSI institute was given for the purpose of running the benchmarks using the Lustre filesystem.Padulano, VE.; Tejedor Saavedra, E.; Alonso-Jordá, P.; López Gómez, J.; Blomer, J. (2022). A caching mechanism to exploit object store speed in High Energy Physics analysis. Cluster Computing. 1-16. https://doi.org/10.1007/s10586-022-03757-211
PROOF as a Service on the Cloud: a Virtual Analysis Facility based on the CernVM ecosystem
PROOF, the Parallel ROOT Facility, is a ROOT-based framework which enables
interactive parallelism for event-based tasks on a cluster of computing nodes.
Although PROOF can be used simply from within a ROOT session with no additional
requirements, deploying and configuring a PROOF cluster used to be not as
straightforward. Recently great efforts have been spent to make the
provisioning of generic PROOF analysis facilities with zero configuration, with
the added advantages of positively affecting both stability and scalability,
making the deployment operations feasible even for the end user. Since a
growing amount of large-scale computing resources are nowadays made available
by Cloud providers in a virtualized form, we have developed the Virtual
PROOF-based Analysis Facility: a cluster appliance combining the solid CernVM
ecosystem and PoD (PROOF on Demand), ready to be deployed on the Cloud and
leveraging some peculiar Cloud features such as elasticity. We will show how
this approach is effective both for sysadmins, who will have little or no
configuration to do to run it on their Clouds, and for the end users, who are
ultimately in full control of their PROOF cluster and can even easily restart
it by themselves in the unfortunate event of a major failure. We will also show
how elasticity leads to a more optimal and uniform usage of Cloud resources.Comment: Talk from Computing in High Energy and Nuclear Physics 2013
(CHEP2013), Amsterdam (NL), October 2013, 7 pages, 4 figure
Software Challenges For HL-LHC Data Analysis
The high energy physics community is discussing where investment is needed to
prepare software for the HL-LHC and its unprecedented challenges. The ROOT
project is one of the central software players in high energy physics since
decades. From its experience and expectations, the ROOT team has distilled a
comprehensive set of areas that should see research and development in the
context of data analysis software, for making best use of HL-LHC's physics
potential. This work shows what these areas could be, why the ROOT team
believes investing in them is needed, which gains are expected, and where
related work is ongoing. It can serve as an indication for future research
proposals and cooperations
ROOT for the HL-LHC: data format
This document discusses the state, roadmap, and risks of the foundational
components of ROOT with respect to the experiments at the HL-LHC (Run 4 and
beyond). As foundational components, the document considers in particular the
ROOT input/output (I/O) subsystem. The current HEP I/O is based on the TFile
container file format and the TTree binary event data format. The work going
into the new RNTuple event data format aims at superseding TTree, to make
RNTuple the production ROOT event data I/O that meets the requirements of Run 4
and beyond
A Roadmap for HEP Software and Computing R&D for the 2020s
Particle physics has an ambitious and broad experimental programme for the coming decades. This programme requires large investments in detector hardware, either to build new facilities and experiments, or to upgrade existing ones. Similarly, it requires commensurate investment in the R&D of software to acquire, manage, process, and analyse the shear amounts of data to be recorded. In planning for the HL-LHC in particular, it is critical that all of the collaborating stakeholders agree on the software goals and priorities, and that the efforts complement each other. In this spirit, this white paper describes the R&D activities required to prepare for this software upgrade.Peer reviewe
Decentralized Data Storage and Processing in the Context of the LHC Experiments at CERN
The computing facilities used to process data for the experiments at the Large Hadron Collider (LHC) at CERN are scattered around the world. The embarrassingly parallel workload allows for use of various computing resources, such as computer centers comprising the Worldwide LHC Computing Grid, commercial and institutional cloud resources, as well as individual home PCs in “volunteer clouds”. Unlike data, the experiment software and its operating system dependencies cannot be easily split into small chunks. Deployment of experiment software on distributed grid sites is challenging since it consists of millions of small files and changes frequently. This thesis develops a systematic approach to distribute a homogeneous runtime environment to a heterogeneous and geographically distributed computing infrastructure. A uniform bootstrap environment is provided by a minimal virtual machine tailored to LHC applications. Based on a study of the characteristics of LHC experiment software, the thesis argues for the use of content-addressable storage and decentralized caching in order to distribute the experiment software. In order to utilize the technology at the required scale, new methods of pre-processing data into content-addressable storage are developed. A co-operative, decentralized memory cache is designed that is optimized for the high peer churn expected in future virtualized computing clusters. This is achieved using a combination of consistent hashing with global knowledge about the worker nodes’ state. The methods have been implemented in the form of a file system for software and Conditions Data delivery. The file system has been widely adopted by the LHC community and the benefits of the presented methods have been demonstrated in practice
- …