24 research outputs found

    A caching mechanism to exploit object store speed in High Energy Physics analysis

    Full text link
    [EN] Data analysis workflows in High Energy Physics (HEP) read data written in the ROOT columnar format. Such data has traditionally been stored in files that are often read via the network from remote storage facilities, which represents a performance penalty especially for data processing workflows that are I/O bound. To address that issue, this paper presents a new caching mechanism, implemented in the I/O subsystem of ROOT, which is independent of the storage backend used to write the dataset. Notably, it can be used to leverage the speed of high-bandwidth, low-latency object stores. The performance of this caching approach is evaluated by running a real physics analysis on an Intel DAOS cluster, both on a single node and distributed on multiple nodes.This work benefited from the support of the CERN Strategic R&D Programme on Technologies for Future Experiments [1] and from grant PID2020-113656RB-C22 funded by Ministerio de Ciencia e Innovacion MCIN/AEI/10.13039/501100011033. The hardware used to perform the experimental evaluation involving DAOS (HPE Delphi cluster described in Sect. 5.2) was made available thanks to a collaboration agreement with Hewlett-Packard Enterprise (HPE) and Intel. User access to the Virgo cluster at the GSI institute was given for the purpose of running the benchmarks using the Lustre filesystem.Padulano, VE.; Tejedor Saavedra, E.; Alonso-Jordá, P.; López Gómez, J.; Blomer, J. (2022). A caching mechanism to exploit object store speed in High Energy Physics analysis. Cluster Computing. 1-16. https://doi.org/10.1007/s10586-022-03757-211

    Prototyping a ROOT-based distributed analysis workflow for HL-LHC: the CMS use case

    Full text link
    The challenges expected for the next era of the Large Hadron Collider (LHC), both in terms of storage and computing resources, provide LHC experiments with a strong motivation for evaluating ways of rethinking their computing models at many levels. Great efforts have been put into optimizing the computing resource utilization for the data analysis, which leads both to lower hardware requirements and faster turnaround for physics analyses. In this scenario, the Compact Muon Solenoid (CMS) collaboration is involved in several activities aimed at benchmarking different solutions for running High Energy Physics (HEP) analysis workflows. A promising solution is evolving software towards more user-friendly approaches featuring a declarative programming model and interactive workflows. The computing infrastructure should keep up with this trend by offering on the one side modern interfaces, and on the other side hiding the complexity of the underlying environment, while efficiently leveraging the already deployed grid infrastructure and scaling toward opportunistic resources like public cloud or HPC centers. This article presents the first example of using the ROOT RDataFrame technology to exploit such next-generation approaches for a production-grade CMS physics analysis. A new analysis facility is created to offer users a modern interactive web interface based on JupyterLab that can leverage HTCondor-based grid resources on different geographical sites. The physics analysis is converted from a legacy iterative approach to the modern declarative approach offered by RDataFrame and distributed over multiple computing nodes. The new scenario offers not only an overall improved programming experience, but also an order of magnitude speedup increase with respect to the previous approach

    First implementation and results of the Analysis Grand Challenge with a fully Pythonic RDataFrame

    Get PDF
    The growing amount of data generated by the LHC requires a shift in how HEP analysis tasks are approached. Efforts to address this computational challenge have led to the rise of a middle-man software layer, a mixture of simple, effective APIs and fast execution engines underneath. Having common, open and reproducible analysis benchmarks proves beneficial in the development of these modern tools. One such benchmark is provided by the Analysis Grand Challenge (AGC), which represents a specification for realistic analysis pipelines. This contribution presents the first AGC implementation that leverages ROOT RDataFrame, a powerful, modern and scalable execution engine for the HENP use cases. The different steps of the benchmarks are written with a composable, flexible and fully Pythonic API. RDataFrame can then transparently run the computations on all the cores of a machine or on multiple nodes thanks to automatic dataset splitting and transparent workload distribution. The portability of this implementation is shown by running on various resources, from managed facilities to open cloud platforms for research, showing usage of interactive and distributed environments

    ROOT’s RNTuple I/O Subsystem: The Path to Production

    Get PDF
    The RNTuple I/O subsystem is ROOT’s future event data file format and access API. It is driven by the expected data volume increase at upcoming HEP experiments, e.g. at the HL-LHC, and recent opportunities in the storage hardware and software landscape such as NVMe drives and distributed object stores. RNTuple is a redesign of the TTree binary format and API and has shown to deliver substantially faster data throughput and better data compression both compared to TTree and to industry standard formats. In order to let HENP computing workflows benefit from RNTuple’s superior performance, however, the I/O stack needs to connect efficiently to the rest of the ecosystem, from grid storage to (distributed) analysis frameworks to (multithreaded) experiment frameworks for reconstruction and ntuple derivation. With the RNTuple binary format soon arriving at its first production release, we present RNTuple’s feature set, integration efforts, and its performance impact on the time-to-solution. We show the latest performance figures of RDataFrame analysis code of realistic complexity, comparing RNTuple and TTree as data sources. We discuss RNTuple’s approach to functionality critical to the HENP I/O (such as multithreaded writes, fast data merging, schema evolution) and we provide an outlook on the road to its use in production

    I/O performance studies of analysis workloads on production and dedicated resources at CERN

    Get PDF
    The recent evolutions of the analysis frameworks and physics data formats of the LHC experiments provide the opportunity of using central analysis facilities with a strong focus on interactivity and short turnaround times, to complement the more common distributed analysis on the Grid. In order to plan for such facilities, it is essential to know in detail the performance of the combination of a given analysis framework, of a specific analysis and of the installed computing and storage resources. This contribution describes performance studies performed at CERN, using the EOS disk-based storage, either directly or through an XCache instance, from both batch resources and highperformance compute nodes which could be used to build an analysis facility. A variety of benchmarks, both synthetic and based on real-world physics analyses and their corresponding input datasets, are utilized. In particular, the RNTuple format from the ROOT project is put to the test and compared to the latest version of the TTree format, and the impact of caches is assessed. In addition, we assessed the difference in performance between the use of storage system specific protocols, like XRootd, and FUSE. The results of this study are intended to be a valuable input in the design of analysis facilities, at CERN and elsewhere

    Software Challenges For HL-LHC Data Analysis

    Full text link
    The high energy physics community is discussing where investment is needed to prepare software for the HL-LHC and its unprecedented challenges. The ROOT project is one of the central software players in high energy physics since decades. From its experience and expectations, the ROOT team has distilled a comprehensive set of areas that should see research and development in the context of data analysis software, for making best use of HL-LHC's physics potential. This work shows what these areas could be, why the ROOT team believes investing in them is needed, which gains are expected, and where related work is ongoing. It can serve as an indication for future research proposals and cooperations

    ROOT for the HL-LHC: data format

    Full text link
    This document discusses the state, roadmap, and risks of the foundational components of ROOT with respect to the experiments at the HL-LHC (Run 4 and beyond). As foundational components, the document considers in particular the ROOT input/output (I/O) subsystem. The current HEP I/O is based on the TFile container file format and the TTree binary event data format. The work going into the new RNTuple event data format aims at superseding TTree, to make RNTuple the production ROOT event data I/O that meets the requirements of Run 4 and beyond

    Blurring High Energy Physics Data Analysis Techniques and Data Science Approaches

    No full text
    Scientific research has always been intertwined to a certain degree with Computing. Even more so over the last few years, during which the needs for resources in terms of storage and processing power have increased exponentially. This holds true for many different joint collaborations in fields such as biology, medicine, earth sciences, physics and astrophysics, among which CERN definitely represents a notable example. Being the largest centre for research in the High Energy Physics (HEP) field, it has always kept pushing for new discoveries in its executive program. The collaborative efforts of thousands of scientists worldwide have led to important results, most notably in recent years the discovery of the Higgs boson, officially announced in 2012 by the researchers at CMS and ATLAS, the two main experiments taking data at the LHC collider. This strenuous work demands the most advanced technological instruments to recreate the physics events and at the same time hardware and software that keep up with the computing needs. But while HEP has been historically at the forefront in developing solutions to cope up with these requirements, in the recent years other fields and industries have experienced steady advances, helped by an unprecedented abundance of data. The research field born to exploit data, namely Data Science, has brought to the table new computing techniques that may well fit the needs of HEP. In this thesis, a programming model commonly used in Data Science, namely MapReduce, will be exploited to work with the most prominent software for HEP analysis, ROOT. The first will be used in the implementation available under the Apache Spark framework to allow for distributing computations over a remote cluster, while the latter will provide the interface to common HEP data formats and analysis models through one of its latest additions, namely RDataFrame. PyRDF, a purposely developed package, will glue all the components together and will be used to showcase how this new model can affect the workflow of a physics analysis

    Distributed Computing Solutions for High Energy Physics Interactive Data Analysis

    Full text link
    [ES] La investigación científica en Física de Altas Energías (HEP) se caracteriza por desafíos computacionales complejos, que durante décadas tuvieron que ser abordados mediante la investigación de técnicas informáticas en paralelo a los avances en la comprensión de la física. Uno de los principales actores en el campo, el CERN, alberga tanto el Gran Colisionador de Hadrones (LHC) como miles de investigadores cada año que se dedican a recopilar y procesar las enormes cantidades de datos generados por el acelerador de partículas. Históricamente, esto ha proporcionado un terreno fértil para las técnicas de computación distribuida, conduciendo a la creación de Worldwide LHC Computing Grid (WLCG), una red global de gran potencia informática para todos los experimentos LHC y del campo HEP. Los datos generados por el LHC hasta ahora ya han planteado desafíos para la informática y el almacenamiento. Esto solo aumentará con futuras actualizaciones de hardware del acelerador, un escenario que requerirá grandes cantidades de recursos coordinados para ejecutar los análisis HEP. La estrategia principal para cálculos tan complejos es, hasta el día de hoy, enviar solicitudes a sistemas de colas por lotes conectados a la red. Esto tiene dos grandes desventajas para el usuario: falta de interactividad y tiempos de espera desconocidos. En años más recientes, otros campos de la investigación y la industria han desarrollado nuevas técnicas para abordar la tarea de analizar las cantidades cada vez mayores de datos generados por humanos (una tendencia comúnmente mencionada como "Big Data"). Por lo tanto, han surgido nuevas interfaces y modelos de programación que muestran la interactividad como una característica clave y permiten el uso de grandes recursos informáticos. A la luz del escenario descrito anteriormente, esta tesis tiene como objetivo aprovechar las herramientas y arquitecturas de la industria de vanguardia para acelerar los flujos de trabajo de análisis en HEP, y proporcionar una interfaz de programación que permite la paralelización automática, tanto en una sola máquina como en un conjunto de recursos distribuidos. Se centra en los modelos de programación modernos y en cómo hacer el mejor uso de los recursos de hardware disponibles al tiempo que proporciona una experiencia de usuario perfecta. La tesis también propone una solución informática distribuida moderna para el análisis de datos HEP, haciendo uso del software llamado ROOT y, en particular, de su capa de análisis de datos llamada RDataFrame. Se exploran algunas áreas clave de investigación en torno a esta propuesta. Desde el punto de vista del usuario, esto se detalla en forma de una nueva interfaz que puede ejecutarse en una computadora portátil o en miles de nodos informáticos, sin cambios en la aplicación del usuario. Este desarrollo abre la puerta a la explotación de recursos distribuidos a través de motores de ejecución estándar de la industria que pueden escalar a múltiples nodos en clústeres HPC o HTC, o incluso en ofertas serverless de nubes comerciales. Dado que el análisis de datos en este campo a menudo está limitado por E/S, se necesita comprender cuáles son los posibles mecanismos de almacenamiento en caché. En este sentido, se investigó un sistema de almacenamiento novedoso basado en la tecnología de almacenamiento de objetos como objetivo para el caché. En conclusión, el futuro del análisis de datos en HEP presenta desafíos desde varias perspectivas, desde la explotación de recursos informáticos y de almacenamiento distribuidos hasta el diseño de interfaces de usuario ergonómicas. Los marcos de software deben apuntar a la eficiencia y la facilidad de uso, desvinculando la definición de los cálculos físicos de los detalles de implementación de su ejecución. Esta tesis se enmarca en el esfuerzo colectivo de la comunidad HEP hacia estos objetivos, definiendo problemas y posibles soluciones que pueden ser adoptadas por futuros investigadores.[CA] La investigació científica a Física d'Altes Energies (HEP) es caracteritza per desafiaments computacionals complexos, que durant dècades van haver de ser abordats mitjançant la investigació de tècniques informàtiques en paral·lel als avenços en la comprensió de la física. Un dels principals actors al camp, el CERN, acull tant el Gran Col·lisionador d'Hadrons (LHC) com milers d'investigadors cada any que es dediquen a recopilar i processar les enormes quantitats de dades generades per l'accelerador de partícules. Històricament, això ha proporcionat un terreny fèrtil per a les tècniques de computació distribuïda, conduint a la creació del Worldwide LHC Computing Grid (WLCG), una xarxa global de gran potència informàtica per a tots els experiments LHC i del camp HEP. Les dades generades per l'LHC fins ara ja han plantejat desafiaments per a la informàtica i l'emmagatzematge. Això només augmentarà amb futures actualitzacions de maquinari de l'accelerador, un escenari que requerirà grans quantitats de recursos coordinats per executar les anàlisis HEP. L'estratègia principal per a càlculs tan complexos és, fins avui, enviar sol·licituds a sistemes de cues per lots connectats a la xarxa. Això té dos grans desavantatges per a l'usuari: manca d'interactivitat i temps de espera desconeguts. En anys més recents, altres camps de la recerca i la indústria han desenvolupat noves tècniques per abordar la tasca d'analitzar les quantitats cada vegada més grans de dades generades per humans (una tendència comunament esmentada com a "Big Data"). Per tant, han sorgit noves interfícies i models de programació que mostren la interactivitat com a característica clau i permeten l'ús de grans recursos informàtics. A la llum de l'escenari descrit anteriorment, aquesta tesi té com a objectiu aprofitar les eines i les arquitectures de la indústria d'avantguarda per accelerar els fluxos de treball d'anàlisi a HEP, i proporcionar una interfície de programació que permet la paral·lelització automàtica, tant en una sola màquina com en un conjunt de recursos distribuïts. Se centra en els models de programació moderns i com fer el millor ús dels recursos de maquinari disponibles alhora que proporciona una experiència d'usuari perfecta. La tesi també proposa una solució informàtica distribuïda moderna per a l'anàlisi de dades HEP, fent ús del programari anomenat ROOT i, en particular, de la seva capa d'anàlisi de dades anomenada RDataFrame. S'exploren algunes àrees clau de recerca sobre aquesta proposta. Des del punt de vista de l'usuari, això es detalla en forma duna nova interfície que es pot executar en un ordinador portàtil o en milers de nodes informàtics, sense canvis en l'aplicació de l'usuari. Aquest desenvolupament obre la porta a l'explotació de recursos distribuïts a través de motors d'execució estàndard de la indústria que poden escalar a múltiples nodes en clústers HPC o HTC, o fins i tot en ofertes serverless de núvols comercials. Atès que sovint l'anàlisi de dades en aquest camp està limitada per E/S, cal comprendre quins són els possibles mecanismes d'emmagatzematge en memòria cau. En aquest sentit, es va investigar un nou sistema d'emmagatzematge basat en la tecnologia d'emmagatzematge d'objectes com a objectiu per a la memòria cau. En conclusió, el futur de l'anàlisi de dades a HEP presenta reptes des de diverses perspectives, des de l'explotació de recursos informàtics i d'emmagatzematge distribuïts fins al disseny d'interfícies d'usuari ergonòmiques. Els marcs de programari han d'apuntar a l'eficiència i la facilitat d'ús, desvinculant la definició dels càlculs físics dels detalls d'implementació de la seva execució. Aquesta tesi s'emmarca en l'esforç col·lectiu de la comunitat HEP cap a aquests objectius, definint problemes i possibles solucions que poden ser adoptades per futurs investigadors.[EN] The scientific research in High Energy Physics (HEP) is characterised by complex computational challenges, which over the decades had to be addressed by researching computing techniques in parallel to the advances in understanding physics. One of the main actors in the field, CERN, hosts both the Large Hadron Collider (LHC) and thousands of researchers yearly who are devoted to collecting and processing the huge amounts of data generated by the particle accelerator. This has historically provided a fertile ground for distributed computing techniques, which led to the creation of the Worldwide LHC Computing Grid (WLCG), a global network providing large computing power for all the experiments revolving around the LHC and the HEP field. Data generated by the LHC so far has already posed challenges for computing and storage. This is only going to increase with future hardware updates of the accelerator, which will bring a scenario that will require large amounts of coordinated resources to run the workflows of HEP analyses. The main strategy for such complex computations is, still to this day, submitting applications to batch queueing systems connected to the grid and wait for the final result to arrive. This has two great disadvantages from the user's perspective: no interactivity and unknown waiting times. In more recent years, other fields of research and industry have developed new techniques to address the task of analysing the ever increasing large amounts of human-generated data (a trend commonly mentioned as "Big Data"). Thus, new programming interfaces and models have arised that most often showcase interactivity as one key feature while also allowing the usage of large computational resources. In light of the scenario described above, this thesis aims at leveraging cutting-edge industry tools and architectures to speed up analysis workflows in High Energy Physics, while providing a programming interface that enables automatic parallelisation, both on a single machine and on a set of distributed resources. It focuses on modern programming models and on how to make best use of the available hardware resources while providing a seamless user experience. The thesis also proposes a modern distributed computing solution to the HEP data analysis, making use of the established software framework called ROOT and in particular of its data analysis layer implemented with the RDataFrame class. A few key research areas that revolved around this proposal are explored. From the user's point of view, this is detailed in the form of a new interface to data analysis that is able to run on a laptop or on thousands of computing nodes, with no change in the user application. This development opens the door to exploiting distributed resources via industry standard execution engines that can scale to multiple nodes on HPC or HTC clusters, or even on serverless offerings of commercial clouds. Since data analysis in this field is often I/O bound, a good comprehension of what are the possible caching mechanisms is needed. In this regard, a novel storage system based on object store technology was researched as a target for caching. In conclusion, the future of data analysis in High Energy Physics presents challenges from various perspectives, from the exploitation of distributed computing and storage resources to the design of ergonomic user interfaces. Software frameworks should aim at efficiency and ease of use, decoupling as much as possible the definition of the physics computations from the implementation details of their execution. This thesis is framed in the collective effort of the HEP community towards these goals, defining problems and possible solutions that can be adopted by future researchers.Padulano, VE. (2023). Distributed Computing Solutions for High Energy Physics Interactive Data Analysis [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/19310

    25th International Conference on Computing in High Energy & Nuclear Physics

    No full text
    Thanks to its RDataFrame interface, ROOT now supports the execution of the same physics analysis code both on a single machine and on a cluster of distributed resources. In the latter scenario, it is common to read the input ROOT datasets over the network from remote storage systems, which often increases the time it takes for physicists to obtain their results. Storing the remote files much closer to where the computations will run can bring latency and execution time down. Such a solution can be improved further by caching only the actual portion of the dataset that will be processed on each machine in the cluster, reusing it in subsequent executions on the same input data. This paper shows the benefits of applying different means of caching input data in a distributed ROOT RDataFrame analysis. Two such mechanisms will be applied to this kind of workflow with different configurations, namely caching on the same nodes that process data or caching on a separate server
    corecore