Parallelization and optimization of a High Energy Physics analysis with ROOT’s RDataFrame and Spark

Abstract

After a remarkable era plenty of great discoveries, particle physics has an ambitious and broad experimental program aiming to expand the limits of our knowledge about the nature of our universe. The roadmap for the coming decades is full of big challenges with the Large Hadron Collider (LHC) at the forefront of scientific research. The discovery of the Higgs boson in 2012 is a major milestone in the history of physics that has inspired many new theories based on its presence. Located at CERN, the European Organization for Nuclear Research, the LHC will remain the most powerful accelerator in the world for years to come. In order to extend its discovery potential, a major hardware upgrade will take place in the 2020s to increase the number of collisions produced by a factor of five beyond its design value. The upgraded version of the LHC, called High-Luminosity LHC (HL-LHC) is expected to produce some 30 times more data than the LHC has currently produced. As the total amount of LHC data already collected is close to an exabyte (101810^{18} bytes), the foreseen evolution in hardware technology will not be enough to face the challenge of processing those increasing volumes of data. Therefore, software will have to cover that gap: there is a need for tools to easily express physics analyses in a high-level way, as well as to automatically parallelize those analyses on new parallel and distributed infrastructures. The High Energy Physics (HEP) community has developed specialized solutions for processing experiments data over decades. However, HEP data sizes are becoming more common in other fields. Recent breakthroughs in Big Data and Cloud platforms inquire whether such technologies could be applied to the domain of physics data analysis. This thesis is carried out in the context of a collaboration between different CERN departments with the aim of providing web-based interactive services as the entry-point for scientists to cutting-edge distributed data processing frameworks such as Spark. In such context, this thesis aims to exploit the aforementioned services to run a real analysis of the CERN TOTEM experiment on 4.7TB of data. In addition, this thesis explores the benefits of a new high-level programming model for physics analysis, called RDataFrame, by translating the original code of the TOTEM analysis to use RDataFrame. Following sections describe for the first time the detailed process of translating a data analysis of this magnitude to the programming model offered by RDataFrame. Moreover, we compare the performance between both codes and provide results gathered from local and distributed analyses. Results are promising and show how the processing time of the dataset can be reduced by multiple order of magnitude with the new analysis model

    Similar works

    Full text

    thumbnail-image

    Available Versions