183 research outputs found
Big Data in HEP: A comprehensive use case study
Experimental Particle Physics has been at the forefront of analyzing the
worlds largest datasets for decades. The HEP community was the first to develop
suitable software and computing tools for this task. In recent times, new
toolkits and systems collectively called Big Data technologies have emerged to
support the analysis of Petabyte and Exabyte datasets in industry. While the
principles of data analysis in HEP have not changed (filtering and transforming
experiment-specific data formats), these new technologies use different
approaches and promise a fresh look at analysis of very large datasets and
could potentially reduce the time-to-physics with increased interactivity. In
this talk, we present an active LHC Run 2 analysis, searching for dark matter
with the CMS detector, as a testbed for Big Data technologies. We directly
compare the traditional NTuple-based analysis with an equivalent analysis using
Apache Spark on the Hadoop ecosystem and beyond. In both cases, we start the
analysis with the official experiment data formats and produce publication
physics plots. We will discuss advantages and disadvantages of each approach
and give an outlook on further studies needed.Comment: Proceedings for 22nd International Conference on Computing in High
Energy and Nuclear Physics (CHEP 2016
The Awkward World of Python and C++
There are undeniable benefits of binding Python and C++ to take advantage of
the best features of both languages. This is especially relevant to the HEP and
other scientific communities that have invested heavily in the C++ frameworks
and are rapidly moving their data analyses to Python. Version 2 of Awkward
Array, a Scikit-HEP Python library, introduces a set of header-only C++
libraries that do not depend on any application binary interface. Users can
directly include these libraries in their compilation instead of linking
against platform-specific libraries. This new development makes the integration
of Awkward Arrays into other projects easier and more portable, as the
implementation is easily separable from the rest of the Awkward Array codebase.
The code is minimal; it does not include all of the code needed to use Awkward
Arrays in Python, nor does it include references to Python or pybind11. The C++
users can use it to make arrays and then copy them to Python without any
specialized data types - only raw buffers, strings, and integers. This C++ code
also simplifies the process of just-in-time (JIT) compilation in ROOT. This
implementation approach solves some of the drawbacks, like packaging projects
where native dependencies can be challenging. In this paper, we demonstrate the
technique to integrate C++ and Python using a header-only approach. We also
describe the implementation of a new LayoutBuilder and a GrowableBuffer.
Furthermore, examples of wrapping the C++ data into Awkward Arrays and exposing
Awkward Arrays to C++ without copying them are discussed.Comment: 6 pages, 2 figures; submitted to ACAT 2022 proceeding
Awkward Arrays in Python, C++, and Numba
The Awkward Array library has been an important tool for physics analysis in
Python since September 2018. However, some interface and implementation issues
have been raised in Awkward Array's first year that argue for a
reimplementation in C++ and Numba. We describe those issues, the new
architecture, and present some examples of how the new interface will look to
users. Of particular importance is the separation of kernel functions from data
structure management, which allows a C++ implementation and a Numba
implementation to share kernel functions, and the algorithm that transforms
record-oriented data into columnar Awkward Arrays.Comment: To be published in CHEP 2019 proceedings, EPJ Web of Conferences;
post-review updat
HEP Software Foundation Community White Paper Working Group - Data and Software Preservation to Enable Reuse
In this chapter of the High Energy Physics Software Foundation Community
Whitepaper, we discuss the current state of infrastructure, best practices, and
ongoing developments in the area of data and software preservation in high
energy physics. A re-framing of the motivation for preservation to enable
re-use is presented. A series of research and development goals in software and
other cyberinfrastructure that will aid in the enabling of reuse of particle
physics analyses and production software are presented and discussed
Awkward Just-In-Time (JIT) Compilation: A Developer’s Experience
Awkward Array is a library for performing NumPy-like computations on nested, variable-sized data, enabling array-oriented programming on arbitrary data structures in Python. However, imperative (procedural) solutions can sometimes be easier to write or faster to run. Performant imperative programming requires compilation; JIT-compilation makes it convenient to compile in an interactive Python environment. Various functions in Awkward Arrays JIT-compile a user’s code into executable machine code. They use several different techniques, but reuse parts of each others’ implementations. We discuss the techniques used to achieve the Awkward Arrays acceleration with JITcompilation, focusing on RDataFrame, cppyy, and Numba, particularly Numba on GPUs: conversions of Awkward Arrays to and from RDataFrame; standalone cppyy; passing Awkward Arrays to and from Python functions compiled by Numba; passing Awkward Arrays to Python functions compiled for GPUs by Numba; and header-only libraries for populating Awkward Arrays from C++ without any Python dependencie
Analysis Description Languages for the LHC
An analysis description language is a domain specific language capable of
describing the contents of an LHC analysis in a standard and unambiguous way,
independent of any computing framework. It is designed for use by anyone with
an interest in, and knowledge of, LHC physics, i.e., experimentalists,
phenomenologists and other enthusiasts. Adopting analysis description languages
would bring numerous benefits for the LHC experimental and phenomenological
communities ranging from analysis preservation beyond the lifetimes of
experiments or analysis software to facilitating the abstraction, design,
visualization, validation, combination, reproduction, interpretation and
overall communication of the analysis contents. Here, we introduce the analysis
description language concept and summarize the current efforts ongoing to
develop such languages and tools to use them in LHC analyses.Comment: Accepted contribution to the proceedings of The 8th Annual Conference
on Large Hadron Collider Physics, LHCP2020, 25-30 May, 2020, onlin
The Scikit HEP Project -- overview and prospects
Scikit-HEP is a community-driven and community-oriented project with the goal
of providing an ecosystem for particle physics data analysis in Python.
Scikit-HEP is a toolset of approximately twenty packages and a few "affiliated"
packages. It expands the typical Python data analysis tools for particle
physicists. Each package focuses on a particular topic, and interacts with
other packages in the toolset, where appropriate. Most of the packages are easy
to install in many environments; much work has been done this year to provide
binary "wheels" on PyPI and conda-forge packages. The Scikit-HEP project has
been gaining interest and momentum, by building a user and developer community
engaging collaboration across experiments. Some of the packages are being used
by other communities, including the astroparticle physics community. An
overview of the overall project and toolset will be presented, as well as a
vision for development and sustainability.Comment: 6 pages, 3 figures, Proceedings of the 24th International Conference
on Computing in High Energy and Nuclear Physics (CHEP 2019), Adelaide,
Australia, 4-8 November 201
Using Big Data Technologies for HEP Analysis
The HEP community is approaching an era were the excellent performances of
the particle accelerators in delivering collision at high rate will force the
experiments to record a large amount of information. The growing size of the
datasets could potentially become a limiting factor in the capability to
produce scientific results timely and efficiently. Recently, new technologies
and new approaches have been developed in industry to answer to the necessity
to retrieve information as quickly as possible to analyze PB and EB datasets.
Providing the scientists with these modern computing tools will lead to
rethinking the principles of data analysis in HEP, making the overall
scientific process faster and smoother.
In this paper, we are presenting the latest developments and the most recent
results on the usage of Apache Spark for HEP analysis. The study aims at
evaluating the efficiency of the application of the new tools both
quantitatively, by measuring the performances, and qualitatively, focusing on
the user experience. The first goal is achieved by developing a data reduction
facility: working together with CERN Openlab and Intel, CMS replicates a real
physics search using Spark-based technologies, with the ambition of reducing 1
PB of public data in 5 hours, collected by the CMS experiment, to 1 TB of data
in a format suitable for physics analysis.
The second goal is achieved by implementing multiple physics use-cases in
Apache Spark using as input preprocessed datasets derived from official CMS
data and simulation. By performing different end-analyses up to the publication
plots on different hardware, feasibility, usability and portability are
compared to the ones of a traditional ROOT-based workflow
Is Julia ready to be adopted by HEP?
The Julia programming language was created 10 years ago and is now a mature and stable language with a large ecosystem including more than 8,000 third-party packages. It was designed for scientific programming to be a high-level and dynamic language as Python is, while achieving runtime performances comparable to C/C++ or even faster. With this, we ask ourselves if the Julia language and its ecosystem is ready now for its adoption by the High Energy Physics community. We will report on a number of investigations and studies of the Julia language that have been done for various representative HEP applications, ranging from computing intensive initial data processing of experimental data and simulation, to final interactive data analysis and plotting. Aspects of collaborative code development of large software within a HEP experiment has also been investigated: scalability with large development teams, continuous integration and code test, code reuse, language interoperability to enable an adiabatic migration of packages and tools, software installation and distribution, training of the community, benefit from development from industry and academia from other fields
- …