Search CORE

50 research outputs found

Big Data in HEP: A comprehensive use case study

Author: Cremonesi Matteo
Elmer Peter
Gutsche Oliver
Jayatilaka Bo
Kowalkowski Jim
Pivarski Jim
Sehrish Saba
Surez Cristina Mantilla
Svyatkovskiy Alexey
Tran Nhan
Publication venue: 'IOP Publishing'
Publication date: 12/03/2017
Field of study

Experimental Particle Physics has been at the forefront of analyzing the worlds largest datasets for decades. The HEP community was the first to develop suitable software and computing tools for this task. In recent times, new toolkits and systems collectively called Big Data technologies have emerged to support the analysis of Petabyte and Exabyte datasets in industry. While the principles of data analysis in HEP have not changed (filtering and transforming experiment-specific data formats), these new technologies use different approaches and promise a fresh look at analysis of very large datasets and could potentially reduce the time-to-physics with increased interactivity. In this talk, we present an active LHC Run 2 analysis, searching for dark matter with the CMS detector, as a testbed for Big Data technologies. We directly compare the traditional NTuple-based analysis with an equivalent analysis using Apache Spark on the Hadoop ecosystem and beyond. In both cases, we start the analysis with the official experiment data formats and produce publication physics plots. We will discuss advantages and disadvantages of each approach and give an outlook on further studies needed.Comment: Proceedings for 22nd International Conference on Computing in High Energy and Nuclear Physics (CHEP 2016

arXiv.org e-Print Archive

Crossref

CERN Document Server

A Ceph S3 Object Data Store for HEP

Author: Gutsche Oliver
Illingworth Robert
Jayatilaka Bo
Jones Chris
Mason David
Peisker Alison
Smith Nick
Publication venue
Publication date: 27/11/2023
Field of study

We present a novel data format design that obviates the need for data tiers by storing individual event data products in column objects. The objects are stored and retrieved through Ceph S3 technology, with a layout designed to minimize metadata volume and maximize data processing parallelism. Performance benchmarks of data storage and retrieval are presented.Comment: CHEP2023 proceedings, to be published in EPJ Web of Conference

arXiv.org e-Print Archive

Using Big Data Technologies for HEP Analysis

Author: Bellini Claudio
Bian Bianny
Canali Luca
Cremonesi Matteo
Dimakopoulos Vasileios
Elmer Peter
Evangelos Evangelos
Fisk Ian
Girone Maria
Gutsche Oliver
Hoh Siew-Yan
Jayatilaka Bo
Khristenko Viktor
Luiselli Andrea
Melo Andrew
Olivito Dominick
Pazzini Jacopo
Pivarski Jim
Svyatkovskiy Alexey
Zanetti Marco
Publication venue: 'EDP Sciences'
Publication date: 01/01/2019
Field of study

The HEP community is approaching an era were the excellent performances of the particle accelerators in delivering collision at high rate will force the experiments to record a large amount of information. The growing size of the datasets could potentially become a limiting factor in the capability to produce scientific results timely and efficiently. Recently, new technologies and new approaches have been developed in industry to answer to the necessity to retrieve information as quickly as possible to analyze PB and EB datasets. Providing the scientists with these modern computing tools will lead to rethinking the principles of data analysis in HEP, making the overall scientific process faster and smoother. In this paper, we are presenting the latest developments and the most recent results on the usage of Apache Spark for HEP analysis. The study aims at evaluating the efficiency of the application of the new tools both quantitatively, by measuring the performances, and qualitatively, focusing on the user experience. The first goal is achieved by developing a data reduction facility: working together with CERN Openlab and Intel, CMS replicates a real physics search using Spark-based technologies, with the ambition of reducing 1 PB of public data in 5 hours, collected by the CMS experiment, to 1 TB of data in a format suitable for physics analysis. The second goal is achieved by implementing multiple physics use-cases in Apache Spark using as input preprocessed datasets derived from official CMS data and simulation. By performing different end-analyses up to the publication plots on different hardware, feasibility, usability and portability are compared to the ones of a traditional ROOT-based workflow

arXiv.org e-Print Archive

EDP Sciences OAI-PMH repository (1.2.0)

CERN Document Server

Editorial: Innovative Analysis Ecosystems for HEP Data

Author: Innocenti Gian Michele
Jayatilaka Bo
Sekmen Sezen
Publication venue
Publication date: 01/01/2021
Field of study

This editorial summarizes the contributions to the Frontiers Research topic “Innovative Analysis Ecosystems for HEP Data”, established under the Big Data and AI in High Energy Physics section and appearing under the Frontiers in Big Data and Frontiers in Artificial Intelligence journals

Directory of Open Access Journals

CERN Document Server

Code of Conduct poster

Author: Baldassari Federica
Jayatilaka Bo
Kodolova Olga
Orimoto Toyoko
Publication venue
Publication date: 01/01/2019
Field of study

know the code of conduct, poste

CERN Document Server

A Ceph S3 Object Data Store for HEP

Author: Gutsche Oliver
Illingworth Robert
Jayatilaka Bo
Jones Chris
Mason David
Peisker Alison
Smith Nick
Publication venue
Publication date: 27/11/2023
Field of study

CERN Document Server

Coffea Columnar Object Framework For Effective Analysis

Author: Acosta Maria
Belforte Stefano
Cremonesi Matteo
Gray Lindsey
Gutsche Oliver
Hall Allison
Jayatilaka Bo
Melo Andrew
Pedro Kevin
Pivarski Jim
Smith Nicholas
Publication venue: 'EDP Sciences'
Publication date: 01/01/2020
Field of study

The coffea framework provides a new approach to High-Energy Physics analysis, via columnar operations, that improves time-to-insight, scalability, portability, and reproducibility of analysis. It is implemented with the Python programming language, the scientific python package ecosystem, and commodity big data technologies. To achieve this suite of improvements across many use cases, coffea takes a factorized approach, separating the analysis implementation and data delivery scheme. All analysis operations are implemented using the NumPy or awkward-array packages which are wrapped to yield user code whose purpose is quickly intuited. Various data delivery schemes are wrapped into a common front-end which accepts user inputs and code, and returns user defined outputs. We will discuss our experience in implementing analysis of CMS data using the coffea framework along with a discussion of the user experience and future directions

Directory of Open Access Journals

Coffea Columnar Object Framework For Effective Analysis

Author: Allison Hall
Andrew Melo
Bo Jayatilaka
Jim Pivarski
Kevin Pedro
Lindsey Gray
Maria Acosta
Matteo Cremonesi
Nicholas Smith
Oliver Gutsche
Stefano Belforte
Publication venue: 'EDP Sciences'
Publication date: 24/02/2020
Field of study

arXiv.org e-Print Archive

EDP Sciences OAI-PMH repository (1.2.0)

CERN Document Server

CMS Analysis and Data Reduction with Apache Spark

Author: Canali Luca
Cremer Illia
Cremonesi Matteo
Elmer Peter
Fisk Ian
Girone Maria
Gutsche Oliver
Jayatilaka Bo
Khristenko Viktor
Kowalkowski Jim
Motesnitsalis Evangelos
Pivarski Jim
Sehrish Saba
Surdy Kacper
Svyatkovskiy Alexey
Publication venue: 'IOP Publishing'
Publication date: 31/10/2017
Field of study

Experimental Particle Physics has been at the forefront of analyzing the world's largest datasets for decades. The HEP community was among the first to develop suitable software and computing tools for this task. In recent times, new toolkits and systems for distributed data processing, collectively called "Big Data" technologies have emerged from industry and open source projects to support the analysis of Petabyte and Exabyte datasets in industry. While the principles of data analysis in HEP have not changed (filtering and transforming experiment-specific data formats), these new technologies use different approaches and tools, promising a fresh look at analysis of very large datasets that could potentially reduce the time-to-physics with increased interactivity. Moreover these new tools are typically actively developed by large communities, often profiting of industry resources, and under open source licensing. These factors result in a boost for adoption and maturity of the tools and for the communities supporting them, at the same time helping in reducing the cost of ownership for the end-users. In this talk, we are presenting studies of using Apache Spark for end user data analysis. We are studying the HEP analysis workflow separated into two thrusts: the reduction of centrally produced experiment datasets and the end-analysis up to the publication plot. Studying the first thrust, CMS is working together with CERN openlab and Intel on the CMS Big Data Reduction Facility. The goal is to reduce 1 PB of official CMS data to 1 TB of ntuple output for analysis. We are presenting the progress of this 2-year project with first results of scaling up Spark-based HEP analysis. Studying the second thrust, we are presenting studies on using Apache Spark for a CMS Dark Matter physics search, comparing Spark's feasibility, usability and performance to the ROOT-based analysis.Experimental Particle Physics has been at the forefront of analyzing the world’s largest datasets for decades. The HEP community was among the first to develop suitable software and computing tools for this task. In recent times, new toolkits and systems for distributed data processing, collectively called “Big Data” technologies have emerged from industry and open source projects to support the analysis of Petabyte and Exabyte datasets in industry. While the principles of data analysis in HEP have not changed (filtering and transforming experiment-specific data formats), these new technologies use different approaches and tools, promising a fresh look at analysis of very large datasets that could potentially reduce the time-to-physics with increased interactivity. Moreover these new tools are typically actively developed by large communities, often profiting of industry resources, and under open source licensing. These factors result in a boost for adoption and maturity of the tools and for the communities supporting them, at the same time helping in reducing the cost of ownership for the end users. In this talk, we are presenting studies of using Apache Spark for end user data analysis. We are studying the HEP analysis workflow separated into two thrusts: the reduction of centrally produced experiment datasets and the end analysis up to the publication plot. Studying the first thrust, CMS is working together with CERN openlab and Intel on the CMS Big Data Reduction Facility. The goal is to reduce 1 PB of official CMS data to 1 TB of ntuple output for analysis. We are presenting the progress of this 2-year project with first results of scaling up Spark-based HEP analysis. Studying the second thrust, we are presenting studies on using Apache Spark for a CMS Dark Matter physics search, investigating Spark’s feasibility, usability and performance compared to the traditional ROOT-based analysis

arXiv.org e-Print Archive

CERN Document Server

Probing the accuracy and precision of Hirshfeld atom refinement with HARt interfaced with Olex2

Author: Bürgi Hans-Beat
Dolomanov Oleg V.
Fugel Malte
Grabowsky Simon
Hathwar Venkatesha R.
Howard Judith A.K.
Hupf Emanuel
Iversen Bo B.
Jayatilaka Dylan
Macchi Piero
Overgaard Jacob
Puschmann Horst
Turner Michael J.
Publication venue: International Union of Crystallography
Publication date: 01/01/2018
Field of study

Hirshfeld atom refinement (HAR) is a novel X-ray structure refinement technique that employs aspherical atomic scattering factors obtained from stockholder partitioning of a theoretically determined tailor-made static electron density. HAR overcomes many of the known limitations of independent atom modelling (IAM), such as too short element–hydrogen distances, r(X—H), or too large atomic displacement parameters (ADPs). This study probes the accuracy and precision of anisotropic hydrogen and non-hydrogen ADPs and of r(X—H) values obtained from HAR. These quantities are compared and found to agree with those obtained from (i) accurate neutron diffraction data measured at the same temperatures as the X-ray data and (ii) multipole modelling (MM), an established alternative method for interpreting X-ray diffraction data with the help of aspherical atomic scattering factors. Results are presented for three chemically different systems: the aromatic hydrocarbon rubrene (orthorhombic 5,6,11,12-tetraphenyltetracene), a co-crystal of zwitterionic betaine, imidazolium cations and picrate anions (BIPa), and the salt potassium hydrogen oxalate (KHOx). The non-hydrogen HAR-ADPs are as accurate and precise as the MM-ADPs. Both show excellent agreement with the neutron-based values and are superior to IAM-ADPs. The anisotropic hydrogen HAR-ADPs show a somewhat larger deviation from neutron-based values than the hydrogen SHADE-ADPs used in MM. Element–hydrogen bond lengths from HAR are in excellent agreement with those obtained from neutron diffraction experiments, although they are somewhat less precise. The residual density contour maps after HAR show fewer features than those after MM. Calculating the static electron density with the def2-TZVP basis set instead of the simpler def2-SVP one does not improve the refinement results significantly. All HARs were performed within the recently introduced HARt option implemented in the Olex2 program. They are easily launched inside its graphical user interface following a conventional IAM

Durham Research Online

Archivio istituzionale della ricerca - Politecnico di Milano

Crossref

Directory of Open Access Journals

Bern Open Repository and Information System (BORIS)