Search CORE

494 research outputs found

Using Big Data Technologies for HEP Analysis

Author: Bellini Claudio
Bian Bianny
Canali Luca
Cremonesi Matteo
Dimakopoulos Vasileios
Elmer Peter
Evangelos Evangelos
Fisk Ian
Girone Maria
Gutsche Oliver
Hoh Siew-Yan
Jayatilaka Bo
Khristenko Viktor
Luiselli Andrea
Melo Andrew
Olivito Dominick
Pazzini Jacopo
Pivarski Jim
Svyatkovskiy Alexey
Zanetti Marco
Publication venue: 'EDP Sciences'
Publication date: 01/01/2019
Field of study

The HEP community is approaching an era were the excellent performances of the particle accelerators in delivering collision at high rate will force the experiments to record a large amount of information. The growing size of the datasets could potentially become a limiting factor in the capability to produce scientific results timely and efficiently. Recently, new technologies and new approaches have been developed in industry to answer to the necessity to retrieve information as quickly as possible to analyze PB and EB datasets. Providing the scientists with these modern computing tools will lead to rethinking the principles of data analysis in HEP, making the overall scientific process faster and smoother. In this paper, we are presenting the latest developments and the most recent results on the usage of Apache Spark for HEP analysis. The study aims at evaluating the efficiency of the application of the new tools both quantitatively, by measuring the performances, and qualitatively, focusing on the user experience. The first goal is achieved by developing a data reduction facility: working together with CERN Openlab and Intel, CMS replicates a real physics search using Spark-based technologies, with the ambition of reducing 1 PB of public data in 5 hours, collected by the CMS experiment, to 1 TB of data in a format suitable for physics analysis. The second goal is achieved by implementing multiple physics use-cases in Apache Spark using as input preprocessed datasets derived from official CMS data and simulation. By performing different end-analyses up to the publication plots on different hardware, feasibility, usability and portability are compared to the ones of a traditional ROOT-based workflow

arXiv.org e-Print Archive

EDP Sciences OAI-PMH repository (1.2.0)

CERN Document Server

Gaining insight from large data volumes with ease

Author: Kuznetsov Valentin
Publication venue: 'EDP Sciences'
Publication date: 18/09/2018
Field of study

Efficient handling of large data-volumes becomes a necessity in today's world. It is driven by the desire to get more insight from the data and to gain a better understanding of user trends which can be transformed into economic incentives (profits, cost-reduction, various optimization of data workflows, and pipelines). In this paper, we discuss how modern technologies are transforming well established patterns in HEP communities. The new data insight can be achieved by embracing Big Data tools for a variety of use-cases, from analytics and monitoring to training Machine Learning models on a terabyte scale. We provide concrete examples within context of the CMS experiment where Big Data tools are already playing or would play a significant role in daily operations

arXiv.org e-Print Archive

EDP Sciences OAI-PMH repository (1.2.0)

Directory of Open Access Journals

CERN Document Server

Exploiting Big Data solutions for CMS computing operations analytics

Author: Daniele Bonacorsi
David Lange
Simone Gasperini
Simone Rossi Tisbeni
Publication venue
Publication date: 01/01/2022
Field of study

Computing operations at the Large Hadron Collider (LHC) at CERN rely on the Worldwide LHC Computing Grid (WLCG) infrastructure, designed to efficiently allow storage, access, and processing of data at the pre-exascale level. A close and detailed study of the exploited computing systems for the LHC physics mission represents an increasingly crucial aspect in the roadmap of High Energy Physics (HEP) towards the exascale regime. In this context, the Compact Muon Solenoid (CMS) experiment has been collecting and storing over the last few years a large set of heterogeneous non-collision data (e.g. meta-data about replicas placement, transfer operations, and actual user access to physics datasets). All this data richness is currently residing on a distributed Hadoop cluster, and it is organized so that running fast and arbitrary queries using the Spark analytics framework is a viable approach for Big Data mining efforts. Using a data-driven approach oriented to the analysis of this meta-data deriving from several CMS computing services, such as DBS (Data Bookkeeping Service) and MCM (Monte Carlo Management system), we started to focus on data storage and data access over the WLCG infrastructure, and we drafted an embryonal software toolkit to investigate recurrent patterns and provide indicators about physics datasets popularity. As a long-term goal, this aims at contributing to the overall design of a predictive/adaptive system that would eventually reduce costs and complexity of the CMS computing operations, while taking into account the stringent requests by the physics analysts communit

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Prototyping a ROOT-based distributed analysis workflow for HL-LHC: the CMS use case

Author: Biasotto Massimo
Ciangottini Diego
Guiraud Enrico
Padulano Vincenzo Eduardo
Saavedra Enric Tejedor
Spiga Daniele
Tedeschi Tommaso
Tracolli Mirco
Publication venue
Publication date: 24/07/2023
Field of study

The challenges expected for the next era of the Large Hadron Collider (LHC), both in terms of storage and computing resources, provide LHC experiments with a strong motivation for evaluating ways of rethinking their computing models at many levels. Great efforts have been put into optimizing the computing resource utilization for the data analysis, which leads both to lower hardware requirements and faster turnaround for physics analyses. In this scenario, the Compact Muon Solenoid (CMS) collaboration is involved in several activities aimed at benchmarking different solutions for running High Energy Physics (HEP) analysis workflows. A promising solution is evolving software towards more user-friendly approaches featuring a declarative programming model and interactive workflows. The computing infrastructure should keep up with this trend by offering on the one side modern interfaces, and on the other side hiding the complexity of the underlying environment, while efficiently leveraging the already deployed grid infrastructure and scaling toward opportunistic resources like public cloud or HPC centers. This article presents the first example of using the ROOT RDataFrame technology to exploit such next-generation approaches for a production-grade CMS physics analysis. A new analysis facility is created to offer users a modern interactive web interface based on JupyterLab that can leverage HTCondor-based grid resources on different geographical sites. The physics analysis is converted from a legacy iterative approach to the modern declarative approach offered by RDataFrame and distributed over multiple computing nodes. The new scenario offers not only an overall improved programming experience, but also an order of magnitude speedup increase with respect to the previous approach

arXiv.org e-Print Archive

Orchestration of machine learning workflows on Internet of Things data

Author: Alves Jose Miguel
Publication venue: Scholarship@Western
Publication date: 24/04/2019
Field of study

Applications empowered by machine learning (ML) and the Internet of Things (IoT) are changing the way people live and impacting a broad range of industries. However, creating and automating ML workflows at scale using real-world IoT data often leads to complex systems integration and production issues. Examples of challenges faced during the development of these ML applications include glue code, hidden dependencies, and data pipeline jungles. This research proposes the Machine Learning Framework for IoT data (ML4IoT), which is designed to orchestrate ML workflows to perform training and enable inference by ML models on IoT data. In the proposed framework, containerized microservices are used to automate the execution of tasks specified in ML workflows, which are defined through REST APIs. To address the problem of integrating big data tools and machine learning into a unified platform, the proposed framework enables the definition and execution of end-to-end ML workflows on large volumes of IoT data. In addition, to address the challenges of running multiple ML workflows in parallel, the ML4IoT has been designed to use container-based components that provide a convenient mechanism to enable the training and deployment of numerous ML models in parallel. Finally, to address the common production issues faced during the development of ML applications, the proposed framework used microservices architecture to bring flexibility, reusability, and extensibility to the framework. Through the experiments, we demonstrated the feasibility of the (ML4IoT), which managed to train and deploy predictive ML models in two types of IoT data. The obtained results suggested that the proposed framework can manage real-world IoT data, by providing elasticity to execute 32 ML workflows in parallel, which were used to train 128 ML models simultaneously. Also, results demonstrated that in the ML4IoT, the performance of rendering online predictions is not affected when 64 ML models are deployed concurrently to infer new information using online IoT data

Scholarship@Western

Storage Solutions for Big Data Systems: A Qualitative Study and Comparison

Author: Alam Mansaf
Ali Syed Arshad
Khan Samiya
Liu Xiufeng
Publication venue
Publication date: 01/01/2019
Field of study

Big data systems development is full of challenges in view of the variety of application areas and domains that this technology promises to serve. Typically, fundamental design decisions involved in big data systems design include choosing appropriate storage and computing infrastructures. In this age of heterogeneous systems that integrate different technologies for optimized solution to a specific real world problem, big data system are not an exception to any such rule. As far as the storage aspect of any big data system is concerned, the primary facet in this regard is a storage infrastructure and NoSQL seems to be the right technology that fulfills its requirements. However, every big data application has variable data characteristics and thus, the corresponding data fits into a different data model. This paper presents feature and use case analysis and comparison of the four main data models namely document oriented, key value, graph and wide column. Moreover, a feature analysis of 80 NoSQL solutions has been provided, elaborating on the criteria and points that a developer must consider while making a possible choice. Typically, big data storage needs to communicate with the execution engine and other processing and visualization technologies to create a comprehensive solution. This brings forth second facet of big data storage, big data file formats, into picture. The second half of the research paper compares the advantages, shortcomings and possible use cases of available big data file formats for Hadoop, which is the foundation for most big data computing technologies. Decentralized storage and blockchain are seen as the next generation of big data storage and its challenges and future prospects have also been discussed

arXiv.org e-Print Archive

Online Research Database In Technology

Computing models in high energy physics

Author: Tommaso Boccali
Publication venue
Publication date: 01/11/2019
Field of study

Abstract High Energy Physics Experiments (HEP experiments in the following) have been at least in the last 3 decades at the forefront of technology, in aspects like detector design and construction, number of collaborators, and complexity of data analyses. As uncommon in previous particle physics experiments, the computing and data handling aspects have not been marginal in their design and operations; the cost of the IT related components, from software development to storage systems and to distributed complex e-Infrastructures, has raised to a level which needs proper understanding and planning from the first moments in the lifetime of an experiment. In the following sections we will first try to explore the computing and software solutions developed and operated in the most relevant past and present experiments, with a focus on the technologies deployed; a technology tracking section is presented in order to pave the way to possible solutions for next decade experiments, and beyond. While the focus of this review is on offline computing model, the distinction is a shady one, and some experiments have already experienced contaminations between triggers selection and offline workflows; it is anticipated the trend will continue in the future

Open Access Repository