708 research outputs found

    Prototype of machine learning “as a service” for CMS physics in signal vs background discrimination

    Get PDF
    Big volumes of data are collected and analysed by LHC experiments at CERN. The success of this scientific challenges is ensured by a great amount of computing power and storage capacity, operated over high performance networks, in very complex LHC computing models on the LHC Computing Grid infrastructure. Now in Run-2 data taking, LHC has an ambitious and broad experimental programme for the coming decades: it includes large investments in detector hardware, and similarly it requires commensurate investment in the R&D in software and com- puting to acquire, manage, process, and analyse the shear amounts of data to be recorded in the High-Luminosity LHC (HL-LHC) era. The new rise of Artificial Intelligence - related to the current Big Data era, to the technological progress and to a bump in resources democratization and efficient allocation at affordable costs through cloud solutions - is posing new challenges but also offering extremely promising techniques, not only for the commercial world but also for scientific enterprises such as HEP experiments. Machine Learning and Deep Learning are rapidly evolving approaches to characterising and describing data with the potential to radically change how data is reduced and analysed, also at LHC. This thesis aims at contributing to the construction of a Machine Learning “as a service” solution for CMS Physics needs, namely an end-to-end data-service to serve Machine Learning trained model to the CMS software framework. To this ambitious goal, this thesis work contributes firstly with a proof of concept of a first prototype of such infrastructure, and secondly with a specific physics use-case: the Signal versus Background discrimination in the study of CMS all-hadronic top quark decays, done with scalable Machine Learning techniques

    Predicting CMS datasets popularity with machine learning

    Get PDF
    In CMS è stato lanciato un progetto di Data Analytics e, all’interno di esso, un’attività specifica pilota che mira a sfruttare tecniche di Machine Learning per predire la popolarità dei dataset di CMS. Si tratta di un’osservabile molto delicata, la cui eventuale predizione premetterebbe a CMS di costruire modelli di data placement più intelligenti, ampie ottimizzazioni nell’uso dello storage a tutti i livelli Tiers, e formerebbe la base per l’introduzione di un solito sistema di data management dinamico e adattivo. Questa tesi descrive il lavoro fatto sfruttando un nuovo prototipo pilota chiamato DCAFPilot, interamente scritto in python, per affrontare questa sfida

    Predicting dataset popularity for the CMS experiment

    Full text link
    The CMS experiment at the LHC accelerator at CERN relies on its computing infrastructure to stay at the frontier of High Energy Physics, searching for new phenomena and making discoveries. Even though computing plays a significant role in physics analysis we rarely use its data to predict the system behavior itself. A basic information about computing resources, user activities and site utilization can be really useful for improving the throughput of the system and its management. In this paper, we discuss a first CMS analysis of dataset popularity based on CMS meta-data which can be used as a model for dynamic data placement and provide the foundation of data-driven approach for the CMS computing infrastructure.Comment: Submitted to proceedings of 17th International workshop on Advanced Computing and Analysis Techniques in physics research (ACAT

    MLaaS4HEP: Machine Learning as a Service for HEP

    Full text link
    Machine Learning (ML) will play a significant role in the success of the upcoming High-Luminosity LHC (HL-LHC) program at CERN. An unprecedented amount of data at the exascale will be collected by LHC experiments in the next decade, and this effort will require novel approaches to train and use ML models. In this paper, we discuss a Machine Learning as a Service pipeline for HEP (MLaaS4HEP) which provides three independent layers: a data streaming layer to read High-Energy Physics (HEP) data in their native ROOT data format; a data training layer to train ML models using distributed ROOT files; a data inference layer to serve predictions using pre-trained ML models via HTTP protocol. Such modular design opens up the possibility to train data at large scale by reading ROOT files from remote storage facilities, e.g. World-Wide LHC Computing Grid (WLCG) infrastructure, and feed the data to the user's favorite ML framework. The inference layer implemented as TensorFlow as a Service (TFaaS) may provide an easy access to pre-trained ML models in existing infrastructure and applications inside or outside of the HEP domain. In particular, we demonstrate the usage of the MLaaS4HEP architecture for a physics use-case, namely the ttˉt\bar{t} Higgs analysis in CMS originally performed using custom made Ntuples. We provide details on the training of the ML model using distributed ROOT files, discuss the performance of the MLaaS and TFaaS approaches for the selected physics analysis, and compare the results with traditional methods.Comment: 16 pages, 10 figures, 2 tables, submitted to Computing and Software for Big Science. arXiv admin note: text overlap with arXiv:1811.0449

    Machine learning as a service for high energy physics (MLaaS4HEP): a service for ML-based data analyses

    Get PDF
    With the CERN LHC program underway, there has been an acceleration of data growth in the High Energy Physics (HEP) field and the usage of Machine Learning (ML) in HEP will be critical during the HL-LHC program when the data that will be produced will reach the exascale. ML techniques have been successfully used in many areas of HEP nevertheless, the development of a ML project and its implementation for production use is a highly time-consuming task and requires specific skills. Complicating this scenario is the fact that HEP data is stored in ROOT data format, which is mostly unknown outside of the HEP community. The work presented in this thesis is focused on the development of a ML as a Service (MLaaS) solution for HEP, aiming to provide a cloud service that allows HEP users to run ML pipelines via HTTP calls. These pipelines are executed by using the MLaaS4HEP framework, which allows reading data, processing data, and training ML models directly using ROOT files of arbitrary size from local or distributed data sources. Such a solution provides HEP users non-expert in ML with a tool that allows them to apply ML techniques in their analyses in a streamlined manner. Over the years the MLaaS4HEP framework has been developed, validated, and tested and new features have been added. A first MLaaS solution has been developed by automatizing the deployment of a platform equipped with the MLaaS4HEP framework. Then, a service with APIs has been developed, so that a user after being authenticated and authorized can submit MLaaS4HEP workflows producing trained ML models ready for the inference phase. A working prototype of this service is currently running on a virtual machine of INFN-Cloud and is compliant to be added to the INFN Cloud portfolio of services

    Prototype of a cloud native solution of Machine Learning as Service for HEP

    Get PDF
    To favor the usage of Machine Learning (ML) techniques in High-Energy Physics (HEP) analyses it would be useful to have a service allowing to perform the entire ML pipeline (in terms of reading the data, training a ML model, and serving predictions) directly using ROOT files of arbitrary size from local or remote distributed data sources. The MLaaS4HEP framework aims to provide such kind of solution. It was successfully validated with a CMS physics use case which gave important feedback about the needs of analysts. For instance, we introduced the possibility for the user to provide pre-processing operations, such as defining new branches and applying cuts. To provide a real service for the user and to integrate it into the INFN Cloud, we started working on MLaaS4HEP cloudification. This would allow to use cloud resources and to work in a distributed environment. In this work, we provide updates on this topic, and in particular, we discuss our first working prototype of the service. It includes an OAuth2 proxy server as authentication/authorization layer, a MLaaS4HEP server, an XRootD proxy server for enabling access to remote ROOT data, and the TensorFlow as a Service (TFaaS) service in charge of the inference phase. With this architecture the user is able to submit ML pipelines, after being authenticated and authorized, using local or remote ROOT files simply using HTTP call

    Machine Learning as a Service for High Energy Physics on heterogeneous computing resources

    Get PDF
    Machine Learning (ML) techniques in the High-Energy Physics (HEP) domain are ubiquitous and will play a significant role also in the upcoming High-Luminosity LHC (HL-LHC) upgrade foreseen at CERN: a huge amount of data will be produced by LHC and collected by the ex- periments, facing challenges at the exascale. Despite ML models are successfully applied in many use-cases (online and offline reconstruction, particle identification, detector simulation, Monte Carlo generation, just to name a few) there is a constant seek for scalable, performant, and production-quality operations of ML-enabled workflows. In addition, the scenario is complicated by the gap among HEP physicists and ML experts, caused by the specificity of some parts of the HEP typical workflows and solutions, and by the difficulty to formulate HEP problems in a way that match the skills of the Computer Science (CS) and ML community and hence its potential ability to step in and help. Among other factors, one of the technical obstacles resides in the difference of data-formats used by ML-practitioners and physicists, where the former use mostly flat-format data representations while the latter use to store data in tree-based objects via the ROOT data format. Another obstacle to further development of ML techniques in HEP resides in the difficulty to secure the adequate computing resources for training and inference of ML models, in a scalable and transparent way in terms of CPU vs GPU vs TPU vs other resources, as well as local vs cloud resources. This yields a technical barrier that prevents a relatively large portion of HEP physicists from fully accessing the potential of ML-enabled systems for scientific research. In order to close this gap, a Machine Learning as a Service for HEP (MLaaS4HEP) solution is presented as a product of R&D activities within the CMS experiment. It offers a service that is capable to directly read ROOT-based data, use the ML solution provided by the user, and ultimately serve predictions by pre-trained ML models “as a service” accessible via HTTP protocol. This solution can be used by physicists or experts outside of HEP domain and it provides access to local or remote data storage without requiring any modification or integration with the experiment specific framework. Moreover, MLaaS4HEP is built with a modular design allowing independent resource allocation that opens up a possibility to train ML models on PB-size datasets remotely accessible from the WLCG sites without physically downloading data into local storage. To prove the feasibility and utility of the MLaaS4HEP service with large datasets and thus be ready for the next future when an increase of data produced is expected, an exploration of different hardware resources is required. In particular, this work aims to provide the MLaaS4HEP service transparent access to heterogeneous resources, which opens up the usage of more powerful resources without requiring any effort from the user side during the access and use phase

    Cloud native approach for Machine Learning as a Service for High Energy Physics

    Get PDF
    Nowadays Machine Learning (ML) techniques are widely adopted in many areas of High Energy Physics (HEP) and certainly will play a significant role also in the upcoming High-Luminosity LHC (HL-LHC) upgrade foreseen at CERN. A huge amount of data will be produced by LHC and collected by the experiments, facing challenges at the exascale. Here, we present Machine Learning as a Service solution for HEP (MLaaS4HEP) to perform an entire ML pipeline (in terms of reading data, processing data, training ML models, serving predictions) in a completely model-agnostic fashion, directly using ROOT files of arbitrary size from local or distributed data sources. With the new version of MLaaS4HEP code based on uproot4, we provide new features to improve users’ experience with the framework and their workflows, e.g. users can provide some preprocessing operations to be applied to ROOT data before starting the ML pipeline. Then our approach is extended to use local and cloud resources via HTTP proxy which allows physicists to submit their workflows using the HTTP protocol. We discuss how this pipeline could be enabled in the INFN Cloud Provider and what could be the final architecture

    A multiwavelength study on the high-energy behaviour of Fermi/LAT pulsars

    Full text link
    Using archival as well as freshly acquired data, we assess the X-ray behaviour of the Fermi/LAT gamma-ray pulsars listed in the First Fermi source catalog. After revisiting the relationships between the pulsars' rotational energy losses and their X and gamma-ray luminosities, we focus on the distance-indipendent gamma to X-ray flux ratios. When plotting our Fgamma/Fx values as a function of the pulsars' rotational energy losses, one immediately sees that pulsars with similar energetics have Fgamma/Fx spanning 3 decades. Such spread, most probably stemming from vastly different geometrical configurations of the X and gamma-ray emitting regions, defies any straightforward interpretation of the plot. Indeed, while energetic pulsars do have low Fgamma/Fx values, little can be said for the bulk of the Fermi neutron stars. Dividing our pulsar sample into radio-loud and radio-quiet subsamples, we find that, on average, radio-quiet pulsars do have higher values of Fgamma/Fx, implying an intrinsec faintness of their X-ray emission and/or a different geometrical configuration. Moreover, despite the large spread mentioned above, statistical tests show a lower scatter in the radio-quiet dataset with respect to the radio-loud one, pointing to a somewhat more constrained geometry for the radio-quiet objects with respect to the radio-loud ones.Comment: 39 pages, 5 figures, 3 tables. To be published in Astrophysical Journa
    • …