Search CORE

5 research outputs found

Disk storage management for LHCb based on Data Popularity estimator

Author: Charpentier Philippe
Hushchyn Mikhail
Ustyuzhanin Andrey
Publication venue: 'IOP Publishing'
Publication date: 01/10/2015
Field of study

This paper presents an algorithm providing recommendations for optimizing the LHCb data storage. The LHCb data storage system is a hybrid system. All datasets are kept as archives on magnetic tapes. The most popular datasets are kept on disks. The algorithm takes the dataset usage history and metadata (size, type, configuration etc.) to generate a recommendation report. This article presents how we use machine learning algorithms to predict future data popularity. Using these predictions it is possible to estimate which datasets should be removed from disk. We use regression algorithms and time series analysis to find the optimal number of replicas for datasets that are kept on disk. Based on the data popularity and the number of replicas optimization, the algorithm minimizes a loss function to find the optimal data distribution. The loss function represents all requirements for data distribution in the data storage system. We demonstrate how our algorithm helps to save disk space and to reduce waiting times for jobs using this data

arXiv.org e-Print Archive

Crossref

CERN Document Server

GRID Storage Optimization in Transparent and User-Friendly Way for LHCb Datasets

Author: Charpentier Philippe
Haen Christophe
Hushchyn Mikhail
Ustyuzhanin Andrey
Publication venue: 'IOP Publishing'
Publication date: 12/05/2017
Field of study

The LHCb collaboration is one of the four major experiments at the Large Hadron Collider at CERN. Many petabytes of data are produced by the detectors and Monte-Carlo simulations. The LHCb Grid interware LHCbDIRAC is used to make data available to all collaboration members around the world. The data is replicated to the Grid sites in different locations. However the Grid disk storage is limited and does not allow keeping replicas of each file at all sites. Thus it is essential to optimize number of replicas to achieve a better Grid performance. In this study, we present a new approach of data replication and distribution strategy based on data popularity prediction. The popularity is performed based on the data access history and metadata, and uses machine learning techniques and time series analysis methods

arXiv.org e-Print Archive

Crossref

CERN Document Server

Machine learning at the energy and intensity frontiers of particle physics

Author: et al.
Radovic Alexander
Publication venue
Publication date: 02/08/2018
Field of study

Our knowledge of the fundamental particles of nature and their interactions is summarized by the standard model of particle physics. Advancing our understanding in this field has required experiments that operate at ever higher energies and intensities, which produce extremely large and information-rich data samples. The use of machine-learning techniques is revolutionizing how we interpret these data samples, greatly increasing the discovery potential of present and future experiments. Here we summarize the challenges and opportunities that come with the use of machine learning at the frontiers of particle physics

Open Access Repository

Disk storage management for LHCb based on Data Popularity estimator

Author: Hushchyn Mikhail
Publication venue
Publication date: 01/01/2015
Field of study

CERN Document Server