5 research outputs found
Disk storage management for LHCb based on Data Popularity estimator
This paper presents an algorithm providing recommendations for optimizing the
LHCb data storage. The LHCb data storage system is a hybrid system. All
datasets are kept as archives on magnetic tapes. The most popular datasets are
kept on disks. The algorithm takes the dataset usage history and metadata
(size, type, configuration etc.) to generate a recommendation report. This
article presents how we use machine learning algorithms to predict future data
popularity. Using these predictions it is possible to estimate which datasets
should be removed from disk. We use regression algorithms and time series
analysis to find the optimal number of replicas for datasets that are kept on
disk. Based on the data popularity and the number of replicas optimization, the
algorithm minimizes a loss function to find the optimal data distribution. The
loss function represents all requirements for data distribution in the data
storage system. We demonstrate how our algorithm helps to save disk space and
to reduce waiting times for jobs using this data
GRID Storage Optimization in Transparent and User-Friendly Way for LHCb Datasets
The LHCb collaboration is one of the four major experiments at the Large
Hadron Collider at CERN. Many petabytes of data are produced by the detectors
and Monte-Carlo simulations. The LHCb Grid interware LHCbDIRAC is used to make
data available to all collaboration members around the world. The data is
replicated to the Grid sites in different locations. However the Grid disk
storage is limited and does not allow keeping replicas of each file at all
sites. Thus it is essential to optimize number of replicas to achieve a better
Grid performance.
In this study, we present a new approach of data replication and distribution
strategy based on data popularity prediction. The popularity is performed based
on the data access history and metadata, and uses machine learning techniques
and time series analysis methods
Machine learning at the energy and intensity frontiers of particle physics
Our knowledge of the fundamental particles of nature and their interactions is summarized by the standard model of particle physics. Advancing our understanding in this field has required experiments that operate at ever higher energies and intensities, which produce extremely large and information-rich data samples. The use of machine-learning techniques is revolutionizing how we interpret these data samples, greatly increasing the discovery potential of present and future experiments. Here we summarize the challenges and opportunities that come with the use of machine learning at the frontiers of particle physics