140,362 research outputs found
Extending DIRAC File Management with Erasure-Coding for efficient storage
The state of the art in Grid style data management is to achieve increased
resilience of data via multiple complete replicas of data files across multiple
storage endpoints. While this is effective, it is not the most space-efficient
approach to resilience, especially when the reliability of individual storage
endpoints is sufficiently high that only a few will be inactive at any point in
time. We report on work performed as part of GridPP\cite{GridPP}, extending the
Dirac File Catalogue and file management interface to allow the placement of
erasure-coded files: each file distributed as N identically-sized chunks of
data striped across a vector of storage endpoints, encoded such that any M
chunks can be lost and the original file can be reconstructed. The tools
developed are transparent to the user, and, as well as allowing up and
downloading of data to Grid storage, also provide the possibility of
parallelising access across all of the distributed chunks at once, improving
data transfer and IO performance. We expect this approach to be of most
interest to smaller VOs, who have tighter bounds on the storage available to
them, but larger (WLCG) VOs may be interested as their total data increases
during Run 2. We provide an analysis of the costs and benefits of the approach,
along with future development and implementation plans in this area. In
general, overheads for multiple file transfers provide the largest issue for
competitiveness of this approach at present.Comment: 21st International Conference on Computing for High Energy and
Nuclear Physics (CHEP2015
MLaaS4HEP: Machine Learning as a Service for HEP
Machine Learning (ML) will play a significant role in the success of the
upcoming High-Luminosity LHC (HL-LHC) program at CERN. An unprecedented amount
of data at the exascale will be collected by LHC experiments in the next
decade, and this effort will require novel approaches to train and use ML
models. In this paper, we discuss a Machine Learning as a Service pipeline for
HEP (MLaaS4HEP) which provides three independent layers: a data streaming layer
to read High-Energy Physics (HEP) data in their native ROOT data format; a data
training layer to train ML models using distributed ROOT files; a data
inference layer to serve predictions using pre-trained ML models via HTTP
protocol. Such modular design opens up the possibility to train data at large
scale by reading ROOT files from remote storage facilities, e.g. World-Wide LHC
Computing Grid (WLCG) infrastructure, and feed the data to the user's favorite
ML framework. The inference layer implemented as TensorFlow as a Service
(TFaaS) may provide an easy access to pre-trained ML models in existing
infrastructure and applications inside or outside of the HEP domain. In
particular, we demonstrate the usage of the MLaaS4HEP architecture for a
physics use-case, namely the Higgs analysis in CMS originally
performed using custom made Ntuples. We provide details on the training of the
ML model using distributed ROOT files, discuss the performance of the MLaaS and
TFaaS approaches for the selected physics analysis, and compare the results
with traditional methods.Comment: 16 pages, 10 figures, 2 tables, submitted to Computing and Software
for Big Science. arXiv admin note: text overlap with arXiv:1811.0449
Any Data, Any Time, Anywhere: Global Data Access for Science
Data access is key to science driven by distributed high-throughput computing
(DHTC), an essential technology for many major research projects such as High
Energy Physics (HEP) experiments. However, achieving efficient data access
becomes quite difficult when many independent storage sites are involved
because users are burdened with learning the intricacies of accessing each
system and keeping careful track of data location. We present an alternate
approach: the Any Data, Any Time, Anywhere infrastructure. Combining several
existing software products, AAA presents a global, unified view of storage
systems - a "data federation," a global filesystem for software delivery, and a
workflow management system. We present how one HEP experiment, the Compact Muon
Solenoid (CMS), is utilizing the AAA infrastructure and some simple performance
metrics.Comment: 9 pages, 6 figures, submitted to 2nd IEEE/ACM International Symposium
on Big Data Computing (BDC) 201
MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface
Application development for distributed computing "Grids" can benefit from
tools that variously hide or enable application-level management of critical
aspects of the heterogeneous environment. As part of an investigation of these
issues, we have developed MPICH-G2, a Grid-enabled implementation of the
Message Passing Interface (MPI) that allows a user to run MPI programs across
multiple computers, at the same or different sites, using the same commands
that would be used on a parallel computer. This library extends the Argonne
MPICH implementation of MPI to use services provided by the Globus Toolkit for
authentication, authorization, resource allocation, executable staging, and
I/O, as well as for process creation, monitoring, and control. Various
performance-critical operations, including startup and collective operations,
are configured to exploit network topology information. The library also
exploits MPI constructs for performance management; for example, the MPI
communicator construct is used for application-level discovery of, and
adaptation to, both network topology and network quality-of-service mechanisms.
We describe the MPICH-G2 design and implementation, present performance
results, and review application experiences, including record-setting
distributed simulations.Comment: 20 pages, 8 figure
CMS Monte Carlo production in the WLCG computing Grid
Monte Carlo production in CMS has received a major boost in performance and
scale since the past CHEP06 conference. The production system has been re-engineered in order
to incorporate the experience gained in running the previous system and to integrate production
with the new CMS event data model, data management system and data processing framework.
The system is interfaced to the two major computing Grids used by CMS, the LHC Computing
Grid (LCG) and the Open Science Grid (OSG).
Operational experience and integration aspects of the new CMS Monte Carlo production
system is presented together with an analysis of production statistics. The new system
automatically handles job submission, resource monitoring, job queuing, job distribution
according to the available resources, data merging, registration of data into the data
bookkeeping, data location, data transfer and placement systems. Compared to the previous
production system automation, reliability and performance have been considerably improved. A
more efficient use of computing resources and a better handling of the inherent Grid unreliability
have resulted in an increase of production scale by about an order of magnitude, capable of
running in parallel at the order of ten thousand jobs and yielding more than two million events
per day
- …