Search CORE

306 research outputs found

Gaining insight from large data volumes with ease

Author: Kuznetsov Valentin
Publication venue: 'EDP Sciences'
Publication date: 18/09/2018
Field of study

Efficient handling of large data-volumes becomes a necessity in today's world. It is driven by the desire to get more insight from the data and to gain a better understanding of user trends which can be transformed into economic incentives (profits, cost-reduction, various optimization of data workflows, and pipelines). In this paper, we discuss how modern technologies are transforming well established patterns in HEP communities. The new data insight can be achieved by embracing Big Data tools for a variety of use-cases, from analytics and monitoring to training Machine Learning models on a terabyte scale. We provide concrete examples within context of the CMS experiment where Big Data tools are already playing or would play a significant role in daily operations

arXiv.org e-Print Archive

EDP Sciences OAI-PMH repository (1.2.0)

Directory of Open Access Journals

CERN Document Server

A Taxonomy of Data Grids for Distributed Data Sharing, Management and Processing

Author: Buyya Rajkumar
Ramamohanarao Kotagiri
Venugopal Srikumar
Publication venue
Publication date: 10/06/2005
Field of study

Data Grids have been adopted as the platform for scientific communities that need to share, access, transport, process and manage large data collections distributed worldwide. They combine high-end computing technologies with high-performance networking and wide-area storage management techniques. In this paper, we discuss the key concepts behind Data Grids and compare them with other data sharing and distribution paradigms such as content delivery networks, peer-to-peer networks and distributed databases. We then provide comprehensive taxonomies that cover various aspects of architecture, data transportation, data replication and resource allocation and scheduling. Finally, we map the proposed taxonomy to various Data Grid systems not only to validate the taxonomy but also to identify areas for future exploration. Through this taxonomy, we aim to categorise existing systems to better understand their goals and their methodology. This would help evaluate their applicability for solving similar problems. This taxonomy also provides a "gap analysis" of this area through which researchers can potentially identify new issues for investigation. Finally, we hope that the proposed taxonomy and mapping also helps to provide an easy way for new practitioners to understand this complex area of research.Comment: 46 pages, 16 figures, Technical Repor

arXiv.org e-Print Archive

CiteSeerX

University of Melbourne Institutional Repository

Storageless and caching Tier-2 models in the UK context

Author: Brew Chris
Britton David
Crooks David
Dewhurst Alastair
MacMahon Ewan
Mohammed Kashif
Roy Gareth
Skipsey Samuel Cadellin
Smith Oliver
Publication venue: 'IOP Publishing'
Publication date: 01/10/2017
Field of study

Operational and other pressures have lead to WLCG experiments moving increasingly to a stratified model for Tier-2 resources, where ``fat" Tier-2s (``T2Ds") and ``thin" Tier-2s (``T2Cs") provide different levels of service. In the UK, this distinction is also encouraged by the terms of the current GridPP5 funding model. In anticipation of this, testing has been performed on the implications, and potential implementation, of such a distinction in our resources. In particular, this presentation presents the results of testing of storage T2Cs, where the ``thin" nature is expressed by the site having either no local data storage, or only a thin caching layer; data is streamed or copied from a ``nearby" T2D when needed by jobs. In OSG, this model has been adopted successfully for CMS AAA sites; but the network topology and capacity in the USA is significantly different to that in the UK (and much of Europe). We present the result of several operational tests: the in-production University College London (UCL) site, which runs ATLAS workloads using storage at the Queen Mary University of London (QMUL) site; the Oxford site, which has had scaling tests performed against T2Ds in various locations in the UK (to test network effects); and the Durham site, which has been testing the specific ATLAS caching solution of ``Rucio Cache" integration with ARC's caching layer

Enlighten

Exploiting Big Data solutions for CMS computing operations analytics

Author: Daniele Bonacorsi
David Lange
Simone Gasperini
Simone Rossi Tisbeni
Publication venue
Publication date: 01/01/2022
Field of study

Computing operations at the Large Hadron Collider (LHC) at CERN rely on the Worldwide LHC Computing Grid (WLCG) infrastructure, designed to efficiently allow storage, access, and processing of data at the pre-exascale level. A close and detailed study of the exploited computing systems for the LHC physics mission represents an increasingly crucial aspect in the roadmap of High Energy Physics (HEP) towards the exascale regime. In this context, the Compact Muon Solenoid (CMS) experiment has been collecting and storing over the last few years a large set of heterogeneous non-collision data (e.g. meta-data about replicas placement, transfer operations, and actual user access to physics datasets). All this data richness is currently residing on a distributed Hadoop cluster, and it is organized so that running fast and arbitrary queries using the Spark analytics framework is a viable approach for Big Data mining efforts. Using a data-driven approach oriented to the analysis of this meta-data deriving from several CMS computing services, such as DBS (Data Bookkeeping Service) and MCM (Monte Carlo Management system), we started to focus on data storage and data access over the WLCG infrastructure, and we drafted an embryonal software toolkit to investigate recurrent patterns and provide indicators about physics datasets popularity. As a long-term goal, this aims at contributing to the overall design of a predictive/adaptive system that would eventually reduce costs and complexity of the CMS computing operations, while taking into account the stringent requests by the physics analysts communit

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Named Data Networking in Climate Research and HEP Applications

Author: Barczyk Artur Jerzy
Liu Ran
Monga Inder
Mughal Azher
Newman Harvey
Papadopoulos Christos
Shannigrahi Susmit
Sim Alex
Vlimant Jean-Roch
Wu John
Yeh Edmund
Publication venue: 'AIP Publishing'
Publication date: 01/04/2015
Field of study

The Computing Models of the LHC experiments continue to evolve from the simple hierarchical MONARC[2] model towards more agile models where data is exchanged among many Tier2 and Tier3 sites, relying on both large scale file transfers with strategic data placement, and an increased use of remote access to object collections with caching through CMS's AAA, ATLAS' FAX and ALICE's AliEn projects, for example. The challenges presented by expanding needs for CPU, storage and network capacity as well as rapid handling of large datasets of file and object collections have pointed the way towards future more agile pervasive models that make best use of highly distributed heterogeneous resources. In this paper, we explore the use of Named Data Networking (NDN), a new Internet architecture focusing on content rather than the location of the data collections. As NDN has shown considerable promise in another data intensive field, Climate Science, we discuss the similarities and differences between the Climate and HEP use cases, along with specific issues HEP faces and will face during LHC Run2 and beyond, which NDN could address

Caltech Authors

Designing Computing System Architecture and Models for the HL-LHC era

Author: Bauerdick Lothar
Bockelman Brian
Elmer Peter
Gowdy Stephen
Tadel Matevz
Wuerthwein Frank
Publication venue: 'IOP Publishing'
Publication date: 20/07/2015
Field of study

This paper describes a programme to study the computing model in CMS after the next long shutdown near the end of the decade.Comment: Submitted to proceedings of the 21st International Conference on Computing in High Energy and Nuclear Physics (CHEP2015), Okinawa, Japa

arXiv.org e-Print Archive

CERN Document Server

Coordinated Caching for High Performance Calibration using Z -> µµ Events of the CMS Experiment

Author: Fischer Max
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2016
Field of study

Calibration of the detectors is a prerequisite for almost all physics analyses conducted as part of the LHC experiment. As such, both speed and precision are critical. As part of this thesis, a high performance analysis infrastructure using coordinated caching has been developed. This has been used to conduct the first calibration of jets using Z -> µµ events recorded during the second LHC run at the CMS experiment

KITopen

Future of networking is the future of Big Data, The

Author: Shannigrahi Susmit
Publication venue: Colorado State University. Libraries
Publication date: 01/01/2019
Field of study

2019 Summer.Includes bibliographical references.Scientific domains such as Climate Science, High Energy Particle Physics (HEP), Genomics, Biology, and many others are increasingly moving towards data-oriented workflows where each of these communities generates, stores and uses massive datasets that reach into terabytes and petabytes, and projected soon to reach exabytes. These communities are also increasingly moving towards a global collaborative model where scientists routinely exchange a significant amount of data. The sheer volume of data and associated complexities associated with maintaining, transferring, and using them, continue to push the limits of the current technologies in multiple dimensions - storage, analysis, networking, and security. This thesis tackles the networking aspect of big-data science. Networking is the glue that binds all the components of modern scientific workflows, and these communities are becoming increasingly dependent on high-speed, highly reliable networks. The network, as the common layer across big-science communities, provides an ideal place for implementing common services. Big-science applications also need to work closely with the network to ensure optimal usage of resources, intelligent routing of requests, and data. Finally, as more communities move towards data-intensive, connected workflows - adopting a service model where the network provides some of the common services reduces not only application complexity but also the necessity of duplicate implementations. Named Data Networking (NDN) is a new network architecture whose service model aligns better with the needs of these data-oriented applications. NDN's name based paradigm makes it easier to provide intelligent features at the network layer rather than at the application layer. This thesis shows that NDN can push several standard features to the network. This work is the first attempt to apply NDN in the context of large scientific data; in the process, this thesis touches upon scientific data naming, name discovery, real-world deployment of NDN for scientific data, feasibility studies, and the designs of in-network protocols for big-data science

Mountain Scholar (Digital Collections of Colorado and Wyoming)