306 research outputs found
Gaining insight from large data volumes with ease
Efficient handling of large data-volumes becomes a necessity in today's
world. It is driven by the desire to get more insight from the data and to gain
a better understanding of user trends which can be transformed into economic
incentives (profits, cost-reduction, various optimization of data workflows,
and pipelines). In this paper, we discuss how modern technologies are
transforming well established patterns in HEP communities. The new data insight
can be achieved by embracing Big Data tools for a variety of use-cases, from
analytics and monitoring to training Machine Learning models on a terabyte
scale. We provide concrete examples within context of the CMS experiment where
Big Data tools are already playing or would play a significant role in daily
operations
A Taxonomy of Data Grids for Distributed Data Sharing, Management and Processing
Data Grids have been adopted as the platform for scientific communities that
need to share, access, transport, process and manage large data collections
distributed worldwide. They combine high-end computing technologies with
high-performance networking and wide-area storage management techniques. In
this paper, we discuss the key concepts behind Data Grids and compare them with
other data sharing and distribution paradigms such as content delivery
networks, peer-to-peer networks and distributed databases. We then provide
comprehensive taxonomies that cover various aspects of architecture, data
transportation, data replication and resource allocation and scheduling.
Finally, we map the proposed taxonomy to various Data Grid systems not only to
validate the taxonomy but also to identify areas for future exploration.
Through this taxonomy, we aim to categorise existing systems to better
understand their goals and their methodology. This would help evaluate their
applicability for solving similar problems. This taxonomy also provides a "gap
analysis" of this area through which researchers can potentially identify new
issues for investigation. Finally, we hope that the proposed taxonomy and
mapping also helps to provide an easy way for new practitioners to understand
this complex area of research.Comment: 46 pages, 16 figures, Technical Repor
Storageless and caching Tier-2 models in the UK context
Operational and other pressures have lead to WLCG experiments moving increasingly to a stratified model for Tier-2 resources, where ``fat" Tier-2s (``T2Ds") and ``thin" Tier-2s (``T2Cs") provide different levels of service.
In the UK, this distinction is also encouraged by the terms of the current GridPP5 funding model. In anticipation of this, testing has been performed on the implications, and potential implementation, of such a distinction in our resources.
In particular, this presentation presents the results of testing of storage T2Cs, where the ``thin" nature is expressed by the site having either no local data storage, or only a thin caching layer; data is streamed or copied from a ``nearby" T2D when needed by jobs.
In OSG, this model has been adopted successfully for CMS AAA sites; but the network topology and capacity in the USA is significantly different to that in the UK (and much of Europe).
We present the result of several operational tests: the in-production University College London (UCL) site, which runs ATLAS workloads using storage at the Queen Mary University of London (QMUL) site; the Oxford site, which has had scaling tests performed against T2Ds in various locations in the UK (to test network effects); and the Durham site, which has been testing the specific ATLAS caching solution of ``Rucio Cache" integration with ARC's caching layer
Exploiting Big Data solutions for CMS computing operations analytics
Computing operations at the Large Hadron Collider (LHC) at CERN rely on the Worldwide LHC Computing Grid (WLCG) infrastructure, designed to efficiently allow storage, access, and processing of data at the pre-exascale level. A close and detailed study of the exploited computing systems for the LHC physics mission represents an increasingly crucial aspect in the roadmap of High Energy Physics (HEP) towards the exascale regime. In this context, the Compact Muon Solenoid (CMS) experiment has been collecting and storing over the last few years a large set of heterogeneous non-collision data (e.g. meta-data about replicas placement, transfer operations, and actual user access to physics datasets). All this data richness is currently residing on a distributed Hadoop cluster, and it is organized so that running fast and arbitrary queries using the Spark analytics framework is a viable approach for Big Data mining efforts. Using a data-driven approach oriented to the analysis of this meta-data deriving from several CMS computing services, such as DBS (Data Bookkeeping Service) and MCM (Monte Carlo Management system), we started to focus on data storage and data access over the WLCG infrastructure, and we drafted an embryonal software toolkit to investigate recurrent patterns and provide indicators about physics datasets popularity. As a long-term goal, this aims at contributing to the overall design of a predictive/adaptive system that would eventually reduce costs and complexity of the CMS computing operations, while taking into account the stringent requests by the physics analysts communit
Named Data Networking in Climate Research and HEP Applications
The Computing Models of the LHC experiments continue to evolve from the simple hierarchical MONARC[2] model towards more agile models where data is exchanged among many Tier2 and Tier3 sites, relying on both large scale file transfers with strategic data placement, and an increased use of remote access to object collections with caching through CMS's AAA, ATLAS' FAX and ALICE's AliEn projects, for example. The challenges presented by expanding needs for CPU, storage and network capacity as well as rapid handling of large datasets of file and object collections have pointed the way towards future more agile pervasive models that make best use of highly distributed heterogeneous resources. In this paper, we explore the use of Named Data Networking (NDN), a new Internet architecture focusing on content rather than the location of the data collections. As NDN has shown considerable promise in another data intensive field, Climate Science, we discuss the similarities and differences between the Climate and HEP use cases, along with specific issues HEP faces and will face during LHC Run2 and beyond, which NDN could address
Designing Computing System Architecture and Models for the HL-LHC era
This paper describes a programme to study the computing model in CMS after
the next long shutdown near the end of the decade.Comment: Submitted to proceedings of the 21st International Conference on
Computing in High Energy and Nuclear Physics (CHEP2015), Okinawa, Japa
Coordinated Caching for High Performance Calibration using Z -> µµ Events of the CMS Experiment
Calibration of the detectors is a prerequisite for almost all physics analyses conducted as part of the LHC experiment. As such, both speed and precision are critical. As part of this thesis, a high performance analysis infrastructure using coordinated caching has been developed. This has been used to conduct the first calibration of jets using Z -> µµ events recorded during the second LHC run at the CMS experiment
Future of networking is the future of Big Data, The
2019 Summer.Includes bibliographical references.Scientific domains such as Climate Science, High Energy Particle Physics (HEP), Genomics, Biology, and many others are increasingly moving towards data-oriented workflows where each of these communities generates, stores and uses massive datasets that reach into terabytes and petabytes, and projected soon to reach exabytes. These communities are also increasingly moving towards a global collaborative model where scientists routinely exchange a significant amount of data. The sheer volume of data and associated complexities associated with maintaining, transferring, and using them, continue to push the limits of the current technologies in multiple dimensions - storage, analysis, networking, and security. This thesis tackles the networking aspect of big-data science. Networking is the glue that binds all the components of modern scientific workflows, and these communities are becoming increasingly dependent on high-speed, highly reliable networks. The network, as the common layer across big-science communities, provides an ideal place for implementing common services. Big-science applications also need to work closely with the network to ensure optimal usage of resources, intelligent routing of requests, and data. Finally, as more communities move towards data-intensive, connected workflows - adopting a service model where the network provides some of the common services reduces not only application complexity but also the necessity of duplicate implementations. Named Data Networking (NDN) is a new network architecture whose service model aligns better with the needs of these data-oriented applications. NDN's name based paradigm makes it easier to provide intelligent features at the network layer rather than at the application layer. This thesis shows that NDN can push several standard features to the network. This work is the first attempt to apply NDN in the context of large scientific data; in the process, this thesis touches upon scientific data naming, name discovery, real-world deployment of NDN for scientific data, feasibility studies, and the designs of in-network protocols for big-data science
- …