11,094 research outputs found
Using Big Data Technologies for HEP Analysis
The HEP community is approaching an era were the excellent performances of
the particle accelerators in delivering collision at high rate will force the
experiments to record a large amount of information. The growing size of the
datasets could potentially become a limiting factor in the capability to
produce scientific results timely and efficiently. Recently, new technologies
and new approaches have been developed in industry to answer to the necessity
to retrieve information as quickly as possible to analyze PB and EB datasets.
Providing the scientists with these modern computing tools will lead to
rethinking the principles of data analysis in HEP, making the overall
scientific process faster and smoother.
In this paper, we are presenting the latest developments and the most recent
results on the usage of Apache Spark for HEP analysis. The study aims at
evaluating the efficiency of the application of the new tools both
quantitatively, by measuring the performances, and qualitatively, focusing on
the user experience. The first goal is achieved by developing a data reduction
facility: working together with CERN Openlab and Intel, CMS replicates a real
physics search using Spark-based technologies, with the ambition of reducing 1
PB of public data in 5 hours, collected by the CMS experiment, to 1 TB of data
in a format suitable for physics analysis.
The second goal is achieved by implementing multiple physics use-cases in
Apache Spark using as input preprocessed datasets derived from official CMS
data and simulation. By performing different end-analyses up to the publication
plots on different hardware, feasibility, usability and portability are
compared to the ones of a traditional ROOT-based workflow
Big Data in HEP: A comprehensive use case study
Experimental Particle Physics has been at the forefront of analyzing the
worlds largest datasets for decades. The HEP community was the first to develop
suitable software and computing tools for this task. In recent times, new
toolkits and systems collectively called Big Data technologies have emerged to
support the analysis of Petabyte and Exabyte datasets in industry. While the
principles of data analysis in HEP have not changed (filtering and transforming
experiment-specific data formats), these new technologies use different
approaches and promise a fresh look at analysis of very large datasets and
could potentially reduce the time-to-physics with increased interactivity. In
this talk, we present an active LHC Run 2 analysis, searching for dark matter
with the CMS detector, as a testbed for Big Data technologies. We directly
compare the traditional NTuple-based analysis with an equivalent analysis using
Apache Spark on the Hadoop ecosystem and beyond. In both cases, we start the
analysis with the official experiment data formats and produce publication
physics plots. We will discuss advantages and disadvantages of each approach
and give an outlook on further studies needed.Comment: Proceedings for 22nd International Conference on Computing in High
Energy and Nuclear Physics (CHEP 2016
Gaining insight from large data volumes with ease
Efficient handling of large data-volumes becomes a necessity in today's
world. It is driven by the desire to get more insight from the data and to gain
a better understanding of user trends which can be transformed into economic
incentives (profits, cost-reduction, various optimization of data workflows,
and pipelines). In this paper, we discuss how modern technologies are
transforming well established patterns in HEP communities. The new data insight
can be achieved by embracing Big Data tools for a variety of use-cases, from
analytics and monitoring to training Machine Learning models on a terabyte
scale. We provide concrete examples within context of the CMS experiment where
Big Data tools are already playing or would play a significant role in daily
operations
CERN openlab Whitepaper on Future IT Challenges in Scientific Research
This whitepaper describes the major IT challenges in scientific research at CERN and several other European and international research laboratories and projects. Each challenge is exemplified through a set of concrete use cases drawn from the requirements of large-scale scientific programs. The paper is based on contributions from many researchers and IT experts of the participating laboratories and also input from the existing CERN openlab industrial sponsors. The views expressed in this document are those of the individual contributors and do not necessarily reflect the view of their organisations and/or affiliates
A Security Monitoring Framework For Virtualization Based HEP Infrastructures
High Energy Physics (HEP) distributed computing infrastructures require
automatic tools to monitor, analyze and react to potential security incidents.
These tools should collect and inspect data such as resource consumption, logs
and sequence of system calls for detecting anomalies that indicate the presence
of a malicious agent. They should also be able to perform automated reactions
to attacks without administrator intervention. We describe a novel framework
that accomplishes these requirements, with a proof of concept implementation
for the ALICE experiment at CERN. We show how we achieve a fully virtualized
environment that improves the security by isolating services and Jobs without a
significant performance impact. We also describe a collected dataset for
Machine Learning based Intrusion Prevention and Detection Systems on Grid
computing. This dataset is composed of resource consumption measurements (such
as CPU, RAM and network traffic), logfiles from operating system services, and
system call data collected from production Jobs running in an ALICE Grid test
site and a big set of malware. This malware was collected from security
research sites. Based on this dataset, we will proceed to develop Machine
Learning algorithms able to detect malicious Jobs.Comment: Proceedings of the 22nd International Conference on Computing in High
Energy and Nuclear Physics, CHEP 2016, 10-14 October 2016, San Francisco.
Submitted to Journal of Physics: Conference Series (JPCS
Future of networking is the future of Big Data, The
2019 Summer.Includes bibliographical references.Scientific domains such as Climate Science, High Energy Particle Physics (HEP), Genomics, Biology, and many others are increasingly moving towards data-oriented workflows where each of these communities generates, stores and uses massive datasets that reach into terabytes and petabytes, and projected soon to reach exabytes. These communities are also increasingly moving towards a global collaborative model where scientists routinely exchange a significant amount of data. The sheer volume of data and associated complexities associated with maintaining, transferring, and using them, continue to push the limits of the current technologies in multiple dimensions - storage, analysis, networking, and security. This thesis tackles the networking aspect of big-data science. Networking is the glue that binds all the components of modern scientific workflows, and these communities are becoming increasingly dependent on high-speed, highly reliable networks. The network, as the common layer across big-science communities, provides an ideal place for implementing common services. Big-science applications also need to work closely with the network to ensure optimal usage of resources, intelligent routing of requests, and data. Finally, as more communities move towards data-intensive, connected workflows - adopting a service model where the network provides some of the common services reduces not only application complexity but also the necessity of duplicate implementations. Named Data Networking (NDN) is a new network architecture whose service model aligns better with the needs of these data-oriented applications. NDN's name based paradigm makes it easier to provide intelligent features at the network layer rather than at the application layer. This thesis shows that NDN can push several standard features to the network. This work is the first attempt to apply NDN in the context of large scientific data; in the process, this thesis touches upon scientific data naming, name discovery, real-world deployment of NDN for scientific data, feasibility studies, and the designs of in-network protocols for big-data science
- …