38 research outputs found
The Grid-distributed data analysis in CMS
The CMS experiment will soon produce a huge amount of data (a few PBytes per year) that will be distributed and stored in many computing centres spread across the countries participating in the collaboration. Data will be available
to the whole CMS physicists: this will be possible thanks to the services provided by supported Grids. CRAB is the CMS collaboration tool developed to allow physicists to access and analyze data stored over world-wide sites. It aims to simplify the data discovery process and the jobs creation, execution and monitoring tasks hiding the details related both to Grid infrastructures and CMS analysis framework. We will discuss the recent evolution of this tool from its standalone version up to the clientserver
architecture adopted for particularly challenging workload volumes and we will report the usage statistics collected from the CRAB community, involving so far almost 600 distinct users
CMS@home: Integrating the Volunteer Cloud and HighâThroughput Computing
Volunteer computing has the potential to provide significant additional computing capacity for the LHC experiments.
Initiatives such as the CMS@home project are aiming to integrate volunteer computing resources into the experimentâs
computational frameworks to support their scientific workloads. This is especially important, as over the next few years
the demands on computing capacity will increase beyond what can be supported by general technology trends. This paper
describes how a volunteer computing project that uses virtualization to run high energy physics simulations can integrate
those resources into their computing infrastructure. The concept of the volunteer cloud is introduced and how this model can
simplify the integration is described. An architecture for implementing the volunteer cloud model is presented along with
an implementation for the CMS@home project. Finally, the submission of real CMS workloads to this volunteer cloud are
compared to identical workloads submitted to the grid
Use of the gLite-WMS in CMS for production and analysis
The CMS experiment at LHC started using the Resource Broker (by the EDG and LCG projects) to submit Monte Carlo production and analysis jobs to distributed computing resources of the WLCG infrastructure over 6 years ago. Since 2006 the gLite Workload Management System (WMS) and Logging \& Bookkeeping (LB) are used. The interaction with the gLite-WMS/LB happens through the CMS production and analysis frameworks, respectively ProdAgent and CRAB, through a common component, BOSSLite. The important improvements recently made in the gLite-WMS/LB as well as in the CMS tools and the intrinsic independence of different WMS/LB instances allow CMS to reach the stability and scalability needed for LHC operations. In particular the use of a multi-threaded approach in BOSSLite allowed to increase the scalability of the systems significantly. In this work we present the operational set up of CMS production and analysis based on the gLite-WMS and the performances obtained in the past data challenges and in the daily Monte Carlo productions and user analysis usage in the experiment
Recommended from our members
Distributed analysis with CRAB: The client-server architecture evolution and commissioning
CRAB (CMS Remote Analysis Builder) is the tool used by CMS to enable running physics analysis in a transparent manner over data distributed across many sites. It abstracts out the interaction with the underlying batch farms, grid infrastructure and CMS workload management tools, such that it is easily usable by non-experts. CRAB can be used as a direct interface to the computing system or can delegate the user task to a server. Major efforts have been dedicated to the client-server system development, allowing the user to deal only with a simple and intuitive interface and to delegate all the work to a server. The server takes care of handling the users jobs during the whole lifetime of the users task. In particular, it takes care of the data and resources discovery, process tracking and output handling. It also provides services such as automatic resubmission in case of failures, notification to the user of the task status, and automatic blacklisting of sites showing evident problems beyond what is provided by existing grid infrastructure. The CRAB Server architecture and its deployment will be presented, as well as the current status and future development. In addition the experience in using the system for initial detector commissioning activities and data analysis will be summarized
CHEP2012
The CMS distributed data analysis workflow assumes that jobs run in a different location from where their results are finally stored. Typically the user output must be transferred across the network from one site to another, possibly on a different continent or over links not necessarily validated for high bandwidth/high reliability transfer. This step is named extit{stage-out} and in CMS was originally implemented as a synchronous step of the analysis job execution. However, our experience showed the weakness of this approach both in terms of low total job execution efficiency and failure rates, wasting precious CPU resources. The nature of analysis data makes it inappropriate to use PhEDEx, the core data placement system for CMS. As part of the new generation of CMS Workload Management tools, the Asynchronous Stage-Out system (AsyncStageOut) has been developed to enable third party copy of the user output. The AsyncStageOut component manages glite FTS transfers of data from the temporary store at the site where the job ran to the final location of the data on behalf of that data owner. The tool uses python daemons, built using the WMCore framework, and CouchDB, to manage the queue of work and FTS transfers. CouchDB also provides the platform for a dedicated operations monitoring system. In this paper, we present the motivations of the asynchronous stage-out system. We give an insight into the design and the implementation of key features, describing how it is coupled with the CMS workload management system. Finally, we show the results and the commissioning experience
CMS users data management service integration and first experiences with its NoSQL data storage
The distributed data analysis workflow in CMS assumes that jobs run in a different location to where their results are finally stored. Typically the user outputs must be transferred from one site to another by a dedicated CMS service, AsyncStageOut. This new service is originally developed to address the inefficiency in using the CMS computing resources when transferring the analysis job outputs, synchronously, once they are produced in the job execution node to the remote site.The AsyncStageOut is designed as a thin application relying only on the NoSQL database (CouchDB) as input and data storage. It has progressed from a limited prototype to a highly adaptable service which manages and monitors the whole user files steps, namely file transfer and publication. The AsyncStageOut is integrated with the Common CMS/Atlas Analysis Framework. It foresees the management of nearly 200k users files per day of close to 1000 individual users per month with minimal delays, and providing a real time monitoring and reports to users and service operators, while being highly available. The associated data volume represents a new set of challenges in the areas of database scalability and service performance and efficiency. In this paper, we present an overview of the AsyncStageOut model and the integration strategy with the Common Analysis Framework. The motivations for using the NoSQL technology are also presented, as well as data design and the techniques used for efficient indexing and monitoring of the data. We describe deployment model for the high availability and scalability of the service. We also discuss the hardware requirements and the results achieved as they were determined by testing with actual data and realistic loads during the commissioning and the initial production phase with the Common Analysis Framework
CHEP2012
In CMS Computing the highest priorities for analysis tools are the improvement of the end users' ability to produce and publish reliable samples and analysis results as well as a transition to a sustainable development and operations model. To achieve these goals CMS decided to incorporate analysis processing into the same framework as data and simulation processing. This strategy foresees that all workload tools (Tier0, Tier1, production, analysis) share a common core with long term maintainability as well as the standardization of the operator interfaces. The re-engineered analysis workload manager, called CRAB3, makes use of newer technologies, such as RESTFul based web services and NoSQL Databases, aiming to increase the scalability and reliability of the system. As opposed to CRAB2, in CRAB3 all work is centrally injected and managed in a global queue. A pool of agents, which can be geographically distributed, consumes work from the central services serving the user tasks. The new architecture of CRAB substantially changes the deployment model and operations activities. In this paper we present the implementation of CRAB3, emphasizing how the new architecture improves the workflow automation and simplifies maintainability. In particular, we will highlight the impact of the new design on daily operations
A comparison of different database technologies for the CMS AsyncStageOut transfer database
AsyncStageOut (ASO) is the component of the CMS distributed data analysis system (CRAB) that manages users transfers in a centrally controlled way using the File Transfer System (FTS3) at CERN. It addresses a major weakness of the previous, decentralized model, namely that the transfer of the userâs output data to a single remote site was part of the job execution, resulting in inefficient use of job slots and an unacceptable failure rate. Currently ASO manages up to 600k files of various sizes per day from more than 500 users per month, spread over more than 100 sites. ASO uses a NoSQL database (CouchDB) as internal bookkeeping and as way to communicate with other CRAB components. Since ASO/CRAB were put in production in 2014, the number of transfers constantly increased up to a point where the pressure to the central CouchDB instance became critical, creating new challenges for the system scalability, performance, and monitoring. This forced a re-engineering of the ASO application to increase its scalability and lowering its operational effort. In this contribution we present a comparison of the performance of the current NoSQL implementation and a new SQL implementation, and how their different strengths and features influenced the design choices and operational experience. We also discuss other architectural changes introduced in the system to handle the increasing load and latency in delivering output to the user