32 research outputs found

    CMS dashboard task monitoring: A user-centric monitoring view

    Get PDF
    We are now in a phase change of the CMS experiment where people are turning more intensely to physics analysis and away from construction. This brings a lot of challenging issues with respect to monitoring of the user analysis. The physicists must be able to monitor the execution status, application and grid-level messages of their tasks that may run at any site within the CMS Virtual Organisation. The CMS Dashboard Task Monitoring project provides this information towards individual analysis users by collecting and exposing a user-centric set of information regarding submitted tasks including reason of failure, distribution by site and over time, consumed time and efficiency. The development was user-driven with physicists invited to test the prototype in order to assemble further requirements and identify weaknesses with the application

    High performance event-building in linux for LHCb

    Get PDF

    Equivalent differentiated services for AODVng

    No full text

    Evaluation of subfarm controllers candidates with an implementation of LHCb event-building

    No full text
    This report summarises experimental results obtained when running an implementation of LHCb event-building on various candidates for subfarm controllers (SFC) of the LHCb data acquisition network. In the document, we first describe the implementation of event-building and then show experimental results

    High performance event-building in linux for LHCb

    No full text

    Association rule mining on grid monitoring data to detect error sources

    No full text
    Error handling is a crucial task in an infrastructure as complex as a grid. There are several monitoring tools put in place, which report failing grid jobs including exit codes. However, the exit codes do not always denote the actual fault, which caused the job failure. Human time and knowledge is required to manually trace back errors to the real fault underlying an error. We perform association rule mining on grid job monitoring data to automatically retrieve knowledge about the grid components' behavior by taking dependencies between grid job characteristics into account. Therewith, problematic grid components are located automatically and this information – expressed by association rules – is visualized in a web interface. This work achieves a decrease in time for fault recovery and yields an improvement of a grid's reliabilit

    Grid reliability

    No full text
    We are offering a system to track the efficiency of different components of the GRID. We can study the performance of both the WMS and the data transfers At the moment, we have set different parts of the system for ALICE, ATLAS, CMS and LHCb. None of the components that we have developed are VO specific, therefore it would be very easy to deploy them for any other VO. Our main goal is basically to improve the reliability of the GRID. The main idea is to discover as soon as possible the different problems that have happened, and inform the responsible. Since we study the jobs and transfers issued by real users, we see the same problems that users see. As a matter of fact, we see even more problems than the end user does, since we are also interested in following up the errors that GRID components can overcome by themselves (like for instance, in case of a job failure, resubmitting the job to a different site). This kind of information is very useful to site and VO administrators. They can find out the efficiency of their sites, and, in case of failures, the problems that they have to solve. The reports that we provide are also interesting for the COD, since the errors might not be VO specific. All this system is based on studying the different actions that users do. Therefore, the first and most important dependency is on monitoring systems. The way we do it is to interface it with the DASHBOARD, which will hide the differences between the heterogeneous sources of data (like RGMA, ICXML or MonALISA). Another service very important for the effectiveness of the Grid reliability is the submission and tracking of tickets, GGUS. This has already been tested with a manual procedure. Since the result was very encouraging, we are working on ways of automatizing this interaction. The main problem that we have found so far is the lacking of communication between the new gLite RB and RGMA. Jobs that went through these resource brokers do not publish their status, thus making our tasks impossible. Another possible problem that we might encounter is the confidentiality of the data. To solve this, we are anonymising the jobs and transfers, since we are only interested in the different status that the job or transfer goes through

    ARDA Dashboard Data Management Monitoring

    No full text
    The Atlas DDM (Distributed Data Management) system is responsible for the management and distribution of data across the different grid sites. The data is generated at CERN and has to be made available as fast as possible in a large number of centres for production purposes, and later in many other sites for end-user analysis. Monitoring their data transfer activity and availability is an essential task for both site administrators and end users doing analysis in their local centres. Data management using the grid depends on a complex set of services. File catalogues for file and file location bookkeeping, transfer services for file movement, storage managers and others. In addition there are several flavours of each of these components, tens of sites each managing a distinct installation - over 100 at the present time - and in some organizations data is seen and moved in larger granularity than files - usually called datasets, which makes the successful usage of the standard grid monitoring tools a non straightforward task. The dashboard provides a unified view of the whole data management infrastructure, relying mostly on the Atlas data management (DDM) system to collect the relevant information regarding dataset and file movement among the different sites, but also retrieving information from the grid fabric services where appropriate. This last point makes it an interesting tool also for other communities that rely on the same lower level grid services. Focusing mostly on data management on the grid, the most relevant services for this area of the dashboard are the transfer services and storage managers. It is essential that all information can be easily and quickly propagated to the dashboard service, either directly or via the DDM services, so that end users can have an almost real-time view over their activities; and production systems can rely on the system views provided by the monitoring. File transfer information is transient in most cases, and taken from the main transfer tool being used - the File Transfer Service (FTS). Storage and storage space information lie in the Storage Resource Managers (SRM), which should be able to provide a unique implementation independent over the physical data and available space. Information regarding file and system meta-data is expected to be kept consistent everywhere, and any changes to be propagated to the interested services like the dashboard. We plan to extend the handling of errors coming from the different Grid services used by the ATLAS DDM system
    corecore