796 research outputs found

    A Mediated Definite Delegation Model allowing for Certified Grid Job Submission

    Full text link
    Grid computing infrastructures need to provide traceability and accounting of their users" activity and protection against misuse and privilege escalation. A central aspect of multi-user Grid job environments is the necessary delegation of privileges in the course of a job submission. With respect to these generic requirements this document describes an improved handling of multi-user Grid jobs in the ALICE ("A Large Ion Collider Experiment") Grid Services. A security analysis of the ALICE Grid job model is presented with derived security objectives, followed by a discussion of existing approaches of unrestricted delegation based on X.509 proxy certificates and the Grid middleware gLExec. Unrestricted delegation has severe security consequences and limitations, most importantly allowing for identity theft and forgery of delegated assignments. These limitations are discussed and formulated, both in general and with respect to an adoption in line with multi-user Grid jobs. Based on the architecture of the ALICE Grid Services, a new general model of mediated definite delegation is developed and formulated, allowing a broker to assign context-sensitive user privileges to agents. The model provides strong accountability and long- term traceability. A prototype implementation allowing for certified Grid jobs is presented including a potential interaction with gLExec. The achieved improvements regarding system security, malicious job exploitation, identity protection, and accountability are emphasized, followed by a discussion of non- repudiation in the face of malicious Grid jobs

    Database Replication for Disconnected Operations with Quasi Real-Time Synchronization

    Get PDF
    Database replication is a way to improve system throughput or achieve high availability. In most cases, using an active-active replica architecture is efficient and easy to deploy. Such a system has CP properties (from the CAP theorem: Consistency, Availability and network Partition tolerance). Creating an AP (available and partition tolerant) system requires using multi-primary replication. This approach, because of many difficulties in implementation, is not widely used. However, deployment of CCDB (experiment conditions and calibration database) needs to be an AP system in two locations. This necessity became an inspiration to examine the state-of-the-art in this field and to test the available solutions. The tests performed evaluate the performance of the chosen replication tools: Bucardo and EDB Replication Server. They show that the tested tools can be successfully used for continuous synchronization of two independent database instances

    MonALISA : A Distributed Service System for Monitoring, Control and Global Optimization

    Get PDF
    The MonALISA (Monitoring Agents in A Large Integrated Services Architecture) framework provides a set of distributed services for monitoring, control, management and global optimization for large scale distributed systems. It is based on an ensemble of autonomous, multi-threaded, agent-based subsystems which are registered as dynamic services. They can be automatically discovered and used by other services or clients. The distributed agents can collaborate and cooperate in performing a wide range of management, control and global optimization tasks using real time monitoring information

    Dynamic scheduling using CPU oversubscription in the ALICE Grid

    Get PDF
    The ALICE Grid is designed to perform a realtime comprehensive monitoring of both jobs and execution nodes in order to maintain a continuous and consistent status of the Grid infrastructure. An extensive database of historical data is available and is periodically analyzed to tune the workflows and data management to optimal performance levels. This data, when evaluated in real time, has the power to trigger decisions for efficient resource management of the currently running payloads, for example to enable the execution of a higher volume of work per unit of time. In this article, we consider scenarios in which, through constant interaction with the monitoring agents, a dynamic adaptation of the running workflows is performed. The target resources are memory and CPU with the objective of using them in their entirety and ensuring optimal utilization fairness between executing jobs. Grid resources are heterogeneous and of different generations, which means that some of them have better hardware characteristics than the minimum required to execute ALICE jobs. Our middleware, JAliEn, works on the basis of having at least 2 GB of RAM allocated per core (allowing up to 8 GB of virtual memory when including swap). Many of the worker nodes have higher memory per core ratios than these basic limits and in terms of available memory they therefore have free resources to accommodate extra jobs. The running jobs may have different behaviors and unequal resource usages depending on their nature. For example, analysis tasks are I/O bound while Monte-Carlo tasks are CPU intensive. Running additional jobs with complementary resource usage patterns on a worker node has a great potential to increase its total efficiency. This paper presents the methodology to exploit the different resource usage profiles by oversubscribing the worker nodes with extra jobs taking into account their CPU resource usage levels and memory capacity

    Multicore workflow characterisation methodology for payloads running in the ALICE Grid

    Get PDF
    For LHC Run3 the ALICE experiment software stack has been completely refactored, incorporating support for multicore job execution. Whereas in both LHC Run 1 and 2 the Grid jobs were single-process and made use of a single CPU core, the new multicore jobs spawn multiple processes and threads within the payload. Some of these multicore jobs deploy a high amount of shortlived processes, in the order of more than a dozen per second. The overhead of starting so many processes impacts the overall CPU utilization of the payloads, in particular its System component. Furthermore, the short-lived processes were not correctly accounted for by the monitoring system of the experiment. This paper presents the developed new methodology for supervising the payload execution. We also present a black box analysis of the new multicore experiment software framework tracing the used resources and system function calls issued by MonteCarlo simulation jobs. Multiple sources of overhead in the lifecycle of processes and threads have thus been identified. This paper describes how the source of each was traced and what solutions were implemented to address them. These improvements have impacted the resource consumption and the overall turnaround time of these payloads with a notable 35% reduction in execution time for a reference production job. We also introduce how this methodology will be used to further improve the efficiency of our experiment software and what other optimization venues are currently under research

    The Dynamics of Network Topology

    Get PDF
    Network monitoring is vital to ensure proper network operation over time, and is tightly integrated with all the data intensive processing tasks used by the LHC experiments. In order to build a coherent set of network management services it is very important to collect in near real-time information about the network topology, the main data flows, traffic volume and the quality of connectivity. A set of dedicated modules were developed in the MonALISA framework to periodically perform network measurements tests between all sites. We developed global services to present in near real-time the entire network topology used by a community. For any LHC experiment such a network topology includes several hundred of routers and tens of Autonomous Systems. Any changes in the global topology are recorded and this information is can be easily correlated with traffic patterns. The evolution in time of global network topology is shown a dedicated GUI. Changes in the global topology at this level occur quite frequently and even small modifications in the connectivity map may significantly affect the network performance. The global topology graphs are correlated with active end to end network performance measurements, done with the Fast Data Transfer application, between all sites. Access to both real-time and historical data, as provided by MonALISA, is also important for developing services able to predict the usage pattern, to aid in efficiently allocating resources globally

    MonALISA : A Distributed Service System for Monitoring, Control and Global Optimization

    Get PDF
    The MonALISA (Monitoring Agents in A Large Integrated Services Architecture) framework provides a set of distributed services for monitoring, control, management and global optimization for large scale distributed systems. It is based on an ensemble of autonomous, multi-threaded, agent-based subsystems which are registered as dynamic services. They can be automatically discovered and used by other services or clients. The distributed agents can collaborate and cooperate in performing a wide range of management, control and global optimization tasks using real time monitoring information

    Job splitting on the ALICE grid, introducing the new job optimizer for the ALICE grid middleware

    Get PDF
    This contribution introduces the job optimizer service for the nextgeneration ALICE Grid middleware, JAliEn (Java Alice Environment). It is a continuous service running on central machines and is essentially responsible for splitting jobs into subjobs, to then be distributed and executed on the ALICE grid. There are several ways of creating subjobs based on various strategies relevant to the aim of any particular grid job. Therefore a user has to explicitly declare that a job is to be split, and also define the strategy to be used. The new job optimizer service aims to retain the old ALICE grid middleware functionalities from the user’s point of view while increasing the performance and throughput. One aspect of increasing performance is looking at how the job optimizer interacts with the job queue database. A different way of describing subjobs in the database is presented, to minimize resource usage. There is also a focus on limiting communications with the database, as this is already a congested area. Furthermore, a new solution to splitting based on the locality of job input data will be presented, aiming to split into subjobs more efficiently, therefore making better use of resources on the grid to further increase throughput. Added options for the user regarding splitting by locality, such as setting a minimum limit for a subjob size, will also be explored

    Site Sonar-A Flexible and Extensible Infrastructure Monitoring Tool for ALICE Computing Grid

    Get PDF
    The ALICE experiment at the CERN Large Hadron Collider relies on a massive, distributed Computing Grid for its data processing. The ALICE Computing Grid is built by combining a large number of individual computing sites distributed globally. These Grid sites are maintained by different institutions across the world and contribute thousands of worker nodes possessing different capabilities and configurations. Developing software for Grid operations that works on all nodes while harnessing the maximum capabilities offered by any given Grid site is challenging without advance knowledge of what capabilities each site offers. Site Sonar is an architecture-independent Grid infrastructure monitoring framework developed by the ALICE Grid team to monitor the infrastructure capabilities and configurations of worker nodes at sites across the ALICE Grid without the need to contact local site administrators. Site Sonar is a highly flexible and extensible framework that offers infrastructure metric collection without local agent installations at Grid sites. This paper introduces the Site Sonar Grid infrastructure monitoring framework and reports significant findings acquired about the ALICE Computing Grid using Site Sonar