6 research outputs found

    New solutions in IT Monitoring: cAdvisor and Collectd

    No full text
    IT Monitoring team constantly improves its solutions to provide reliable solutions and computation platforms for the whole organization. There is an increasing trend of using containers for the different computation platforms within CERNs IT department, so IT Monitoring team has to be ready to provide solutions for container monitoring. Therefore, the challenges of the student project were to find solutions for container monitoring which could be integrated to existing IT Monitoring systems for an easy adoption. When the solution was chosen, an initial prototype for the system had to be implemented, as well

    Testing of complex, large-scale distributed storage systems: a CERN disk storage case study

    No full text
    Complex, large-scale distributed systems are frequently used to solve extraordinary computing, storage and other problems. However, the development of these systems usually requires working with several software components, maintaining and improving a large codebase and also providing a collaborative environment for many developers working together. The central role that such complex systems play in mission critical tasks and also in the daily activity of the users means that any software bug affecting the availability of the service has far reaching effects. Providing an easily extensible testing framework is a pre-requisite for building both confidence in the system but also among developers who contribute to the code. The testing framework can address concrete bugs found in the codebase thus avoiding any future regressions and also provides a high degree of confidence for the people contributing new code. Easily incorporating other people's work into the project greatly helps scaling out manpower so that having more developers contributing to the project can actually result in more work being done rather then more bugs added. In this paper we go through the case study of EOS, the CERN disk storage system and introduce the methods and mechanisms of how to achieve all-automatic regression and robustness testing along with continuous integration for such a large-scale, complex and critical system using a container-based environment

    Testing of complex, large-scale distributed storage systems: a CERN disk storage case study

    Get PDF
    Complex, large-scale distributed systems are frequently used to solve extraordinary computing, storage and other problems. However, the development of these systems usually requires working with several software components, maintaining and improving a large codebase and also providing a collaborative environment for many developers working together. The central role that such complex systems play in mission critical tasks and also in the daily activity of the users means that any software bug affecting the availability of the service has far reaching effects. Providing an easily extensible testing framework is a pre-requisite for building both confidence in the system but also among developers who contribute to the code. The testing framework can address concrete bugs found in the odebase thus avoiding any future regressions and also provides a high degree of confidence for the people contributing new code. Easily incorporating other people's work into the project greatly helps scaling out manpower so that having more developers contributing to the project can actually result in more work being done rather then more bugs added. In this paper we go through the case study of EOS, the CERN disk storage system and introduce the methods and mechanisms of how to achieve all-automatic regression and robustness testing along with continuous integration for such a large-scale, complex and critical system using a container-based environment

    Phase Advance Interlocking Throughout the Whole LHC Cycle

    No full text
    Each beam of CERN's Large Hadron Collider (LHC) stores 360 MJ at design energy and design intensity. In the unlikely event of an asynchronous beam dump, not all particles would be extracted immediately. They would still take one turn around the ring, oscillating with potentially high amplitudes. In case the beam would hit one of the experimental detectors or the collimators close to the interaction points, severe damage could occur. In order to minimize the risk during such a scenario, a new interlock system was put in place in 2016. This system guarantees a phase advance of zero degrees (within tolerances) between the extraction kicker and the interaction point. This contribution describes the motivation for this new system as well as the technical implementation and the strategies used to derive appropriate tolerances to allow sufficient protection without risking false beam dump triggers

    Streaming Pool - managing long-living reactive streams for Java

    No full text
    A common use case in accelerator control systems is subscribing to many properties and multiple devices and combine data from this. A new technology which got standardized during recent years in software industry are so-called reactive streams. Libraries implementing this standard provide a rich set of operators to manipulate, combine and subscribe to streams of data. However, the usual focus of such streaming libraries are applications in which those streams complete within a limited amount of time or collapse due to errors. On the other hand, in the case of a control systems we want to have those streams live for a very long time (ideally infinitely) and handle errors gracefully. In this paper we describe an approach which allows two reactive stream styles: ephemeral and long-living. This allows the developers to profit from both, the extensive features of reactive stream libraries and keeping the streams alive continuously. Further plans and ideas are also discussed

    Second Generation LHC Analysis Framework: Workload-based and User-oriented Solution

    No full text
    Consolidation and upgrades of accelerator equipment during the first long LHC shutdown period enabled particle collisions at energy levels almost twice higher compared to the first operational phase. Consequently, the software infrastructure providing vital information for machine operation and its optimisation needs to be updated to keep up with the challenges imposed by the increasing amount of collected data and the complexity of analysis. Current tools, designed more than a decade ago, have proven their reliability by significantly outperforming initially provisioned workloads, but are unable to scale efficiently to satisfy the growing needs of operators and hardware experts. In this paper we present our progress towards the development of a new workload-driven solution for LHC transient data analysis, based on identified user requirements. An initial setup and study of modern data storage and processing engines appropriate for the accelerator data analysis was conducted. First simulations of the proposed novel partitioning and replication approach, targeting a highly efficient service for heterogeneous analysis requests, were designed and performed
    corecore