20 research outputs found

    Design and implementation of experimental data access security policy for HEPS container computing platform

    Get PDF
    China’s High-Energy Photon Source (HEPS), the first national highenergy synchrotron radiation light source, is under design and construction. In the future, at the first stage of HEPS, it is predicted that 24PB raw experimental data will be produced per month from 14 beamlines. Faced with such a huge scale of scientific data and diverse data analysis environments in light source disciplines, the HEPS scientific computing platform was designed and implemented based on container mirroring and dynamic orchestration technology to provide HEPS users with a data analysis environment. In this article, a data security access strategy is designed and evaluated for a scientific computing platform to ensure the security and efficiency of data access for users in the entire process of data analysis. First, the general situation of HEPS is introduced. Second, the challenges faced by the HEPS scientific computing system. Third, the architecture and service process of the scientific computing platform are described from the perspective of IT, some key technical implementations will be introduced in detail. Finally, the application effect of data access security policies on computing platforms will be demonstrated

    CernVM Workshop 2019

    No full text
    IHEP has been using CVMFS since 2017 and cvmfs-stratum-one.ihep.ac server provides software repository duplicate services for cern.ch,opensciencegrid.org,egi.eu,ihep.ac.cn .Part of this report will introduce the status and next plan of CVMFS-stratum-one.ihep.ac.cn. China's Large High Altitude Air Shower Observatory (LHAASO), a cosmic ray detection facility located in the high mountains of Sichuan province. The experimental data of LHAASO captured by the detector will be processed in the computer room which located in the high altitude and poor natural environment , and then transmitted to IHEP. The other part of this report will introduce the application of CVMFS in LHAASO experiments, and share some problems encountered in the use of CVMFS. Expect to use for 10 minutes

    Machine Learning-based Anomaly Detection of Ganglia Monitoring Data in HEP Data Center

    No full text
    This paper introduces a generic and scalable anomaly detection framework. Anomaly detection can improve operation and maintenance efficiency and assure experiments can be carried out effectively. The framework facilitates common tasks such as data sample building, retagging and visualization, deviation measurement and performance measurement for machine learning-based anomaly detection methods. The samples we used are sourced from Ganglia monitoring data. There are several anomaly detection methods to handle spatial and temporal anomalies within the framework. Finally, we show the rudimental application of the framework on Lustre distributed file systems in daily operation and maintenance

    Machine Learning-based Anomaly Detection of Ganglia Monitoring Data in HEP Data Center

    Get PDF
    This paper introduces a generic and scalable anomaly detection framework. Anomaly detection can improve operation and maintenance efficiency and assure experiments can be carried out effectively. The framework facilitates common tasks such as data sample building, retagging and visualization, deviation measurement and performance measurement for machine learning-based anomaly detection methods. The samples we used are sourced from Ganglia monitoring data. There are several anomaly detection methods to handle spatial and temporal anomalies within the framework. Finally, we show the rudimental application of the framework on Lustre distributed file systems in daily operation and maintenance

    An automatic solution to make HTCondor more stable and easier

    Get PDF
    HTCondor has been widely adopted by HEP clusters to provide high-level scheduling performance. Unlike other schedulers, HTCondor provides loose management of the worker nodes. We developed a maintenance automation tool called “HTCondor MAT” that focuses on dynamic resource management and automatic error handling. A central database records all worker node information, which is sent to the worker node for the startd configuration. If an error happens for the worker node, the node information stored in the database is updated and the worker node is reconfigured with the new node information. The new configuration stops the startd from accepting error-related jobs until the worker node recovers. The MAT has been deployed in the IHEP HTC cluster to provide a central way to manage the worker nodes and remove the impacts of errors on the worker nodes automatically

    An automatic solution to make HTCondor more stable and easier

    No full text
    HTCondor has been widely adopted by HEP clusters to provide high-level scheduling performance. Unlike other schedulers, HTCondor provides loose management of the worker nodes. We developed a maintenance automation tool called “HTCondor MAT” that focuses on dynamic resource management and automatic error handling. A central database records all worker node information, which is sent to the worker node for the startd configuration. If an error happens for the worker node, the node information stored in the database is updated and the worker node is reconfigured with the new node information. The new configuration stops the startd from accepting error-related jobs until the worker node recovers. The MAT has been deployed in the IHEP HTC cluster to provide a central way to manage the worker nodes and remove the impacts of errors on the worker nodes automatically

    A Lightweight Submission Frontend Toolkit HepJob

    No full text
    A typical HEP Computing Center normally runs at least one batch system. As an example, at IHEP (Institute of High Energy Physics, Chinese Academy of Sciences), we’ve used three batch systems: PBS, HTCondor and SLURM. After running PBS as a local batch system for 10 years, we replaced it by HTCondor (for HTC) and SLURM (for HPC). During that period, problems came up on both user and admin sides. Introduction of the new batch systems implies necessity for users to acquire additional knowledge specific for every batch system, in particular, batch commands. In some cases, users have to use both HTCondor and SLURM in parallel. Furthermore, HTCondor and SLURM provide more functionality, which means more complicated usage mode, compared to the simple PBS commands. On admin side, HTCondor gives more freedom to users, which brings an additional challenge to site administrators. Site administrators have to find the solutions for many problems: preventing users from requesting the resources they are not allowed to use, checking if the required attributes are correct, deciding where requested resources are located (SLURM cluster, the cluster of the virtual machines, the remote sites, etc). To meet the above requirements, HepJob was designed and developed. HepJob provides a set of simple user commands, for example: hep_sub, hep_q, hep_rm, etc. In the submission process, HepJob checks all the attributes and ensures all attributes are correct; assigns proper resources to users (the user and group info is obtained from the management database); routes jobs to the target site; performs other steps as required. Users can start with HepJob very easily and administrators can take the necessary management actions in HepJob

    Cyber security detection and monitoring at IHEP private cloud for web services

    No full text
    To improve hardware utilization and save manpower in system maintenance, most of the web services in IHEP have been migrated to a private cloud build upon OpenStack. However, cyber security attacks becomes a serious threats to the cloud progressively. Therefore, a cyber security detection and monitoring system is deployed for this cloud platform. This system collects various security related logs as data sources, and processes them in a framework composed of open source data store, analysis and visualization tools. With this system, security incidents and events can be handled in time and rapid response can be taken to protect cloud platform against cyber security threats

    A Lightweight Submission Frontend Toolkit HepJob

    Get PDF
    A typical HEP Computing Center normally runs at least one batch system. As an example, at IHEP (Institute of High Energy Physics, Chinese Academy of Sciences), we’ve used three batch systems: PBS, HTCondor and SLURM. After running PBS as a local batch system for 10 years, we replaced it by HTCondor (for HTC) and SLURM (for HPC). During that period, problems came up on both user and admin sides. Introduction of the new batch systems implies necessity for users to acquire additional knowledge specific for every batch system, in particular, batch commands. In some cases, users have to use both HTCondor and SLURM in parallel. Furthermore, HTCondor and SLURM provide more functionality, which means more complicated usage mode, compared to the simple PBS commands. On admin side, HTCondor gives more freedom to users, which brings an additional challenge to site administrators. Site administrators have to find the solutions for many problems: preventing users from requesting the resources they are not allowed to use, checking if the required attributes are correct, deciding where requested resources are located (SLURM cluster, the cluster of the virtual machines, the remote sites, etc). To meet the above requirements, HepJob was designed and developed. HepJob provides a set of simple user commands, for example: hep_sub, hep_q, hep_rm, etc. In the submission process, HepJob checks all the attributes and ensures all attributes are correct; assigns proper resources to users (the user and group info is obtained from the management database); routes jobs to the target site; performs other steps as required. Users can start with HepJob very easily and administrators can take the necessary management actions in HepJob

    Evolution of the LHAASO Distributed Computing System based Cloud

    Get PDF
    In this paper we will describe the LHAASO distributed computing system based on virtualization and cloud computing technologies. Particularly, we discuss the key points of integrating distributed resources. A solution of integrating cross-domain resources is proposed, which adopt the Openstack+HTCondor to make the distributed resources work as a whole resource pool. A flexible resource scheduling strategy and a job scheduling policy are presented to realize the resource expansion on demand and the efficient job scheduling to remote sites transparently, so as to improve the overall resource utilization. We will also introduce the deployment of the computing system located in Daocheng, the LHAASO observation base using cloud-based architecture, which greatly helps to reduce the operation and maintenance cost as well as to make sure the system availability and stability. Finally, we will show running status of the system
    corecore