18 research outputs found

    Performance of an Operating High Energy Physics Data Grid: D0SAR-Grid

    Full text link
    The D0 experiment at Fermilab's Tevatron will record several petabytes of data over the next five years in pursuing the goals of understanding nature and searching for the origin of mass. Computing resources required to analyze these data far exceed capabilities of any one institution. Moreover, the widely scattered geographical distribution of D0 collaborators poses further serious difficulties for optimal use of human and computing resources. These difficulties will exacerbate in future high energy physics experiments, like the LHC. The computing grid has long been recognized as a solution to these problems. This technology is being made a more immediate reality to end users in D0 by developing a grid in the D0 Southern Analysis Region (D0SAR), D0SAR-Grid, using all available resources within it and a home-grown local task manager, McFarm. We will present the architecture in which the D0SAR-Grid is implemented, the use of technology and the functionality of the grid, and the experience from operating the grid in simulation, reprocessing and data analyses for a currently running HEP experiment.Comment: 3 pages, no figures, conference proceedings of DPF04 tal

    Towards high availability for high-performance computing system services: Accomplishments and limitations

    No full text
    During the last several years, our teams at Oak Ridge National Laboratory, Louisiana Tech University, and Tennessee Technological University focused on efficient redundancy strategies for head and service nodes of high-performance computing (HPC) systems in order to pave the way for high availability (HA) in HPC. These nodes typically run critical HPC system services, like job and resource management, and represent single points of failure and control for an entire HPC system. The overarching goal of our research is to provide high-level reliability, availability, and serviceability (RAS) for HPC systems by combining HA and HPC technology. This paper summarizes our accomplishments, such as developed concepts and implemented proof-of-concept prototypes, and describes existing limitations, such as performance issues, which need to be dealt with for production-type deployment

    Asymmetric Active-Active High Availability for High-end

    No full text
    Linux clusters have become very popular for scientific computing at research institutions world-wide, because they can be easily deployed at a fairly low cost. However, the most pressing issues of today`s cluster solutions are availability and serviceability. The conventional Beowulf cluster architecture has a single head node connected to a group of compute nodes. This head node is a typical single point of failure and control, which severely limits availability and serviceability by e#ectively cutting o# healthy compute nodes from the outside world upon overload or failure. In this paper, we describe a paradigm that addresses this issue using asymmetric active-active high availability. Our framework comprises of n + 1 head nodes, where n head nodes are active in the sense that they provide services to simultaneously incoming user requests. One standby server monitors all active servers and performs a fail-over in case of a detected outage. We present a prototype implementation based on a 2 + 1 solution and discuss initial results

    On programming models for service-level high availability

    No full text
    This paper provides an overview of existing programming models for service-level high availability and investigates their differences, similarities, advantages, and disadvantages. Its goal is to help to improve reuse of code and to allow adaptation to quality of service requirements. It further aims at encouraging a discussion about these programming models and their provided quality of service, such as availability, performance, serviceability, usability and applicability. Within this context, the presented research focuses on providing high availability for services running on head and service nodes of high-performance computing systems. The proposed conceptual service model and the discussed service-level high availability programming models are applicable to many parallel and distributed computing scenarios as a networked system‘s availability can invariably be improved by increasing the availability of its most critical services

    Active/active replication for highly available HPC system services

    No full text
    Today‘s high performance computing systems have several reliability deficiencies resulting in availability and serviceability issues. Head and service nodes represent a single point of failure and control for an entire system as they render it inaccessible and unmanageable in case of a failure until repair, causing a significant downtime. This paper introduces two distinct replication methods (internal and external) for providing symmetric active/active high availability for multiple head and service nodes running in virtual synchrony. It presents a comparison of both methods in terms of expected correctness, ease-of-use and performance based on early results from ongoing work in providing symmetric active/active high availability for two HPC system services (TORQUE and PVFS metadata server). It continues with a short description of a distributed mutual exclusion algorithm and a brief statement regarding the handling of Byzantine failures. This paper concludes with an overview of past and ongoing work, and a short summary of the presented research

    Transparent symmetric active/active replication for service-level high availability

    No full text
    As service-oriented architectures become more important in parallel and distributed computing systems, individual service instance reliability as well as appropriate service redundancy becomes an essential necessity in order to increase overall system availability. This paper focuses on providing redundancy strategies using service-level replication techniques. Based on previous research using symmetric active/active replication, this paper proposes a transparent symmetric active/active replication approach that allows for more reuse of code between individual service-level replication implementations by using a virtual communication layer. Serviceand client-side interceptors are utilized in order to provide total transparency. Clients and servers are unaware of the replication infrastructure as it provides all necessary mechanisms internally. 1
    corecore