556 research outputs found

    A Pattern Language for High-Performance Computing Resilience

    Full text link
    High-performance computing systems (HPC) provide powerful capabilities for modeling, simulation, and data analytics for a broad class of computational problems. They enable extreme performance of the order of quadrillion floating-point arithmetic calculations per second by aggregating the power of millions of compute, memory, networking and storage components. With the rapidly growing scale and complexity of HPC systems for achieving even greater performance, ensuring their reliable operation in the face of system degradations and failures is a critical challenge. System fault events often lead the scientific applications to produce incorrect results, or may even cause their untimely termination. The sheer number of components in modern extreme-scale HPC systems and the complex interactions and dependencies among the hardware and software components, the applications, and the physical environment makes the design of practical solutions that support fault resilience a complex undertaking. To manage this complexity, we developed a methodology for designing HPC resilience solutions using design patterns. We codified the well-known techniques for handling faults, errors and failures that have been devised, applied and improved upon over the past three decades in the form of design patterns. In this paper, we present a pattern language to enable a structured approach to the development of HPC resilience solutions. The pattern language reveals the relations among the resilience patterns and provides the means to explore alternative techniques for handling a specific fault model that may have different efficiency and complexity characteristics. Using the pattern language enables the design and implementation of comprehensive resilience solutions as a set of interconnected resilience patterns that can be instantiated across layers of the system stack.Comment: Proceedings of the 22nd European Conference on Pattern Languages of Program

    A review of experiences with reliable multicast

    Get PDF

    PROPOSED MIDDLEWARE SOLUTION FOR RESOURCE-CONSTRAINED DISTRIBUTED EMBEDDED NETWORKS

    Get PDF
    The explosion in processing power of embedded systems has enabled distributed embedded networks to perform more complicated tasks. Middleware are sets of encapsulations of common and network/operating system-specific functionality into generic, reusable frameworks to manage such distributed networks. This thesis will survey and categorize popular middleware implementations into three adapted layers: host-infrastructure, distribution, and common services. This thesis will then apply a quantitative approach to grading and proposing a single middleware solution from all layers for two target platforms: CubeSats and autonomous unmanned aerial vehicles (UAVs). CubeSats are 10x10x10cm nanosatellites that are popular university-level space missions, and impose power and volume constraints. Autonomous UAVs are similarly-popular hobbyist-level vehicles that exhibit similar power and volume constraints. The MAVLink middleware from the host-infrastructure layer is proposed as the middleware to manage the distributed embedded networks powering these platforms in future projects. Finally, this thesis presents a performance analysis on MAVLink managing the ARM Cortex-M 32-bit processors that power the target platforms

    Multicast-Based Interactive-Group Object-Replication For Fault Tolerance

    Get PDF
    Distributed systems are clusters of computers working together on one task. The sharing of information across different architectures, and the timely and efficient use of the network resources for communication among computers are some of the problems involved in the implementation of a distributed system. In the case of a low latency system, the network utilization and the responsiveness of the communication mechanism are even more critical. This thesis introduces a new approach for the distribution of messages to computers in the system, in which, the Common Object Request Broker Architecture (CORBA) is used in conjunction with IP multicast to implement a fault-tolerant, low latency distributed system. Fault tolerance is achieved by replication of the current state of the system across several hosts. An update of the current state is initiated by a client application that contacts one of the state object replicas. The new information needs to be distributed to all the members of the distributed system (the object replicas). This state update is accomplished by using a two-phase commit protocol, which is implemented using a binary tree structure along with IP multicast to reduce the amount of network utilization, distribute the computation load associated with state propagation, and to achieve faster communication among the members of the distributed system. The use of IP multicast enhances the speed of message distribution, while the two-phase commit protocol encapsulates IP multicast to produce a reliable multicast service that is suitable for fault tolerant, distributed low latency applications. The binary tree structure, finally, is essential for the load sharing of the state commit response collection processing

    Employing CORBA in the Mobile Telecommunications Sector and the Application of Component Oriented Programming Techniques

    Get PDF
    This paper focuses on the implications of employing CORBA Middleware in today’s wireless telecommunications services and applications developments. It seeks to understand what current opportunities exist to support the use of this technology and assesses some of the performance implications that might be encountered. Discussion also revolves around the use of Component Oriented Development techniques in this same vertical and observations are made as to the benefits and practical problems associated with pursuit of this methodology in wireless telecommunications development programmes. Interest in this topic was aroused after observing the continued development of wireless telecommunications technology (Third Generation - 3G) that allows users to maintain permanent but connectionless contact with more traditional Wide Area Network (WAN) & Local Area Network (LAN) based services. As the boundaries between cellular, wireless LAN and traditional fixed network technologies become increasingly blurred, trends are emerging that seek to extend and migrate the technologies that are currently employed on cabled and wireless LAN networks over to the cellular networks and mass market user equipment

    Security issues in adaptive distributed systems

    Get PDF
    Adaptive Distributed Systems (ADSs) are distributed systems that can evolve their behaviors based on changes in their environments. In this work, we discuss security and propose security metrics issues in the context of ADSs. A key premise with adaptation of distributed systems is that in order to detect changes, information must be collected by monitoring the system and its environment. How monitoring should be done, what should be monitored, and the impact monitoring may have on the security mechanism of the target system need to be carefully considered. Conversely, the impact of implementation of security mechanism on the adaptation of distributed system is also assessed. We propose security metrics that can be used to quantify the impact of monitoring on the security mechanism of the target distributed system

    Generic Distribution Support for Programming Systems

    Get PDF
    This dissertation provides constructive proof, through the implementation of a middleware, that distribution transparency is practical, generic, and extensible. Fault tolerant distributed services can be developed by using the failure detection abilities of the middleware. By generic we mean that the middleware can be used for many different programming languages and paradigms. Distribution for each kind of language entity is done in terms of consistency protocols, which guarantee that the semantics of the entities are preserved in a distributed setting. The middleware allows new consistency protocols to be added easily. The efficiency of the middleware and the ease of integration are shown by coupling the middleware to a programming system, which encompasses the object oriented, the functional, and the concurrent-declarative programming paradigms. Our measurements show that the distribution middleware is competitive with the most popular distributed programming systems (JavaRMI, .NET, IBM CORBA)
    • …
    corecore