465 research outputs found

    A Pattern Language for High-Performance Computing Resilience

    Full text link
    High-performance computing systems (HPC) provide powerful capabilities for modeling, simulation, and data analytics for a broad class of computational problems. They enable extreme performance of the order of quadrillion floating-point arithmetic calculations per second by aggregating the power of millions of compute, memory, networking and storage components. With the rapidly growing scale and complexity of HPC systems for achieving even greater performance, ensuring their reliable operation in the face of system degradations and failures is a critical challenge. System fault events often lead the scientific applications to produce incorrect results, or may even cause their untimely termination. The sheer number of components in modern extreme-scale HPC systems and the complex interactions and dependencies among the hardware and software components, the applications, and the physical environment makes the design of practical solutions that support fault resilience a complex undertaking. To manage this complexity, we developed a methodology for designing HPC resilience solutions using design patterns. We codified the well-known techniques for handling faults, errors and failures that have been devised, applied and improved upon over the past three decades in the form of design patterns. In this paper, we present a pattern language to enable a structured approach to the development of HPC resilience solutions. The pattern language reveals the relations among the resilience patterns and provides the means to explore alternative techniques for handling a specific fault model that may have different efficiency and complexity characteristics. Using the pattern language enables the design and implementation of comprehensive resilience solutions as a set of interconnected resilience patterns that can be instantiated across layers of the system stack.Comment: Proceedings of the 22nd European Conference on Pattern Languages of Program

    Towards Middleware for Fault-tolerance in Distributed Real-time and Embedded Systems

    Get PDF
    Abstract. Distributed real-time and embedded (DRE) systems often require support for multiple simultaneous quality of service (QoS) properties, such as real-timeliness and fault tolerance, that operate within resource constrained environments. These resource constraints motivate the need for a lightweight middleware infrastructure, while the need for simultaneous QoS properties require the middleware to provide fault tolerance capabilities that respect time-critical needs of DRE systems. Conventional middleware solutions, such as Fault-tolerant CORBA (FT-CORBA) and Continuous Availability API for J2EE, have limited utility for DRE systems because they are heavyweight (e.g., the complexity of their feature-rich fault tolerance capabilities consumes excessive runtime resources), yet incomplete (e.g., they lack mechanisms that enable fault tolerance while maintaining real-time predictability). This paper provides three contributions to the development and standardization of lightweight real-time and fault-tolerant middleware for DRE systems. First, we discuss the challenges in realizing real-time faulttolerant solutions for DRE systems using contemporary middleware. Second, we describe recent progress towards standardizing a CORBA lightweight fault-tolerance specification for DRE systems. Third, we present the architecture of FLARe, which is a prototype based on the OMG real-time fault-tolerant CORBA middleware standardization efforts that is lightweight (e.g., leverages only those server-and client-side mechanisms required for real-time systems) and predictable (e.g., provides fault-tolerant mechanisms that respect time-critical performance needs of DRE systems)

    Generic Distribution Support for Programming Systems

    Get PDF
    This dissertation provides constructive proof, through the implementation of a middleware, that distribution transparency is practical, generic, and extensible. Fault tolerant distributed services can be developed by using the failure detection abilities of the middleware. By generic we mean that the middleware can be used for many different programming languages and paradigms. Distribution for each kind of language entity is done in terms of consistency protocols, which guarantee that the semantics of the entities are preserved in a distributed setting. The middleware allows new consistency protocols to be added easily. The efficiency of the middleware and the ease of integration are shown by coupling the middleware to a programming system, which encompasses the object oriented, the functional, and the concurrent-declarative programming paradigms. Our measurements show that the distribution middleware is competitive with the most popular distributed programming systems (JavaRMI, .NET, IBM CORBA)

    Integrating legacy mainframe systems: architectural issues and solutions

    Get PDF
    For more than 30 years, mainframe computers have been the backbone of computing systems throughout the world. Even today it is estimated that some 80% of the worlds' data is held on such machines. However, new business requirements and pressure from evolving technologies, such as the Internet is pushing these existing systems to their limits and they are reaching breaking point. The Banking and Financial Sectors in particular have been relying on mainframes for the longest time to do their business and as a result it is they that feel these pressures the most. In recent years there have been various solutions for enabling a re-engineering of these legacy systems. It quickly became clear that to completely rewrite them was not possible so various integration strategies emerged. Out of these new integration strategies, the CORBA standard by the Object Management Group emerged as the strongest, providing a standards based solution that enabled the mainframe applications become a peer in a distributed computing environment. However, the requirements did not stop there. The mainframe systems were reliable, secure, scalable and fast, so any integration strategy had to ensure that the new distributed systems did not lose any of these benefits. Various patterns or general solutions to the problem of meeting these requirements have arisen and this research looks at applying some of these patterns to mainframe based CORBA applications. The purpose of this research is to examine some of the issues involved with making mainframebased legacy applications inter-operate with newer Object Oriented Technologies

    Efficient Customizable Middleware

    Get PDF
    The rather large feature set of current Distributed Object Computing (DOC) middleware can be a liability for certain applications which have a need for only a certain subset of these features but have to suffer performance degradation and code bloat due to all the present features. To address this concern, a unique approach to building fully customizable middleware was undertaken in FACET, a CORBA event channel written using AspectJ. FACET consists of a small, essential core that represents the basic structure and functionality of an event channel into which additional features are woven using aspects so that the resulting event channel supports all of the features needed by a given embedded application. However, the use of CORBA as the underlying transport mechanism may make FACET unsuitable for use in small-scale embedded systems because of the considerable footprint of many ORBs. In this thesis, we describe how the use of CORBA in the event channel can be made an optional feature in building highly efficient middle-ware. We look at the challenges that arise in abstracting the method invocation layer, document design patterns discovered and present quantitative footprint, throughput performance data and analysis. We also examine the problem of integrating FACET, written in Java, into the Boeing Open Experimental Platform (OEP), written in C++, in order to serve as a replacement for the TAO Real-Time Event Channel (RTEC). We evaluate the available alternatives in building such an implementation for efficiency, describe our use of a native-code compiler for Java, gcj, and present data on the efficacy of this approach. Finally, we take preliminary look into the problem of efficiently testing middleware with a large number of highly granular features. Since the number of possible combinations grow exponentially, building and testing all possible combinations quickly becomes impractical. To address this, we examine the conditions under which features are non-interfering. Non-interfering features will only need to be tested in isolation removing the need to test features in combination thus reducing the intractability of the problem

    MDDPro: Model-Driven Dependability Provisioning in Enterprise Distributed Real-Time and Embedded Systems

    Get PDF
    Abstract Service oriented architecture (SOA) design principles are increasingly being adopted to develop distributed real-time and embedded (DRE) systems, such as avionics mission computing, due to the availability of real-time component middleware platforms. Traditional approaches to fault tolerance that rely on replication and recovery of a single server or a single host do not work in this paradigm since the fault management schemes must now account for the timely and simultaneous failover of groups of entities while improving system availability by minimizing the risk of simultaneous failures of replicated entities. This paper describes MDDPro, a model-driven dependability provisioning tool for DRE systems. MDDPro provides intuitive modeling abstractions to specify failover requirements of DRE systems at different granularities. MDDPro enables plugging in different replica placement algorithms to improve system availability. Finally, its generative capabilities automate the deployment and configuration of the DRE system on the underlying platforms

    PROPOSED MIDDLEWARE SOLUTION FOR RESOURCE-CONSTRAINED DISTRIBUTED EMBEDDED NETWORKS

    Get PDF
    The explosion in processing power of embedded systems has enabled distributed embedded networks to perform more complicated tasks. Middleware are sets of encapsulations of common and network/operating system-specific functionality into generic, reusable frameworks to manage such distributed networks. This thesis will survey and categorize popular middleware implementations into three adapted layers: host-infrastructure, distribution, and common services. This thesis will then apply a quantitative approach to grading and proposing a single middleware solution from all layers for two target platforms: CubeSats and autonomous unmanned aerial vehicles (UAVs). CubeSats are 10x10x10cm nanosatellites that are popular university-level space missions, and impose power and volume constraints. Autonomous UAVs are similarly-popular hobbyist-level vehicles that exhibit similar power and volume constraints. The MAVLink middleware from the host-infrastructure layer is proposed as the middleware to manage the distributed embedded networks powering these platforms in future projects. Finally, this thesis presents a performance analysis on MAVLink managing the ARM Cortex-M 32-bit processors that power the target platforms
    corecore