332 research outputs found

    Space station payload operations scheduling with ESP2

    Get PDF
    The Mission Analysis Division of the Systems Analysis and Integration Laboratory at the Marshall Space Flight Center is developing a system of programs to handle all aspects of scheduling payload operations for Space Station. The Expert Scheduling Program (ESP2) is the heart of this system. The task of payload operations scheduling can be simply stated as positioning the payload activities in a mission so that they collect their desired data without interfering with other activities or violating mission constraints. ESP2 is an advanced version of the Experiment Scheduling Program (ESP) which was developed by the Mission Integration Branch beginning in 1979 to schedule Spacelab payload activities. The automatic scheduler in ESP2 is an expert system that embodies the rules that expert planners would use to schedule payload operations by hand. This scheduler uses depth-first searching, backtracking, and forward chaining techniques to place an activity so that constraints (such as crew, resources, and orbit opportunities) are not violated. It has an explanation facility to show why an activity was or was not scheduled at a certain time. The ESP2 user can also place the activities in the schedule manually. The program offers graphical assistance to the user and will advise when constraints are being violated. ESP2 also has an option to identify conflict introduced into an existing schedule by changes to payload requirements, mission constraints, and orbit opportunities

    A Survey on Transactional Stream Processing

    Full text link
    Transactional stream processing (TSP) strives to create a cohesive model that merges the advantages of both transactional and stream-oriented guarantees. Over the past decade, numerous endeavors have contributed to the evolution of TSP solutions, uncovering similarities and distinctions among them. Despite these advances, a universally accepted standard approach for integrating transactional functionality with stream processing remains to be established. Existing TSP solutions predominantly concentrate on specific application characteristics and involve complex design trade-offs. This survey intends to introduce TSP and present our perspective on its future progression. Our primary goals are twofold: to provide insights into the diverse TSP requirements and methodologies, and to inspire the design and development of groundbreaking TSP systems

    An Optimization Based Design for Integrated Dependable Real-Time Embedded Systems

    Get PDF
    Moving from the traditional federated design paradigm, integration of mixedcriticality software components onto common computing platforms is increasingly being adopted by automotive, avionics and the control industry. This method faces new challenges such as the integration of varied functionalities (dependability, responsiveness, power consumption, etc.) under platform resource constraints and the prevention of error propagation. Based on model driven architecture and platform based design’s principles, we present a systematic mapping process for such integration adhering a transformation based design methodology. Our aim is to convert/transform initial platform independent application specifications into post integration platform specific models. In this paper, a heuristic based resource allocation approach is depicted for the consolidated mapping of safety critical and non-safety critical applications onto a common computing platform meeting particularly dependability/fault-tolerance and real-time requirements. We develop a supporting tool suite for the proposed framework, where VIATRA (VIsual Automated model TRAnsformations) is used as a transformation tool at different design steps. We validate the process and provide experimental results to show the effectiveness, performance and robustness of the approach

    Adjoint-Based Uncertainty Quantification and Sensitivity Analysis for Reactor Depletion Calculations

    Get PDF
    Depletion calculations for nuclear reactors model the dynamic coupling between the material composition and neutron flux and help predict reactor performance and safety characteristics. In order to be trusted as reliable predictive tools and inputs to licensing and operational decisions, the simulations must include an accurate and holistic quantification of errors and uncertainties in its outputs. Uncertainty quantification is a formidable challenge in large, realistic reactor models because of the large number of unknowns and myriad sources of uncertainty and error. We present a framework for performing efficient uncertainty quantification in depletion problems using an adjoint approach, with emphasis on high-fidelity calculations using advanced massively parallel computing architectures. This approach calls for a solution to two systems of equations: (a) the forward, engineering system that models the reactor, and (b) the adjoint system, which is mathematically related to but different from the forward system. We use the solutions of these systems to produce sensitivity and error estimates at a cost that does not grow rapidly with the number of uncertain inputs. We present the framework in a general fashion and apply it to both the source-driven and k-eigenvalue forms of the depletion equations. We describe the implementation and verification of solvers for the forward and ad- joint equations in the PDT code, and we test the algorithms on realistic reactor analysis problems. We demonstrate a new approach for reducing the memory and I/O demands on the host machine, which can be overwhelming for typical adjoint algorithms. Our conclusion is that adjoint depletion calculations using full transport solutions are not only computationally tractable, they are the most attractive option for performing uncertainty quantification on high-fidelity reactor analysis problems

    Locality-driven checkpoint and recovery

    Get PDF
    Checkpoint and recovery are important fault-tolerance techniques for distributed systems. The two categories of existing strategies incur unacceptable performance cost either at run time or upon failure recovery, when applied to large-scale distributed systems. In particular, the large number of messages and processes in these systems causes either considerable checkpoint as well as logging overhead, or catastrophic global-wise recovery effect. This thesis proposes a locality-driven strategy for efficiently checkpointing and recovering such systems with both affordable runtime cost and controllable failure recoverability. Messages establish dependencies between distributed processes, which can be either preserved by coordinated checkpoints or removed via logging. Existing strategies enforce a uniform handling policy for all message dependencies, and hence gains advantage at one end but bears disadvantage at the other. In this thesis, a generic theory of Quasi-Atomic Recovery has been formulated to accommodate message handling requirements of both kinds, and to allow using different message handling methods together. Quasi-atomicity of recovery blocks implies proper confinement of recoveries, and thus enables localization of checkpointing and recovery around such a block and consequently a hybrid strategy with combined advantages from both ends. A strategy of group checkpointing with selective logging has been proposed, based on the observation of message localization around 'locality regions' in distributed systems. In essence, a group-wise coordinated checkpoint is created around such a region and only the few inter-region messages are logged subsequently. Runtime overhead is optimized due to largely reduced logging efforts, and recovery spread is as localized as region-wise. Various protocols have been developed to provide trade-offs between flexibility and performance. Also proposed is the idea of process clone that can be used to effectively remove program-order recovery dependencies among successive group checkpoints and thus to stop inter-group recovery spread. Distributed executions exhibit locality of message interactions. Such locality originates from resolving distributed dependency localization via message passing, and appears as a hierarchical 'region-transition' pattern. A bottom-up approach has been proposed to identify those regions, by detecting popular recurrence patterns from individual processes as 'locality intervals', and then composing them into 'locality regions' based on their tight message coupling relations between each other. Experiments conducted on real-life applications have shown the existence of hierarchical locality regions and have justified the feasibility of this approach. Performance optimization of group checkpoint strategies has to do with their uses of locality. An abstract performance measure has been-proposed to properly integrate both runtime overhead and failure recoverability in a region-wise marner. Taking this measure as the optimization objective, a greedy heuristic has been introduced to decompose a given distributed execution into optimized regions. Analysis implies that an execution pattern with good locality leads to good optimized performance, and the locality pattern itself can serve as a good candidate for the optimal decomposition. Consequently, checkpoint protocols have been developed to efficiently identify optimized regions in such an execution, with assistance of either design-time or runtime knowledge

    A proactive fault tolerance framework for high performance computing (HPC) systems in the cloud

    Get PDF
    High Performance Computing (HPC) systems have been widely used by scientists and researchers in both industry and university laboratories to solve advanced computation problems. Most advanced computation problems are either data-intensive or computation-intensive. They may take hours, days or even weeks to complete execution. For example, some of the traditional HPC systems computations run on 100,000 processors for weeks. Consequently traditional HPC systems often require huge capital investments. As a result, scientists and researchers sometimes have to wait in long queues to access shared, expensive HPC systems. Cloud computing, on the other hand, offers new computing paradigms, capacity, and flexible solutions for both business and HPC applications. Some of the computation-intensive applications that are usually executed in traditional HPC systems can now be executed in the cloud. Cloud computing price model eliminates huge capital investments. However, even for cloud-based HPC systems, fault tolerance is still an issue of growing concern. The large number of virtual machines and electronic components, as well as software complexity and overall system reliability, availability and serviceability (RAS), are factors with which HPC systems in the cloud must contend. The reactive fault tolerance approach of checkpoint/restart, which is commonly used in HPC systems, does not scale well in the cloud due to resource sharing and distributed systems networks. Hence, the need for reliable fault tolerant HPC systems is even greater in a cloud environment. In this thesis we present a proactive fault tolerance approach to HPC systems in the cloud to reduce the wall-clock execution time, as well as dollar cost, in the presence of hardware failure. We have developed a generic fault tolerance algorithm for HPC systems in the cloud. We have further developed a cost model for executing computation-intensive applications on HPC systems in the cloud. Our experimental results obtained from a real cloud execution environment show that the wall-clock execution time and cost of running computation-intensive applications in the cloud can be considerably reduced compared to checkpoint and redundancy techniques used in traditional HPC systems
    • …
    corecore