1,179 research outputs found

    Parallel Architectures for Planetary Exploration Requirements (PAPER)

    Get PDF
    The Parallel Architectures for Planetary Exploration Requirements (PAPER) project is essentially research oriented towards technology insertion issues for NASA's unmanned planetary probes. It was initiated to complement and augment the long-term efforts for space exploration with particular reference to NASA/LaRC's (NASA Langley Research Center) research needs for planetary exploration missions of the mid and late 1990s. The requirements for space missions as given in the somewhat dated Advanced Information Processing Systems (AIPS) requirements document are contrasted with the new requirements from JPL/Caltech involving sensor data capture and scene analysis. It is shown that more stringent requirements have arisen as a result of technological advancements. Two possible architectures, the AIPS Proof of Concept (POC) configuration and the MAX Fault-tolerant dataflow multiprocessor, were evaluated. The main observation was that the AIPS design is biased towards fault tolerance and may not be an ideal architecture for planetary and deep space probes due to high cost and complexity. The MAX concepts appears to be a promising candidate, except that more detailed information is required. The feasibility for adding neural computation capability to this architecture needs to be studied. Key impact issues for architectural design of computing systems meant for planetary missions were also identified

    A runtime heuristic to selectively replicate tasks for application-specific reliability targets

    Get PDF
    In this paper we propose a runtime-based selective task replication technique for task-parallel high performance computing applications. Our selective task replication technique is automatic and does not require modification/recompilation of OS, compiler or application code. Our heuristic, we call App_FIT, selects tasks to replicate such that the specified reliability target for an application is achieved. In our experimental evaluation, we show that App FIT selective replication heuristic is low-overhead and highly scalable. In addition, results indicate that complete task replication is overkill for achieving reliability targets. We show that with App FIT, we can tolerate pessimistic exascale error rates with only 53% of the tasks being replicated.This work was supported by FI-DGR 2013 scholarship and the European Community’s Seventh Framework Programme [FP7/2007-2013] under the Mont-blanc 2 Project (www.montblanc-project.eu), grant agreement no. 610402 and in part by the European Union (FEDER funds) under contract TIN2015-65316-P.Peer ReviewedPostprint (author's final draft

    Predictable migration and communication in the Quest-V multikernal

    Full text link
    Quest-V is a system we have been developing from the ground up, with objectives focusing on safety, predictability and efficiency. It is designed to work on emerging multicore processors with hardware virtualization support. Quest-V is implemented as a ``distributed system on a chip'' and comprises multiple sandbox kernels. Sandbox kernels are isolated from one another in separate regions of physical memory, having access to a subset of processing cores and I/O devices. This partitioning prevents system failures in one sandbox affecting the operation of other sandboxes. Shared memory channels managed by system monitors enable inter-sandbox communication. The distributed nature of Quest-V means each sandbox has a separate physical clock, with all event timings being managed by per-core local timers. Each sandbox is responsible for its own scheduling and I/O management, without requiring intervention of a hypervisor. In this paper, we formulate bounds on inter-sandbox communication in the absence of a global scheduler or global system clock. We also describe how address space migration between sandboxes can be guaranteed without violating service constraints. Experimental results on a working system show the conditions under which Quest-V performs real-time communication and migration.National Science Foundation (1117025

    Extending an Application-Level Checkpointing Tool to Provide Fault Tolerance Support to OpenMP Applications

    Get PDF
    [Abstract] Despite the increasing popularity of shared-memory systems, there is a lack of tools for providing fault tolerance support to shared-memory applications. CPPC (ComPiler for Portable Checkpointing) is an application-level checkpointing tool focused on the insertion of fault tolerance into long-running MPI applications. This paper presents an extension to CPPC to allow the checkpointing of OpenMP applications. The proposed solution maintains the main characteristics of CPPC: portability and reduced checkpoint file size. The performance of the proposal is evaluated using the OpenMP NAS Parallel Benchmarks showing that most of the applications present small checkpoint overheads.Ministerio de Economía y Competitividad; TIN2013-42148-

    Approaches to multiprocessor error recovery using an on-chip interconnect subsystem

    Get PDF
    For future multicores, a dedicated interconnect subsystem for on-chip monitors was found to be highly beneficial in terms of scalability, performance and area. In this thesis, such a monitor network (MNoC) is used for multicores to support selective error identification and recovery and maintain target chip reliability in the context of dynamic voltage and frequency scaling (DVFS). A selective shared memory multiprocessor recovery is performed using MNoC in which, when an error is detected, only the group of processors sharing an application with the affected processors are recovered. Although the use of DVFS in contemporary multicores provides significant protection from unpredictable thermal events, a potential side effect can be an increased processor exposure to soft errors. To address this issue, a flexible fault prevention and recovery mechanism has been developed to selectively enable a small amount of per-core dual modular redundancy (DMR) in response to increased vulnerability, as measured by the processor architectural vulnerability factor (AVF). Our new algorithm for DMR deployment aims to provide a stable effective soft error rate (SER) by using DMR in response to DVFS caused by thermal events. The algorithm is implemented in real-time on the multicore using MNoC and controller which evaluates thermal information and multicore performance statistics in addition to error information. DVFS experiments with a multicore simulator using standard benchmarks show an average 6% improvement in overall power consumption and a stable SER by using selective DMR versus continuous DMR deployment
    • …
    corecore