405 research outputs found

    Fine-Grain Checkpointing with In-Cache-Line Logging

    Full text link
    Non-Volatile Memory offers the possibility of implementing high-performance, durable data structures. However, achieving performance comparable to well-designed data structures in non-persistent (transient) memory is difficult, primarily because of the cost of ensuring the order in which memory writes reach NVM. Often, this requires flushing data to NVM and waiting a full memory round-trip time. In this paper, we introduce two new techniques: Fine-Grained Checkpointing, which ensures a consistent, quickly recoverable data structure in NVM after a system failure, and In-Cache-Line Logging, an undo-logging technique that enables recovery of earlier state without requiring cache-line flushes in the normal case. We implemented these techniques in the Masstree data structure, making it persistent and demonstrating the ease of applying them to a highly optimized system and their low (5.9-15.4\%) runtime overhead cost.Comment: In 2019 Architectural Support for Programming Languages and Operating Systems (ASPLOS 19), April 13, 2019, Providence, RI, US

    Towards Model Checking of Network Applications for IoT System Development

    Get PDF
    With the expansion of the Internet, Internet of Things (IoT) gains lots of interest from industries and academia. IoT applications enable human-to-device and device-to-device interactions. For a successful deployment of IoT systems and services, software reliability is a very important requirement for IoT to ensure that data/messages have been received and performed properly in a timely manner. The concurrent connections of embedded sensors and actuators are nondeterministic in nature which makes testing insufficient to guarantee program correctness. In contrast, model checking techniques explore the entire behavior of a system under test (SUT) in brute-force and systematic manner. It investigates each reachable state for different thread schedules. Recent model checking techniques have been applied directly to networked programs. This paper reviews model checking techniques for networked applications and presents their strengths and limitations. A preliminary proposal for model checking of networked applications that addresses the identified gap is presented

    Enabling Distributed Applications Optimization in Cloud Environment

    Get PDF
    The past few years have seen dramatic growth in the popularity of public clouds, such as Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS), and Container-as-a-Service (CaaS). In both commercial and scientific fields, quick environment setup and application deployment become a mandatory requirement. As a result, more and more organizations choose cloud environments instead of setting up the environment by themselves from scratch. The cloud computing resources such as server engines, orchestration, and the underlying server resources are served to the users as a service from a cloud provider. Most of the applications that run in public clouds are the distributed applications, also called multi-tier applications, which require a set of servers, a service ensemble, that cooperate and communicate to jointly provide a certain service or accomplish a task. Moreover, a few research efforts are conducting in providing an overall solution for distributed applications optimization in the public cloud. In this dissertation, we present three systems that enable distributed applications optimization: (1) the first part introduces DocMan, a toolset for detecting containerized application’s dependencies in CaaS clouds, (2) the second part introduces a system to deal with hot/cold blocks in distributed applications, (3) the third part introduces a system named FP4S, a novel fragment-based parallel state recovery mechanism that can handle many simultaneous failures for a large number of concurrently running stream applications

    Hard and Soft Error Resilience for One-sided Dense Linear Algebra Algorithms

    Get PDF
    Dense matrix factorizations, such as LU, Cholesky and QR, are widely used by scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on supercomputers, whose ever-growing scale induces a fast decline of the Mean Time To Failure (MTTF). This dissertation develops fault tolerance algorithms for one-sided dense matrix factorizations, which handles Both hard and soft errors. For hard errors, we propose methods based on diskless checkpointing and Algorithm Based Fault Tolerance (ABFT) to provide full matrix protection, including the left and right factor that are normally seen in dense matrix factorizations. A horizontal parallel diskless checkpointing scheme is devised to maintain the checkpoint data with scalable performance and low space overhead, while the ABFT checksum that is generated before the factorization constantly updates itself by the factorization operations to protect the right factor. In addition, without an available fault tolerant MPI supporting environment, we have also integrated the Checkpoint-on-Failure(CoF) mechanism into one-sided dense linear operations such as QR factorization to recover the running stack of the failed MPI process. Soft error is more challenging because of the silent data corruption, which leads to a large area of erroneous data due to error propagation. Full matrix protection is developed where the left factor is protected by column-wise local diskless checkpointing, and the right factor is protected by a combination of a floating point weighted checksum scheme and soft error modeling technique. To allow practical use on large scale system, we have also developed a complexity reduction scheme such that correct computing results can be recovered with low performance overhead. Experiment results on large scale cluster system and multicore+GPGPU hybrid system have confirmed that our hard and soft error fault tolerance algorithms exhibit the expected error correcting capability, low space and performance overhead and compatibility with double precision floating point operation

    NearPM: A Near-Data Processing System for Storage-Class Applications

    Full text link
    Persistent Memory (PM) technologies enable program recovery to a consistent state in a case of failure. To ensure this crash-consistent behavior, programs need to enforce persist ordering by employing mechanisms, such as logging and checkpointing, which introduce additional data movement. The emerging near-data processing (NDP) architec-tures can effectively reduce this data movement overhead. In this work we propose NearPM, a near data processor that supports accelerable primitives in crash consistent programs. Using these primitives NearPM accelerate commonly used crash consistency mechanisms logging, checkpointing, and shadow-paging. NearPM further reduces the synchronization overheads between the NDP and the CPU to guarantee persistent ordering by moving ordering handling near memory. We ensures a correct persist ordering between CPU and NDP devices, as well as among multiple NDP devices with Partitioned Persist Ordering (PPO). We prototype NearPM on an FPGA platform.1 NearPM executes data-intensive operations in crash consistency mechanisms with correct ordering guarantees while the rest of the program runs on the CPU. We evaluate nine PM workloads, where each work load supports three crash consistency mechanisms -logging, checkpointing, and shadow paging. Overall, NearPM achieves 4.3-9.8X speedup in the NDP-offloaded operations and 1.22-1.35X speedup in end-to-end execution

    JOLTS : checkpointing and coordination in grid systems

    Get PDF
    The need for increased computational power is growing faster than our ability to produce faster computers. Already researchers are proposing systems that require peta-flop capable super computers, a far cry from what is currently capable. To meet such high computational requirements, networks of computers will be required. While it is possible to network together computers to achieve a single task, making that network more flexible to handle a multitude of different tasks is the promise of grid computing. Grid systems are slowly appearing that are designed to run many independent tasks, and provide the ability for programs to migrate between machines before completion. However, these systems lack coordination capabilities. Many grid systems/environments allow multiple tasks to communicate/coordinate with each other based on various paradigms, but don't provide migration capabilities. This thesis proposes a system, called JOLTS, that attempts to fill a gap by providing both checkpointing and coordination capabilities. The coordination model offered by JOLTS is based on the Objective Linda coordination language, with some additions. This thesis will show that the object space model is an effective form of coordination and communication, and can effectively be combined with checkpointing capabilities inside the same grid system
    • …
    corecore