405 research outputs found
Fine-Grain Checkpointing with In-Cache-Line Logging
Non-Volatile Memory offers the possibility of implementing high-performance,
durable data structures. However, achieving performance comparable to
well-designed data structures in non-persistent (transient) memory is
difficult, primarily because of the cost of ensuring the order in which memory
writes reach NVM. Often, this requires flushing data to NVM and waiting a full
memory round-trip time.
In this paper, we introduce two new techniques: Fine-Grained Checkpointing,
which ensures a consistent, quickly recoverable data structure in NVM after a
system failure, and In-Cache-Line Logging, an undo-logging technique that
enables recovery of earlier state without requiring cache-line flushes in the
normal case. We implemented these techniques in the Masstree data structure,
making it persistent and demonstrating the ease of applying them to a highly
optimized system and their low (5.9-15.4\%) runtime overhead cost.Comment: In 2019 Architectural Support for Programming Languages and Operating
Systems (ASPLOS 19), April 13, 2019, Providence, RI, US
Towards Model Checking of Network Applications for IoT System Development
With the expansion of the Internet, Internet of Things (IoT) gains lots of interest from industries and academia. IoT applications enable human-to-device and device-to-device interactions. For a successful deployment of IoT systems and services, software reliability is a very important requirement for IoT to ensure that data/messages have been received and performed properly in a timely manner. The concurrent connections of embedded sensors and actuators are nondeterministic in nature which makes testing insufficient to guarantee program correctness. In contrast, model checking techniques explore the entire behavior of a system under test (SUT) in brute-force and systematic manner. It investigates each reachable state for different thread schedules. Recent model checking techniques have been applied directly to networked programs. This paper reviews model checking techniques for networked applications and presents their strengths and limitations. A preliminary proposal for model checking of networked applications that addresses the identified gap is presented
Enabling Distributed Applications Optimization in Cloud Environment
The past few years have seen dramatic growth in the popularity of public clouds, such as Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS), and Container-as-a-Service (CaaS). In both commercial and scientific fields, quick environment setup and application deployment become a mandatory requirement. As a result, more and more organizations choose cloud environments instead of setting up the environment by themselves from scratch. The cloud computing resources such as server engines, orchestration, and the underlying server resources are served to the users as a service from a cloud provider. Most of the applications that run in public clouds are the distributed applications, also called multi-tier applications, which require a set of servers, a service ensemble, that cooperate and communicate to jointly provide a certain service or accomplish a task. Moreover, a few research efforts are conducting in providing an overall solution for distributed applications optimization in the public cloud.
In this dissertation, we present three systems that enable distributed applications optimization: (1) the first part introduces DocMan, a toolset for detecting containerized application’s dependencies in CaaS clouds, (2) the second part introduces a system to deal with hot/cold blocks in distributed applications, (3) the third part introduces a system named FP4S, a novel fragment-based parallel state recovery mechanism that can handle many simultaneous failures for a large number of concurrently running stream applications
Hard and Soft Error Resilience for One-sided Dense Linear Algebra Algorithms
Dense matrix factorizations, such as LU, Cholesky and QR, are widely used by scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on supercomputers, whose ever-growing scale induces a fast decline of the Mean Time To Failure (MTTF). This dissertation develops fault tolerance algorithms for one-sided dense matrix factorizations, which handles Both hard and soft errors.
For hard errors, we propose methods based on diskless checkpointing and Algorithm Based Fault Tolerance (ABFT) to provide full matrix protection, including the left and right factor that are normally seen in dense matrix factorizations. A horizontal parallel diskless checkpointing scheme is devised to maintain the checkpoint data with scalable performance and low space overhead, while the ABFT checksum that is generated before the factorization constantly updates itself by the factorization operations to protect the right factor. In addition, without an available fault tolerant MPI supporting environment, we have also integrated the Checkpoint-on-Failure(CoF) mechanism into one-sided dense linear operations such as QR factorization to recover the running stack of the failed MPI process.
Soft error is more challenging because of the silent data corruption, which leads to a large area of erroneous data due to error propagation. Full matrix protection is developed where the left factor is protected by column-wise local diskless checkpointing, and the right factor is protected by a combination of a floating point weighted checksum scheme and soft error modeling technique. To allow practical use
on large scale system, we have also developed a complexity reduction scheme such that correct computing results can be recovered with low performance overhead.
Experiment results on large scale cluster system and multicore+GPGPU hybrid system have confirmed that our hard and soft error fault tolerance algorithms exhibit the expected error correcting capability, low space and performance overhead and compatibility with double precision floating point operation
NearPM: A Near-Data Processing System for Storage-Class Applications
Persistent Memory (PM) technologies enable program recovery to a consistent
state in a case of failure. To ensure this crash-consistent behavior, programs
need to enforce persist ordering by employing mechanisms, such as logging and
checkpointing, which introduce additional data movement. The emerging near-data
processing (NDP) architec-tures can effectively reduce this data movement
overhead. In this work we propose NearPM, a near data processor that supports
accelerable primitives in crash consistent programs. Using these primitives
NearPM accelerate commonly used crash consistency mechanisms logging,
checkpointing, and shadow-paging. NearPM further reduces the synchronization
overheads between the NDP and the CPU to guarantee persistent ordering by
moving ordering handling near memory. We ensures a correct persist ordering
between CPU and NDP devices, as well as among multiple NDP devices with
Partitioned Persist Ordering (PPO). We prototype NearPM on an FPGA platform.1
NearPM executes data-intensive operations in crash consistency mechanisms with
correct ordering guarantees while the rest of the program runs on the CPU. We
evaluate nine PM workloads, where each work load supports three crash
consistency mechanisms -logging, checkpointing, and shadow paging. Overall,
NearPM achieves 4.3-9.8X speedup in the NDP-offloaded operations and 1.22-1.35X
speedup in end-to-end execution
JOLTS : checkpointing and coordination in grid systems
The need for increased computational power is growing faster than our ability to produce faster computers. Already researchers are proposing systems that require peta-flop capable super computers, a far cry from what is currently capable. To meet such high computational requirements, networks of computers will be required. While it is possible to network together computers to achieve a single task, making that network more flexible to handle a multitude of different tasks is the promise of grid computing.
Grid systems are slowly appearing that are designed to run many independent tasks, and provide the ability for programs to migrate between machines before completion. However, these systems lack coordination capabilities. Many grid systems/environments allow multiple tasks to communicate/coordinate with each other based on various paradigms, but don't provide migration capabilities.
This thesis proposes a system, called JOLTS, that attempts to fill a gap by providing both checkpointing and coordination capabilities. The coordination model offered by JOLTS is based on the Objective Linda coordination language, with some additions. This thesis will show that the object space model is an effective form of coordination and communication, and can effectively be combined with checkpointing capabilities inside the same grid system
- …