306 research outputs found
Efficient Logging in Non-Volatile Memory by Exploiting Coherency Protocols
Non-volatile memory (NVM) technologies such as PCM, ReRAM and STT-RAM allow
processors to directly write values to persistent storage at speeds that are
significantly faster than previous durable media such as hard drives or SSDs.
Many applications of NVM are constructed on a logging subsystem, which enables
operations to appear to execute atomically and facilitates recovery from
failures. Writes to NVM, however, pass through a processor's memory system,
which can delay and reorder them and can impair the correctness and cost of
logging algorithms.
Reordering arises because of out-of-order execution in a CPU and the
inter-processor cache coherence protocol. By carefully considering the
properties of these reorderings, this paper develops a logging protocol that
requires only one round trip to non-volatile memory while avoiding expensive
computations. We show how to extend the logging protocol to building a
persistent set (hash map) that also requires only a single round trip to
non-volatile memory for insertion, updating, or deletion
Fine-Grain Checkpointing with In-Cache-Line Logging
Non-Volatile Memory offers the possibility of implementing high-performance,
durable data structures. However, achieving performance comparable to
well-designed data structures in non-persistent (transient) memory is
difficult, primarily because of the cost of ensuring the order in which memory
writes reach NVM. Often, this requires flushing data to NVM and waiting a full
memory round-trip time.
In this paper, we introduce two new techniques: Fine-Grained Checkpointing,
which ensures a consistent, quickly recoverable data structure in NVM after a
system failure, and In-Cache-Line Logging, an undo-logging technique that
enables recovery of earlier state without requiring cache-line flushes in the
normal case. We implemented these techniques in the Masstree data structure,
making it persistent and demonstrating the ease of applying them to a highly
optimized system and their low (5.9-15.4\%) runtime overhead cost.Comment: In 2019 Architectural Support for Programming Languages and Operating
Systems (ASPLOS 19), April 13, 2019, Providence, RI, US
Parendi: Thousand-Way Parallel RTL Simulation
Hardware development relies on simulations, particularly cycle-accurate RTL
(Register Transfer Level) simulations, which consume significant time. As
single-processor performance grows only slowly, conventional, single-threaded
RTL simulation is becoming less practical for increasingly complex chips and
systems. A solution is parallel RTL simulation, where ideally, simulators could
run on thousands of parallel cores. However, existing simulators can only
exploit tens of cores.
This paper studies the challenges inherent in running parallel RTL simulation
on a multi-thousand-core machine (the Graphcore IPU, a 1472-core machine).
Simulation performance requires balancing three factors: synchronization,
communication, and computation. We experimentally evaluate each metric and
analyze how it affects parallel simulation speed, drawing on contrasts between
the large-scale IPU and smaller but faster x86 systems.
Using this analysis, we build Parendi, an RTL simulator for the IPU. It
distributes RTL simulation across 5888 cores on 4 IPU sockets. Parendi runs
large RTL designs up to 4x faster than a powerful, state-of-the-art x86
multicore system
Object-Oriented Recovery for Non-volatile Memory
New non-volatile memory (NVM) technologies enable direct, durable storage of data in an application's heap. Durable, randomly accessible memory facilitates the construction of applications that do not lose data at system shutdown or power failure. Existing NVM programming frameworks provide mechanisms to consistently capture a running application's state. They do not, however, fully support object-oriented languages or ensure that the persistent heap is consistent with the environment when the application is restarted. In this paper, we propose a new NVM language extension and runtime system that supports object-oriented NVM programming and avoids the pitfalls of prior approaches. At the heart of our technique is \emph{object reconstruction}, which transparently restores and reconstructs a persistent object's state during program restart. It is implemented in NVMReconstruction, a Clang/LLVM extension and runtime library that provides: (i) transient fields in persistent objects, (ii) support for virtual functions and function pointers, (iii) direct representation of persistent pointers as virtual addresses, and (iv) type-specific reconstruction of a persistent object during program restart. In addition, NVMReconstruction supports updating an application's code, even if this causes objects to expand, by providing object migration. NVMReconstruction also can compact the persistent heap to reduce fragmentation. In experiments, we demonstrate the versatility and usability of object reconstruction and its low runtime performance cost
Manticore: Hardware-Accelerated RTL Simulation with Static Bulk-Synchronous Parallelism
The demise of Moore's Law and Dennard Scaling has revived interest in
specialized computer architectures and accelerators. Verification and testing
of this hardware heavily uses cycle-accurate simulation of
register-transfer-level (RTL) designs. The best software RTL simulators can
simulate designs at 1--1000~kHz, i.e., more than three orders of magnitude
slower than hardware. Faster simulation can increase productivity by speeding
design iterations and permitting more exhaustive exploration.
One possibility is to use parallelism as RTL exposes considerable fine-grain
concurrency. However, state-of-the-art RTL simulators generally perform best
when single-threaded since modern processors cannot effectively exploit
fine-grain parallelism.
This work presents Manticore: a parallel computer designed to accelerate RTL
simulation. Manticore uses a static bulk-synchronous parallel (BSP) execution
model to eliminate runtime synchronization barriers among many simple
processors. Manticore relies entirely on its compiler to schedule resources and
communication. Because RTL code is practically free of long divergent execution
paths, static scheduling is feasible. Communication and synchronization no
longer incur runtime overhead, enabling efficient fine-grain parallelism.
Moreover, static scheduling dramatically simplifies the physical
implementation, significantly increasing the potential parallelism on a chip.
Our 225-core FPGA prototype running at 475 MHz outperforms a state-of-the-art
RTL simulator on an Intel Xeon processor running at 3.3 GHz by up to
27.9 (geomean 5.3) in nine Verilog benchmarks
Jiagu: Optimizing Serverless Computing Resource Utilization with Harmonized Efficiency and Practicability
Current serverless platforms struggle to optimize resource utilization due to
their dynamic and fine-grained nature. Conventional techniques like
overcommitment and autoscaling fall short, often sacrificing utilization for
practicability or incurring performance trade-offs. Overcommitment requires
predicting performance to prevent QoS violation, introducing trade-off between
prediction accuracy and overheads. Autoscaling requires scaling instances in
response to load fluctuations quickly to reduce resource wastage, but more
frequent scaling also leads to more cold start overheads. This paper introduces
Jiagu, which harmonizes efficiency with practicability through two novel
techniques. First, pre-decision scheduling achieves accurate prediction while
eliminating overheads by decoupling prediction and scheduling. Second,
dual-staged scaling achieves frequent adjustment of instances with minimum
overhead. We have implemented a prototype and evaluated it using real-world
applications and traces from the public cloud platform. Our evaluation shows a
54.8% improvement in deployment density over commercial clouds (with
Kubernetes) while maintaining QoS, and 81.0%--93.7% lower scheduling costs and
a 57.4%--69.3% reduction in cold start latency compared to existing QoS-aware
schedulers in research work.Comment: 17 pages, 17 figure
- …