6,520 research outputs found
Algorithm-Directed Crash Consistence in Non-Volatile Memory for HPC
Fault tolerance is one of the major design goals for HPC. The emergence of
non-volatile memories (NVM) provides a solution to build fault tolerant HPC.
Data in NVM-based main memory are not lost when the system crashes because of
the non-volatility nature of NVM. However, because of volatile caches, data
must be logged and explicitly flushed from caches into NVM to ensure
consistence and correctness before crashes, which can cause large runtime
overhead.
In this paper, we introduce an algorithm-based method to establish crash
consistence in NVM for HPC applications. We slightly extend application data
structures or sparsely flush cache blocks, which introduce ignorable runtime
overhead. Such extension or cache flushing allows us to use algorithm knowledge
to \textit{reason} data consistence or correct inconsistent data when the
application crashes. We demonstrate the effectiveness of our method for three
algorithms, including an iterative solver, dense matrix multiplication, and
Monte-Carlo simulation. Based on comprehensive performance evaluation on a
variety of test environments, we demonstrate that our approach has very small
runtime overhead (at most 8.2\% and less than 3\% in most cases), much smaller
than that of traditional checkpoint, while having the same or less
recomputation cost.Comment: 12 page
{\mu}-DDRL: A QoS-Aware Distributed Deep Reinforcement Learning Technique for Service Offloading in Fog computing Environments
Fog and Edge computing extend cloud services to the proximity of end users,
allowing many Internet of Things (IoT) use cases, particularly latency-critical
applications. Smart devices, such as traffic and surveillance cameras, often do
not have sufficient resources to process computation-intensive and
latency-critical services. Hence, the constituent parts of services can be
offloaded to nearby Edge/Fog resources for processing and storage. However,
making offloading decisions for complex services in highly stochastic and
dynamic environments is an important, yet difficult task. Recently, Deep
Reinforcement Learning (DRL) has been used in many complex service offloading
problems; however, existing techniques are most suitable for centralized
environments, and their convergence to the best-suitable solutions is slow. In
addition, constituent parts of services often have predefined data dependencies
and quality of service constraints, which further intensify the complexity of
service offloading. To solve these issues, we propose a distributed DRL
technique following the actor-critic architecture based on Asynchronous
Proximal Policy Optimization (APPO) to achieve efficient and diverse
distributed experience trajectory generation. Also, we employ PPO clipping and
V-trace techniques for off-policy correction for faster convergence to the most
suitable service offloading solutions. The results obtained demonstrate that
our technique converges quickly, offers high scalability and adaptability, and
outperforms its counterparts by improving the execution time of heterogeneous
services
- …