29,480 research outputs found
Securing Real-Time Internet-of-Things
Modern embedded and cyber-physical systems are ubiquitous. A large number of
critical cyber-physical systems have real-time requirements (e.g., avionics,
automobiles, power grids, manufacturing systems, industrial control systems,
etc.). Recent developments and new functionality requires real-time embedded
devices to be connected to the Internet. This gives rise to the real-time
Internet-of-things (RT-IoT) that promises a better user experience through
stronger connectivity and efficient use of next-generation embedded devices.
However RT- IoT are also increasingly becoming targets for cyber-attacks which
is exacerbated by this increased connectivity. This paper gives an introduction
to RT-IoT systems, an outlook of current approaches and possible research
challenges towards secure RT- IoT frameworks
Checkpointing as a Service in Heterogeneous Cloud Environments
A non-invasive, cloud-agnostic approach is demonstrated for extending
existing cloud platforms to include checkpoint-restart capability. Most cloud
platforms currently rely on each application to provide its own fault
tolerance. A uniform mechanism within the cloud itself serves two purposes: (a)
direct support for long-running jobs, which would otherwise require a custom
fault-tolerant mechanism for each application; and (b) the administrative
capability to manage an over-subscribed cloud by temporarily swapping out jobs
when higher priority jobs arrive. An advantage of this uniform approach is that
it also supports parallel and distributed computations, over both TCP and
InfiniBand, thus allowing traditional HPC applications to take advantage of an
existing cloud infrastructure. Additionally, an integrated health-monitoring
mechanism detects when long-running jobs either fail or incur exceptionally low
performance, perhaps due to resource starvation, and proactively suspends the
job. The cloud-agnostic feature is demonstrated by applying the implementation
to two very different cloud platforms: Snooze and OpenStack. The use of a
cloud-agnostic architecture also enables, for the first time, migration of
applications from one cloud platform to another.Comment: 20 pages, 11 figures, appears in CCGrid, 201
Integrated analysis of error detection and recovery
An integrated modeling and analysis of error detection and recovery is presented. When fault latency and/or error latency exist, the system may suffer from multiple faults or error propagations which seriously deteriorate the fault-tolerant capability. Several detection models that enable analysis of the effect of detection mechanisms on the subsequent error handling operations and the overall system reliability were developed. Following detection of the faulty unit and reconfiguration of the system, the contaminated processes or tasks have to be recovered. The strategies of error recovery employed depend on the detection mechanisms and the available redundancy. Several recovery methods including the rollback recovery are considered. The recovery overhead is evaluated as an index of the capabilities of the detection and reconfiguration mechanisms
Algorithm-Directed Crash Consistence in Non-Volatile Memory for HPC
Fault tolerance is one of the major design goals for HPC. The emergence of
non-volatile memories (NVM) provides a solution to build fault tolerant HPC.
Data in NVM-based main memory are not lost when the system crashes because of
the non-volatility nature of NVM. However, because of volatile caches, data
must be logged and explicitly flushed from caches into NVM to ensure
consistence and correctness before crashes, which can cause large runtime
overhead.
In this paper, we introduce an algorithm-based method to establish crash
consistence in NVM for HPC applications. We slightly extend application data
structures or sparsely flush cache blocks, which introduce ignorable runtime
overhead. Such extension or cache flushing allows us to use algorithm knowledge
to \textit{reason} data consistence or correct inconsistent data when the
application crashes. We demonstrate the effectiveness of our method for three
algorithms, including an iterative solver, dense matrix multiplication, and
Monte-Carlo simulation. Based on comprehensive performance evaluation on a
variety of test environments, we demonstrate that our approach has very small
runtime overhead (at most 8.2\% and less than 3\% in most cases), much smaller
than that of traditional checkpoint, while having the same or less
recomputation cost.Comment: 12 page
- âŠ