26,658 research outputs found
Fault Tolerant Adaptive Parallel and Distributed Simulation through Functional Replication
This paper presents FT-GAIA, a software-based fault-tolerant parallel and
distributed simulation middleware. FT-GAIA has being designed to reliably
handle Parallel And Distributed Simulation (PADS) models, which are needed to
properly simulate and analyze complex systems arising in any kind of scientific
or engineering field. PADS takes advantage of multiple execution units run in
multicore processors, cluster of workstations or HPC systems. However, large
computing systems, such as HPC systems that include hundreds of thousands of
computing nodes, have to handle frequent failures of some components. To cope
with this issue, FT-GAIA transparently replicates simulation entities and
distributes them on multiple execution nodes. This allows the simulation to
tolerate crash-failures of computing nodes. Moreover, FT-GAIA offers some
protection against Byzantine failures, since interaction messages among the
simulated entities are replicated as well, so that the receiving entity can
identify and discard corrupted messages. Results from an analytical model and
from an experimental evaluation show that FT-GAIA provides a high degree of
fault tolerance, at the cost of a moderate increase in the computational load
of the execution units.Comment: arXiv admin note: substantial text overlap with arXiv:1606.0731
Fault-Tolerant Adaptive Parallel and Distributed Simulation
Discrete Event Simulation is a widely used technique that is used to model
and analyze complex systems in many fields of science and engineering. The
increasingly large size of simulation models poses a serious computational
challenge, since the time needed to run a simulation can be prohibitively
large. For this reason, Parallel and Distributes Simulation techniques have
been proposed to take advantage of multiple execution units which are found in
multicore processors, cluster of workstations or HPC systems. The current
generation of HPC systems includes hundreds of thousands of computing nodes and
a vast amount of ancillary components. Despite improvements in manufacturing
processes, failures of some components are frequent, and the situation will get
worse as larger systems are built. In this paper we describe FT-GAIA, a
software-based fault-tolerant extension of the GAIA/ART\`IS parallel simulation
middleware. FT-GAIA transparently replicates simulation entities and
distributes them on multiple execution nodes. This allows the simulation to
tolerate crash-failures of computing nodes; furthermore, FT-GAIA offers some
protection against byzantine failures since synchronization messages are
replicated as well, so that the receiving entity can identify and discard
corrupted messages. We provide an experimental evaluation of FT-GAIA on a
running prototype. Results show that a high degree of fault tolerance can be
achieved, at the cost of a moderate increase in the computational load of the
execution units.Comment: Proceedings of the IEEE/ACM International Symposium on Distributed
Simulation and Real Time Applications (DS-RT 2016
Recommended from our members
Distributed simulation and the grid: Position statements
The Grid provides a new and unrivaled technology for large scale distributed simulation as it enables collaboration and the use of distributed computing resources. This panel paper presents the views of four researchers in the area of Distributed Simulation and the Grid. Together we try to identify the main research issues involved in applying Grid technology to distributed simulation and the key future challenges that need to be solved to achieve this goal. Such challenges include not only technical challenges, but also political ones such as management methodology for the Grid and the development of standards. The benefits of the Grid to end-user simulation modelers also are discussed
Cache Equalizer: A Cache Pressure Aware Block Placement Scheme for Large-Scale Chip Multiprocessors
This paper describes Cache Equalizer (CE), a novel distributed cache management scheme for large scale chip multiprocessors (CMPs). Our work is motivated by large asymmetry in cache sets usages. CE decouples the physical locations of cache blocks from their addresses for the sake of reducing misses caused by destructive interferences. Temporal pressure at the on-chip last-level cache, is continuously collected at a group (comprised of cache sets) granularity, and periodically recorded at the memory controller to guide the placement process. An incoming block is consequently placed at a cache group that exhibits the minimum pressure. CE provides Quality of Service (QoS) by robustly offering better performance than the baseline shared NUCA cache. Simulation results using a full-system simulator demonstrate that CE outperforms shared NUCA caches by an average of 15.5% and by as much as 28.5% for the benchmark programs we examined. Furthermore, evaluations manifested the outperformance of CE versus related CMP cache designs
TOFEC: Achieving Optimal Throughput-Delay Trade-off of Cloud Storage Using Erasure Codes
Our paper presents solutions using erasure coding, parallel connections to
storage cloud and limited chunking (i.e., dividing the object into a few
smaller segments) together to significantly improve the delay performance of
uploading and downloading data in and out of cloud storage.
TOFEC is a strategy that helps front-end proxy adapt to level of workload by
treating scalable cloud storage (e.g. Amazon S3) as a shared resource requiring
admission control. Under light workloads, TOFEC creates more smaller chunks and
uses more parallel connections per file, minimizing service delay. Under heavy
workloads, TOFEC automatically reduces the level of chunking (fewer chunks with
increased size) and uses fewer parallel connections to reduce overhead,
resulting in higher throughput and preventing queueing delay. Our trace-driven
simulation results show that TOFEC's adaptation mechanism converges to an
appropriate code that provides the optimal delay-throughput trade-off without
reducing system capacity. Compared to a non-adaptive strategy optimized for
throughput, TOFEC delivers 2.5x lower latency under light workloads; compared
to a non-adaptive strategy optimized for latency, TOFEC can scale to support
over 3x as many requests
DOH: A Content Delivery Peer-to-Peer Network
Many SMEs and non-pro¯t organizations su®er when their Web
servers become unavailable due to °ash crowd e®ects when their web site
becomes popular. One of the solutions to the °ash-crowd problem is to place
the web site on a scalable CDN (Content Delivery Network) that replicates
the content and distributes the load in order to improve its response time.
In this paper, we present our approach to building a scalable Web Hosting
environment as a CDN on top of a structured peer-to-peer system of collaborative
web-servers integrated to share the load and to improve the overall
system performance, scalability, availability and robustness. Unlike clusterbased
solutions, it can run on heterogeneous hardware, over geographically
dispersed areas. To validate and evaluate our approach, we have developed a
system prototype called DOH (DKS Organized Hosting) that is a CDN implemented
on top of the DKS (Distributed K-nary Search) structured P2P
system with DHT (Distributed Hash table) functionality [9]. The prototype
is implemented in Java, using the DKS middleware, the Jetty web-server, and
a modiÂŻed JavaFTP server. The proposed design of CDN has been evaluated
by simulation and by evaluation experiments on the prototype
Broadcasting in Prefix Space: P2P Data Dissemination with Predictable Performance
A broadcast mode may augment peer-to-peer overlay networks with an efficient,
scalable data replication function, but may also give rise to a virtual link
layer in VPN-type solutions. We introduce a simple broadcasting mechanism that
operates in the prefix space of distributed hash tables without signaling. This
paper concentrates on the performance analysis of the prefix flooding scheme.
Starting from simple models of recursive -ary trees, we analytically derive
distributions of hop counts and the replication load. Extensive simulation
results are presented further on, based on an implementation within the OverSim
framework. Comparisons are drawn to Scribe, taken as a general reference model
for group communication according to the shared, rendezvous-point-centered
distribution paradigm. The prefix flooding scheme thereby confirmed its widely
predictable performance and consistently outperformed Scribe in all metrics.
Reverse path selection in overlays is identified as a major cause of
performance degradation.Comment: final version for ICIW'0
Distributed Selfish Coaching
Although cooperation generally increases the amount of resources available to a community of nodes, thus improving individual and collective performance, it also allows for the appearance of potential mistreatment problems through the exposition of one node's resources to others. We study such concerns by considering a group of independent, rational, self-aware nodes that cooperate using on-line caching algorithms, where the exposed resource is the storage at each node. Motivated by content networking applications -- including web caching, CDNs, and P2P -- this paper extends our previous work on the on-line version of the problem, which was conducted under a game-theoretic framework, and limited to object replication. We identify and investigate two causes of mistreatment: (1) cache state interactions (due to the cooperative servicing of requests) and (2) the adoption of a common scheme for cache management policies. Using analytic models, numerical solutions of these models, as well as simulation experiments, we show that on-line cooperation schemes using caching are fairly robust to mistreatment caused by state interactions. To appear in a substantial manner, the interaction through the exchange of miss-streams has to be very intense, making it feasible for the mistreated nodes to detect and react to exploitation. This robustness ceases to exist when nodes fetch and store objects in response to remote requests, i.e., when they operate as Level-2 caches (or proxies) for other nodes. Regarding mistreatment due to a common scheme, we show that this can easily take place when the "outlier" characteristics of some of the nodes get overlooked. This finding underscores the importance of allowing cooperative caching nodes the flexibility of choosing from a diverse set of schemes to fit the peculiarities of individual nodes. To that end, we outline an emulation-based framework for the development of mistreatment-resilient distributed selfish caching schemes. Our framework utilizes a simple control-theoretic approach to dynamically parameterize the cache management scheme. We show performance evaluation results that quantify the benefits from instantiating such a framework, which could be substantial under skewed demand profiles.National Science Foundation (CNS Cybertrust 0524477, CNS NeTS 0520166, CNS ITR 0205294, EIA RI 0202067); EU IST (CASCADAS and E-NEXT); Marie Curie Outgoing International Fellowship of the EU (MOIF-CT-2005-007230
- …