41,048 research outputs found
Stream Distributed Coded Computing
The emerging large-scale and data-hungry algorithms require the computations
to be delegated from a central server to several worker nodes. One major
challenge in the distributed computations is to tackle delays and failures
caused by the stragglers. To address this challenge, introducing efficient
amount of redundant computations via distributed coded computation has received
significant attention. Recent approaches in this area have mainly focused on
introducing minimum computational redundancies to tolerate certain number of
stragglers. To the best of our knowledge, the current literature lacks a
unified end-to-end design in a heterogeneous setting where the workers can vary
in their computation and communication capabilities. The contribution of this
paper is to devise a novel framework for joint scheduling-coding, in a setting
where the workers and the arrival of stream computational jobs are based on
stochastic models. In our initial joint scheme, we propose a systematic
framework that illustrates how to select a set of workers and how to split the
computational load among the selected workers based on their differences in
order to minimize the average in-order job execution delay. Through
simulations, we demonstrate that the performance of our framework is
dramatically better than the performance of naive method that splits the
computational load uniformly among the workers, and it is close to the ideal
performance
Improving Performance of Iterative Methods by Lossy Checkponting
Iterative methods are commonly used approaches to solve large, sparse linear
systems, which are fundamental operations for many modern scientific
simulations. When the large-scale iterative methods are running with a large
number of ranks in parallel, they have to checkpoint the dynamic variables
periodically in case of unavoidable fail-stop errors, requiring fast I/O
systems and large storage space. To this end, significantly reducing the
checkpointing overhead is critical to improving the overall performance of
iterative methods. Our contribution is fourfold. (1) We propose a novel lossy
checkpointing scheme that can significantly improve the checkpointing
performance of iterative methods by leveraging lossy compressors. (2) We
formulate a lossy checkpointing performance model and derive theoretically an
upper bound for the extra number of iterations caused by the distortion of data
in lossy checkpoints, in order to guarantee the performance improvement under
the lossy checkpointing scheme. (3) We analyze the impact of lossy
checkpointing (i.e., extra number of iterations caused by lossy checkpointing
files) for multiple types of iterative methods. (4)We evaluate the lossy
checkpointing scheme with optimal checkpointing intervals on a high-performance
computing environment with 2,048 cores, using a well-known scientific
computation package PETSc and a state-of-the-art checkpoint/restart toolkit.
Experiments show that our optimized lossy checkpointing scheme can
significantly reduce the fault tolerance overhead for iterative methods by
23%~70% compared with traditional checkpointing and 20%~58% compared with
lossless-compressed checkpointing, in the presence of system failures.Comment: 14 pages, 10 figures, HPDC'1
A spatially distributed model for the dynamic prediction of sediment erosion and transport in mountainous forested watersheds
Erosion and sediment transport in a temperate forested watershed are predicted with a new sediment model that represents the main sources of sediment generation in forested environments (mass wasting, hillslope erosion, and road surface erosion) within the distributed hydrology-soil-vegetation model (DHSVM) environment. The model produces slope failures on the basis of a factor-of-safety analysis with the infinite slope model through use of stochastically generated soil and vegetation parameters. Failed material is routed downslope with a rule-based scheme that determines sediment delivery to streams. Sediment from hillslopes and road surfaces is also transported to the channel network. A simple channel routing scheme is implemented to predict basin sediment yield. We demonstrate through an initial application of this model to the Rainy Creek catchment, a tributary of the Wenatchee River, which drains the east slopes of the Cascade Mountains, that the model produces plausible sediment yield and ratios of landsliding and surface erosion when compared to published rates for similar catchments in the Pacific Northwest. A road removal scenario and a basin-wide fire scenario are both evaluated with the model
Evaluation of fault-tolerant parallel-processor architectures over long space missions
The impact of a five year space mission environment on fault-tolerant parallel processor architectures is examined. The target application is a Strategic Defense Initiative (SDI) satellite requiring 256 parallel processors to provide the computation throughput. The reliability requirements are that the system still be operational after five years with .99 probability and that the probability of system failure during one-half hour of full operation be less than 10(-7). The fault tolerance features an architecture must possess to meet these reliability requirements are presented, many potential architectures are briefly evaluated, and one candidate architecture, the Charles Stark Draper Laboratory's Fault-Tolerant Parallel Processor (FTPP) is evaluated in detail. A methodology for designing a preliminary system configuration to meet the reliability and performance requirements of the mission is then presented and demonstrated by designing an FTPP configuration
Genet: A Quickly Scalable Fat-Tree Overlay for Personal Volunteer Computing using WebRTC
WebRTC enables browsers to exchange data directly but the number of possible
concurrent connections to a single source is limited. We overcome the
limitation by organizing participants in a fat-tree overlay: when the maximum
number of connections of a tree node is reached, the new participants connect
to the node's children. Our design quickly scales when a large number of
participants join in a short amount of time, by relying on a novel scheme that
only requires local information to route connection messages: the destination
is derived from the hash value of the combined identifiers of the message's
source and of the node that is holding the message. The scheme provides
deterministic routing of a sequence of connection messages from a single source
and probabilistic balancing of newer connections among the leaves. We show that
this design puts at least 83% of nodes at the same depth as a deterministic
algorithm, can connect a thousand browser windows in 21-55 seconds in a local
network, and can be deployed for volunteer computing to tap into 320 cores in
less than 30 seconds on a local network to increase the total throughput on the
Collatz application by two orders of magnitude compared to a single core
Significance Driven Hybrid 8T-6T SRAM for Energy-Efficient Synaptic Storage in Artificial Neural Networks
Multilayered artificial neural networks (ANN) have found widespread utility
in classification and recognition applications. The scale and complexity of
such networks together with the inadequacies of general purpose computing
platforms have led to a significant interest in the development of efficient
hardware implementations. In this work, we focus on designing energy efficient
on-chip storage for the synaptic weights. In order to minimize the power
consumption of typical digital CMOS implementations of such large-scale
networks, the digital neurons could be operated reliably at scaled voltages by
reducing the clock frequency. On the contrary, the on-chip synaptic storage
designed using a conventional 6T SRAM is susceptible to bitcell failures at
reduced voltages. However, the intrinsic error resiliency of NNs to small
synaptic weight perturbations enables us to scale the operating voltage of the
6TSRAM. Our analysis on a widely used digit recognition dataset indicates that
the voltage can be scaled by 200mV from the nominal operating voltage (950mV)
for practically no loss (less than 0.5%) in accuracy (22nm predictive
technology). Scaling beyond that causes substantial performance degradation
owing to increased probability of failures in the MSBs of the synaptic weights.
We, therefore propose a significance driven hybrid 8T-6T SRAM, wherein the
sensitive MSBs are stored in 8T bitcells that are robust at scaled voltages due
to decoupled read and write paths. In an effort to further minimize the area
penalty, we present a synaptic-sensitivity driven hybrid memory architecture
consisting of multiple 8T-6T SRAM banks. Our circuit to system-level simulation
framework shows that the proposed synaptic-sensitivity driven architecture
provides a 30.91% reduction in the memory access power with a 10.41% area
overhead, for less than 1% loss in the classification accuracy.Comment: Accepted in Design, Automation and Test in Europe 2016 conference
(DATE-2016
- …