7,737 research outputs found

    Sparse Allreduce: Efficient Scalable Communication for Power-Law Data

    Full text link
    Many large datasets exhibit power-law statistics: The web graph, social networks, text data, click through data etc. Their adjacency graphs are termed natural graphs, and are known to be difficult to partition. As a consequence most distributed algorithms on these graphs are communication intensive. Many algorithms on natural graphs involve an Allreduce: a sum or average of partitioned data which is then shared back to the cluster nodes. Examples include PageRank, spectral partitioning, and many machine learning algorithms including regression, factor (topic) models, and clustering. In this paper we describe an efficient and scalable Allreduce primitive for power-law data. We point out scaling problems with existing butterfly and round-robin networks for Sparse Allreduce, and show that a hybrid approach improves on both. Furthermore, we show that Sparse Allreduce stages should be nested instead of cascaded (as in the dense case). And that the optimum throughput Allreduce network should be a butterfly of heterogeneous degree where degree decreases with depth into the network. Finally, a simple replication scheme is introduced to deal with node failures. We present experiments showing significant improvements over existing systems such as PowerGraph and Hadoop

    Exploiting Data Representation for Fault Tolerance

    Full text link
    We explore the link between data representation and soft errors in dot products. We present an analytic model for the absolute error introduced should a soft error corrupt a bit in an IEEE-754 floating-point number. We show how this finding relates to the fundamental linear algebra concepts of normalization and matrix equilibration. We present a case study illustrating that the probability of experiencing a large error in a dot product is minimized when both vectors are normalized. Furthermore, when data is normalized we show that the absolute error is less than one or very large, which allows us to detect large errors. We demonstrate how this finding can be used by instrumenting the GMRES iterative solver. We count all possible errors that can be introduced through faults in arithmetic in the computationally intensive orthogonalization phase, and show that when scaling is used the absolute error can be bounded above by one

    A Unified Coded Deep Neural Network Training Strategy Based on Generalized PolyDot Codes for Matrix Multiplication

    Full text link
    This paper has two contributions. First, we propose a novel coded matrix multiplication technique called Generalized PolyDot codes that advances on existing methods for coded matrix multiplication under storage and communication constraints. This technique uses "garbage alignment," i.e., aligning computations in coded computing that are not a part of the desired output. Generalized PolyDot codes bridge between Polynomial codes and MatDot codes, trading off between recovery threshold and communication costs. Second, we demonstrate that Generalized PolyDot can be used for training large Deep Neural Networks (DNNs) on unreliable nodes prone to soft-errors. This requires us to address three additional challenges: (i) prohibitively large overhead of coding the weight matrices in each layer of the DNN at each iteration; (ii) nonlinear operations during training, which are incompatible with linear coding; and (iii) not assuming presence of an error-free master node, requiring us to architect a fully decentralized implementation without any "single point of failure." We allow all primary DNN training steps, namely, matrix multiplication, nonlinear activation, Hadamard product, and update steps as well as the encoding/decoding to be error-prone. We consider the case of mini-batch size B=1B=1, as well as B>1B>1, leveraging coded matrix-vector products, and matrix-matrix products respectively. The problem of DNN training under soft-errors also motivates an interesting, probabilistic error model under which a real number (P,Q)(P,Q) MDS code is shown to correct P−Q−1P-Q-1 errors with probability 11 as compared to ⌊P−Q2⌋\lfloor \frac{P-Q}{2} \rfloor for the more conventional, adversarial error model. We also demonstrate that our proposed strategy can provide unbounded gains in error tolerance over a competing replication strategy and a preliminary MDS-code-based strategy for both these error models.Comment: Presented in part at the IEEE International Symposium on Information Theory 2018 (Submission Date: Jan 12 2018); Currently under review at the IEEE Transactions on Information Theor

    Improving Performance of Iterative Methods by Lossy Checkponting

    Get PDF
    Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fundamental operations for many modern scientific simulations. When the large-scale iterative methods are running with a large number of ranks in parallel, they have to checkpoint the dynamic variables periodically in case of unavoidable fail-stop errors, requiring fast I/O systems and large storage space. To this end, significantly reducing the checkpointing overhead is critical to improving the overall performance of iterative methods. Our contribution is fourfold. (1) We propose a novel lossy checkpointing scheme that can significantly improve the checkpointing performance of iterative methods by leveraging lossy compressors. (2) We formulate a lossy checkpointing performance model and derive theoretically an upper bound for the extra number of iterations caused by the distortion of data in lossy checkpoints, in order to guarantee the performance improvement under the lossy checkpointing scheme. (3) We analyze the impact of lossy checkpointing (i.e., extra number of iterations caused by lossy checkpointing files) for multiple types of iterative methods. (4)We evaluate the lossy checkpointing scheme with optimal checkpointing intervals on a high-performance computing environment with 2,048 cores, using a well-known scientific computation package PETSc and a state-of-the-art checkpoint/restart toolkit. Experiments show that our optimized lossy checkpointing scheme can significantly reduce the fault tolerance overhead for iterative methods by 23%~70% compared with traditional checkpointing and 20%~58% compared with lossless-compressed checkpointing, in the presence of system failures.Comment: 14 pages, 10 figures, HPDC'1
    • …
    corecore