2,599 research outputs found

    Dynamic Parameter Allocation in Parameter Servers

    Full text link
    To keep up with increasing dataset sizes and model complexity, distributed training has become a necessity for large machine learning tasks. Parameter servers ease the implementation of distributed parameter management---a key concern in distributed training---, but can induce severe communication overhead. To reduce communication overhead, distributed machine learning algorithms use techniques to increase parameter access locality (PAL), achieving up to linear speed-ups. We found that existing parameter servers provide only limited support for PAL techniques, however, and therefore prevent efficient training. In this paper, we explore whether and to what extent PAL techniques can be supported, and whether such support is beneficial. We propose to integrate dynamic parameter allocation into parameter servers, describe an efficient implementation of such a parameter server called Lapse, and experimentally compare its performance to existing parameter servers across a number of machine learning tasks. We found that Lapse provides near-linear scaling and can be orders of magnitude faster than existing parameter servers

    Parallel Algorithms for Constrained Tensor Factorization via the Alternating Direction Method of Multipliers

    Full text link
    Tensor factorization has proven useful in a wide range of applications, from sensor array processing to communications, speech and audio signal processing, and machine learning. With few recent exceptions, all tensor factorization algorithms were originally developed for centralized, in-memory computation on a single machine; and the few that break away from this mold do not easily incorporate practically important constraints, such as nonnegativity. A new constrained tensor factorization framework is proposed in this paper, building upon the Alternating Direction method of Multipliers (ADMoM). It is shown that this simplifies computations, bypassing the need to solve constrained optimization problems in each iteration; and it naturally leads to distributed algorithms suitable for parallel implementation on regular high-performance computing (e.g., mesh) architectures. This opens the door for many emerging big data-enabled applications. The methodology is exemplified using nonnegativity as a baseline constraint, but the proposed framework can more-or-less readily incorporate many other types of constraints. Numerical experiments are very encouraging, indicating that the ADMoM-based nonnegative tensor factorization (NTF) has high potential as an alternative to state-of-the-art approaches.Comment: Submitted to the IEEE Transactions on Signal Processin

    A Parallel Solver for Graph Laplacians

    Full text link
    Problems from graph drawing, spectral clustering, network flow and graph partitioning can all be expressed in terms of graph Laplacian matrices. There are a variety of practical approaches to solving these problems in serial. However, as problem sizes increase and single core speeds stagnate, parallelism is essential to solve such problems quickly. We present an unsmoothed aggregation multigrid method for solving graph Laplacians in a distributed memory setting. We introduce new parallel aggregation and low degree elimination algorithms targeted specifically at irregular degree graphs. These algorithms are expressed in terms of sparse matrix-vector products using generalized sum and product operations. This formulation is amenable to linear algebra using arbitrary distributions and allows us to operate on a 2D sparse matrix distribution, which is necessary for parallel scalability. Our solver outperforms the natural parallel extension of the current state of the art in an algorithmic comparison. We demonstrate scalability to 576 processes and graphs with up to 1.7 billion edges.Comment: PASC '18, Code: https://github.com/ligmg/ligm

    A Shift Selection Strategy for Parallel Shift-invert Spectrum Slicing in Symmetric Self-consistent Eigenvalue Computation

    Get PDF
    © 2020 ACM. The central importance of large-scale eigenvalue problems in scientific computation necessitates the development of massively parallel algorithms for their solution. Recent advances in dense numerical linear algebra have enabled the routine treatment of eigenvalue problems with dimensions on the order of hundreds of thousands on the world's largest supercomputers. In cases where dense treatments are not feasible, Krylov subspace methods offer an attractive alternative due to the fact that they do not require storage of the problem matrices. However, demonstration of scalability of either of these classes of eigenvalue algorithms on computing architectures capable of expressing massive parallelism is non-trivial due to communication requirements and serial bottlenecks, respectively. In this work, we introduce the SISLICE method: a parallel shift-invert algorithm for the solution of the symmetric self-consistent field (SCF) eigenvalue problem. The SISLICE method drastically reduces the communication requirement of current parallel shift-invert eigenvalue algorithms through various shift selection and migration techniques based on density of states estimation and k-means clustering, respectively. This work demonstrates the robustness and parallel performance of the SISLICE method on a representative set of SCF eigenvalue problems and outlines research directions that will be explored in future work
    • …
    corecore