447 research outputs found
Parallelized preconditioned model building algorithm for matrix factorization
Matrix factorization is a common task underlying several machine learning applications such as recommender systems, topic modeling, or compressed sensing. Given a large and possibly sparse matrix A, we seek two smaller matrices W and H such that their product is as close to A as possible. The objective is minimizing the sum of square errors in the approximation. Typically such problems involve hundreds of thousands of unknowns, so an optimizer must be exceptionally efficient. In this study, a new algorithm, Preconditioned Model Building is adapted to factorize matrices composed of movie ratings in the MovieLens data sets with 1, 10, and 20 million entries. We present experiments that compare the sequential MATLAB implementation of the PMB algorithm with other algorithms in the minFunc package. We also employ a lock-free sparse matrix factorization algorithm and provide a scalable shared-memory parallel implementation. We show that (a) the optimization performance of the PMB algorithm is comparable to the best algorithms in common use, and (b) the computational performance can be significantly increased with parallelizatio
Efficient parallelization strategy for real-time FE simulations
This paper introduces an efficient and generic framework for finite-element
simulations under an implicit time integration scheme. Being compatible with
generic constitutive models, a fast matrix assembly method exploits the fact
that system matrices are created in a deterministic way as long as the mesh
topology remains constant. Using the sparsity pattern of the assembled system
brings about significant optimizations on the assembly stage. As a result,
developed techniques of GPU-based parallelization can be directly applied with
the assembled system. Moreover, an asynchronous Cholesky precondition scheme is
used to improve the convergence of the system solver. On this basis, a
GPU-based Cholesky preconditioner is developed, significantly reducing the data
transfer between the CPU/GPU during the solving stage. We evaluate the
performance of our method with different mesh elements and hyperelastic models
and compare it with typical approaches on the CPU and the GPU
Domain decomposition methods for the parallel computation of reacting flows
Domain decomposition is a natural route to parallel computing for partial differential equation solvers. Subdomains of which the original domain of definition is comprised are assigned to independent processors at the price of periodic coordination between processors to compute global parameters and maintain the requisite degree of continuity of the solution at the subdomain interfaces. In the domain-decomposed solution of steady multidimensional systems of PDEs by finite difference methods using a pseudo-transient version of Newton iteration, the only portion of the computation which generally stands in the way of efficient parallelization is the solution of the large, sparse linear systems arising at each Newton step. For some Jacobian matrices drawn from an actual two-dimensional reacting flow problem, comparisons are made between relaxation-based linear solvers and also preconditioned iterative methods of Conjugate Gradient and Chebyshev type, focusing attention on both iteration count and global inner product count. The generalized minimum residual method with block-ILU preconditioning is judged the best serial method among those considered, and parallel numerical experiments on the Encore Multimax demonstrate for it approximately 10-fold speedup on 16 processors
- …