7,303 research outputs found

    Reproducibility, accuracy and performance of the Feltor code and library on parallel computer architectures

    Get PDF
    Feltor is a modular and free scientific software package. It allows developing platform independent code that runs on a variety of parallel computer architectures ranging from laptop CPUs to multi-GPU distributed memory systems. Feltor consists of both a numerical library and a collection of application codes built on top of the library. Its main target are two- and three-dimensional drift- and gyro-fluid simulations with discontinuous Galerkin methods as the main numerical discretization technique. We observe that numerical simulations of a recently developed gyro-fluid model produce non-deterministic results in parallel computations. First, we show how we restore accuracy and bitwise reproducibility algorithmically and programmatically. In particular, we adopt an implementation of the exactly rounded dot product based on long accumulators, which avoids accuracy losses especially in parallel applications. However, reproducibility and accuracy alone fail to indicate correct simulation behaviour. In fact, in the physical model slightly different initial conditions lead to vastly different end states. This behaviour translates to its numerical representation. Pointwise convergence, even in principle, becomes impossible for long simulation times. In a second part, we explore important performance tuning considerations. We identify latency and memory bandwidth as the main performance indicators of our routines. Based on these, we propose a parallel performance model that predicts the execution time of algorithms implemented in Feltor and test our model on a selection of parallel hardware architectures. We are able to predict the execution time with a relative error of less than 25% for problem sizes between 0.1 and 1000 MB. Finally, we find that the product of latency and bandwidth gives a minimum array size per compute node to achieve a scaling efficiency above 50% (both strong and weak)

    Optimization of micropillar sequences for fluid flow sculpting

    Get PDF
    Inertial fluid flow deformation around pillars in a microchannel is a new method for controlling fluid flow. Sequences of pillars have been shown to produce a rich phase space with a wide variety of flow transformations. Previous work has successfully demonstrated manual design of pillar sequences to achieve desired transformations of the flow cross-section, with experimental validation. However, such a method is not ideal for seeking out complex sculpted shapes as the search space quickly becomes too large for efficient manual discovery. We explore fast, automated optimization methods to solve this problem. We formulate the inertial flow physics in microchannels with different micropillar configurations as a set of state transition matrix operations. These state transition matrices are constructed from experimentally validated streamtraces. This facilitates modeling the effect of a sequence of micropillars as nested matrix-matrix products, which have very efficient numerical implementations. With this new forward model, arbitrary micropillar sequences can be rapidly simulated with various inlet configurations, allowing optimization routines quick access to a large search space. We integrate this framework with the genetic algorithm and showcase its applicability by designing micropillar sequences for various useful transformations. We computationally discover micropillar sequences for complex transformations that are substantially shorter than manually designed sequences. We also determine sequences for novel transformations that were difficult to manually design. Finally, we experimentally validate these computational designs by fabricating devices and comparing predictions with the results from confocal microscopy

    Minimizing Communication for Eigenproblems and the Singular Value Decomposition

    Full text link
    Algorithms have two costs: arithmetic and communication. The latter represents the cost of moving data, either between levels of a memory hierarchy, or between processors over a network. Communication often dominates arithmetic and represents a rapidly increasing proportion of the total cost, so we seek algorithms that minimize communication. In \cite{BDHS10} lower bounds were presented on the amount of communication required for essentially all O(n3)O(n^3)-like algorithms for linear algebra, including eigenvalue problems and the SVD. Conventional algorithms, including those currently implemented in (Sca)LAPACK, perform asymptotically more communication than these lower bounds require. In this paper we present parallel and sequential eigenvalue algorithms (for pencils, nonsymmetric matrices, and symmetric matrices) and SVD algorithms that do attain these lower bounds, and analyze their convergence and communication costs.Comment: 43 pages, 11 figure

    Reproducibility Strategies for Parallel Preconditioned Conjugate Gradient

    Get PDF
    The Preconditioned Conjugate Gradient method is often used in numerical simulations. While being widely used, the solver is also known for its lack of accuracy while computing the residual. In this article, we aim at a twofold goal: enhance the accuracy of the solver but also ensure its reproducibility in a message-passing implementation. We design and employ various strategies starting from the ExBLAS approach (through preserving every bit of information until final rounding) to its more lightweight performance-oriented variant (through expanding the intermediate precision). These algorithmic strategies are reinforced with programmability suggestions to assure deterministic executions. Finally, we verify these strategies on modern HPC systems: both versions deliver reproducible number of iterations, residuals, direct errors, and vector-solutions for the overhead of only 29 % (ExBLAS) and 4 % (lightweight) on 768 processes

    Reproducibility strategies for parallel preconditioned Conjugate Gradient

    Get PDF
    [EN] The Preconditioned Conjugate Gradient method is often used in numerical simulations. While being widely used, the solver is also known for its lack of accuracy while computing the residual. In this article, we aim at a twofold goal: enhance the accuracy of the solver but also ensure its reproducibility in a message-passing implementation. We design and employ various strategies starting from the ExBLAS approach (through preserving every bit of information until final rounding) to its more lightweight performance-oriented variant (through expanding the intermediate precision). These algorithmic strategies are reinforced with programmability suggestions to assure deterministic executions. Finally, we verify these strategies on modern HPC systems: both versions deliver reproducible number of iterations, residuals, direct errors, and vector-solutions for the overhead of only 29% (ExBLAS) and 4% (lightweight) on 768 processes.To begin with, we would like to thank the reviewers for their thorough reading of the article as well as their valuable comments and suggestions. This research was partially supported by the European Union's Horizon 2020 research, innovation programme under the Marie Sklodowska-Curie grant agreement via the Robust project No. 842528 as well as the Project HPC-EUROPA3 (INFRAIA-2016-1-730897), with the support of the H2020 EC RIA Programme; in particular, the author gratefully acknowledges the support of Vicenc Beltran and the computer resources and technical support provided by BSC. The researchers from Universitat Jaume I (UJI) and Universidad Politecnica de Valencia (UPV) were supported by MINECO, Spain project TIN2017-82972-R. Maria Barreda was also supported by the POSDOC-A/2017/11 project from the Universitat Jaume I, Spain.Iakymchuk, R.; Barreda, M.; Wiesenberger, M.; Aliaga, JI.; Quintana Ortí, ES. (2020). Reproducibility strategies for parallel preconditioned Conjugate Gradient. Journal of Computational and Applied Mathematics. 371:1-13. https://doi.org/10.1016/j.cam.2019.112697S113371Lawson, C. L., Hanson, R. J., Kincaid, D. R., & Krogh, F. T. (1979). Basic Linear Algebra Subprograms for Fortran Usage. ACM Transactions on Mathematical Software, 5(3), 308-323. doi:10.1145/355841.355847Dongarra, J. J., Du Croz, J., Hammarling, S., & Duff, I. S. (1990). A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software, 16(1), 1-17. doi:10.1145/77626.79170Demmel, J., & Nguyen, H. D. (2015). Parallel Reproducible Summation. IEEE Transactions on Computers, 64(7), 2060-2070. doi:10.1109/tc.2014.2345391Iakymchuk, R., Graillat, S., Defour, D., & Quintana-Ortí, E. S. (2019). Hierarchical approach for deriving a reproducible unblocked LU factorization. The International Journal of High Performance Computing Applications, 33(5), 791-803. doi:10.1177/1094342019832968Iakymchuk, R., Defour, D., Collange, S., & Graillat, S. (2016). Reproducible and Accurate Matrix Multiplication. Lecture Notes in Computer Science, 126-137. doi:10.1007/978-3-319-31769-4_11Rump, S. M., Ogita, T., & Oishi, S. (2009). Accurate Floating-Point Summation Part II: Sign, K-Fold Faithful and Rounding to Nearest. SIAM Journal on Scientific Computing, 31(2), 1269-1302. doi:10.1137/07068816xBurgess, N., Goodyer, C., Hinds, C. N., & Lutz, D. R. (2019). High-Precision Anchored Accumulators for Reproducible Floating-Point Summation. IEEE Transactions on Computers, 68(7), 967-978. doi:10.1109/tc.2018.2855729D. Mukunoki, T. Ogita, K. Ozaki, Accurate and reproducible BLAS routines with Ozaki scheme for many-core architectures, in: Proc. International Conference on Parallel Processing and Applied Mathematics, PPAM2019, 2019, accepted.Ogita, T., Rump, S. M., & Oishi, S. (2005). Accurate Sum and Dot Product. SIAM Journal on Scientific Computing, 26(6), 1955-1988. doi:10.1137/030601818Kulisch, U., & Snyder, V. (2010). The exact dot product as basic tool for long interval arithmetic. Computing, 91(3), 307-313. doi:10.1007/s00607-010-0127-7Boldo, S., & Melquiond, G. (2008). Emulation of a FMA and Correctly Rounded Sums: Proved Algorithms Using Rounding to Odd. IEEE Transactions on Computers, 57(4), 462-471. doi:10.1109/tc.2007.70819Wiesenberger, M., Einkemmer, L., Held, M., Gutierrez-Milla, A., Sáez, X., & Iakymchuk, R. (2019). Reproducibility, accuracy and performance of the Feltor code and library on parallel computer architectures. Computer Physics Communications, 238, 145-156. doi:10.1016/j.cpc.2018.12.006Fousse, L., Hanrot, G., Lefèvre, V., Pélissier, P., & Zimmermann, P. (2007). MPFR. ACM Transactions on Mathematical Software, 33(2), 13. doi:10.1145/1236463.1236468J. Demmel, H.D. Nguyen, Fast reproducible floating-point summation, in: Proceedings of ARITH-21, 2013, pp. 163–172.Ozaki, K., Ogita, T., Oishi, S., & Rump, S. M. (2011). Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications. Numerical Algorithms, 59(1), 95-118. doi:10.1007/s11075-011-9478-1Carson, E., & Higham, N. J. (2018). Accelerating the Solution of Linear Systems by Iterative Refinement in Three Precisions. SIAM Journal on Scientific Computing, 40(2), A817-A847. doi:10.1137/17m114081

    Online Tensor Methods for Learning Latent Variable Models

    Get PDF
    We introduce an online tensor decomposition based approach for two latent variable modeling problems namely, (1) community detection, in which we learn the latent communities that the social actors in social networks belong to, and (2) topic modeling, in which we infer hidden topics of text articles. We consider decomposition of moment tensors using stochastic gradient descent. We conduct optimization of multilinear operations in SGD and avoid directly forming the tensors, to save computational and storage costs. We present optimized algorithm in two platforms. Our GPU-based implementation exploits the parallelism of SIMD architectures to allow for maximum speed-up by a careful optimization of storage and data transfer, whereas our CPU-based implementation uses efficient sparse matrix computations and is suitable for large sparse datasets. For the community detection problem, we demonstrate accuracy and computational efficiency on Facebook, Yelp and DBLP datasets, and for the topic modeling problem, we also demonstrate good performance on the New York Times dataset. We compare our results to the state-of-the-art algorithms such as the variational method, and report a gain of accuracy and a gain of several orders of magnitude in the execution time.Comment: JMLR 201

    The Tensor Networks Anthology: Simulation techniques for many-body quantum lattice systems

    Full text link
    We present a compendium of numerical simulation techniques, based on tensor network methods, aiming to address problems of many-body quantum mechanics on a classical computer. The core setting of this anthology are lattice problems in low spatial dimension at finite size, a physical scenario where tensor network methods, both Density Matrix Renormalization Group and beyond, have long proven to be winning strategies. Here we explore in detail the numerical frameworks and methods employed to deal with low-dimension physical setups, from a computational physics perspective. We focus on symmetries and closed-system simulations in arbitrary boundary conditions, while discussing the numerical data structures and linear algebra manipulation routines involved, which form the core libraries of any tensor network code. At a higher level, we put the spotlight on loop-free network geometries, discussing their advantages, and presenting in detail algorithms to simulate low-energy equilibrium states. Accompanied by discussions of data structures, numerical techniques and performance, this anthology serves as a programmer's companion, as well as a self-contained introduction and review of the basic and selected advanced concepts in tensor networks, including examples of their applications.Comment: 115 pages, 56 figure
    corecore