3,873 research outputs found

    A low-cost parallel implementation of direct numerical simulation of wall turbulence

    Full text link
    A numerical method for the direct numerical simulation of incompressible wall turbulence in rectangular and cylindrical geometries is presented. The distinctive feature resides in its design being targeted towards an efficient distributed-memory parallel computing on commodity hardware. The adopted discretization is spectral in the two homogeneous directions; fourth-order accurate, compact finite-difference schemes over a variable-spacing mesh in the wall-normal direction are key to our parallel implementation. The parallel algorithm is designed in such a way as to minimize data exchange among the computing machines, and in particular to avoid taking a global transpose of the data during the pseudo-spectral evaluation of the non-linear terms. The computing machines can then be connected to each other through low-cost network devices. The code is optimized for memory requirements, which can moreover be subdivided among the computing nodes. The layout of a simple, dedicated and optimized computing system based on commodity hardware is described. The performance of the numerical method on this computing system is evaluated and compared with that of other codes described in the literature, as well as with that of the same code implementing a commonly employed strategy for the pseudo-spectral calculation.Comment: To be published in J. Comp. Physic

    Recent development and perspectives of machines for lattice QCD

    Full text link
    I highlight recent progress in cluster computer technology and assess status and prospects of cluster computers for lattice QCD with respect to the development of QCDOC and apeNEXT. Taking the LatFor test case, I specify a 512-processor QCD-cluster better than 1$/Mflops.Comment: 14 pages, 17 figures, Lattice2003(plenary

    Balancing the Communication Load of Asynchronously Parallelized Machine Learning Algorithms

    Full text link
    Stochastic Gradient Descent (SGD) is the standard numerical method used to solve the core optimization problem for the vast majority of machine learning (ML) algorithms. In the context of large scale learning, as utilized by many Big Data applications, efficient parallelization of SGD is in the focus of active research. Recently, we were able to show that the asynchronous communication paradigm can be applied to achieve a fast and scalable parallelization of SGD. Asynchronous Stochastic Gradient Descent (ASGD) outperforms other, mostly MapReduce based, parallel algorithms solving large scale machine learning problems. In this paper, we investigate the impact of asynchronous communication frequency and message size on the performance of ASGD applied to large scale ML on HTC cluster and cloud environments. We introduce a novel algorithm for the automatic balancing of the asynchronous communication load, which allows to adapt ASGD to changing network bandwidths and latencies.Comment: arXiv admin note: substantial text overlap with arXiv:1505.0495

    Lattice QCD Production on Commodity Clusters at Fermilab

    Full text link
    We describe the construction and results to date of Fermilab's three Myrinet-networked lattice QCD production clusters (an 80-node dual Pentium III cluster, a 48-node dual Xeon cluster, and a 128-node dual Xeon cluster). We examine a number of aspects of performance of the MILC lattice QCD code running on these clusters.Comment: Talk from the 2003 Computing in High Energy and Nuclear Physics (CHEP03), La Jolla, Ca, USA, March 2003, 6 pages, LaTeX, 8 eps figures. PSN TUIT00

    Better than $1/Mflops sustained: a scalable PC-based parallel computer for lattice QCD

    Full text link
    We study the feasibility of a PC-based parallel computer for medium to large scale lattice QCD simulations. The E\"otv\"os Univ., Inst. Theor. Phys. cluster consists of 137 Intel P4-1.7GHz nodes with 512 MB RDRAM. The 32-bit, single precision sustained performance for dynamical QCD without communication is 1510 Mflops/node with Wilson and 970 Mflops/node with staggered fermions. This gives a total performance of 208 Gflops for Wilson and 133 Gflops for staggered QCD, respectively (for 64-bit applications the performance is approximately halved). The novel feature of our system is its communication architecture. In order to have a scalable, cost-effective machine we use Gigabit Ethernet cards for nearest-neighbor communications in a two-dimensional mesh. This type of communication is cost effective (only 30% of the hardware costs is spent on the communication). According to our benchmark measurements this type of communication results in around 40% communication time fraction for lattices upto 48^3\cdot96 in full QCD simulations. The price/sustained-performance ratio for full QCD is better than 1/MflopsforWilson(andaround1/Mflops for Wilson (and around 1.5/Mflops for staggered) quarks for practically any lattice size, which can fit in our parallel computer. The communication software is freely available upon request for non-profit organizations.Comment: 14 pages, 3 figures, final version to appear in Comp.Phys.Com
    corecore