Search CORE

3,873 research outputs found

A low-cost parallel implementation of direct numerical simulation of wall turbulence

Author: Bertolotti
del Álamo
Dmitruk
Günther
Iovieno
Jiménez
Kim
Kim
Kwok
Lele
Mahesh
Maurizio Quadrio
Moin
Moser
Na
Paolo Luchini
Pelz
Pozzi
Quadrio
Quadrio
Spotz
Thomas
Publication venue: 'Elsevier BV'
Publication date: 18/06/2005
Field of study

A numerical method for the direct numerical simulation of incompressible wall turbulence in rectangular and cylindrical geometries is presented. The distinctive feature resides in its design being targeted towards an efficient distributed-memory parallel computing on commodity hardware. The adopted discretization is spectral in the two homogeneous directions; fourth-order accurate, compact finite-difference schemes over a variable-spacing mesh in the wall-normal direction are key to our parallel implementation. The parallel algorithm is designed in such a way as to minimize data exchange among the computing machines, and in particular to avoid taking a global transpose of the data during the pseudo-spectral evaluation of the non-linear terms. The computing machines can then be connected to each other through low-cost network devices. The code is optimized for memory requirements, which can moreover be subdivided among the computing nodes. The layout of a simple, dedicated and optimized computing system based on commodity hardware is described. The performance of the numerical method on this computing system is evaluated and compared with that of other codes described in the literature, as well as with that of the same code implementing a commonly employed strategy for the pseudo-spectral calculation.Comment: To be published in J. Comp. Physic

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Politecnico di Milano

Crossref

Archivio della Ricerca - Università di Salerno

CERN Document Server

Recent development and perspectives of machines for lattice QCD

Author: Aglietti
Ammendola
Aoki
Aoki
APE
Arndt
Bartoloni
Bhanot
Bodin
Bodin
Boyle
Boyle
Boyle
Brickner
Chen
Chiu
Christ
Christ
Christ
CP-PACS
Csikor
Fischer
Fodor
Fodor
Gellrich
Gottlieb
Gottlieb
Hasenbusch
Holmgren
Iwasaki
Iwasaki
Lindahl
Luo
Luscher
Marinari
Marinari
Mawhinney
Meuer
Negrassus
Ridge
Sexton
Singh
Sroczynski
Sroczynski
Th Lippert
Watson
Watson
Weingarten
Weingarten
Publication venue: 'Elsevier BV'
Publication date: 10/11/2003
Field of study

I highlight recent progress in cluster computer technology and assess status and prospects of cluster computers for lattice QCD with respect to the development of QCDOC and apeNEXT. Taking the LatFor test case, I specify a 512-processor QCD-cluster better than 1$/Mflops.Comment: 14 pages, 17 figures, Lattice2003(plenary

arXiv.org e-Print Archive

Crossref

Juelich Shared Electronic Resources

CERN Document Server

Balancing the Communication Load of Asynchronously Parallelized Machine Learning Algorithms

Author: Keuper Janis
Pfreundt Franz-Josef
Publication venue
Publication date: 05/10/2015
Field of study

Stochastic Gradient Descent (SGD) is the standard numerical method used to solve the core optimization problem for the vast majority of machine learning (ML) algorithms. In the context of large scale learning, as utilized by many Big Data applications, efficient parallelization of SGD is in the focus of active research. Recently, we were able to show that the asynchronous communication paradigm can be applied to achieve a fast and scalable parallelization of SGD. Asynchronous Stochastic Gradient Descent (ASGD) outperforms other, mostly MapReduce based, parallel algorithms solving large scale machine learning problems. In this paper, we investigate the impact of asynchronous communication frequency and message size on the performance of ASGD applied to large scale ML on HTC cluster and cloud environments. We introduce a novel algorithm for the automatic balancing of the asynchronous communication load, which allows to adapt ASGD to changing network bandwidths and latencies.Comment: arXiv admin note: substantial text overlap with arXiv:1505.0495

arXiv.org e-Print Archive

Fraunhofer-ePrints

Lattice QCD Production on Commodity Clusters at Fermilab

Author: Gottlieb Steven
Holmgren D.
Mackenzie P.
Simone J.
Singh A.
Publication venue
Publication date: 08/07/2003
Field of study

We describe the construction and results to date of Fermilab's three Myrinet-networked lattice QCD production clusters (an 80-node dual Pentium III cluster, a 48-node dual Xeon cluster, and a 128-node dual Xeon cluster). We examine a number of aspects of performance of the MILC lattice QCD code running on these clusters.Comment: Talk from the 2003 Computing in High Energy and Nuclear Physics (CHEP03), La Jolla, Ca, USA, March 2003, 6 pages, LaTeX, 8 eps figures. PSN TUIT00

arXiv.org e-Print Archive

UNT Digital Library

Better than $1/Mflops sustained: a scalable PC-based parallel computer for lattice QCD

Author: Aglietti
Alfieri
Alfieri
Aoki
Arsenin
Avico
Bartoloni
Bartonoli
Bodin
Boyle
Chen
Christ
Csikor
Csikor
Di Pierro
Eicker
Fodor
Fodor
Gottlieb
Gottlieb
Gottlieb
Gábor Papp
Iwasaki
Iwasaki
Lüscher
Sándor D. Katz
Tripiccione
Ukawa
Yoshie
Zoltán Fodor
Publication venue: 'Elsevier BV'
Publication date: 21/05/2003
Field of study

We study the feasibility of a PC-based parallel computer for medium to large scale lattice QCD simulations. The E\"otv\"os Univ., Inst. Theor. Phys. cluster consists of 137 Intel P4-1.7GHz nodes with 512 MB RDRAM. The 32-bit, single precision sustained performance for dynamical QCD without communication is 1510 Mflops/node with Wilson and 970 Mflops/node with staggered fermions. This gives a total performance of 208 Gflops for Wilson and 133 Gflops for staggered QCD, respectively (for 64-bit applications the performance is approximately halved). The novel feature of our system is its communication architecture. In order to have a scalable, cost-effective machine we use Gigabit Ethernet cards for nearest-neighbor communications in a two-dimensional mesh. This type of communication is cost effective (only 30% of the hardware costs is spent on the communication). According to our benchmark measurements this type of communication results in around 40% communication time fraction for lattices upto 48^3\cdot96 in full QCD simulations. The price/sustained-performance ratio for full QCD is better than

1/Mflops for Wilson (and around

1.5/Mflops for staggered) quarks for practically any lattice size, which can fit in our parallel computer. The communication software is freely available upon request for non-profit organizations.Comment: 14 pages, 3 figures, final version to appear in Comp.Phys.Com

arXiv.org e-Print Archive

Crossref