3,873 research outputs found
A low-cost parallel implementation of direct numerical simulation of wall turbulence
A numerical method for the direct numerical simulation of incompressible wall
turbulence in rectangular and cylindrical geometries is presented. The
distinctive feature resides in its design being targeted towards an efficient
distributed-memory parallel computing on commodity hardware. The adopted
discretization is spectral in the two homogeneous directions; fourth-order
accurate, compact finite-difference schemes over a variable-spacing mesh in the
wall-normal direction are key to our parallel implementation. The parallel
algorithm is designed in such a way as to minimize data exchange among the
computing machines, and in particular to avoid taking a global transpose of the
data during the pseudo-spectral evaluation of the non-linear terms. The
computing machines can then be connected to each other through low-cost network
devices. The code is optimized for memory requirements, which can moreover be
subdivided among the computing nodes. The layout of a simple, dedicated and
optimized computing system based on commodity hardware is described. The
performance of the numerical method on this computing system is evaluated and
compared with that of other codes described in the literature, as well as with
that of the same code implementing a commonly employed strategy for the
pseudo-spectral calculation.Comment: To be published in J. Comp. Physic
Recent development and perspectives of machines for lattice QCD
I highlight recent progress in cluster computer technology and assess status
and prospects of cluster computers for lattice QCD with respect to the
development of QCDOC and apeNEXT. Taking the LatFor test case, I specify a
512-processor QCD-cluster better than 1$/Mflops.Comment: 14 pages, 17 figures, Lattice2003(plenary
Balancing the Communication Load of Asynchronously Parallelized Machine Learning Algorithms
Stochastic Gradient Descent (SGD) is the standard numerical method used to
solve the core optimization problem for the vast majority of machine learning
(ML) algorithms. In the context of large scale learning, as utilized by many
Big Data applications, efficient parallelization of SGD is in the focus of
active research. Recently, we were able to show that the asynchronous
communication paradigm can be applied to achieve a fast and scalable
parallelization of SGD. Asynchronous Stochastic Gradient Descent (ASGD)
outperforms other, mostly MapReduce based, parallel algorithms solving large
scale machine learning problems. In this paper, we investigate the impact of
asynchronous communication frequency and message size on the performance of
ASGD applied to large scale ML on HTC cluster and cloud environments. We
introduce a novel algorithm for the automatic balancing of the asynchronous
communication load, which allows to adapt ASGD to changing network bandwidths
and latencies.Comment: arXiv admin note: substantial text overlap with arXiv:1505.0495
Lattice QCD Production on Commodity Clusters at Fermilab
We describe the construction and results to date of Fermilab's three
Myrinet-networked lattice QCD production clusters (an 80-node dual Pentium III
cluster, a 48-node dual Xeon cluster, and a 128-node dual Xeon cluster). We
examine a number of aspects of performance of the MILC lattice QCD code running
on these clusters.Comment: Talk from the 2003 Computing in High Energy and Nuclear Physics
(CHEP03), La Jolla, Ca, USA, March 2003, 6 pages, LaTeX, 8 eps figures. PSN
TUIT00
Better than $1/Mflops sustained: a scalable PC-based parallel computer for lattice QCD
We study the feasibility of a PC-based parallel computer for medium to large
scale lattice QCD simulations. The E\"otv\"os Univ., Inst. Theor. Phys. cluster
consists of 137 Intel P4-1.7GHz nodes with 512 MB RDRAM. The 32-bit, single
precision sustained performance for dynamical QCD without communication is 1510
Mflops/node with Wilson and 970 Mflops/node with staggered fermions. This gives
a total performance of 208 Gflops for Wilson and 133 Gflops for staggered QCD,
respectively (for 64-bit applications the performance is approximately halved).
The novel feature of our system is its communication architecture. In order to
have a scalable, cost-effective machine we use Gigabit Ethernet cards for
nearest-neighbor communications in a two-dimensional mesh. This type of
communication is cost effective (only 30% of the hardware costs is spent on the
communication). According to our benchmark measurements this type of
communication results in around 40% communication time fraction for lattices
upto 48^3\cdot96 in full QCD simulations. The price/sustained-performance ratio
for full QCD is better than 1.5/Mflops for
staggered) quarks for practically any lattice size, which can fit in our
parallel computer. The communication software is freely available upon request
for non-profit organizations.Comment: 14 pages, 3 figures, final version to appear in Comp.Phys.Com
- …