1,508 research outputs found
An OpenSHMEM Implementation for the Adapteva Epiphany Coprocessor
This paper reports the implementation and performance evaluation of the
OpenSHMEM 1.3 specification for the Adapteva Epiphany architecture within the
Parallella single-board computer. The Epiphany architecture exhibits massive
many-core scalability with a physically compact 2D array of RISC CPU cores and
a fast network-on-chip (NoC). While fully capable of MPMD execution, the
physical topology and memory-mapped capabilities of the core and network
translate well to Partitioned Global Address Space (PGAS) programming models
and SPMD execution with SHMEM.Comment: 14 pages, 9 figures, OpenSHMEM 2016: Third workshop on OpenSHMEM and
Related Technologie
Using XDAQ in Application Scenarios of the CMS Experiment
XDAQ is a generic data acquisition software environment that emerged from a
rich set of of use-cases encountered in the CMS experiment. They cover not the
deployment for multiple sub-detectors and the operation of different processing
and networking equipment as well as a distributed collaboration of users with
different needs. The use of the software in various application scenarios
demonstrated the viability of the approach. We discuss two applications, the
tracker local DAQ system for front-end commissioning and the muon chamber
validation system. The description is completed by a brief overview of XDAQ.Comment: Conference CHEP 2003 (Computing in High Energy and Nuclear Physics,
La Jolla, CA
APENet: LQCD clusters a la APE
Developed by the APE group, APENet is a new high speed, low latency,
3-dimensional interconnect architecture optimized for PC clusters running
LQCD-like numerical applications. The hardware implementation is based on a
single PCI-X 133MHz network interface card hosting six indipendent
bi-directional channels with a peak bandwidth of 676 MB/s each direction. We
discuss preliminary benchmark results showing exciting performances similar or
better than those found in high-end commercial network systems.Comment: Lattice2004(machines), 3 pages, 4 figure
Learning from the Success of MPI
The Message Passing Interface (MPI) has been extremely successful as a
portable way to program high-performance parallel computers. This success has
occurred in spite of the view of many that message passing is difficult and
that other approaches, including automatic parallelization and directive-based
parallelism, are easier to use. This paper argues that MPI has succeeded
because it addresses all of the important issues in providing a parallel
programming model.Comment: 12 pages, 1 figur
Lattice QCD Production on Commodity Clusters at Fermilab
We describe the construction and results to date of Fermilab's three
Myrinet-networked lattice QCD production clusters (an 80-node dual Pentium III
cluster, a 48-node dual Xeon cluster, and a 128-node dual Xeon cluster). We
examine a number of aspects of performance of the MILC lattice QCD code running
on these clusters.Comment: Talk from the 2003 Computing in High Energy and Nuclear Physics
(CHEP03), La Jolla, Ca, USA, March 2003, 6 pages, LaTeX, 8 eps figures. PSN
TUIT00
On the acceleration of wavefront applications using distributed many-core architectures
In this paper we investigate the use of distributed graphics processing unit (GPU)-based architectures to accelerate pipelined wavefront applicationsâa ubiquitous class of parallel algorithms used for the solution of a number of scientific and engineering applications. Specifically, we employ a recently developed port of the LU solver (from the NAS Parallel Benchmark suite) to investigate the performance of these algorithms on high-performance computing solutions from NVIDIA (Tesla C1060 and C2050) as well as on traditional clusters (AMD/InfiniBand and IBM BlueGene/P). Benchmark results are presented for problem classes A to C and a recently developed performance model is used to provide projections for problem classes D and E, the latter of which represents a billion-cell problem. Our results demonstrate that while the theoretical performance of GPU solutions will far exceed those of many traditional technologies, the sustained application performance is currently comparable for scientific wavefront applications. Finally, a breakdown of the GPU solution is conducted, exposing PCIe overheads and decomposition constraints. A new k-blocking strategy is proposed to improve the future performance of this class of algorithm on GPU-based architectures
- âŠ