Search CORE

100 research outputs found

Results and Frontiers in Lattice Baryon Spectroscopy

Author: Adam C. Lichtl
Chaden Djalali
Colin Morningstar
David Richards
Fernando Umeres
George Fleming
John Bulava
K. Jimmy Juge
Nilmani Mathur
Philip L. Cole
Ricardo Alarcon
Robert Edwards
Stephen J. Wallace
Publication venue: 'AIP Publishing'
Publication date: 01/01/2007
Field of study

The Lattice Hadron Physics Collaboration (LHPC) baryon spectroscopy effort is reviewed. To date the LHPC has performed exploratory Lattice QCD calculations of the low-lying spectrum of Nucleon and Delta baryons. These calculations demonstrate the effectiveness of our method by obtaining the masses of an unprecedented number of excited states with definite quantum numbers. Future work of the project is outlined.Comment: To appear in the proceedings for the VII Latin American Symposium of Nuclear Physics and Application

arXiv.org e-Print Archive

Crossref

Automating Topology Aware Mapping for Supercomputers

Author: Bhatele Abhinav
Publication venue
Publication date: 01/01/2010
Field of study

Petascale machines with hundreds of thousands of cores are being built. These machines have varying interconnect topologies and large network diameters. Computation is cheap and communication on the network is becoming the bottleneck for scaling of parallel applications. Network contention, specifically, is becoming an increasingly important factor affecting overall performance. The broad goal of this dissertation is performance optimization of parallel applications through reduction of network contention. Most parallel applications have a certain communication topology. Mapping of tasks in a parallel application based on their communication graph, to the physical processors on a machine can potentially lead to performance improvements. Mapping of the communication graph for an application on to the interconnect topology of a machine while trying to localize communication is the research problem under consideration. The farther different messages travel on the network, greater is the chance of resource sharing between messages. This can create contention on the network for networks commonly used today. Evaluative studies in this dissertation show that on IBM Blue Gene and Cray XT machines, message latencies can be severely affected under contention. Realizing this fact, application developers have started paying attention to the mapping of tasks to physical processors to minimize contention. Placement of communicating tasks on nearby physical processors can minimize the distance traveled by messages and reduce the chances of contention. Performance improvements through topology aware placement for applications such as NAMD and OpenAtom are used to motivate this work. Building on these ideas, the dissertation proposes algorithms and techniques for automatic mapping of parallel applications to relieve the application developers of this burden. The effect of contention on message latencies is studied in depth to guide the design of mapping algorithms. The hop-bytes metric is proposed for the evaluation of mapping algorithms as a better metric than the previously used maximum dilation metric. The main focus of this dissertation is on developing topology aware mapping algorithms for parallel applications with regular and irregular communication patterns. The automatic mapping framework is a suite of such algorithms with capabilities to choose the best mapping for a problem with a given communication graph. The dissertation also briefly discusses completely distributed mapping techniques which will be imperative for machines of the future.published or submitted for publicationnot peer reviewe

CiteSeerX

Illinois Digital Environment for Access to Learning and Scholarship Repository

Performance Measurement and Analysis of Large-Scale Parallel Applications on Leadership Computing Systems

Author
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2008
Field of study

Crossref

Beam Dynamics in High Intensity Cyclotrons Including Neighboring Bunch Effects: Model, Implementation and Application

Author: A. Adelmann
A. Adelmann
F. Iselin
J. Galambos
J. J. Yang
M. Humbel
M. Seidel
M. Seidel
M. M. Gordon
P. Bertrand
R. Baartman
R. W. Hockney
S. Koscielniak
T. J. Zhang
W. Joho
Publication venue: 'American Physical Society (APS)'
Publication date: 01/03/2010
Field of study

Space charge effects, being one of the most significant collective effects, play an important role in high intensity cyclotrons. However, for cyclotrons with small turn separation, other existing effects are of equal importance. Interactions of radially neighboring bunches are also present, but their combined effects has not yet been investigated in any great detail. In this paper, a new particle in cell based self-consistent numerical simulation model is presented for the first time. The model covers neighboring bunch effects and is implemented in the three-dimensional object-oriented parallel code OPAL-cycl, a flavor of the OPAL framework. We discuss this model together with its implementation and validation. Simulation results are presented from the PSI 590 MeV Ring Cyclotron in the context of the ongoing high intensity upgrade program, which aims to provide a beam power of 1.8 MW (CW) at the target destination

arXiv.org e-Print Archive

Crossref

Directory of Open Access Journals

Predictive analysis and optimisation of pipelined wavefront applications using reusable analytic models

Author: Mudalige Gihan R.
Publication venue
Publication date
Field of study

Pipelined wavefront computations are an ubiquitous class of high performance parallel algorithms used for the solution of many scientific and engineering applications. In order to aid the design and optimisation of these applications, and to ensure that during procurement platforms are chosen best suited to these codes, there has been considerable research in analysing and evaluating their operational performance. Wavefront codes exhibit complex computation, communication, synchronisation patterns, and as a result there exist a large variety of such codes and possible optimisations. The problem is compounded by each new generation of high performance computing system, which has often introduced a previously unexplored architectural trait, requiring previous performance models to be rewritten and reevaluated. In this thesis, we address the performance modelling and optimisation of this class of application, as a whole. This differs from previous studies in which bespoke models are applied to specific applications. The analytic performance models are generalised and reusable, and we demonstrate their application to the predictive analysis and optimisation of pipelined wavefront computations running on modern high performance computing systems. The performance model is based on the LogGP parameterisation, and uses a small number of input parameters to specify the particular behaviour of most wavefront codes. The new parameters and model equations capture the key structural and behavioural differences among different wavefront application codes, providing a succinct summary of the operations for each application and insights into alternative wavefront application design. The models are applied to three industry-strength wavefront codes and are validated on several systems including a Cray XT3/XT4 and an InfiniBand commodity cluster. Model predictions show high quantitative accuracy (less than 20% error) for all high performance configurations and excellent qualitative accuracy. The thesis presents applications, projections and insights for optimisations using the model, which show the utility of reusable analytic models for performance engineering of high performance computing codes. In particular, we demonstrate the use of the model for: (1) evaluating application configuration and resulting performance; (2) evaluating hardware platform issues including platform sizing, configuration; (3) exploring hardware platform design alternatives and system procurement and, (4) considering possible code and algorithmic optimisations

Warwick Research Archives Portal Repository

HPCC Update and Analysis

Author: Jeffery A Kuehn
Nathan L Wichmann
Publication venue
Publication date: 11/04/2020
Field of study

Abstract: The last year has seen significant updates in the programming environment and operating systems on the Cray X1E and Cray XT3 as well as the much anticipated release of version 1.0 of HPCC Benchmark. This paper will provide an update and analysis of the HPCC Benchmark Results for Cray XT3 and X1E as well as a comparison against historical results

CiteSeerX

OutFlank Routing: Increasing Throughput in Toroidal Interconnection Networks

Author: Versaci Francesco
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 28/10/2013
Field of study

We present a new, deadlock-free, routing scheme for toroidal interconnection networks, called OutFlank Routing (OFR). OFR is an adaptive strategy which exploits non-minimal links, both in the source and in the destination nodes. When minimal links are congested, OFR deroutes packets to carefully chosen intermediate destinations, in order to obtain travel paths which are only an additive constant longer than the shortest ones. Since routing performance is very sensitive to changes in the traffic model or in the router parameters, an accurate discrete-event simulator of the toroidal network has been developed to empirically validate OFR, by comparing it against other relevant routing strategies, over a range of typical real-world traffic patterns. On the 16x16x16 (4096 nodes) simulated network OFR exhibits improvements of the maximum sustained throughput between 14% and 114%, with respect to Adaptive Bubble Routing.Comment: 9 pages, 5 figures, to be presented at ICPADS 201

arXiv.org e-Print Archive

Crossref

Predictive analysis and optimisation of pipelined wavefront applications using reusable analytic models

Author: Mudalige Gihan Ravideva
Publication venue
Publication date: 01/01/2009
Field of study

OpenGrey Repository

Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation

Author: Awan Ammar Ahmad
Bedorf Jeroen
Chu Ching-Hsiang
Panda Dhabaleswar K.
Subramoni Hari
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 25/10/2018
Field of study

TensorFlow has been the most widely adopted Machine/Deep Learning framework. However, little exists in the literature that provides a thorough understanding of the capabilities which TensorFlow offers for the distributed training of large ML/DL models that need computation and communication at scale. Most commonly used distributed training approaches for TF can be categorized as follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this paper, we provide an in-depth performance characterization and analysis of these distributed training approaches on various GPU clusters including the Piz Daint system (6 on Top500). We perform experiments to gain novel insights along the following vectors: 1) Application-level scalability of DNN training, 2) Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on these experiments, we present two key insights: 1) Overall, No-gRPC designs achieve better performance compared to gRPC-based approaches for most configurations, and 2) The performance of No-gRPC is heavily influenced by the gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware MPI Allreduce design that exploits CUDA kernels and pointer caching to perform large reductions efficiently. Our proposed designs offer 5-17X better performance than NCCL2 for small and medium messages, and reduces latency by 29% for large messages. The proposed optimizations help Horovod-MPI to achieve approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs. Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint cluster.Comment: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-revie

arXiv.org e-Print Archive

Crossref