Search CORE

953 research outputs found

Parallel implementation of the TRANSIMS micro-simulation

Author: Nagel Kai
Rickert Marcus
Publication venue
Publication date: 01/01/2000
Field of study

This paper describes the parallel implementation of the TRANSIMS traffic micro-simulation. The parallelization method is domain decomposition, which means that each CPU of the parallel computer is responsible for a different geographical area of the simulated region. We describe how information between domains is exchanged, and how the transportation network graph is partitioned. An adaptive scheme is used to optimize load balancing. We then demonstrate how computing speeds of our parallel micro-simulations can be systematically predicted once the scenario and the computer architecture are known. This makes it possible, for example, to decide if a certain study is feasible with a certain computing budget, and how to invest that budget. The main ingredients of the prediction are knowledge about the parallel implementation of the micro-simulation, knowledge about the characteristics of the partitioning of the transportation network graph, and knowledge about the interaction of these quantities with the computer system. In particular, we investigate the differences between switched and non-switched topologies, and the effects of 10 Mbit, 100 Mbit, and Gbit Ethernet. keywords: Traffic simulation, parallel computing, transportation planning, TRANSIM

arXiv.org e-Print Archive

CiteSeerX

Repository for Publications and Research Data

RIACS

Author: Oliger Joseph
Publication venue
Publication date
Field of study

Topics considered include: high-performance computing; cognitive and perceptual prostheses (computational aids designed to leverage human abilities); autonomous systems. Also included: development of a 3D unstructured grid code based on a finite volume formulation and applied to the Navier-stokes equations; Cartesian grid methods for complex geometry; multigrid methods for solving elliptic problems on unstructured grids; algebraic non-overlapping domain decomposition methods for compressible fluid flow problems on unstructured meshes; numerical methods for the compressible navier-stokes equations with application to aerodynamic flows; research in aerodynamic shape optimization; S-HARP: a parallel dynamic spectral partitioner; numerical schemes for the Hamilton-Jacobi and level set equations on triangulated domains; application of high-order shock capturing schemes to direct simulation of turbulence; multicast technology; network testbeds; supercomputer consolidation project

NASA Technical Reports Server

A visual Analytics System for Optimizing Communications in Massively Parallel Applications

Author: Fujiwara Takanori
Ma Kwan-Liu
Malakar Preeti
Papka Michael E.
Reda Khairi
Vishwanath Venkatram
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

Current and future supercomputers have tens of thousands of compute nodes interconnected with high-dimensional networks and complex network topologies for improved performance. Application developers are required to write scalable parallel programs in order to achieve high throughput on these machines. Application performance is largely determined by efficient inter-process communication. A common way to analyze and optimize performance is through profiling parallel codes to identify communication bottlenecks. However, understanding gigabytes of profile data is not a trivial task. In this paper, we present a visual analytics system for identifying the scalability bottlenecks and improving the communication efficiency of massively parallel applications. Visualization methods used in this system are designed to comprehend large-scale and varied communication patterns on thousands of nodes in complex networks such as the 5D torus and the dragonfly. We also present efficient rerouting and remapping algorithms that can be coupled with our interactive visual analytics design for performance optimization. We demonstrate the utility of our system with several case studies using three benchmark applications on two leading supercomputers. The mapping suggestion from our system led to 38% improvement in hop-bytes for MiniAMR application on 4,096 MPI processes.This research has been sponsored in part by the U.S. National Science Foundation through grant IIS-1320229, and the U.S. Department of Energy through grants DE-SC0012610 and DE-SC0014917. This research has been funded in part and used resources of the Argonne Leadership Computing Facility at Argonne National Lab- oratory, which is supported by the Office of Science of the U.S. Department of Energy under contract no. DE-AC02-06CH11357. This work was supported in part by the DOE Office of Science, ASCR, under award numbers 57L38, 57L32, 57L11, 57K50, and 508050

Crossref

IUPUIScholarWorks

Automating Topology Aware Mapping for Supercomputers

Author: Bhatele Abhinav
Publication venue
Publication date: 01/01/2010
Field of study

Petascale machines with hundreds of thousands of cores are being built. These machines have varying interconnect topologies and large network diameters. Computation is cheap and communication on the network is becoming the bottleneck for scaling of parallel applications. Network contention, specifically, is becoming an increasingly important factor affecting overall performance. The broad goal of this dissertation is performance optimization of parallel applications through reduction of network contention. Most parallel applications have a certain communication topology. Mapping of tasks in a parallel application based on their communication graph, to the physical processors on a machine can potentially lead to performance improvements. Mapping of the communication graph for an application on to the interconnect topology of a machine while trying to localize communication is the research problem under consideration. The farther different messages travel on the network, greater is the chance of resource sharing between messages. This can create contention on the network for networks commonly used today. Evaluative studies in this dissertation show that on IBM Blue Gene and Cray XT machines, message latencies can be severely affected under contention. Realizing this fact, application developers have started paying attention to the mapping of tasks to physical processors to minimize contention. Placement of communicating tasks on nearby physical processors can minimize the distance traveled by messages and reduce the chances of contention. Performance improvements through topology aware placement for applications such as NAMD and OpenAtom are used to motivate this work. Building on these ideas, the dissertation proposes algorithms and techniques for automatic mapping of parallel applications to relieve the application developers of this burden. The effect of contention on message latencies is studied in depth to guide the design of mapping algorithms. The hop-bytes metric is proposed for the evaluation of mapping algorithms as a better metric than the previously used maximum dilation metric. The main focus of this dissertation is on developing topology aware mapping algorithms for parallel applications with regular and irregular communication patterns. The automatic mapping framework is a suite of such algorithms with capabilities to choose the best mapping for a problem with a given communication graph. The dissertation also briefly discusses completely distributed mapping techniques which will be imperative for machines of the future.published or submitted for publicationnot peer reviewe

CiteSeerX

Illinois Digital Environment for Access to Learning and Scholarship Repository

Project Final Report: HPC-Colony II

Author
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date
Field of study

Crossref

A Case for Cooperative and Incentive-Based Coupling of Distributed Clusters

Author: Buyya Rajkumar
Harwood Aaron
Ranjan Rajiv
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2005
Field of study

Research interest in Grid computing has grown significantly over the past five years. Management of distributed resources is one of the key issues in Grid computing. Central to management of resources is the effectiveness of resource allocation as it determines the overall utility of the system. The current approaches to superscheduling in a grid environment are non-coordinated since application level schedulers or brokers make scheduling decisions independently of the others in the system. Clearly, this can exacerbate the load sharing and utilization problems of distributed resources due to suboptimal schedules that are likely to occur. To overcome these limitations, we propose a mechanism for coordinated sharing of distributed clusters based on computational economy. The resulting environment, called \emph{Grid-Federation}, allows the transparent use of resources from the federation when local resources are insufficient to meet its users' requirements. The use of computational economy methodology in coordinating resource allocation not only facilitates the QoS based scheduling, but also enhances utility delivered by resources.Comment: 22 pages, extended version of the conference paper published at IEEE Cluster'05, Boston, M

arXiv.org e-Print Archive

CiteSeerX

Crossref

Parallel Computers and Complex Systems

Author: G.C. Fox
P.D. Coddington
Publication venue: University Press
Publication date
Field of study

We present an overview of the state of the art and future trends in high performance parallel and distributed computing, and discuss techniques for using such computers in the simulation of complex problems in computational science. The use of high performance parallel computers can help improve our understanding of complex systems, and the converse is also true --- we can apply techniques used for the study of complex systems to improve our understanding of parallel computing. We consider parallel computing as the mapping of one complex system --- typically a model of the world --- into another complex system --- the parallel computer. We study static, dynamic, spatial and temporal properties of both the complex systems and the map between them. The result is a better understanding of which computer architectures are good for which problems, and of software structure, automatic partitioning of data, and the performance of parallel machines

CiteSeerX

Large scale transportation simulations on Beowulf clusters

Author: Cetin Nurhan
Nagel Kai
Publication venue: ETH, Eidgenössische Technische Hochschule, Departement of Computer Science
Publication date: 01/01/2001
Field of study

Repository for Publications and Research Data

Recommended from our members

Computer-aided programming for multiprocessing systems

Author: Gajski Daniel D.
Wu Min-You
Publication venue: eScholarship, University of California
Publication date: 30/06/1988
Field of study

As both the number of processors and the complexity of problems to be solved increase, programming multiprocessing systems becomes more difficult and error-prone. This report discusses parallel models of computation and tools for computer-aided programming (CAP). Program development tools are necessary since programmers are not able to develop complex parallel programs efficiently. In particular, a CAP tool, named Hypertool, is described here. It performs scheduling and handles the communication primitive insertion automatically so that many errors are eliminated. It also generates the performance estimates and other program quality measures to help programmers in improving their algorithms and programs. Experiments have shown that up to a 300% performance improvement can be achieved by computer-aided programming

eScholarship - University of California

An efficient MPI/OpenMP parallelization of the Hartree-Fock method for the second generation of Intel Xeon Phi processor

Author: Alexeev Yuri
D'mello Michael
Gordon Mark S.
Keipert Kristopher
Mironov Vladimir
Moskovsky Alexander
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 14/08/2017
Field of study

Modern OpenMP threading techniques are used to convert the MPI-only Hartree-Fock code in the GAMESS program to a hybrid MPI/OpenMP algorithm. Two separate implementations that differ by the sharing or replication of key data structures among threads are considered, density and Fock matrices. All implementations are benchmarked on a super-computer of 3,000 Intel Xeon Phi processors. With 64 cores per processor, scaling numbers are reported on up to 192,000 cores. The hybrid MPI/OpenMP implementation reduces the memory footprint by approximately 200 times compared to the legacy code. The MPI/OpenMP code was shown to run up to six times faster than the original for a range of molecular system sizes.Comment: SC17 conference paper, 12 pages, 7 figure

arXiv.org e-Print Archive

Crossref