Search CORE

391 research outputs found

Scalability of Incompressible Flow Computations on Multi-GPU Clusters Using Dual-Level and Tri-Level Parallelism

Author: Balaji P.
Bova S. W.
Cappello F.
Cappello F.
Cappello F.
Cwire
Cwire
Cwire
Dong S.
Elsen E.
Goglin B.
Griebel M.
Gropp W.
Guermond J.L.L.
Göddeke D.
Hager G.
Hempel R.
Henty D. S.
Kindratenko V.
Luong P.
Lusk E.
Nakajima K.
Nakajima K.
Owens J.D.
Rabenseifner R.
Schive H.
Showerman M.
Simon H.
Thibault J. C.
Wan D.C.
Publication venue: 'IUScholarWorks'
Publication date: 04/01/2011
Field of study

High performance computing using graphics processing units (GPUs) is gaining popularity in the scientific computing field, with many large compute clusters being augmented with multiple GPUs in each node. We investigate hybrid tri-level (MPI-OpenMP-CUDA) parallel implementations to explore the efficiency and scalability of incompressible flow computations on GPU clusters up to 128 GPUS. This work details some of the unique issues faced when merging fine-grain parallelism on the GPU using CUDA with coarse-grain parallelism using OpenMP for intra-node and MPI for inter-node communication. Comparisons between the tri-level MPI-OpenMP-CUDA and dual-level MPI-CUDA implementations are shown using computationally large computational fluid dynamics (CFD) simulations. Our results demonstrate that a tri-level parallel implementation does not provide a significant advantage in performance over the dual-level implementation, however further research is needed to justify our conclusion for a cluster with a high GPU per node density or when using software that can utilize OpenMP’s fine-grain parallelism more effectively

Crossref

Boise State University - ScholarWorks

Requirements and Problems in Parallel Model Development at DWD

Author
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2000
Field of study

Crossref

Multi-Level Parallelism for Incompressible Flow Computations on GPU Clusters

Author: Jacobsen Dana A.
Senocak Inanc
Publication venue: 'IUScholarWorks'
Publication date: 01/01/2013
Field of study

We investigate multi-level parallelism on GPU clusters with MPI-CUDA and hybrid MPI-OpenMP-CUDA parallel implementations, in which all computations are done on the GPU using CUDA. We explore efficiency and scalability of incompressible flow computations using up to 256 GPUs on a problem with approximately 17.2 billion cells. Our work addresses some of the unique issues faced when merging fine-grain parallelism on the GPU using CUDA with coarse-grain parallelism that use either MPI or MPI-OpenMP for communications. We present three different strategies to overlap computations with communications, and systematically assess their impact on parallel performance on two different GPU clusters. Our results for strong and weak scaling analysis of incompressible flow computations demonstrate that GPU clusters offer significant benefits for large data sets, and a dual-level MPI-CUDA implementation with maximum overlapping of computation and communication provides substantial benefits in performance. We also find that our tri-level MPI-OpenMP-CUDA parallel implementation does not offer a significant advantage in performance over the dual-level implementation on GPU clusters with two GPUs per node, but on clusters with higher GPU counts per node or with different domain decomposition strategies a tri-level implementation may exhibit higher efficiency than a dual-level implementation and needs to be investigated further

Boise State University - ScholarWorks

Parallel Image Reconstruction Systems for Magnetic Induction Tomography

Author: Yasheng Maimaitijiang
Publication venue
Publication date: 02/03/2009
Field of study

University of South Wales Research Explorer

A survey of techniques and technologies for web-based real-time interactive rendering

Author: Pacheco Filipe
Tovar Eduardo
Publication venue: IPP-Hurray Group
Publication date: 01/01/2001
Field of study

When exploring a virtual environment, realism depends mainly on two factors: realistic images and real-time feedback (motions, behaviour etc.). In this context, photo realism and physical validity of computer generated images required by emerging applications, such as advanced e-commerce, still impose major challenges in the area of rendering research whereas the complexity of lighting phenomena further requires powerful and predictable computing if time constraints must be attained. In this technical report we address the state-of-the-art on rendering, trying to put the focus on approaches, techniques and technologies that might enable real-time interactive web-based clientserver rendering systems. The focus is on the end-systems and not the networking technologies used to interconnect client(s) and server(s).Siemens; Bertelsmann mediaSystems GmbH; Eptron Multimedia; Instituto Politécnico do Porto - ISEP-IPP; Institute Laboratory for Mixed Realities at the Academy of Media Arts Cologne, LMR; Mälardalen Real-Time Research Centre (MRTC) at Mälardalen University in Västerås; Q-Systems

Repositório Científico do Instituto Politécnico do Porto

ATCOM: Automatically tuned collective communication system for SMP clusters.

Author: Wu Meng-Shiou
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2005
Field of study

Conventional implementations of collective communications are based on point-to-point communications, and their optimizations have been focused on efficiency of those communication algorithms. However, point-to-point communications are not the optimal choice for modern computing clusters of SMPs due to their two-level communication structure. In recent years, a few research efforts have investigated efficient collective communications for SMP clusters. This dissertation is focused on platform-independent algorithms and implementations in this area;There are two main approaches to implementing efficient collective communications for clusters of SMPs: using shared memory operations for intra-node communications, and over-lapping inter-node/intra-node communications. The former fully utilizes the hardware based shared memory of an SMP, and the latter takes advantage of the inherent hierarchy of the communications within a cluster of SMPs. Previous studies focused on clusters of SMP from certain vendors. However, the previously proposed methods are not portable to other systems. Because the performance optimization issue is very complicated and the developing process is very time consuming, it is highly desired to have self-tuning, platform-independent implementations. As proven in this dissertation, such an implementation can significantly outperform the other point-to-point based portable implementations and some platform-specific implementations;The dissertation describes in detail the architecture of the platform-independent implementation. There are four system components: shared memory-based collective communications, overlapping mechanisms for inter-node and intra-node communications, a prediction-based tuning module and a micro-benchmark based tuning module. Each component is carefully designed with the goal of automatic tuning in mind

Digital Repository @ Iowa State University (ISU)

UNT Digital Library

Architecture aware parallel programming in Glasgow parallel Haskell (GPH)

Author: Aswad Mustafa K.H.
Publication venue: Mathematical and Computer Sciences
Publication date: 01/08/2012
Field of study

General purpose computing architectures are evolving quickly to become manycore and hierarchical: i.e. a core can communicate more quickly locally than globally. To be effective on such architectures, programming models must be aware of the communications hierarchy. This thesis investigates a programming model that aims to share the responsibility of task placement, load balance, thread creation, and synchronisation between the application developer and the runtime system. The main contribution of this thesis is the development of four new architectureaware constructs for Glasgow parallel Haskell that exploit information about task size and aim to reduce communication for small tasks, preserve data locality, or to distribute large units of work. We define a semantics for the constructs that specifies the sets of PEs that each construct identifies, and we check four properties of the semantics using QuickCheck. We report a preliminary investigation of architecture aware programming models that abstract over the new constructs. In particular, we propose architecture aware evaluation strategies and skeletons. We investigate three common paradigms, such as data parallelism, divide-and-conquer and nested parallelism, on hierarchical architectures with up to 224 cores. The results show that the architecture-aware programming model consistently delivers better speedup and scalability than existing constructs, together with a dramatic reduction in the execution time variability. We present a comparison of functional multicore technologies and it reports some of the first ever multicore results for the Feedback Directed Implicit Parallelism (FDIP) and the semi-explicit parallelism (GpH and Eden) languages. The comparison reflects the growing maturity of the field by systematically evaluating four parallel Haskell implementations on a common multicore architecture. The comparison contrasts the programming effort each language requires with the parallel performance delivered. We investigate the minimum thread granularity required to achieve satisfactory performance for three implementations parallel functional language on a multicore platform. The results show that GHC-GUM requires a larger thread granularity than Eden and GHC-SMP. The thread granularity rises as the number of cores rises

ROS: The Research Output Service. Heriot-Watt University Edinburgh

Requirements and problems in parallel model development at

Author: Günther Doms
Jürgen Steppeler
Ulrich Schättler
Publication venue
Publication date: 03/04/2020
Field of study

Nearly 30 years after introducing the first computer model for weather forecasting, the Deutscher Wetterdienst (DWD) is developing the 4th generation of its numerical weather prediction (NWP) system. It consists of a global grid point model (GME) based on a triangular grid and a non-hydrostatic Lokal Modell (LM). The operational demand for running this new system is immense and can only be met by parallel computers. From the experience gained in developing earlier NWP models, several new problems had to be taken into account during the design phase of the system. Most important were portability (including efficieny of the programs on several computer architectures) and ease of code maintainability. Also the organization and administration of the work done by developers from different teams and institutions is more complex than it used to be. This paper describes the models and gives some performance results. The modular approach used for the design of the LM is explained and the effects on the development are discussed

CiteSeerX

GridSim: A Toolkit for the Modeling and Simulation of Distributed Resource Management and Scheduling for Grid Computing

Author: Buyya Rajkumar
Murshed Manzur
Publication venue
Publication date: 01/01/2002
Field of study

Clusters, grids, and peer-to-peer (P2P) networks have emerged as popular paradigms for next generation parallel and distributed computing. The management of resources and scheduling of applications in such large-scale distributed systems is a complex undertaking. In order to prove the effectiveness of resource brokers and associated scheduling algorithms, their performance needs to be evaluated under different scenarios such as varying number of resources and users with different requirements. In a grid environment, it is hard and even impossible to perform scheduler performance evaluation in a repeatable and controllable manner as resources and users are distributed across multiple organizations with their own policies. To overcome this limitation, we have developed a Java-based discrete-event grid simulation toolkit called GridSim. The toolkit supports modeling and simulation of heterogeneous grid resources (both time- and space-shared), users and application models. It provides primitives for creation of application tasks, mapping of tasks to resources, and their management. To demonstrate suitability of the GridSim toolkit, we have simulated a Nimrod-G like grid resource broker and evaluated the performance of deadline and budget constrained cost- and time-minimization scheduling algorithms

arXiv.org e-Print Archive

CiteSeerX