Search CORE

370,055 research outputs found

A Parallel Dual Fast Gradient Method for MPC Applications

Author: Ferranti Laura
Keviczky Tamas
Publication venue
Publication date: 21/03/2015
Field of study

We propose a parallel adaptive constraint-tightening approach to solve a linear model predictive control problem for discrete-time systems, based on inexact numerical optimization algorithms and operator splitting methods. The underlying algorithm first splits the original problem in as many independent subproblems as the length of the prediction horizon. Then, our algorithm computes a solution for these subproblems in parallel by exploiting auxiliary tightened subproblems in order to certify the control law in terms of suboptimality and recursive feasibility, along with closed-loop stability of the controlled system. Compared to prior approaches based on constraint tightening, our algorithm computes the tightening parameter for each subproblem to handle the propagation of errors introduced by the parallelization of the original problem. Our simulations show the computational benefits of the parallelization with positive impacts on performance and numerical conditioning when compared with a recent nonparallel adaptive tightening scheme.Comment: This technical report is an extended version of the paper "A Parallel Dual Fast Gradient Method for MPC Applications" by the same authors submitted to the 54th IEEE Conference on Decision and Contro

arXiv.org e-Print Archive

Crossref

Machine Learning for Run-Time Energy Optimisation in Many-Core Systems

Author: Al-Hashimi Bashir B M
Balagopal Vibishna
Biswas Dwaipayan
Merrett Geoff V
Shafik Rishad
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/03/2017
Field of study

In recent years, the focus of computing has moved away from performance-centric serial computation to energy-efficient parallel computation. This necessitates run-time optimisation techniques to address the dynamic resource requirements of different applications on many-core architectures. In this paper, we report on intelligent run-time algorithms which have been experimentally validated for managing energy and application performance in many-core embedded system. The algorithms are underpinned by a cross-layer system approach where the hardware, system software and application layers work together to optimise the energy-performance trade-off. Algorithm development is motivated by the biological process of how a human brain (acting as an agent) interacts with the external environment (system) changing their respective states over time. This leads to a pay-off for the action taken, and the agent eventually learns to take the optimal/best decisions in future. In particular, our online approach uses a model-free reinforcement learning algorithm that suitably selects the appropriate voltage-frequency scaling based on workload prediction to meet the applications’ performance requirements and achieve energy savings of up to 16% in comparison to state-of-the-art-techniques, when tested on four ARM A15 cores of an ODROID-XU3 platform

Southampton (e-Prints Soton)

Crossref

Performance predictability of divide and conquer skeletons

Author: Printista Alicia Marcela
Saez Fernando
Publication venue
Publication date: 03/10/2012
Field of study

Parallel divide and conquer computations, encompassing a wide variety of applications, can be modeled and encapsulated as a high level primitive called skeleton. The paper deals with a skeleton designed for parallel divide and conquer algorithms that provide hypercubical communications among processes The paper also introduces an accurate timing model designed for prediction of proposed primitive. The timing analysis model presented here still characterizing the communication time through architecture parameters but introduces a few novelties. The proposal is to introduce different kinds of components to the analytical model by associating a performance constant for each specific conceptual block of the skeleton. The trace files obtained from the execution of the resulting code using the skeleton are used by lineal regression techniques giving us, among other information, the values of the parameters of those blocks. An extended example showing the relative accuracy of the proposed approach concludes the paper.Workshop de Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informática (RedUNCI

Servicio de Difusión de la Creación Intelectual

TokenTLB+CUP: A Token-Based Page Classification with Cooperative Usage Prediction

Author: Esteve Garcia Albert
Gómez Requena María Engracia
Robles Martínez Antonio
Ros Bardisa Alberto
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

[EN] Discerning the private or shared condition of the data accessed by the applications is an increasingly decisive approach to achieving efficiency and scalability in multi- and many-core systems. Since most memory accesses in both sequential and parallel applications are either private (accessed only by one core) or read-only (not written) data, devoting the full cost of coherence to every memory access results in sub-optimal performance and limits the scalability and efficiency of the multiprocessor. This paper introduces TokenTLB, a TLB-based page classification approach based on exchange and count of tokens. Token counting on TLBs is a natural and efficient way for classifying memory pages, and it does not require the use of complex and undesirable persistent requests or arbitration. In addition, classification is extended with Cooperative Usage Predictor (CUP), a token-based system-wide page usage predictor retrieved through TLB cooperation, in order to perform a classification unaffected by TLB size. Through cycle-accurate simulation we observed that TokenTLB spends 43.6% of cycles as private per page on average, and CUP further increases the time spent as private by 22.0%. CUP avoids 4 out of 5 TLB invalidations when compared to state-of-the-art predictors, thus proving far better prediction accuracy and making usage prediction an attractive mechanism for the first time.This work has been jointly supported by the MINECO and European Commission (FEDER funds) under the project TIN2015-66972-C5-1-R and TIN2015-66972-C5-3-R and the Fundacion Seneca-Agencia de Ciencia y Tecnologia de la Region de Murcia under the project Jovenes Lideres en Investigacion 18956/JLI/13.Esteve Garcia, A.; Ros Bardisa, A.; Robles Martínez, A.; Gómez Requena, ME. (2018). TokenTLB+CUP: A Token-Based Page Classification with Cooperative Usage Prediction. IEEE Transactions on Parallel and Distributed Systems. 29(5):1188-1201. https://doi.org/10.1109/TPDS.2017.2782808S1188120129

Crossref

RiuNet

EXPLORING MULTIPLE LEVELS OF PERFORMANCE MODELING FOR HETEROGENEOUS SYSTEMS

Author: Pallipuram Krishnamani Venkittaraman Vivek
Publication venue: Clemson University Libraries
Publication date: 01/12/2013
Field of study

The current trend in High-Performance Computing (HPC) is to extract concurrency from clusters that include heterogeneous resources such as General Purpose Graphical Processing Units (GPGPUs) and Field Programmable Gate Array (FPGAs). Although these heterogeneous systems can provide substantial performance for massively parallel applications, much of the available computing resources are often under-utilized due to inefficient application mapping, load balancing, and tuning. While several performance prediction models exist to efficiently tune applications, they often require significant computing architecture knowledge for reliable prediction. In addition, they do not address multiple levels of design space abstraction and it is often difficult to choose a reliable prediction model for a given design. In this research, we develop a multi-level suite of performance prediction models for heterogeneous systems that primarily targets Synchronous Iterative Algorithms (SIAs). The modeling suite aims to produce accurate and straightforward application runtime prediction prior to the actual large-scale implementation. This suite addresses two levels of system abstraction: 1) low-level where partial knowledge of the application implementation is present along with the system specifications and 2) high-level where the implementation details are minimum and only high-level computing system specifications are given. The performance prediction modeling suite is developed using our proposed Synchronous Iterative GPGPU Execution (SIGE) model for GPGPU clusters, motivated by the RC Amenability Test for Scalable Systems (RATSS) model for FPGA clusters. The low-level abstraction for GPGPU clusters consists of a regression-based performance prediction framework that statistically abstracts system architecture characteristics, enabling performance prediction without detailed architecture knowledge. In this framework, the overall execution time of an application is predicted using regression models developed for host-device computations and network-level communications performed in the algorithm. We have used a family of Spiking Neural Network (SNN) models and an Anisotropic Diffusion Filter (ADF) algorithm as SIA case studies for verification of the regression-based framework and achieved over 90% prediction accuracy compared to the actual implementations for several GPGPU cluster configurations tested. The results establish the adequacy of the low-level abstraction model for advanced, fine-grained performance prediction and design space exploration (DSE). The high-level abstraction consists of the following two primary modeling approaches: qualitative modeling that uses existing subjective-analytical models for computation and communication; and quantitative modeling that predicts computation and communication performance by measuring hardware events associated with objective-analytical models using micro-benchmarks. The performance prediction provided by the high-level abstraction approaches, albeit coarse-grained, delivers useful insight into application performance on the chosen heterogeneous system. A blend of the two high-level modeling approaches, labeled as hybrid modeling, is explored for insightful preliminary performance prediction. The performance prediction models in the multi-level suite are verified and compared for their accuracy and ease-of-use, allowing developers to choose a model that best satisfies their design space abstraction. We also construct a roadmap that guides user from optimal Application-to-Accelerator (A2A) mapping to fine-grained performance prediction, thereby providing a hierarchical approach to optimal application porting on the target heterogeneous system. The end goal of this dissertation research is to offer the HPC community a thorough, non-architecture specific, performance prediction framework in the form of a hierarchical modeling suite that enables them to optimally utilize the heterogeneous resources

Clemson University: TigerPrints

동시에 실행되는 병렬 처리 어플리케이션들을 위한 병렬성 관리

Author: 조영현
Publication venue: 서울대학교 대학원
Publication date: 01/08/2020
Field of study

학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·컴퓨터공학부, 2020. 8. Bernhard Egger.Running multiple parallel jobs on the same multicore machine is becoming more important to improve utilization of the given hardware resources. While co-location of parallel jobs is common practice, it still remains a challenge for current parallel runtime systems to efficiently execute multiple parallel applications simultaneously. Conventional parallelization runtimes such as OpenMP generate a fixed number of worker threads, typically as many as there are cores in the system, to utilize all physical core resources. On such runtime systems, applications may not achieve their peak performance when given full use of all physical core resources. Moreover, the OS kernel needs to manage all worker threads generated by all running parallel applications, and it may require huge management costs with an increasing number of co-located applications. In this thesis, we focus on improving runtime performance for co-located parallel applications. To achieve this goal, the first idea of this work is to ensure spatial scheduling to execute multiple co-located parallel applications simultaneously. Spatial scheduling that provides distinct core resources for applications is considered a promising and scalable approach for executing co-located applications. Despite the growing importance of spatial scheduling, there are still two fundamental research issues with this approach. First, spatial scheduling requires a runtime support for parallel applications to run efficiently in spatial core allocation that can change at runtime. Second, the scheduler needs to assign the proper number of core resources to applications depending on the applications performance characteristics for better runtime performance. To this end, in this thesis, we present three novel runtime-level techniques to efficiently execute co-located parallel applications with spatial scheduling. First, we present a cooperative runtime technique that provides malleable parallel execution for OpenMP parallel applications. The malleable execution means that applications can dynamically adapt their degree of parallelism to the varying core resource availability. It allows parallel applications to run efficiently at changing core resource availability compared to conventional runtime systems that do not adjust the degree of parallelism of the application. Second, this thesis introduces an analytical performance model that can estimate resource utilization and the performance of parallel programs in dependence of the provided core resources. We observe that the performance of parallel loops is typically limited by memory performance, and employ queueing theory to model the memory performance. The queueing system-based approach allows us to estimate the performance by using closed-form equations and hardware performance counters. Third, we present a core allocation framework to manage core resources between co-located parallel applications. With analytical modeling, we observe that maximizing both CPU utilization and memory bandwidth usage can generally lead to better performance compared to conventional core allocation policies that maximize only CPU usage. The presented core allocation framework optimizes utilization of multi-dimensional resources of CPU cores and memory bandwidth on multi-socket multicore systems based on the cooperative parallel runtime support and the analytical model.멀티코어 시스템에서 여러 개의 병렬 처리 어플리케이션들을 함께 실행시키는 것 은 주어진 하드웨어 자원을 효율적으로 사용하기 위해서 점점 더 중요해지고 있다. 하지만, 현재 런타임 시스템에서 여러 개의 병렬 처리 어플리케이션들을 동시에 효율적으로 실행시키는 것은 여전히 어려운 문제이다. OpenMP와 같이 통상 사 용되는 병렬화 런타임 시스템들은 모든 하드웨어 코어 자원을 사용하기 위해서 일반적으로 코어 개수 만큼 스레드를 생성하여 어플리케이션을 실행시킨다. 이 때, 어플리케이션은 모든 코어 자원을 활용할 때 오히려 최적의 성능을 얻지 못할 수도 있으며, 운영체제 커널의 부하는 실행되는 어플리케이션의 개수가 늘어날 수록 관리해야 하는 스레드의 개수가 늘어나기 때문에 계속해서 커지게 된다. 본 학위 논문에서, 우리는 함께 실행되는 병렬 처리 어플리케이션들의 런타임 성능을 높이는 것에 집중한다. 이를 위해, 본 연구의 핵심 목표는 함께 실행되는 어플리케이션들에게 공간 분할식 스케줄링 방법을 적용하는 것이다. 각 어플리 케이션에게 독립적인 코어 자원을 할당해주는 공간 분할식 스케줄링은 점점 더 늘어나는 코어 자원의 개수를 효율적으로 관리하기 위한 방법으로 많은 관심을 받고 있다. 하지만, 공간 분할 스케줄링 방법을 통해 어플리케이션을 실행시키는 것은 두 가지 연구 과제를 가지고 있다. 먼저, 각 어플리케이션은 가변적인 코어 자원 상에서 효율적으로 실행되기 위한 런타임 기술을 필요로 하고, 스케줄러는 어플리케이션들의 성능 특성을 고려해서 런타임 성능을 높일 수 있도록 적당한 수의 코어 자원을 제공해야한다. 이 학위 논문에서, 우리는 함께 실행되는 병렬 처리 어플리케이션들을 공간 분 할 스케줄링을 통해서 효율적으로 실행시키기 위한 세가지 런타임 시스템 기술을 소개한다. 먼저 우리는 협동적인 런타임 시스템이라는 기술을 소개하는데, 이는 OpenMP 병렬 처리 어플리케이션들에게 유연하고 효율적인 실행 환경을 제공한다. 이 기술은 공유 메모리 병렬 실행에 내재되어 있는 특성을 활용하여 병렬처리 프로그램들이 변화하는 코어 자원에 맞추어 병렬성의 정도를 동적으로 조절할 수 있도록 해준다. 이러한 유연한 실행 모델은 병렬 어플리케이션들이 사용 가능한 코어 자원이 동적으로 변화하는 환경에서 어플리케이션의 스레드 수준 병렬성을 다루지 못하는 기존 런타임 시스템들에 비해서 더 효율적으로 실행될 수 있도록 해준다. 두번째로, 본 논문은 사용되는 코어 자원에 따른 병렬처리 프로그램의 성능 및 자원 활용도를 예측할 수 있도록 해주는 분석적 성능 모델을 소개한다. 병렬 처리 코드의 성능 확장성이 일반적으로 메모리 성능에 좌우된다는 관찰에 기초하여, 제 안된 해석 모델은 큐잉 이론을 활용하여 메모리 시스템의 성능 정보들을 계산한다. 이 큐잉 시스템에 기반한 방법은 유용한 성능 정보들을 수식을 통해 효율적으로 계산할 수 있도록 하며 상용 시스템에서 제공하는 하드웨어 성능 카운터만을 요구 하기 때문에 활용 가능성 또한 높다. 마지막으로, 본 논문은 동시에 실행되는 병렬 처리 어플리케이션들 사이에서 코어 자원을 할당해주는 프레임워크를 소개한다. 제안된 프레임워크는 동시에 동 작하는 병렬 처리 어플리케이션의 병렬성 및 코어 자원을 관리하여 멀티 소켓 멀티코어 시스템에서 CPU 자원 및 메모리 대역폭 자원 활용도를 동시에 최적 화한다. 해석적인 모델링과 제안된 코어 할당 프레임워크의 성능 평가를 통해서, 우리가 제안하는 정책이 일반적인 경우에 CPU 자원의 활용도만을 최적화하는 방법에 비해서 함께 동작하는 어플리케이션들의 실행시간을 감소시킬 수 있음을 보여준다.1 Introduction 1 1.1 Motivation 1 1.2 Background 5 1.2.1 The OpenMP Runtime System 5 1.2.2 Target Multi-Socket Multicore Systems 7 1.3 Contributions 8 1.3.1 Cooperative Runtime Systems 9 1.3.2 Performance Modeling 9 1.3.3 Parallelism Management 10 1.4 Related Work 11 1.4.1 Cooperative Runtime Systems 11 1.4.2 Performance Modeling 12 1.4.3 Parallelism Management 14 1.5 Organization of this Thesis 15 2 Dynamic Spatial Scheduling with Cooperative Runtime Systems 17 2.1 Overview 17 2.2 Malleable Workloads 19 2.3 Cooperative OpenMP Runtime System 21 2.3.1 Cooperative User-Level Tasking 22 2.3.2 Cooperative Dynamic Loop Scheduling 27 2.4 Experimental Results 30 2.4.1 Standalone Application Performance 30 2.4.2 Performance in Spatial Core Allocation 33 2.5 Discussion 35 2.5.1 Contributions 35 2.5.2 Limitations and Future Work 36 2.5.3 Summary 37 3 Performance Modeling of Parallel Loops using Queueing Systems 38 3.1 Overview 38 3.2 Background 41 3.2.1 Queueing Models 41 3.2.2 Insights on Performance Modeling of Parallel Loops 43 3.2.3 Performance Analysis 46 3.3 Queueing Systems for Multi-Socket Multicores 54 3.3.1 Hierarchical Queueing Systems 54 3.3.2 Computingthe Parameter Values 60 3.4 The Speedup Prediction Model 63 3.4.1 The Speedup Model 63 3.4.2 Implementation 64 3.5 Evaluation 65 3.5.1 64-core AMD Opteron Platform 66 3.5.2 72-core Intel Xeon Platform 68 3.6 Discussion 70 3.6.1 Applicability of the Model 70 3.6.2 Limitations of the Model 72 3.6.3 Summary 73 4 Maximizing System Utilization via Parallelism Management 74 4.1 Overview 74 4.2 Background 76 4.2.1 Modeling Performance Metrics 76 4.2.2 Our Resource Management Policy 79 4.3 NuPoCo: Parallelism Management for Co-Located Parallel Loops 82 4.3.1 Online Performance Model 82 4.3.2 Managing Parallelism 86 4.4 Evaluation of NuPoCo 90 4.4.1 Evaluation Scenario 1 90 4.4.2 Evaluation Scenario 2 98 4.5 MOCA: An Evolutionary Approach to Core Allocation 103 4.5.1 Evolutionary Core Allocation 104 4.5.2 Model-Based Allocation 106 4.6 Evaluation of MOCA 113 4.7 Discussion 118 4.7.1 Contributions and Limitations 118 4.7.2 Summary 119 5 Conclusion and Future Work 120 5.1 Conclusion 120 5.2 Future work 122 5.2.1 Improving Multi-Objective Core Allocation 122 5.2.2 Co-Scheduling of Parallel Jobs for HPC Systems 123 A Additional Experiments for the Performance Model 124 A.1 Memory Access Distribution and Poisson Distribution 124 A.1.1 Memory Access Distribution 124 A.1.2 Kolmogorov Smirnov Test 127 A.2 Additional Performance Modeling Results 134 A.2.1 Results with Intel Hyperthreading 134 A.2.2 Results with Cooperative User-Level Tasking 134 A.2.3 Results with Other Loop Schedulers 138 A.2.4 Results with Different Number of Memory Nodes 138 B Other Research Contributions of the Author 141 B.1 Compiler and Runtime Support for Integrated CPU-GPU Systems 141 B.2 Modeling NUMA Architectures with Stochastic Tool 143 B.3 Runtime Environment for a Manycore Architecture 143 초록 159 Acknowledgements 161Docto

SNU Open Repository and Archive

Optimization of communication intensive applications on HPC networks

Author: Jain Nikhil
Publication venue
Publication date: 01/05/2016
Field of study

Communication is a necessary but overhead inducing component of parallel programming. Its impact on application design and performance is due to several related aspects of a parallel job execution: network topology, routing protocol, suitability of algorithm being used to the network, job placement, etc. This thesis is aimed at developing an understanding of how communication plays out on networks of high performance computing systems and exploring methods that can be used to improve communication performance of large scale applications. Broadly speaking, three topics have been studied in detail in this thesis. The first of these topics is task mapping and job placement on practical installations of torus and dragonfly networks. Next, use of supervised learning algorithms for conducting diagnostic studies of how communication evolves on networks is explored. Finally, efficacy of packet-level simulations for prediction-based studies of communication performance on different networks using different network parameters is analyzed. The primary contribution of this thesis is development of scalable diagnostic and prediction methods that can assist in the process of network designing, adapting applications to future systems, and optimizing execution of applications on existing systems. These meth- ods include a supervised learning approach, a functional modeling tool (called Damselfly), and a PDES-based packet level simulator (called TraceR), all of which are described in this thesis

Illinois Digital Environment for Access to Learning and Scholarship Repository

Big Data Application and System Co-optimization in Cloud and HPC Environment

Author: Wang Xinying
Publication venue
Publication date: 28/06/2022
Field of study

The emergence of big data requires powerful computational resources and memory subsystems that can be scaled efficiently to accommodate its demands. Cloud is a new well-established computing paradigm that can offer customized computing and memory resources to meet the scalable demands of big data applications. In addition, the flexible pay-as-you-go pricing model offers opportunities for using large scale of resources with low cost and no infrastructure maintenance burdens. High performance computing (HPC) on the other hand also has powerful infrastructure that has potential to support big data applications. In this dissertation, we explore the application and system co-optimization opportunities to support big data in both cloud and HPC environments. Specifically, we explore the unique features of both application and system to seek overlooked optimization opportunities or tackle challenges that are difficult to be addressed by only looking at the application or system individually. Based on the characteristics of the workloads and their underlying systems to derive the optimized deployment and runtime schemes, we divide the workflow into four categories: 1) memory intensive applications; 2) compute intensive applications; 3) both memory and compute intensive applications; 4) I/O intensive applications.When deploying memory intensive big data applications to the public clouds, one important yet challenging problem is selecting a specific instance type whose memory capacity is large enough to prevent out-of-memory errors while the cost is minimized without violating performance requirements. In this dissertation, we propose two techniques for efficient deployment of big data applications with dynamic and intensive memory footprint in the cloud. The first approach builds a performance-cost model that can accurately predict how, and by how much, virtual memory size would slow down the application and consequently, impact the overall monetary cost. The second approach employs a lightweight memory usage prediction methodology based on dynamic meta-models adjusted by the application's own traits. The key idea is to eliminate the periodical checkpointing and migrate the application only when the predicted memory usage exceeds the physical allocation. When applying compute intensive applications to the clouds, it is critical to make the applications scalable so that it can benefit from the massive cloud resources. In this dissertation, we first use the Kirchhoff law, which is one of the most widely used physical laws in many engineering principles, as an example workload for our study. The key challenge of applying the Kirchhoff law to real-world applications at scale lies in the high, if not prohibitive, computational cost to solve a large number of nonlinear equations. In this dissertation, we propose a high-performance deep-learning-based approach for Kirchhoff analysis, namely HDK. HDK employs two techniques to improve the performance: (i) early pruning of unqualified input candidates which simplify the equation and select a meaningful input data range; (ii) parallelization of forward labelling which execute steps of the problem in parallel. When it comes to both memory and compute intensive applications in clouds, we use blockchain system as a benchmark. Existing blockchain frameworks exhibit a technical barrier for many users to modify or test out new research ideas in blockchains. To make it worse, many advantages of blockchain systems can be demonstrated only at large scales, which are not always available to researchers. In this dissertation, we develop an accurate and efficient emulating system to replay the execution of large-scale blockchain systems on tens of thousands of nodes in the cloud. For I/O intensive applications, we observe one important yet often neglected side effect of lossy scientific data compression. Lossy compression techniques have demonstrated promising results in significantly reducing the scientific data size while guaranteeing the compression error bounds, but the compressed data size is often highly skewed and thus impact the performance of parallel I/O. Therefore, we believe it is critical to pay more attention to the unbalanced parallel I/O caused by lossy scientific data compression

University of Nevada, Reno ScholarWorks Repository

Analysis and design development of parallel 3-D mesh refinement algorithms for finite element electromagnetics with tetrahedra

Author: Ren Da Qi.
Publication venue: McGill University
Publication date
Field of study

Optimal partitioning of three-dimensional (3-D) mesh applications necessitates dynamically determining and optimizing for the most time-inhibiting factors, such as load imbalance and communication volume. One challenge is to create an analytical model where the programmer can focus on optimizing load imbalance or communication volume to reduce execution time. Another challenge is the best individual performance of a specific mesh refinement demands precise study and the selection of the suitable computation strategy. Very-large-scale finite element method (FEM) applications require sophisticated capabilities for using the underlying parallel computer's resources in the most efficient way. Thus, classifying these requirements in a manner that conforms to the programmer is crucial.This thesis contributes a simulation-based approach for the algorithm analysis and design of parallel, 3-D FEM mesh refinement that utilizes Petri Nets (PN) as the modeling and simulation tool. PN models are implemented based on detailed software prototypes and system architectures, which imitate the behaviour of the parallel meshing process. Subsequently, estimates for performance measures are derived from discrete event simulations. New communication strategies are contributed in the thesis for parallel mesh refinement that pipeline the computation and communication time by means of the workload prediction approach and task breaking point approach. To examine the performance of these new designs, PN models are created for modeling and simulating each of them and their efficiencies are justified by the simulation results. Also based on the PN modeling approach, the performance of a Random Polling Dynamic Load Balancing protocol has been examined. Finally, the PN models are validated by a MPI benchmarking program running on the real multiprocessor system. The advantages of new pipelined communication designs as well as the benefits of PN approach for evaluating and developing high performance parallel mesh refinement algorithms are demonstrated

eScholarship@McGill