166 research outputs found

    "Virtual malleability" applied to MPI jobs to improve their execution in a multiprogrammed environment"

    Get PDF
    This work focuses on scheduling of MPI jobs when executing in shared-memory multiprocessors (SMPs). The objective was to obtain the best performance in response time in multiprogrammed multiprocessors systems using batch systems, assuming all the jobs have the same priority. To achieve that purpose, the benefits of supporting malleability on MPI jobs to reduce fragmentation and consequently improve the performance of the system were studied. The contributions made in this work can be summarized as follows:· Virtual malleability: A mechanism where a job is assigned a dynamic processor partition, where the number of processes is greater than the number of processors. The partition size is modified at runtime, according to external requirements such as the load of the system, by varying the multiprogramming level, making the job contend for resources with itself. In addition to this, a mechanism which decides at runtime if applying local or global process queues to an application depending on the load balancing between processes of it. · A job scheduling policy, that takes decisions such as how many processes to start with and the maximum multiprogramming degree based on the type and number of applications running and queued. Moreover, as soon as a job finishes execution and where there are queued jobs, this algorithm analyzes whether it is better to start execution of another job immediately or just wait until there are more resources available. · A new alternative to backfilling strategies for the problema of window execution time expiring. Virtual malleability is applied to the backfilled job, reducing its partition size but without aborting or suspending it as in traditional backfilling. The evaluation of this thesis has been done using a practical approach. All the proposals were implemented, modifying the three scheduling levels: queuing system, processor scheduler and runtime library. The impact of the contributions were studied under several types of workloads, varying machine utilization, communication and, balance degree of the applications, multiprogramming level, and job size. Results showed that it is possible to offer malleability over MPI jobs. An application obtained better performance when contending for the resources with itself than with other applications, especially in workloads with high machine utilization. Load imbalance was taken into account obtaining better performance if applying the right queue type to each application independently.The job scheduling policy proposed exploited virtual malleability by choosing at the beginning of execution some parameters like the number of processes and maximum multiprogramming level. It performed well under bursty workloads with low to medium machine utilizations. However as the load increases, virtual malleability was not enough. That is because, when the machine is heavily loaded, the jobs, once shrunk are not able to expand, so they must be executed all the time with a partition smaller than the job size, thus degrading performance. Thus, at this point the job scheduling policy concentrated just in moldability.Fragmentation was alleviated also by applying backfilling techniques to the job scheduling algorithm. Virtual malleability showed to be an interesting improvement in the window expiring problem. Backfilled jobs even on a smaller partition, can continue execution reducing memory swapping generated by aborts/suspensions In this way the queueing system is prevented from reinserting the backfilled job in the queue and re-executing it in the future.Postprint (published version

    Survey on Parallel Computing and Performance Modelling in High Performance Computing

    Get PDF
    The parallel programming come a long way with the advances in the HPC. The high performance computing landscape is shifting from collections of homogeneous nodes towards heterogeneous systems, in which nodes consist of a combination of traditional out-of-order execution cores and accelerator devices. Accelerators, built around GPUs, many-core chips, FPGAs or DSPs, are used to offload compute-intensive tasks. Large-scale GPU clusters are gaining popularity in the scientific computing community and having massive range of applications. However, their deployment and production use are associated with a number of new challenges including CUDA. In this paper, we present our efforts to address some of the issues related to HPC and also introduced some performance modelling techniques along with GPU clustering. DOI: 10.17762/ijritcc2321-8169.15029

    Improving the scalability of parallel N-body applications with an event driven constraint based execution model

    Full text link
    The scalability and efficiency of graph applications are significantly constrained by conventional systems and their supporting programming models. Technology trends like multicore, manycore, and heterogeneous system architectures are introducing further challenges and possibilities for emerging application domains such as graph applications. This paper explores the space of effective parallel execution of ephemeral graphs that are dynamically generated using the Barnes-Hut algorithm to exemplify dynamic workloads. The workloads are expressed using the semantics of an Exascale computing execution model called ParalleX. For comparison, results using conventional execution model semantics are also presented. We find improved load balancing during runtime and automatic parallelism discovery improving efficiency using the advanced semantics for Exascale computing.Comment: 11 figure

    "Virtual malleability" applied to MPI jobs to improve their execution in a multiprogrammed environment"

    Get PDF
    This work focuses on scheduling of MPI jobs when executing in shared-memory multiprocessors (SMPs). The objective was to obtain the best performance in response time in multiprogrammed multiprocessors systems using batch systems, assuming all the jobs have the same priority. To achieve that purpose, the benefits of supporting malleability on MPI jobs to reduce fragmentation and consequently improve the performance of the system were studied. The contributions made in this work can be summarized as follows:· Virtual malleability: A mechanism where a job is assigned a dynamic processor partition, where the number of processes is greater than the number of processors. The partition size is modified at runtime, according to external requirements such as the load of the system, by varying the multiprogramming level, making the job contend for resources with itself. In addition to this, a mechanism which decides at runtime if applying local or global process queues to an application depending on the load balancing between processes of it. · A job scheduling policy, that takes decisions such as how many processes to start with and the maximum multiprogramming degree based on the type and number of applications running and queued. Moreover, as soon as a job finishes execution and where there are queued jobs, this algorithm analyzes whether it is better to start execution of another job immediately or just wait until there are more resources available. · A new alternative to backfilling strategies for the problema of window execution time expiring. Virtual malleability is applied to the backfilled job, reducing its partition size but without aborting or suspending it as in traditional backfilling. The evaluation of this thesis has been done using a practical approach. All the proposals were implemented, modifying the three scheduling levels: queuing system, processor scheduler and runtime library. The impact of the contributions were studied under several types of workloads, varying machine utilization, communication and, balance degree of the applications, multiprogramming level, and job size. Results showed that it is possible to offer malleability over MPI jobs. An application obtained better performance when contending for the resources with itself than with other applications, especially in workloads with high machine utilization. Load imbalance was taken into account obtaining better performance if applying the right queue type to each application independently.The job scheduling policy proposed exploited virtual malleability by choosing at the beginning of execution some parameters like the number of processes and maximum multiprogramming level. It performed well under bursty workloads with low to medium machine utilizations. However as the load increases, virtual malleability was not enough. That is because, when the machine is heavily loaded, the jobs, once shrunk are not able to expand, so they must be executed all the time with a partition smaller than the job size, thus degrading performance. Thus, at this point the job scheduling policy concentrated just in moldability.Fragmentation was alleviated also by applying backfilling techniques to the job scheduling algorithm. Virtual malleability showed to be an interesting improvement in the window expiring problem. Backfilled jobs even on a smaller partition, can continue execution reducing memory swapping generated by aborts/suspensions In this way the queueing system is prevented from reinserting the backfilled job in the queue and re-executing it in the future

    A hardware runtime for task-based programming models

    Get PDF
    © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Task-based programming models such as OpenMP 5.0 and OmpSs are simple to use and powerful enough to exploit task parallelism of applications over multicore, manycore and heterogeneous systems. However, their software-only runtimes introduce relevant overhead when targeting fine-grained tasks, resulting in performance losses. To overcome this drawback, we present a hardware runtime Picos++ that accelerates critical runtime functions such as task dependence analysis, nested task support, and heterogeneous task scheduling. As a proof-of-concept, the Picos++ hardware runtime has been integrated with a compiler infrastructure that supports parallel task-based programming models. A FPGA SoC running Linux OS has been used to implement the hardware accelerated part of Picos++, integrated with a heterogeneous system composed of 4 symmetric multiprocessor (SMP) cores and several hardware functional accelerators (HwAccs) for task execution. Results show significant improvements on energy and performance compared to state-of-the-art parallel software-only runtimes. With Picos++, applications can achieve up to 7.6x speedup and save up to 90 percent of energy, when using 4 threads and up to 4 HwAccs, and even reach a speedup of 16x over the software alternative when using 12 HwAccs and small tasks.Peer ReviewedPostprint (author's final draft

    Exploiting Graphics Processing Units for Massively Parallel Multi-Dimensional Indexing

    Get PDF
    Department of Computer EngineeringScientific applications process truly large amounts of multi-dimensional datasets. To efficiently navigate such datasets, various multi-dimensional indexing structures, such as the R-tree, have been extensively studied for the past couple of decades. Since the GPU has emerged as a new cost-effective performance accelerator, now it is common to leverage the massive parallelism of the GPU in various applications such as medical image processing, computational chemistry, and particle physics. However, hierarchical multi-dimensional indexing structures are inherently not well suited for parallel processing because their irregular memory access patterns make it difficult to exploit massive parallelism. Moreover, recursive tree traversal often fails due to the small run-time stack and cache memory in the GPU. First, we propose Massively Parallel Three-phase Scanning (MPTS) R-tree traversal algorithm to avoid the irregular memory access patterns and recursive tree traversal so that the GPU can access tree nodes in a sequential manner. The experimental study shows that MPTS R-tree traversal algorithm consistently outperforms traditional recursive R-Tree search algorithm for multi-dimensional range query processing. Next, we focus on reducing the query response time and extending n-ary multi-dimensional indexing structures - R-tree, so that a large number of GPU threads cooperate to process a single query in parallel. Because the number of submitted concurrent queries in scientific data analysis applications is relatively smaller than that of enterprise database systems and ray tracing in computer graphics. Hence, we propose a novel variant of R-trees Massively Parallel Hilbert R-Tree (MPHR-Tree), which is designed for a novel parallel tree traversal algorithm Massively Parallel Restart Scanning (MPRS). The MPRS algorithm traverses the MPHR-Tree in mostly contiguous memory access patterns without recursion, which offers more chances to optimize the parallel SIMD algorithm. Our extensive experimental results show that the MPRS algorithm outperforms the other stackless tree traversal algorithms, which are designed for efficient ray tracing in computer graphics community. Furthermore, we develop query co-processing scheme that makes use of both the CPU and GPU. In this approach, we store the internal and leaf nodes of upper tree in CPU host memory and GPU device memory, respectively. We let the CPU traverse internal nodes because the conditional branches in hierarchical tree structures often cause a serious warp divergence problem in the GPU. For leaf nodes, the GPU scans a large number of leaf nodes in parallel based on the selection ratio of a given range query. It is well known that the GPU is superior to the CPU for parallel scanning. The experimental results show that our proposed multi-dimensional range query co-processing scheme improves the query response time by up to 12x and query throughput by up to 4x compared to the state-of-the-art GPU tree traversal algorithm.ope

    Simulation Modelling of Distributed-Shared Memory Multiprocessors

    Get PDF
    Institute for Computing Systems ArchitectureDistributed shared memory (DSM) systems have been recognised as a compelling platform for parallel computing due to the programming advantages and scalability. DSM systems allow applications to access data in a logically shared address space by abstracting away the distinction of physical memory location. As the location of data is transparent, the sources of overhead caused by accessing the distant memories are difficult to analyse. This memory locality problem has been identified as crucial to DSM performance. Many researchers have investigated the problem using simulation as a tool for conducting experiments resulting in the progressive evolution of DSM systems. Nevertheless, both the diversity of architectural configurations and the rapid advance of DSM implementations impose constraints on simulation model designs in two issues: the limitation of the simulation framework on model extensibility and the lack of verification applicability during a simulation run causing the delay in verification process. This thesis studies simulation modelling techniques for memory locality analysis of various DSM systems implemented on top of a cluster of symmetric multiprocessors. The thesis presents a simulation technique to promote model extensibility and proposes a technique for verification applicability, called a Specification-based Parameter Model Interaction (SPMI). The proposed techniques have been implemented in a new interpretation-driven simulation called DSiMCLUSTER on top of a discrete event simulation (DES) engine known as HASE. Experiments have been conducted to determine which factors are most influential on the degree of locality and to determine the possibility to maximise the stability of performance. DSiMCLUSTER has been validated against a SunFire 15K server and has achieved similarity of cache miss results, an average of +-6% with the worst case less than 15% of difference. These results confirm that the techniques used in developing the DSiMCLUSTER can contribute ways to achieve both (a) a highly extensible simulation framework to keep up with the ongoing innovation of the DSM architecture, and (b) the verification applicability resulting in an efficient framework for memory analysis experiments on DSM architecture
    corecore