41,715 research outputs found

    Improving the scalability of parallel N-body applications with an event driven constraint based execution model

    Full text link
    The scalability and efficiency of graph applications are significantly constrained by conventional systems and their supporting programming models. Technology trends like multicore, manycore, and heterogeneous system architectures are introducing further challenges and possibilities for emerging application domains such as graph applications. This paper explores the space of effective parallel execution of ephemeral graphs that are dynamically generated using the Barnes-Hut algorithm to exemplify dynamic workloads. The workloads are expressed using the semantics of an Exascale computing execution model called ParalleX. For comparison, results using conventional execution model semantics are also presented. We find improved load balancing during runtime and automatic parallelism discovery improving efficiency using the advanced semantics for Exascale computing.Comment: 11 figure

    Application of Micro-Genetic Algorithm for Task Based Computing

    Get PDF
    Abstract — Pervasive computing calls for applications which are often composed from independent and distributed components using facilities from the environment. This paradigm has evolved into task based computing where the application composition relies on explicit user task descriptions. The composition of applications has to be performed at run-time as the environment is dynamic and heterogeneous due to e.g., mobility of the user. An algorithm that decides on a component set and allocates it onto hosts accordingly to user task preferences and the platform constraints plays a central role in the application composition process. In this paper we will describe an algorithm for task-based application allocation. The algorithm uses micro-genetic approach and is characterized by a very low computational load and good convergence properties. We will compare the performance and the scalability of our algorithm with a straightforward evolutionary algorithm. Besides, we will outline a system for task-based computing where our algorithm is used. I

    Efficient Generation of Parallel Spin-images Using Dynamic Loop Scheduling

    Get PDF
    High performance computing (HPC) systems underwent a significant increase in their processing capabilities. Modern HPC systems combine large numbers of homogeneous and heterogeneous computing resources. Scalability is, therefore, an essential aspect of scientific applications to efficiently exploit the massive parallelism of modern HPC systems. This work introduces an efficient version of the parallel spin-image algorithm (PSIA), called EPSIA. The PSIA is a parallel version of the spin-image algorithm (SIA). The (P)SIA is used in various domains, such as 3D object recognition, categorization, and 3D face recognition. EPSIA refers to the extended version of the PSIA that integrates various well-known dynamic loop scheduling (DLS) techniques. The present work: (1) Proposes EPSIA, a novel flexible version of PSIA; (2) Showcases the benefits of applying DLS techniques for optimizing the performance of the PSIA; (3) Assesses the performance of the proposed EPSIA by conducting several scalability experiments. The performance results are promising and show that using well-known DLS techniques, the performance of the EPSIA outperforms the performance of the PSIA by a factor of 1.2 and 2 for homogeneous and heterogeneous computing resources, respectively

    A Survey on Load Balancing Algorithms for VM Placement in Cloud Computing

    Get PDF
    The emergence of cloud computing based on virtualization technologies brings huge opportunities to host virtual resource at low cost without the need of owning any infrastructure. Virtualization technologies enable users to acquire, configure and be charged on pay-per-use basis. However, Cloud data centers mostly comprise heterogeneous commodity servers hosting multiple virtual machines (VMs) with potential various specifications and fluctuating resource usages, which may cause imbalanced resource utilization within servers that may lead to performance degradation and service level agreements (SLAs) violations. To achieve efficient scheduling, these challenges should be addressed and solved by using load balancing strategies, which have been proved to be NP-hard problem. From multiple perspectives, this work identifies the challenges and analyzes existing algorithms for allocating VMs to PMs in infrastructure Clouds, especially focuses on load balancing. A detailed classification targeting load balancing algorithms for VM placement in cloud data centers is investigated and the surveyed algorithms are classified according to the classification. The goal of this paper is to provide a comprehensive and comparative understanding of existing literature and aid researchers by providing an insight for potential future enhancements.Comment: 22 Pages, 4 Figures, 4 Tables, in pres

    A Tuned and Scalable Fast Multipole Method as a Preeminent Algorithm for Exascale Systems

    Full text link
    Among the algorithms that are likely to play a major role in future exascale computing, the fast multipole method (FMM) appears as a rising star. Our previous recent work showed scaling of an FMM on GPU clusters, with problem sizes in the order of billions of unknowns. That work led to an extremely parallel FMM, scaling to thousands of GPUs or tens of thousands of CPUs. This paper reports on a a campaign of performance tuning and scalability studies using multi-core CPUs, on the Kraken supercomputer. All kernels in the FMM were parallelized using OpenMP, and a test using 10^7 particles randomly distributed in a cube showed 78% efficiency on 8 threads. Tuning of the particle-to-particle kernel using SIMD instructions resulted in 4x speed-up of the overall algorithm on single-core tests with 10^3 - 10^7 particles. Parallel scalability was studied in both strong and weak scaling. The strong scaling test used 10^8 particles and resulted in 93% parallel efficiency on 2048 processes for the non-SIMD code and 54% for the SIMD-optimized code (which was still 2x faster). The weak scaling test used 10^6 particles per process, and resulted in 72% efficiency on 32,768 processes, with the largest calculation taking about 40 seconds to evaluate more than 32 billion unknowns. This work builds up evidence for our view that FMM is poised to play a leading role in exascale computing, and we end the paper with a discussion of the features that make it a particularly favorable algorithm for the emerging heterogeneous and massively parallel architectural landscape
    • …
    corecore