16 research outputs found
Dynamic Configuration of CUDA Runtime Variables for CDP-based Divide-and-Conquer Algorithms
International audienceCUDA Dynamic Parallelism (CDP) is an extension of the GPGPU programming model proposed to better address irregular applications and recursive patterns of computation. However, processing memory demanding problems by using CDP is not straightforward, because of its particular memory organization. This work presents an algorithm to deal with such an issue. It dynamically calculates and configures the CDP runtime variables and the GPU heap on the basis of an analysis of the partial backtracking tree. The proposed algorithm was implemented for solving permutation combinatorial problems and experimented on two test-cases: N-Queens and the Asymmetric Travelling Salesman Problem. The proposed algorithm allows different CDP-based backtracking from the literature to solve memory demanding problems, adaptively with respect to the number of recursive kernel generations and the presence of dynamic allocations on GPU
Building and Combining Matching Algorithms
International audienceThe concept of matching is ubiquitous in declarative programming and in automated reasoning. For instance, it is a key mechanism to run rule-based programs and to simplify clauses generated by theorem provers. A matching problem can be seen as a particular conjunction of equations where each equation has a ground side. We give an overview of techniques that can be applied to build and combine matching algorithms. First, we survey mutation-based techniques as a way to build a generic matching algorithm for a large class of equational theories. Second, combination techniques are introduced to get combined matching algorithms for disjoint unions of theories. Then we show how these combination algorithms can be extended to handle non-disjoint unions of theories sharing only constructors. These extensions are possible if an appropriate notion of normal form is computable
Recommended from our members
Scaling communication-intensive applications on bluegene/P using one-sided communication and overlap
In earlier work, we showed that the one-sided communication model found in PGAS languages (such as UPC) offers significant advantages in communication efficiency by decoupling data transfer from processor synchronization. We explore the use of the PGAS model on IBM Blue-Gene/P, an architecture that combines low-power, quad-core processors with extreme scalability. We demonstrate that the PGAS model, using a new port of the Berkeley UPC compiler and GASNet one-sided communication layer, outperforms two-sided (MPI) communication in both microbenchmarks and a case study of the communication-limited benchmark, NAS FT. We scale the benchmark up to 16,384 cores of the BlueGene/P and demonstrate that UPC consistently outperforms MPI by as much as 66% for some processor configurations and an average of 32%. In addition, the results demonstrate the scalability of the PGAS model and the Berkeley implementation of UPC, the viability of using it on machines with multicore nodes, and the effectiveness of the BG/P communication layer for supporting one-sided communication and PGAS languages. © 2009 IEEE
Recommended from our members
Tuning collective communication for Partitioned Global Address Space programming models
Partitioned Global Address Space (PGAS) languages offer programmers the convenience of a shared memory programming style combined with locality control necessary to run on large-scale distributed memory systems. Even within a PGAS language programmers often need to perform global communication operations such as broadcasts or reductions, which are best performed as collective operations in which a group of threads work together to perform the operation. In this paper we consider the problem of implementing collective communication within PGAS languages and explore some of the design trade-offs in both the interface and implementation. In particular, PGAS collectives have semantic issues that are different than in send-receive style message passing programs, and different implementation approaches that take advantage of the one-sided communication style in these languages. We present an implementation framework for PGAS collectives as part of the GASNet communication layer, which supports shared memory, distributed memory and hybrids. The framework supports a broad set of algorithms for each collective, over which the implementation may be automatically tuned. Finally, we demonstrate the benefit of optimized GASNet collectives using application benchmarks written in UPC, and demonstrate that the GASNet collectives can deliver scalable performance on a variety of state-of-the-art parallel machines including a Cray XT4, an IBM BlueGene/P, and a Sun Constellation system with InfiniBand interconnect. © 2011 Elsevier B.V. All rights reserved
A GPU-Based Backtracking Algorithm for Permutation Combinatorial Problems
International audienceThis work presents a GPU-based backtracking algorithm for permutation combinatorial problems based on the Integer-Vector-Matrix (IVM) data structure. IVM is a data structure dedicated to permutation combinatorial optimization problems. In this algorithm, the load balancing is performed without intervention of the CPU, inside a work stealing phase invoked after each node expansion phase. The proposed work stealing approach uses a virtual n-dimensional hypercube topology and a triggering mechanism to reduce the overhead incurred by dynamic load balancing. We have implemented this new algorithm for solving instances of the Asymmetric Travelling Salesman Problem by implicit enumeration, a scenario where the cost of node evaluation is low, compared to the overall search procedure. Experimental results show that the dynamically load balanced IVM-algorithm reaches speed-ups up to 17X over a serial implementation using a bitset-data structure and up to 2X over its GPU counterpart
11
Linguistic primitives for replica-aware coordination offer suitable solutions to the challenging problems of data distribution and locality in large-scale high-performance computing. The data replication mechanisms that had previously been designed to extend Klaim with replicated tuples are now used to experiment with X10, a parallel programming language primarily targeting clusters of multi-core processors linked in a large-scale system via high-performance networks. Our approach aims at allowing the programmer to specify and coordinate the replication of shared data items by taking into account the desired consistency properties. The programmer can hence exploit such flexible mechanisms to adapt data distribution and locality to the needs of the application, in order to improve performance in terms of concurrency and data access. We investigate issues related to replica consistency and provide a performance analysis, which includes scenarios where replica-based specifications and relaxed consistency provide significant performance gains