14 research outputs found

    A Grid-enabled Branch and Bound Algorithm for Solving Challenging Combinatorial Optimization Problems

    Get PDF
    Solving exactly large scale instances of combinatorial optimization problems requires a huge amount of computational resources. In this paper, we propose an adaptation of the parallel Branch and Bound algorithm for computational grids. We consequently propose new ways to efficiently deal with some crucial issues, mainly dynamic adaptive load balancing, fault tolerance, global information sharing and termination detection of the algorithm. A special coding of the work units distributed and folded/unfolded during the exploration of the search tree allows to optimize the involved communications. The algorithm has been implemented following a large scale idle time stealing paradigm. It has been experimented on a Flow-Shop problem instance (Ta056) that has never been solved exactly. The optimal solution has been found with proof of optimality within 25 days using about 1900 processors belonging to 9 Nation-wide distinct clusters (administration domains). During the resolution, the worker processors were exploited with an average to 97% while the farmer processor was exploited only 1.7% of the time. These two rates are good indicators on the parallel efficiency of the proposed approach and its scalability

    A GPU-accelerated Branch-and-Bound Algorithm for the Flow-Shop Scheduling Problem

    Get PDF
    Branch-and-Bound (B&B) algorithms are time intensive tree-based exploration methods for solving to optimality combinatorial optimization problems. In this paper, we investigate the use of GPU computing as a major complementary way to speed up those methods. The focus is put on the bounding mechanism of B&B algorithms, which is the most time consuming part of their exploration process. We propose a parallel B&B algorithm based on a GPU-accelerated bounding model. The proposed approach concentrate on optimizing data access management to further improve the performance of the bounding mechanism which uses large and intermediate data sets that do not completely fit in GPU memory. Extensive experiments of the contribution have been carried out on well known FSP benchmarks using an Nvidia Tesla C2050 GPU card. We compared the obtained performances to a single and a multithreaded CPU-based execution. Accelerations up to x100 are achieved for large problem instances

    An Adaptative Multi-GPU based Branch-and-Bound. A Case Study: the Flow-Shop Scheduling Problem

    Get PDF
    Solving exactly Combinatorial Optimization Problems (COPs) using a Branch-and-Bound (B&B) algorithm requires a huge amount of computational resources. Therefore, we recently investigated designing B&B algorithms on top of graphics processing units (GPUs) using a parallel bounding model. The proposed model assumes parallelizing the evaluation of the lower bounds on pools of sub-problems. The results demonstrated that the size of the evaluated pool has a significant impact on the performance of B&B and that it depends strongly on the problem instance being solved. In this paper, we design an adaptative parallel B&B algorithm for solving permutation-based combinatorial optimization problems such as FSP (Flow-shop Scheduling Problem) on GPU accelerators. To do so, we propose a dynamic heuristic for parameter auto-tuning at runtime. Another challenge of this work is to exploit larger degrees of parallelism by using the combined computational power of multiple GPU devices. The approach has been applied to the permutation flow-shop problem. Extensive experiments have been carried out on well-known FSP benchmarks using an Nvidia Tesla S1070 Computing System equipped with two Tesla T10 GPUs. Compared to a CPU-based execution, accelerations up to 105 are achieved for large problem instances.Comment: 14th IEEE International Conference on High Performance Computing and Communications, HPCC 2012 (2012

    An Incremental Parallel PGAS-based Tree Search Algorithm

    Get PDF
    International audienceIn this work, we show that the Chapel high-productivity language is suitable for the design and implementation of all aspects involved in the conception of parallel tree search algorithms for solving combinatorial problems. Initially, it is possible to hand-optimize the data structures involved in the search process in a way equivalent to C. As a consequence, the single-threaded search in Chapel is on average only 7% slower than its counterpart written in C. Whereas programming a multicore tree search in Chapel is equivalent to C-OpenMP in terms of performance and programmability, its productivity-aware features for distributed programming stand out. It is possible to incrementally conceive a distributed tree search algorithm starting from its multicore counterpart by adding few lines of code. The distributed implementation performs load balancing among different computer nodes and also exploits all CPU cores of the system. Chapel presents an interesting trade-off between programmability and performance despite the high level of its features. The distributed tree search in Chapel is on average 16% slower and reaches up to 80% of the scalability achieved by its C-MPI+OpenMP counterpart

    An Incremental Parallel PGAS-based Tree Search Algorithm

    Get PDF
    International audienceIn this work, we show that the Chapel high-productivity language is suitable for the design and implementation of all aspects involved in the conception of parallel tree search algorithms for solving combinatorial problems. Initially, it is possible to hand-optimize the data structures involved in the search process in a way equivalent to C. As a consequence, the single-threaded search in Chapel is on average only 7% slower than its counterpart written in C. Whereas programming a multicore tree search in Chapel is equivalent to C-OpenMP in terms of performance and programmability, its productivity-aware features for distributed programming stand out. It is possible to incrementally conceive a distributed tree search algorithm starting from its multicore counterpart by adding few lines of code. The distributed implementation performs load balancing among different computer nodes and also exploits all CPU cores of the system. Chapel presents an interesting trade-off between programmability and performance despite the high level of its features. The distributed tree search in Chapel is on average 16% slower and reaches up to 80% of the scalability achieved by its C-MPI+OpenMP counterpart

    Towards Chapel-based Exascale Tree Search Algorithms: dealing with multiple GPU accelerators

    Get PDF
    International audienceTree-based search algorithms applied to combinatorial optimization problems are highly irregular and time consuming when it comes to solving big instances. Solving such instances efficiently requires the use of massively parallel distributed-memory supercomputers. According to recent Top 500 trends, the degree of parallelism in these supercomputers continues to increase in size and complexity, with millions of heterogeneous (mainly CPU-GPU) cores. Harnessing this scale of computing resources raises at least three challenging issues which are described and addressed in this paper. Indeed, as a step towards exascale computing, we revisit the design and implementation of tree search algorithms dealing with multiple GPUs, in addition to scalability and productivity-awareness using Chapel. The proposed algorithm exploits Chapel's distributed iterators by combining a partial search strategy with pre-compiled CUDA kernels for more efficient exploitation of the intra-node parallelism. Extensive experimentation on big N-Queens problem instances using 24 GPUs shows that up to 90% of the linear speedup can be achieved

    Many-core Branch-and-Bound for GPU accelerators and MIC coprocessors

    Get PDF
    International audienceCoprocessors are increasingly becoming key building blocks of High Performance Computing platforms. These many-core energy-efficient devices boost the performance of traditional processors. On the other hand, Branch-and-Bound (B&B) algorithms are tree-based exact methods for solving to optimality combinatorial optimization problems (COPs). Solving large COPs results in the generation of a very large pool of subproblems and the evaluation of their associated lower bounds. Generating and evaluating those subproblems on coprocessors raises several issues including processor-coprocessor data transfer optimization, vectorization, thread divergence, and so on. In this paper, we investigate the offload-based parallel design and implementation of B&B algorithms for coprocessors addressing these issues. Two major many-core architectures are considered and compared: Nvidia GPU and Intel MIC. The proposed approaches have been experimented using the Flow-Shop scheduling problem and two hardware configurations equivalent in terms of energy consumption: Nvidia Tesla K40 and Intel Xeon Phi 5110P. The reported results show that the GPU-accelerated approach outperforms the MIC offload-based one even in its vectorized version. Moreover, vectorization improves the efficiency of the MIC offload-based approach with a factor of two

    A grid-enabled branch and bound algorithm for solving challenging combinatorial optimization problems

    Get PDF
    Solving optimally large instances of combinatorial optimization problems requires a huge amount of computational resources. In this paper, we propose an adaptation of the parallel Branch and Bound algorithm for computational grids. Such gridification is based on new ways to efficiently deal with some crucial issues, mainly dynamic adaptive load balancing, fault tolerance, global information sharing and termination detection of the algorithm. A new efficient coding of the work units (search sub-trees) distributed during the exploration of the search tree is proposed to optimize the involved communications. The algorithm has been implemented following a large scale idle time stealing paradigm (Farmer-Worker). It has been experimented on a Flow-Shop problem instance ( ) that has never been optimally solved. The new algorithm allowed to realize a success story as the optimal solution has been found with proof of optimality, within days using about processors belonging to Nation-wide distinct clusters (administration domains). During the resolution, the worker processors were exploited with an average of while the farmer processor was exploited only of the time. These two rates are good indicators on the efficiency of the proposed approach and its scalability

    Towards ultra-scale Branch-and-Bound using a high-productivity language

    Get PDF
    International audienceDue to the highly irregular nature and prohibitive execution times of Branch-and-Bound (B&B) algorithms applied to combinatorial optimization problems (COPs), their parallelization has received these two last decades great attention. Indeed, significant efforts have been made to revisit the parallelization of B&B following the rapid evolution of high-performance computing technologies dealing with their associated scientific and technical challenges. However, these parallelization efforts have always been guided by the performance objective setting aside programming productivity. Nevertheless, this latter is crucial for designing ultra-scale algorithms able to harness modern supercomputers which are increasingly complex, including millions of processing cores and heterogeneous building-block devices. In this paper, we investigate the partitioned global address space (PGAS)-based approach using Chapel for the productivity-aware design and implementation of distributed B&B for solving large COPs. The proposed algorithms are intensively experimented using the Flow-shop scheduling problem as a test-case. The Chapel-based implementation is compared to its MPI+X-based traditionally used counterpart in terms of performance, scalabil-ity, and productivity. The results show that Chapel is much more expressive and up to 7.8× more productive than MPI+Pthreads. In addition, the Chapel-based search presents performance equivalent to MPI+Pthreads for its best results on 1024 cores and reaches up to 84% of the linear speedup. However, there are cases where the built-in load balancing provided by Chapel cannot produce regular load among computer nodes. In such cases, the MPI-based search can be up to 4.2× faster and reaches speedups up to 3× higher than its Chapel-based counterpart. Thorough feedback on the experience is given, pointing out the strengths and limitations of the two opposite approaches (Chapel vs. MPI+X). To the best of our knowledge, the present study is pioneering within the context of exact parallel optimization
    corecore