3 research outputs found

    A Lightweight Island Model for the Genetic Algorithm over GPGPU

    Get PDF
    This paper presents a parallel approach of the genetic algorithm (GA) over the Graphical Processing Unit (GPU) to solve the Traveling Salesman Problem (TSP). Since the earlier studies did not focus on implementing the island model in a persistent way, this paper introduces an approach, named Lightweight Island Model (LIM), that aims to implement the concept of persistent threads in the island model of the genetic algorithm. For that, we present the implementation details to convert the traditional island model, which is separated into multiple kernels, into a computing paradigm based on a persistent kernel. Many synchronization techniques, including cooperative groups and implicit synchronization, are discussed to reduce the CPU-GPU interaction that existed in the traditional island model. A new parallelization strategy is presented for distributing the work among live threads during the selection and crossover steps. The GPU configurations that lead to the best possible performance are also determined. The introduced approach will be compared, in terms of speedup and solution quality, with the traditional island model (TIM) as well as with related works that concentrated on suggesting a lighter version of the master-slave model, including switching among kernels (SAK) and scheduled light kernel (SLK) approaches. The results show that the new approach can increase the speed-up to 27x over serial CPU, 4.5x over the traditional island model, and up to 1.5–2x over SAK and SLK approaches

    Novel Methodologies for Predictable CPU-To-GPU Command Offloading

    Get PDF
    There is an increasing industrial and academic interest towards a more predictable characterization of real-time tasks on high-performance heterogeneous embedded platforms, where a host system offloads parallel workloads to an integrated accelerator, such as General Purpose-Graphic Processing Units (GP-GPUs). In this paper, we analyze an important aspect that has not yet been considered in the real-time literature, and that may significantly affect real-time performance if not properly treated, i.e., the time spent by the CPU for submitting GP-GPU operations. We will show that the impact of CPU-to-GPU kernel submissions may be indeed relevant for typical real-time workloads, and that it should be properly factored in when deriving an integrated schedulability analysis for the considered platforms. This is the case when an application is composed of many small and consecutive GPU compute/copy operations. While existing techniques mitigate this issue by batching kernel calls into a reduced number of persistent kernel invocations, in this work we present and evaluate three other approaches that are made possible by recently released versions of the NVIDIA CUDA GP-GPU API, and by Vulkan, a novel open standard GPU API that allows an improved control of GPU command submissions. We will show that this added control may significantly improve the application performance and predictability due to a substantial reduction in CPU-to-GPU driver interactions, making Vulkan an interesting candidate for becoming the state-of-the-art API for heterogeneous Real-Time systems. Our findings are evaluated on a latest generation NVIDIA Jetson AGX Xavier embedded board, executing typical workloads involving Deep Neural Networks of parameterized complexity

    Efficient Implementation of Genetic Algorithms on GP-GPU with Scheduled Persistent CUDA Threads

    No full text
    In this paper we present a heavily exploration oriented implementation of genetic algorithms to be executed on graphic processor units (GPUs) that is optimized with our novel mechanism for scheduling GPU-side synchronized jobs that takes inspiration from the concept of persistent threads. Persistent Threads allow an efficient distribution of work loads throughout the GPU so to fully exploit the CUDA (NVIDIA's proprietary Compute Unified Device Architecture) architecture. Our approach (named Scheduled Light Kernel, SLK) uses a specifically designed data structure for issuing sequences of commands from the CPU to the GPU able to minimize CPUGPU communications, exploit streams of concurrent execution of different device side functions within different Streaming Multiprocessors and minimize kernels launch overhead. Results obtained on two completely different experimental settings show that our approach is able to dramatically increase the performance of the tested genetic algorithms compared to the baseline implementation that (while still running on a GPU) does not exploit our proposed approach. Our proposed SLK approach does not require substantial code rewriting and is also compared to newly introduced features in the last CUDA development toolkit, such as nested kernel invocations for dynamic parallelism
    corecore