137 research outputs found
Accelerating ant colony optimization-based edge detection on the GPU using CUDA
Ant Colony Optimization (ACO) is a nature-inspired metaheuristic that can be applied to a wide range of optimization problems. In this paper we present the first parallel implementation of an ACO-based (image processing) edge detection algorithm on the Graphics Processing Unit (GPU) using NVIDIA CUDA. We extend recent work so that we are able to implement a novel data-parallel approach that maps individual ants to thread warps. By exploiting the massively parallel nature of the GPU, we are able to execute significantly more ants per ACO-iteration allowing us to reduce the total number of iterations required to create an edge map. We hope that reducing the execution time of an ACO-based implementation of edge detection will increase its viability in image processing and computer vision
Generic Techniques in General Purpose GPU Programming with Applications to Ant Colony and Image Processing Algorithms
In 2006 NVIDIA introduced a new unified GPU architecture facilitating general-purpose computation on the GPU. The following year NVIDIA introduced CUDA, a parallel programming architecture for developing general purpose applications for direct execution on the new unified GPU. CUDA exposes the GPU's massively parallel architecture of the GPU so that parallel code can be written to execute much faster than its sequential counterpart. Although CUDA abstracts the underlying architecture, fully utilising and scheduling the GPU is non-trivial and has given rise to a new active area of research. Due to the inherent complexities pertaining to GPU development, in this thesis we explore and find efficient parallel mappings of existing and new parallel algorithms on the GPU using NVIDIA CUDA. We place particular emphasis on metaheuristics, image processing and designing reusable techniques and mappings that can be applied to other problems and domains.
We begin by focusing on Ant Colony Optimisation (ACO), a nature inspired heuristic approach for solving optimisation problems. We present a versatile improved data-parallel approach for solving the Travelling Salesman Problem using ACO resulting in significant speedups. By extending our initial work, we show how existing mappings of ACO on the GPU are unable to compete against their sequential counterpart when common CPU optimisation strategies are employed and detail three distinct candidate set parallelisation strategies for execution on the GPU. By further extending our data-parallel approach we present the first implementation of an ACO-based edge detection algorithm on the GPU to reduce the execution time and improve the viability of ACO-based edge detection. We finish by presenting a new color edge detection technique using the volume of a pixel in the HSI color space along with a parallel GPU implementation that is able to withstand greater levels of noise than existing algorithms
Accelerating supply chains with Ant Colony Optimization across range of hardware solutions
This pre-print, arXiv:2001.08102v1 [cs.NE], was published subsequently by Elsevier in Computers and Industrial Engineering, vol. 147, 106610, pp. 1-14 on 29 Jun 2020 and is available at https://doi.org/10.1016/j.cie.2020.106610Ant Colony algorithm has been applied to various optimization problems, however most of the previous work on scaling and parallelism focuses on Travelling Salesman Problems (TSPs). Although, useful for benchmarks and new idea comparison, the algorithmic dynamics does not always transfer to complex real-life problems, where additional meta-data is required during solution construction. This paper looks at real-life outbound supply chain problem using Ant Colony Optimization (ACO) and its scaling dynamics with two parallel ACO architectures - Independent Ant Colonies (IAC) and Parallel Ants (PA). Results showed that PA was able to reach a higher solution quality in fewer iterations as the number of parallel instances increased. Furthermore, speed performance was measured across three different hardware solutions - 16 core CPU, 68 core Xeon Phi and up to 4 Geforce GPUs. State of the art, ACO vectorization techniques such as SS-Roulette were implemented using C++ and CUDA. Although excellent for TSP, it was concluded that for the given supply chain problem GPUs are not suitable due to meta-data access footprint required. Furthermore, compared to their sequential counterpart, vectorized CPU AVX2 implementation achieved 25.4x speedup on CPU while Xeon Phi with its AVX512 instruction set reached 148x on PA with Vectorized (PAwV). PAwV is therefore able to scale at least up to 1024 parallel instances on the supply chain network problem solved
GPU parallelization strategies for metaheuristics: a survey
Metaheuristics have been showing interesting results in solving hard optimization problems. However, they become limited in terms of effectiveness and runtime for high dimensional problems. Thanks to the independency of metaheuristics components, parallel computing appears as an attractive choice to reduce the execution time and to improve solution quality. By exploiting the increasing performance and programability of graphics processing units (GPUs) to this aim, GPU-based parallel metaheuristics have been implemented using different designs. RecentresultsinthisareashowthatGPUstendtobeeffectiveco-processors forleveraging complex optimization problems.In thissurvey, mechanisms involvedinGPUprogrammingforimplementingparallelmetaheuristicsare presentedanddiscussedthroughastudyofrelevantresearchpapers.
Metaheuristics can obtain satisfying results when solving optimization problems in a reasonable time. However, they suffer from the lack of scalability. Metaheuristics become limited ahead complex highdimensional optimization problems. To overcome this limitation, GPU based parallel computing appears as a strong alternative. Thanks to GPUs, parallelmetaheuristicsachievedbetterresultsintermsofcomputation,and evensolutionquality
Recommended from our members
OptPlatform: metaheuristic optimisation framework for solving complex real-world problems
This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University LondonWe optimise daily, whether that is planning a round trip that visits the most attractions within a given holiday budget or just taking a train instead of driving a car in a rush hour. Many problems, just like these, are solved by individuals as part of our daily schedule, and they are effortless and straightforward. If we now scale that to many individuals with many different schedules, like a school timetable, we get to a point where it is just not feasible or practical to solve by hand. In such instances, optimisation methods are used to obtain an optimal solution. In this thesis, a practical approach to optimisation has been taken by developing an optimisation platform with all the necessary tools to be used by practitioners who are not necessarily familiar with the subject of optimisation. First, a high-performance metaheuristic optimisation framework (MOF) called OptPlatform is implemented, and the versatility and performance are evaluated across multiple benchmarks and real-world optimisation problems. Results show that, compared to competing MOFs, the OptPlatform outperforms in both the solution quality and computation time. Second, the most suitable hardware platform for OptPlatform is determined by an in-depth analysis of Ant Colony Optimisation scaling across CPU, GPU and enterprise Xeon Phi. Contrary to the common benchmark problems used in the literature, the supply chain problem solved could not scale on GPUs. Third, a variety of metaheuristics are implemented into OptPlatform. Including, a new metaheuristic based on Imperialist Competitive Algorithm (ICA), called ICA with Independence and Constrained Assimilation (ICAwICA) is proposed. The ICAwICA was compared against two different types of benchmark problems, and results show the versatile application of the algorithm, matching and in some cases outperforming the custom-tuned approaches. Finally, essential MOF features like automatic algorithm selection and tuning, lacking on existing frameworks, are implemented in OptPlatform. Two novel approaches are proposed and compared to existing methods. Results indicate the superiority of the implemented tuning algorithms within constrained tuning budget environment
Analysis of Independent Roulette Selection in Parallel Ant Colony Optimization
The increased availability of high-performance parallel architectures such as the Graphics Processing Unit (GPU) has led to significant interest in modified versions of metaheuristics that take advantage of their capabilities. Parallel Ant Colony Optimization (ACO) algorithms are now widely-used, but these often present a challenge in terms of maximizing the potential for parallelism. One common bottleneck for parallelization of ACO occurs during the tour construction phase, when edges are probabilistically selected. Independent Roulette (I-Roulette) is an alternative to the standard Roulette Selection method used during this phase, and this achieves significant performance improvements on the GPU. In this paper we provide the first in-depth study of how I-Roulette works. We establish that, even though I-Roulette works in a qualitatively different way to Roulette Wheel selection, its use in two popular ACO variants does not affect the quality of the solutions obtained. However, I-Roulette significantly accelerates convergence to a solution. Our theoretical analysis shows that I-Roulette possesses several interesting and non-obvious features, and is capable of a form of dynamical adaptation during the tour construction process
A membrane parallel rapidly-exploring random tree algorithm for robotic motion planning
© 2020-IOS Press and the authors. All rights reserved. In recent years, incremental sampling-based motion planning algorithms have been widely used to solve robot motion planning problems in high-dimensional configuration spaces. In particular, the Rapidly-exploring Random Tree (RRT) algorithm and its asymptotically-optimal counterpart called RRT∗ are popular algorithms used in real-life applications due to its desirable properties. Such algorithms are inherently iterative, but certain modules such as the collision-checking procedure can be parallelized providing significant speedup with respect to sequential implementations. In this paper, the RRT and RRT∗ algorithms have been adapted to a bioinspired computational framework called Membrane Computing whose models of computation, a.k.a. P systems, run in a non-deterministic and massively parallel way. A large number of robotic applications are currently using a variant of P systems called Enzymatic Numerical P systems (ENPS) for reactive controlling, but there is a lack of solutions for motion planning in the framework. The novel models in this work have been designed using the ENPS framework. In order to test and validate the ENPS models for RRT and RRT*, we present two ad-hoc implementations able to emulate the computation of the models using OpenMP and CUDA. Finally, we show the speedup of our solutions with respect to sequential baseline implementations. The results show a speedup up to 6x using OpenMP with 8 cores against the sequential implementation and up to 24x using CUDA against the best multi-threading configuration
Optimization of scientific algorithms in heterogeneous systems and accelerators for high performance computing
Actualmente, la computación de propósito general en GPU es uno de los pilares básicos
de la computación de alto rendimiento. Aunque existen cientos de aplicaciones
aceleradas en GPU, aún hay algoritmos científicos poco estudiados. Por ello, la
motivación de esta tesis ha sido investigar la posibilidad de acelerar significativamente
en GPU un conjunto de algoritmos pertenecientes a este grupo.
En primer lugar, se ha obtenido una implementación optimizada del algoritmo de
compresión de vídeo e imagen CAVLC (Context-Adaptive Variable Length Encoding), que
es el método entrópico más usado en el estándar de codificación de vídeo H.264. La
aceleración respecto a la mejor implementación anterior está entre 2.5x y 5.4x. Esta
solución puede aprovecharse como el componente entrópico de codificadores H.264
software, y utilizarse en sistemas de compresión de vídeo e imagen en formatos
distintos a H.264, como imágenes médicas.
En segundo lugar, se ha desarrollado GUD-Canny, un detector de bordes de Canny no
supervisado y distribuido. El sistema resuelve las principales limitaciones de las
implementaciones del algoritmo de Canny, que son el cuello de botella causado por el
proceso de histéresis y el uso de umbrales de histéresis fijos. Dada una imagen, esta
se divide en un conjunto de sub-imágenes, y, para cada una de ellas, se calcula de forma
no supervisada un par de umbrales de histéresis utilizando el método de MedinaCarnicer. El detector satisface el requisito de tiempo real, al ser 0.35 ms el tiempo
promedio en detectar los bordes de una imagen 512x512.
En tercer lugar, se ha realizado una implementación optimizada del método de
compresión de datos VLE (Variable-Length Encoding), que es 2.6x más rápida en
promedio que la mejor implementación anterior. Además, esta solución incluye un
nuevo método scan inter-bloque, que se puede usar para acelerar la propia operación
scan y otros algoritmos, como el de compactación. En el caso de la operación scan, se
logra una aceleración de 1.62x si se usa el método propuesto en lugar del utilizado en la
mejor implementación anterior de VLE.
Esta tesis doctoral concluye con un capítulo sobre futuros trabajos de investigación que
se pueden plantear a partir de sus contribuciones
Programming Heterogeneous Parallel Machines Using Refactoring and Monte-Carlo Tree Search
Funding: This work was supported by the EU Horizon 2020 project, TeamPlay, Grant Number 779882, and UK EPSRC Discovery, Grant Number EP/P020631/1.This paper presents a new technique for introducing and tuning parallelism for heterogeneous shared-memory systems (comprising a mixture of CPUs and GPUs), using a combination of algorithmic skeletons (such as farms and pipelines), Monte–Carlo tree search for deriving mappings of tasks to available hardware resources, and refactoring tool support for applying the patterns and mappings in an easy and effective way. Using our approach, we demonstrate easily obtainable, significant and scalable speedups on a number of case studies showing speedups of up to 41 over the sequential code on a 24-core machine with one GPU. We also demonstrate that the speedups obtained by mappings derived by the MCTS algorithm are within 5–15% of the best-obtained manual parallelisation.Publisher PDFPeer reviewe
REAL-TIME DATA MINING FOR PROCESS OPERATIONS USING GRAPHICS PROCESSING UNIT (GPU)-BASED HIGH PERFORMANCE COMPUTING
Ph.DDOCTOR OF PHILOSOPH
- …