79 research outputs found
Re-engineering the ant colony optimization for CMP architectures
[EN] The ant colony optimization (ACO) is inspired by the behavior of real ants, and as a bioinspired method, its underlying computation is massively parallel by definition. This paper shows re-engineering strategies to migrate the ACO algorithm applied to the Traveling Salesman Problem to modern Intel-based multi- and many-core architectures in a step-by-step methodology. The paper provides detailed guidelines on how to optimize the algorithm for the intra-node (thread and vector) parallelization, showing the performance scalability along with the number of cores on different Intel architectures, reporting up to 5.5x speedup factor between the Intel Xeon Phi Knights Landing and Intel Xeon v2. Moreover, parallel efficiency is provided for all targeted architectures, finding that core load imbalance, memory bandwidth limitations, and NUMA effects on data placement are some of the key factors limiting performance. Finally, a distributed implementation is also presented, reaching up to 2.96x speedup factor when running the code on 3 nodes over the single-node counterpart version. In the latter case, the parallel efficiency is affected by the synchronization frequency, which also affects the quality of the solution found by the distributed implementation.This work was partially supported by the Fundación Séneca, Agencia de Ciencia y Tecnología de la Región de Murcia under Project 20813/PI/18, and by Spanish Ministry of Science, Innovation and Universities as well as European Commission FEDER funds under Grants TIN2015-66972-C5-3-R, RTI2018-098156-B-C53, TIN2016-78799-P (AEI/FEDER, UE), and RTC-2017-6389-5. We acknowledge the excellent work done by Victor Montesinos while he was doing a research internship supported by the University of Murcia.Cecilia-Canales, JM.; García Carrasco, JM. (2020). Re-engineering the ant colony optimization for CMP architectures. The Journal of Supercomputing (Online). 76(6):4581-4602. https://doi.org/10.1007/s11227-019-02869-8S45814602766Yang XS (2010) Nature-inspired metaheuristic algorithms. Luniver Press, LebanonAkila M, Anusha P, Sindhu M, Selvan Krishnasamy T (2017) Examination of PSO, GA-PSO and ACO algorithms for the design optimization of printed antennas. In: IEEE Applied Electromagnetics Conference (AEMC)Dorigo M, Stützle T (2004) Ant colony optimization. A bradford book. The MIT Press, CambridgeCecilia JM, García JM, Nisbet A, Amos M, Ujaldón M (2013) Enhancing data parallelism for ant colony optimization on GPUs. J Parallel Distrib Comput 73(1):42–51Dawson L, Stewart I (2013) Improving ant colony optimization performance on the GPU using CUDA. In: IEEE Conference on Evolutionary Computation, pp 1901–1908Llanes A, Cecilia JM, Sánchez A, García JM, Amos M, Ujaldón M (2016) Dynamic load balancing on heterogeneous clusters for parallel ant colony optimization. Cluster Comput 19(1):1–11Cecilia JM, Llanes A, Abellán JL, Gómez-Luna J, Chang L, Hwu WW (2018) High-throughput ant colony optimization on graphics processing units. J Parallel Distrib Comput 113:261–274Lloyd H, Amos M (2016) A Highly Parallelized and Vectorized Implementation of Max–Min Ant System on Intel Xeon Phi. In: IEEE computational intelligenceTirado F, Barrientos RJ, González P, Mora M (2017) Efficient exploitation of the Xeon Phi architecture for the ant colony optimization (ACO) metaheuristic. J Supercomput 73(11):5053–5070Montesinos V, García JM (2018) Vectorization strategies for ant colony optimization on intel architectures. Parallel Computing is Everywhere. IOS Press, Amsterdam, pp 400–409Lawler E, Lenstra J, Kan A, Shmoys D (1987) The Traveling salesman problem. Wiley, New YorkMontesinos V (June 2018) Performance analysis of ant colony optimization on intel architectures. Master’s Thesis, University of Murcia (Spain)Lloyd H, Amos M (2017) Analysis of independent roulette selection in parallel ant colony optimization. In: Genetic and Evolutionary Computation Conference, ACM, pp 19–26Dorigo M (1992) Optimization, learning and natural algorithms. Ph.D. Thesis, Politecnico di Milano, ItalyDuran A, Klemm M (2012) The intel many integrated core architecture. In: Internal Conference on High Performance Computing and Simulation (HPCS), pp 365–366The OpenMP API specification for parallel programming. URL: https://www.openmp.org . [Last accessed 14 June 2018]The Message Passing Interface (MPI) standard. URL: http://www.mcs.anl.gov/research/projects/mpi/ . [Last accessed 15 June 2018]Vladimirov A, Asai R (2016) Clustering modes in Knights landing processors: developer’s guide. Colfax international. URL: https://colfaxresearch.com/knl-numa/ . [Last accessed: 16 June 2018]Intel Developer Zone. URL: https://software.intel.com/en-us/modern-code . [Last accessed 02 Oct 2018]Pearce M (2018) What is code modernization? Intel developer zone. URL: http://software.intel.com/en-us/articles/what-is-code-modernization . [Last accessed 15 Feb 2018]Stützle T ACOTSP v1.03. Last accessed 15 Feb 2018. URL: http://iridia.ulb.ac.be/~mdorigo/ACO/downloads/ACOTSP-1.03.tgzReinelt G (1991) TSPLIB—a traveling salesman problem library. ORSA J Comput 3:376–384Crainic TG, Toulouse M (2003) Parallel strategies for meta-heuristics. State-of-the-art handbook in metaheuristics. Kluwer Academic Publishers, Dordrecht, pp 475–513Delévacq A, Delisle P, Gravel M, Krajecki M (2013) Parallel ant colony optimization on graphics processing units. J Parallel Distrib Comput 73(1):52–61Skinderowicz R (2016) The GPU-based parallel ant colony system. J Parallel Distrib Comput 98:48–60Zhou Y, He F, Hou N, Qiu Y (2018) Parallel ant colony optimization on multi-core SIMD CPUs. Future Gener Comput Syst 79:473–487Peake J, Amos M, Yiapanis P, Lloyd H (2018) Vectorized candidate set selection for parallel ant colony optimization. In: Genetic and Evolutionary Computation Conference, ACM, pp 1300–1306Stützle T (1998) Parallelization strategies for ant colony optimization. In: Eiben AE, Bäck T, Schoenauer M, Schwefel HP (eds) Parallel problem solving from nature—PPSN V. PPSN. Lecture Notes in Computer Science, vol 1498. Springer, Berlin, HeidelbergAbdelkafi O, Lepagnot J, Idoumghar L (2014) Multi-level parallelization for hybrid ACO. In: Siarry P, Idoumghar L, Lepagnot J (eds) Swarm Intelligence Based Optimization. ICSIBO 2014. Lecture Notes in Computer Science, vol 8472. Springer, ChamMichel R, Middendorf M (1998) An island model based ant system with lookahead for the shortest super sequence problem. In: Eiben AE, Bäck T, Schoenauer M, Schwefel HP (eds) Parallel problem solving from nature— PPSN V. PPSN. Lecture Notes in Computer Science, vol 1498. Springer, Berlin, HeidelbergChen L, Sun H, Wang S (2008) Parallel implementation of ant colony optimization on MPP. In: International Conference on Machine Learning and CyberneticsLin Y, Cai H, Xiao J, Zhang J (2007) Pseudo parallel ant colony optimization for continuous functions. In: International Conference on Natural Computatio
Accelerating supply chains with Ant Colony Optimization across range of hardware solutions
This pre-print, arXiv:2001.08102v1 [cs.NE], was published subsequently by Elsevier in Computers and Industrial Engineering, vol. 147, 106610, pp. 1-14 on 29 Jun 2020 and is available at https://doi.org/10.1016/j.cie.2020.106610Ant Colony algorithm has been applied to various optimization problems, however most of the previous work on scaling and parallelism focuses on Travelling Salesman Problems (TSPs). Although, useful for benchmarks and new idea comparison, the algorithmic dynamics does not always transfer to complex real-life problems, where additional meta-data is required during solution construction. This paper looks at real-life outbound supply chain problem using Ant Colony Optimization (ACO) and its scaling dynamics with two parallel ACO architectures - Independent Ant Colonies (IAC) and Parallel Ants (PA). Results showed that PA was able to reach a higher solution quality in fewer iterations as the number of parallel instances increased. Furthermore, speed performance was measured across three different hardware solutions - 16 core CPU, 68 core Xeon Phi and up to 4 Geforce GPUs. State of the art, ACO vectorization techniques such as SS-Roulette were implemented using C++ and CUDA. Although excellent for TSP, it was concluded that for the given supply chain problem GPUs are not suitable due to meta-data access footprint required. Furthermore, compared to their sequential counterpart, vectorized CPU AVX2 implementation achieved 25.4x speedup on CPU while Xeon Phi with its AVX512 instruction set reached 148x on PA with Vectorized (PAwV). PAwV is therefore able to scale at least up to 1024 parallel instances on the supply chain network problem solved
High-Order Epistasis Detection in High Performance Computing Systems
Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Resumo]
Nos últimos anos, os estudos de asociación do xenoma completo (Genome-Wide
Association Studies, GWAS) están a gañar moita popularidade de cara a buscar unha
explicación xenética á presenza ou ausencia de certas enfermidades nos humanos.Hai
un consenso nestes estudos sobre a existencia de interaccións xenéticas que condicionan
a expresión de enfermidades complexas, un fenómeno coñecido como epistasia.
Esta tese céntrase no estudo deste fenómeno empregando a computación de altas
prestacións (High-Performance Computing, HPC) e dende a súa perspectiva estadística:
a desviación da expresión dun fenotipo como a suma dos efectos individuais de
múltiples variantes xenéticas. Con este obxectivo desenvolvemos unha primeira ferramenta,
chamada MPI3SNP, que identifica interaccións de tres variantes a partir dun
conxunto de datos de entrada. MPI3SNP implementa unha busca exhaustiva empregando
un test de asociación baseado na Información Mutua, e explota os recursos de
clústeres de CPUs ou GPUs para acelerar a busca. Coa axuda desta ferramenta avaliamos
o estado da arte da detección de epistasia a través dun estudo que compara o rendemento
de vintesete ferramentas. A conclusión máis importante desta comparativa
é a incapacidade dos métodos non exhaustivos de atopar interacción ante a ausencia
de efectos marxinais (pequenos efectos de asociación das variantes individuais que
participan na epistasia). Por isto, esta tese continuou centrándose na optimización da
busca exhaustiva de epistasia. Por unha parte, mellorouse a eficiencia do test de asociación
a través dunha implantación vectorial do mesmo. Por outro lado, creouse un
algoritmo distribuído que implementa unha busca exhaustiva capaz de atopar epistasia
de calquera orden. Estes dous fitos lógranse en Fiuncho, unha ferramenta que integra
toda a investigación realizada, obtendo un rendemento en clústeres de CPUs que
supera a todas as súas alternativas no estado da arte. Adicionalmente, desenvolveuse
unha libraría para simular escenarios biolóxicos con epistasia chamada Toxo. Esta
libraría permite a simulación de epistasia seguindo modelos de interacción xenética
existentes para orde alto.[Resumen]
En los últimos años, los estudios de asociación del genoma completo (Genome-
Wide Association Studies, GWAS) están ganando mucha popularidad de cara a buscar
una explicación genética a la presencia o ausencia de ciertas enfermedades en los seres
humanos. Existe un consenso entre estos estudios acerca de que muchas enfermedades
complejas presentan interacciones entre los diferentes genes que intervienen en su
expresión, un fenómeno conocido como epistasia. Esta tesis se centra en el estudio de
este fenómeno empleando la computación de altas prestaciones (High-Performance
Computing, HPC) y desde su perspectiva estadística: la desviación de la expresión de
un fenotipo como suma de los efectos de múltiples variantes genéticas. Para ello se
ha desarrollado una primera herramienta, MPI3SNP, que identifica interacciones de
tres variantes a partir de un conjunto de datos de entrada. MPI3SNP implementa una
búsqueda exhaustiva empleando un test de asociación basado en la Información Mutua,
y explota los recursos de clústeres de CPUs o GPUs para acelerar la búsqueda.
Con la ayuda de esta herramienta, hemos evaluado el estado del arte de la detección
de epistasia a través de un estudio que compara el rendimiento de veintisiete herramientas.
La conclusión más importante de esta comparativa es la incapacidad de los
métodos no exhaustivos de localizar interacciones ante la ausencia de efectos marginales
(pequeños efectos de asociación de variantes individuales pertenecientes a una
relación epistática). Por ello, esta tesis continuó centrándose en la optimización de la
búsqueda exhaustiva. Por un lado, se mejoró la eficiencia del test de asociación a través
de una implementación vectorial del mismo. Por otra parte, se diseñó un algoritmo
distribuido que implementa una búsqueda exhaustiva capaz de encontrar relaciones
epistáticas de cualquier tamaño. Estos dos hitos se logran en Fiuncho, una herramienta
que integra toda la investigación realizada, obteniendo un rendimiento en clústeres
de CPUs que supera a todas sus alternativas del estado del arte. A mayores, también se
ha desarrollado una librería para simular escenarios biológicos con epistasia llamada
Toxo. Esta librería permite la simulación de epistasia siguiendomodelos de interacción
existentes para orden alto.[Abstract]
In recent years, Genome-Wide Association Studies (GWAS) have become more and
more popular with the intent of finding a genetic explanation for the presence or absence
of particular diseases in human studies. There is consensus about the presence
of genetic interactions during the expression of complex diseases, a phenomenon
called epistasis. This thesis focuses on the study of this phenomenon, employingHigh-
Performance Computing (HPC) for this purpose and from a statistical definition of the
problem: the deviation of the expression of a phenotype from the addition of the individual
contributions of genetic variants. For this purpose, we first developedMPI3SNP,
a programthat identifies interactions of three variants froman input dataset. MPI3SNP
implements an exhaustive search of epistasis using an association test based on the
Mutual Information and exploits the resources of clusters of CPUs or GPUs to speed up
the search. Then, we evaluated the state-of-the-art methods with the help of MPI3SNP
in a study that compares the performance of twenty-seven tools. The most important
conclusion of this study is the inability of non-exhaustive approaches to locate epistasis
in the absence of marginal effects (small association effects of individual variants
that partake in an epistasis interaction). For this reason, this thesis continued focusing
on the optimization of the exhaustive search. First, we improved the efficiency of
the association test through a vector implementation of this procedure. Then, we developed
a distributed algorithm capable of locating epistasis interactions of any order.
These two milestones were achieved in Fiuncho, a program that incorporates all the
research carried out, obtaining the best performance in CPU clusters out of all the alternatives
of the state-of-the-art. In addition, we also developed a library to simulate
particular scenarios with epistasis called Toxo. This library allows for the simulation of
epistasis that follows existing interaction models for high-order interactions
Scaling Techniques for Parallel Ant Colony Optimization on Large Problem Instances
Ant Colony Optimization (ACO) is a nature-inspired optimization metaheuristic which has been successfully applied to a wide range of different problems. However, a significant limiting factor in terms of its scalability is memory complexity; in many problems, the pheromone matrix which encodes trails left by ants grows quadratically with the instance size. For very large instances, this memory requirement is a limiting factor, making ACO an impractical technique. In this paper we propose a restricted variant of the pheromone matrix with linear memory complexity, which stores pheromone values only for members of a candidate set of next moves. We also evaluate two selection methods for moves outside the candidate set. Using a combination of these techniques we achieve, in a reasonable time, the best solution qualities recorded by ACO on the Art TSP Traveling Salesman Problem instances, and the first evaluation of a parallel implementation of MAX-MIN Ant System on instances of this scale (≥ 105 vertices). We find that, although ACO cannot yet achieve the solutions found by state-of-the-art genetic algorithms, we rapidly find approximate solutions within 1 -- 2% of the best known
A Highly Parallelized and Vectorized Implementation of Max-Min Ant System on Intel Xeon Phi
The increasing trend in processor design towards many-core architectures with wide vector processing units is largely motivated by the fact that single core performance has hit a ‘power wall’, meaning that performance gains are currently achievable only through increasingly parallel and vectorized execution models. Consequently, applications can only exploit the full performance of modern processors if they achieve high parallel and vector efficiencies. In this paper, we illustrate how this might be achieved for the well-established Ant Colony Optimization metaheuristic. We describe a highly parallel and vectorized variant of the Max-Min Ant System algorithm applied to the Traveling Salesman Problem, and present two novel vectorized algorithms for selecting cities during the tour construction phase. We present experimental results from an implementation on the Intel R Xeon PhiTM platform, which show that very high parallel and vector efficiencies are achieved, and significant speedups are obtained compared to both the reference serial implementation and the previous best Xeon Phi implementation available in the literature
Vectorized Candidate Set Selection for Parallel Ant Colony Optimization
Ant Colony Optimization (ACO) is a well-established nature-inspired heuristic, and parallel versions of the algorithm now exist to take advantage of emerging high-performance computing processors. However, careful attention must be paid to parallel components of such implementations if the full benefit of these platforms is to be obtained. One such component of the ACO algorithm is next node selection, which presents unique challenges in a parallel setting. In this paper, we present a new node selection method for ACO, Vectorized Candidate Set Selection (VCSS), which achieves significant speedup over existing selection methods on a test set of Traveling Salesman Problem instances
Soft Computing Techiniques for the Protein Folding Problem on High Performance Computing Architectures
The protein-folding problem has been extensively studied during the last
fifty years. The understanding of the dynamics of global shape of a protein and the influence
on its biological function can help us to discover new and more effective
drugs to deal with diseases of pharmacological relevance. Different computational approaches
have been developed by different researchers in order to foresee the threedimensional
arrangement of atoms of proteins from their sequences. However, the
computational complexity of this problem makes mandatory the search for new models,
novel algorithmic strategies and hardware platforms that provide solutions in a
reasonable time frame. We present in this revision work the past and last tendencies
regarding protein folding simulations from both perspectives; hardware and software.
Of particular interest to us are both the use of inexact solutions to this computationally hard problem as
well as which hardware platforms have been used for running this kind of Soft Computing techniques.This work is jointly supported by the FundaciónSéneca (Agencia Regional de Ciencia y Tecnología, Región de Murcia) under grants 15290/PI/2010 and 18946/JLI/13, by the Spanish MEC and European Commission FEDER under grant with reference TEC2012-37945-C02-02 and TIN2012-31345, by the Nils Coordinated Mobility under grant 012-ABEL-CM-2014A, in part financed by the European Regional Development Fund (ERDF). We also thank NVIDIA for hardware donation within UCAM GPU educational and research centers.Ingeniería, Industria y Construcció
A SIMD Algorithm for the Detection of Epistatic Interactions of Any Order
Financiado para publicación en acceso aberto: Universidade da Coruña/CISUG[Abstract] Epistasis is a phenomenon in which a phenotype outcome is determined by the interaction of genetic variation at two or more loci and it cannot be attributed to the additive combination of effects corresponding to the individual loci. Although it has been more than 100 years since William Bateson introduced this concept, it still is a topic under active research. Locating epistatic interactions is a computationally expensive challenge that involves analyzing an exponentially growing number of combinations. Authors in this field have resorted to a multitude of hardware architectures in order to speed up the search, but little to no attention has been paid to the vector instructions that current CPUs include in their instruction sets. This work extends an existing third-order exhaustive algorithm to support the search of epistasis interactions of any order and discusses multiple SIMD implementations of the different functions that compose the search using Intel AVX Intrinsics. Results using the GCC and the Intel compiler show that the 512-bit explicit vector implementation proposed here performs the best out of all of the other implementations evaluated. The proposed 512-bit vectorization accelerates the original implementation of the algorithm by an average factor of 7 and 12, for GCC and the Intel Compiler, respectively, in the scenarios tested.This work is supported by the Ministry of Science and Innovation of Spain (PID2019-104184RB-I00/AEI/10.13039/501100011033), the Xunta de Galicia and FEDER funds of the EU (Centro de Investigación de Galicia accreditation 2019-2022, grant no. ED431G2019/01), Consolidation Program of Competitive Research (grant no. ED431C 2021/30), the FPU Program of the Ministry of Education of Spain (grant no. FPU16/01333), and the Universidade da Coruña/CISUG for funding the open access chargeXunta de Galicia; ED431G2019/01Xunta de Galicia; ED431C2021/3
Generating and auto-tuning parallel stencil codes
In this thesis, we present a software framework, Patus, which generates high performance stencil codes for different types of hardware platforms, including current multicore CPU and graphics processing unit architectures. The ultimate goals of the framework are productivity, portability (of both the code and performance), and achieving a high performance on the target platform.
A stencil computation updates every grid point in a structured grid based on the values of its neighboring points. This class of computations occurs frequently in scientific and general purpose computing (e.g., in partial differential equation solvers or in image processing), justifying the focus on this kind of computation.
The proposed key ingredients to achieve the goals of productivity, portability, and performance are domain specific languages (DSLs) and the auto-tuning methodology.
The Patus stencil specification DSL allows the programmer to express a stencil computation in a concise way independently of hardware architecture-specific details. Thus, it increases the programmer productivity by disburdening her or him of low level programming model issues and of manually applying hardware platform-specific
code optimization techniques. The use of domain specific languages also implies code reusability: once implemented, the same stencil specification can be reused on different
hardware platforms, i.e., the specification code is portable across hardware architectures. Constructing the language to be geared towards a special purpose makes it amenable to more aggressive optimizations and therefore to potentially higher performance.
Auto-tuning provides performance and performance portability by automated adaptation of implementation-specific parameters to the characteristics of the hardware on which the code will run. By automating the process of parameter tuning — which essentially amounts to solving an integer programming problem in which the objective function is the number representing the code's performance as a function of the parameter configuration, — the system can also be used more productively than if the programmer had to fine-tune the code manually.
We show performance results for a variety of stencils, for which Patus was used to generate the corresponding implementations. The selection includes stencils taken from two real-world applications: a simulation of the temperature within the human body during hyperthermia cancer treatment and a seismic application. These examples demonstrate the framework's flexibility and ability to produce high performance code
Acceleration of particle swarm optimization with AVX instructions
Parallel implementations of algorithms are usually compared with single-core CPU performance. The advantage of multicore vector processors decreases the performance gap between GPU and CPU computation, as shown in many recent pieces of research. With the AVX-512 instruction set, there will be another performance boost for CPU computations. The availability of parallel code running on CPUs made them much easier and more accessible than GPUs. This article compares the performances of parallel implementations of the particle swarm optimization algorithm. The code was written in C++, and we used various techniques to obtain parallel execution through Advanced Vector Extensions. We present the performance on various benchmark functions and different problem configurations. The article describes and compares the performance boost gained from parallel execution on CPU, along with advantages and disadvantages of parallelization techniques.Web of Science132art. no. 73
- …