131 research outputs found

    A General Framework for Accelerating Swarm Intelligence Algorithms on FPGAs, GPUs and Multi-core CPUs

    Get PDF
    Swarm intelligence algorithms (SIAs) have demonstrated excellent performance when solving optimization problems including many real-world problems. However, because of their expensive computational cost for some complex problems, SIAs need to be accelerated effectively for better performance. This paper presents a high-performance general framework to accelerate SIAs (FASI). Different from the previous work which accelerate SIAs through enhancing the parallelization only, FASI considers both the memory architectures of hardware platforms and the dataflow of SIAs, and it reschedules the framework of SIAs as a converged dataflow to improve the memory access efficiency. FASI achieves higher acceleration ability by matching the algorithm framework to the hardware architectures. We also design deep optimized structures of the parallelization and convergence of FASI based on the characteristics of specific hardware platforms. We take the quantum behaved particle swarm optimization algorithm (QPSO) as a case to evaluate FASI. The results show that FASI improves the throughput of SIAs and provides better performance through optimizing the hardware implementations. In our experiments, FASI achieves a maximum of 290.7Mbit/s throughput which is higher than several existing systems, and FASI on FPGAs achieves a better speedup than that on GPUs and multi-core CPUs. FASI is up to 123 times and not less than 1.45 times faster in terms of optimization time on Xilinx Kintex Ultrascale xcku040 when compares to Intel Core i7-6700 CPU/ NVIDIA GTX1080 GPU. Finally, we compare the differences of deploying FASI on hardware platforms and provide some guidelines for promoting the acceleration performance according to the hardware architectures

    Fine-grained parallelization of fitness functions in bioinformatics optimization problems: gene selection for cancer classification and biclustering of gene expression data

    Get PDF
    ANTECEDENTES: las metaheurísticas se utilizan ampliamente para resolver grandes problemas de optimización combinatoria en bioinformática debido al enorme conjunto de posibles soluciones. Dos problemas representativos son la selección de genes para la clasificación del cáncer y el agrupamiento de los datos de expresión génica. En la mayoría de los casos, estas metaheurísticas, así como otras técnicas no lineales, aplican una función de adecuación a cada solución posible con una población de tamaño limitado, y ese paso involucra latencias más altas que otras partes de los algoritmos, lo cual es la razón por la cual el tiempo de ejecución de las aplicaciones dependerá principalmente del tiempo de ejecución de la función de aptitud. Además, es habitual encontrar formulaciones aritméticas de punto flotante para las funciones de fitness. De esta manera, una paralelización cuidadosa de estas funciones utilizando la tecnología de hardware reconfigurable acelerará el cálculo, especialmente si se aplican en paralelo a varias soluciones de la población. RESULTADOS: una paralelización de grano fino de dos funciones de aptitud de punto flotante de diferentes complejidades y características involucradas en el biclustering de los datos de expresión génica y la selección de genes para la clasificación del cáncer permitió obtener mayores aceleraciones y cómputos de potencia reducida con respecto a los microprocesadores habituales. CONCLUSIONES: Los resultados muestran mejores rendimientos utilizando tecnología de hardware reconfigurable en lugar de los microprocesadores habituales, en términos de tiempo de consumo y consumo de energía, no solo debido a la paralelización de las operaciones aritméticas, sino también gracias a la evaluación de aptitud concurrente para varios individuos de la población en La metaheurística. Esta es una buena base para crear soluciones aceleradas y de bajo consumo de energía para escenarios informáticos intensivos.BACKGROUND: Metaheuristics are widely used to solve large combinatorial optimization problems in bioinformatics because of the huge set of possible solutions. Two representative problems are gene selection for cancer classification and biclustering of gene expression data. In most cases, these metaheuristics, as well as other non-linear techniques, apply a fitness function to each possible solution with a size-limited population, and that step involves higher latencies than other parts of the algorithms, which is the reason why the execution time of the applications will mainly depend on the execution time of the fitness function. In addition, it is usual to find floating-point arithmetic formulations for the fitness functions. This way, a careful parallelization of these functions using the reconfigurable hardware technology will accelerate the computation, specially if they are applied in parallel to several solutions of the population. RESULTS: A fine-grained parallelization of two floating-point fitness functions of different complexities and features involved in biclustering of gene expression data and gene selection for cancer classification allowed for obtaining higher speedups and power-reduced computation with regard to usual microprocessors. CONCLUSIONS: The results show better performances using reconfigurable hardware technology instead of usual microprocessors, in computing time and power consumption terms, not only because of the parallelization of the arithmetic operations, but also thanks to the concurrent fitness evaluation for several individuals of the population in the metaheuristic. This is a good basis for building accelerated and low-energy solutions for intensive computing scenarios.• Ministerio de Economía y Competitividad y Fondos FEDER. Contrato TIN2012-30685 (I+D+i) • Gobierno de Extremadura. Ayuda GR15011 para grupos TIC015 • CONICYT/FONDECYT/REGULAR/1160455. Beca para Ricardo Soto Guzmán • CONICYT/FONDECYT/REGULAR/1140897. Beca para Broderick CrawfordpeerReviewe

    Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

    Full text link
    Deep Neural Networks (DNNs) are becoming an important tool in modern computing applications. Accelerating their training is a major challenge and techniques range from distributed algorithms to low-level circuit design. In this survey, we describe the problem from a theoretical perspective, followed by approaches for its parallelization. We present trends in DNN architectures and the resulting implications on parallelization strategies. We then review and model the different types of concurrency in DNNs: from the single operator, through parallelism in network inference and training, to distributed deep learning. We discuss asynchronous stochastic optimization, distributed system architectures, communication schemes, and neural architecture search. Based on those approaches, we extrapolate potential directions for parallelism in deep learning

    Accelerating supply chains with Ant Colony Optimization across range of hardware solutions

    Get PDF
    This pre-print, arXiv:2001.08102v1 [cs.NE], was published subsequently by Elsevier in Computers and Industrial Engineering, vol. 147, 106610, pp. 1-14 on 29 Jun 2020 and is available at https://doi.org/10.1016/j.cie.2020.106610Ant Colony algorithm has been applied to various optimization problems, however most of the previous work on scaling and parallelism focuses on Travelling Salesman Problems (TSPs). Although, useful for benchmarks and new idea comparison, the algorithmic dynamics does not always transfer to complex real-life problems, where additional meta-data is required during solution construction. This paper looks at real-life outbound supply chain problem using Ant Colony Optimization (ACO) and its scaling dynamics with two parallel ACO architectures - Independent Ant Colonies (IAC) and Parallel Ants (PA). Results showed that PA was able to reach a higher solution quality in fewer iterations as the number of parallel instances increased. Furthermore, speed performance was measured across three different hardware solutions - 16 core CPU, 68 core Xeon Phi and up to 4 Geforce GPUs. State of the art, ACO vectorization techniques such as SS-Roulette were implemented using C++ and CUDA. Although excellent for TSP, it was concluded that for the given supply chain problem GPUs are not suitable due to meta-data access footprint required. Furthermore, compared to their sequential counterpart, vectorized CPU AVX2 implementation achieved 25.4x speedup on CPU while Xeon Phi with its AVX512 instruction set reached 148x on PA with Vectorized (PAwV). PAwV is therefore able to scale at least up to 1024 parallel instances on the supply chain network problem solved

    Optimization of Deep Neural Networks Using SoCs with OpenCL

    Full text link
    [EN] In the optimization of deep neural networks (DNNs) via evolutionary algorithms (EAs) and the implementation of the training necessary for the creation of the objective function, there is often a trade-off between efficiency and flexibility. Pure software solutions implemented on general-purpose processors tend to be slow because they do not take advantage of the inherent parallelism of these devices, whereas hardware realizations based on heterogeneous platforms (combining central processing units (CPUs), graphics processing units (GPUs) and/or field-programmable gate arrays (FPGAs)) are designed based on different solutions using methodologies supported by different languages and using very different implementation criteria. This paper first presents a study that demonstrates the need for a heterogeneous (CPU-GPU-FPGA) platform to accelerate the optimization of artificial neural networks (ANNs) using genetic algorithms. Second, the paper presents implementations of the calculations related to the individuals evaluated in such an algorithm on different (CPU- and FPGA-based) platforms, but with the same source files written in OpenCL. The implementation of individuals on remote, low-cost FPGA systems on a chip (SoCs) is found to enable the achievement of good efficiency in terms of performance per watt.This research was funded by Spanish Agency of Research grant number FPA2016-78595-C3-3-R.Gadea Gironés, R.; Colom Palero, RJ.; Herrero Bosch, V. (2018). Optimization of Deep Neural Networks Using SoCs with OpenCL. Sensors. 18(5). https://doi.org/10.3390/s18051384S18

    Parallel Acceleration and Improvement of Gravitational Field Optimization Algorithm

    Get PDF
    The Gravitational Field Algorithm, a modern optimization algorithm, mainly simulates celestial mechanics and is derived from the Solar Nebular Disk Model (SNDM). It simulates the process of planetary formation to search for the optimal solution. Although this optimization algorithm has more advantages than other optimization algorithms in multi-peak optimization problems, it still has the shortcoming of long computation time when dealing with large-scale datasets or solving complex problems. Therefore, it is necessary to improve the efficiency of the Gravitational Field Algorithm (GFA). In this paper, an optimization method based on multi-population parallel is proposed to accelerate the Gravitational Field Algorithm. With the help of the parallel mechanism in MATLAB, the algorithm execution speed will be improved by using the parallel computing mode of multi-core CPU. In addition, this paper also improves the absorption operation strategy. By comparing the experimental results of eight classical unconstrained optimization problems, it is shown that the computational efficiency of this method is improved compared with the original Gravitational Field Algorithm, and the algorithm accuracy has also been slightly improved

    Architectures and GPU-Based Parallelization for Online Bayesian Computational Statistics and Dynamic Modeling

    Get PDF
    Recent work demonstrates that coupling Bayesian computational statistics methods with dynamic models can facilitate the analysis of complex systems associated with diverse time series, including those involving social and behavioural dynamics. Particle Markov Chain Monte Carlo (PMCMC) methods constitute a particularly powerful class of Bayesian methods combining aspects of batch Markov Chain Monte Carlo (MCMC) and the sequential Monte Carlo method of Particle Filtering (PF). PMCMC can flexibly combine theory-capturing dynamic models with diverse empirical data. Online machine learning is a subcategory of machine learning algorithms characterized by sequential, incremental execution as new data arrives, which can give updated results and predictions with growing sequences of available incoming data. While many machine learning and statistical methods are adapted to online algorithms, PMCMC is one example of the many methods whose compatibility with and adaption to online learning remains unclear. In this thesis, I proposed a data-streaming solution supporting PF and PMCMC methods with dynamic epidemiological models and demonstrated several successful applications. By constructing an automated, easy-to-use streaming system, analytic applications and simulation models gain access to arriving real-time data to shorten the time gap between data and resulting model-supported insight. The well-defined architecture design emerging from the thesis would substantially expand traditional simulation models' potential by allowing such models to be offered as continually updated services. Contingent on sufficiently fast execution time, simulation models within this framework can consume the incoming empirical data in real-time and generate informative predictions on an ongoing basis as new data points arrive. In a second line of work, I investigated the platform's flexibility and capability by extending this system to support the use of a powerful class of PMCMC algorithms with dynamic models while ameliorating such algorithms' traditionally stiff performance limitations. Specifically, this work designed and implemented a GPU-enabled parallel version of a PMCMC method with dynamic simulation models. The resulting codebase readily has enabled researchers to adapt their models to the state-of-art statistical inference methods, and ensure that the computation-heavy PMCMC method can perform significant sampling between the successive arrival of each new data point. Investigating this method's impact with several realistic PMCMC application examples showed that GPU-based acceleration allows for up to 160x speedup compared to a corresponding CPU-based version not exploiting parallelism. The GPU accelerated PMCMC and the streaming processing system can complement each other, jointly providing researchers with a powerful toolset to greatly accelerate learning and securing additional insight from the high-velocity data increasingly prevalent within social and behavioural spheres. The design philosophy applied supported a platform with broad generalizability and potential for ready future extensions. The thesis discusses common barriers and difficulties in designing and implementing such systems and offers solutions to solve or mitigate them

    High-Order Epistasis Detection in High Performance Computing Systems

    Get PDF
    Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Resumo] Nos últimos anos, os estudos de asociación do xenoma completo (Genome-Wide Association Studies, GWAS) están a gañar moita popularidade de cara a buscar unha explicación xenética á presenza ou ausencia de certas enfermidades nos humanos.Hai un consenso nestes estudos sobre a existencia de interaccións xenéticas que condicionan a expresión de enfermidades complexas, un fenómeno coñecido como epistasia. Esta tese céntrase no estudo deste fenómeno empregando a computación de altas prestacións (High-Performance Computing, HPC) e dende a súa perspectiva estadística: a desviación da expresión dun fenotipo como a suma dos efectos individuais de múltiples variantes xenéticas. Con este obxectivo desenvolvemos unha primeira ferramenta, chamada MPI3SNP, que identifica interaccións de tres variantes a partir dun conxunto de datos de entrada. MPI3SNP implementa unha busca exhaustiva empregando un test de asociación baseado na Información Mutua, e explota os recursos de clústeres de CPUs ou GPUs para acelerar a busca. Coa axuda desta ferramenta avaliamos o estado da arte da detección de epistasia a través dun estudo que compara o rendemento de vintesete ferramentas. A conclusión máis importante desta comparativa é a incapacidade dos métodos non exhaustivos de atopar interacción ante a ausencia de efectos marxinais (pequenos efectos de asociación das variantes individuais que participan na epistasia). Por isto, esta tese continuou centrándose na optimización da busca exhaustiva de epistasia. Por unha parte, mellorouse a eficiencia do test de asociación a través dunha implantación vectorial do mesmo. Por outro lado, creouse un algoritmo distribuído que implementa unha busca exhaustiva capaz de atopar epistasia de calquera orden. Estes dous fitos lógranse en Fiuncho, unha ferramenta que integra toda a investigación realizada, obtendo un rendemento en clústeres de CPUs que supera a todas as súas alternativas no estado da arte. Adicionalmente, desenvolveuse unha libraría para simular escenarios biolóxicos con epistasia chamada Toxo. Esta libraría permite a simulación de epistasia seguindo modelos de interacción xenética existentes para orde alto.[Resumen] En los últimos años, los estudios de asociación del genoma completo (Genome- Wide Association Studies, GWAS) están ganando mucha popularidad de cara a buscar una explicación genética a la presencia o ausencia de ciertas enfermedades en los seres humanos. Existe un consenso entre estos estudios acerca de que muchas enfermedades complejas presentan interacciones entre los diferentes genes que intervienen en su expresión, un fenómeno conocido como epistasia. Esta tesis se centra en el estudio de este fenómeno empleando la computación de altas prestaciones (High-Performance Computing, HPC) y desde su perspectiva estadística: la desviación de la expresión de un fenotipo como suma de los efectos de múltiples variantes genéticas. Para ello se ha desarrollado una primera herramienta, MPI3SNP, que identifica interacciones de tres variantes a partir de un conjunto de datos de entrada. MPI3SNP implementa una búsqueda exhaustiva empleando un test de asociación basado en la Información Mutua, y explota los recursos de clústeres de CPUs o GPUs para acelerar la búsqueda. Con la ayuda de esta herramienta, hemos evaluado el estado del arte de la detección de epistasia a través de un estudio que compara el rendimiento de veintisiete herramientas. La conclusión más importante de esta comparativa es la incapacidad de los métodos no exhaustivos de localizar interacciones ante la ausencia de efectos marginales (pequeños efectos de asociación de variantes individuales pertenecientes a una relación epistática). Por ello, esta tesis continuó centrándose en la optimización de la búsqueda exhaustiva. Por un lado, se mejoró la eficiencia del test de asociación a través de una implementación vectorial del mismo. Por otra parte, se diseñó un algoritmo distribuido que implementa una búsqueda exhaustiva capaz de encontrar relaciones epistáticas de cualquier tamaño. Estos dos hitos se logran en Fiuncho, una herramienta que integra toda la investigación realizada, obteniendo un rendimiento en clústeres de CPUs que supera a todas sus alternativas del estado del arte. A mayores, también se ha desarrollado una librería para simular escenarios biológicos con epistasia llamada Toxo. Esta librería permite la simulación de epistasia siguiendomodelos de interacción existentes para orden alto.[Abstract] In recent years, Genome-Wide Association Studies (GWAS) have become more and more popular with the intent of finding a genetic explanation for the presence or absence of particular diseases in human studies. There is consensus about the presence of genetic interactions during the expression of complex diseases, a phenomenon called epistasis. This thesis focuses on the study of this phenomenon, employingHigh- Performance Computing (HPC) for this purpose and from a statistical definition of the problem: the deviation of the expression of a phenotype from the addition of the individual contributions of genetic variants. For this purpose, we first developedMPI3SNP, a programthat identifies interactions of three variants froman input dataset. MPI3SNP implements an exhaustive search of epistasis using an association test based on the Mutual Information and exploits the resources of clusters of CPUs or GPUs to speed up the search. Then, we evaluated the state-of-the-art methods with the help of MPI3SNP in a study that compares the performance of twenty-seven tools. The most important conclusion of this study is the inability of non-exhaustive approaches to locate epistasis in the absence of marginal effects (small association effects of individual variants that partake in an epistasis interaction). For this reason, this thesis continued focusing on the optimization of the exhaustive search. First, we improved the efficiency of the association test through a vector implementation of this procedure. Then, we developed a distributed algorithm capable of locating epistasis interactions of any order. These two milestones were achieved in Fiuncho, a program that incorporates all the research carried out, obtaining the best performance in CPU clusters out of all the alternatives of the state-of-the-art. In addition, we also developed a library to simulate particular scenarios with epistasis called Toxo. This library allows for the simulation of epistasis that follows existing interaction models for high-order interactions
    corecore