11 research outputs found

    Multi–GPU Implementation of Machine Learning Algorithm using CUDA and OpenCL

    Full text link

    Análisis de rendimiento de un clúster HPC y, arquitecturas manycore y multicorer

    Get PDF
    Actualmente la tendencia para obtener una gran cantidad de cómputo, es mediante la computación paralela, un claro ejemplo radica en el hecho que los computadores más rápidos del mundo son clústeres, formados por aceleradores y procesadores. Los clústeres HPC ofrecen una capacidad computacional para solventar requerimientos de alto costo computacional en áreas como: predicción climática y aprendizaje automático. Sin embargo para tener un rendimiento óptimo en aplicaciones que hacen uso del paralelismo, es necesario analizar requerimientos como: carga de comunicaciones, plataformas de paralelización, entre otras. El objetivo de este estudio es analizar la escalabilidad del modelo de predicción climática WRF en clústeres HPC en base a las comunicaciones y procesos MPI; y el rendimiento del algoritmo Horizontal Diffusion en aceleradores XeonPhi y TeslaKepler, usando OpenCL y CUDA. Los resultados muestran una dependencia de la escalabilidad del WRF con las comunicaciones, debido a que, se obtuvo una aceleración máxima para InfiniBand FDR de 25.9, QDR 21.42 y Ethernet 6.4 veces más con respecto a una ejecución secuencial. El acelerador TeslaK40m muestra un rendimiento 6 veces mayor que XeonPhi, debido a que el algoritmo no utiliza la vectorización eficientemente. OpenCL tiene una curva de aprendizaje mayor que CUDA debido a sus propiedades multiplataforma, en cuanto a rendimiento CUDA es un 6% mejor, debido a que está orientado a GPUs NVIDIA, y posee configuraciones que mejoran el desempeño, por otro lado OpenCL no es afectado en gran medida por ser multiplataforma, y es una opción cuando los requerimientos, están orientados hacia la portabilidad.Currently the tendency to obtain a large amount of computing, is through parallel computing, a clear example lies in the fact that the fastest computers in the world are clusters for high performance computing, formed by accelerators and multiprocessors. HPC clusters offer an important computational capacity to solve high computational cost requirements, in areas such as: weather prediction, machine learning. However, to achieve optimal performance is necessary to analyze their requirements, such as: communications, parallelization platforms. The objective of this study is to analyze the scalability of the climate prediction model WRF in HPC clusters based on inter-node communication and MPI processes; and the performance of the Horizontal Diffusion algorithm on XeonPhi and TeslaKepler accelerators, using OpenCL and CUDA. The results show a dependence of the scalability of the WRF with communications, since; a maximum acceleration was obtained for InfiniBand FDR of 25.9, QDR 21.42 and Ethernet 6.4 times more than sequential execution. The accelerator Tesla K40m shows a performance 6 times greater than XeonPhi, since the algorithm does not efficiently use vectorization Intel property, in addition Intel OpenCL drivers for its architecture manycore, they are deprecated. OpenCL has a higher learning curve than CUDA due to its multiplatform properties, in terms of performance CUDA shows a 6% improvement since it is oriented to NVIDIA GPUs, and has configurations that improve the performance, on the other hand OpenCL is not very affected by its multiplatform property, and is an option to consider if the requirements are oriented towards portability.Ingeniero en SistemasCuenc

    Use of GPU for time series analysis

    Get PDF
    The Global Navigation Satellite System (GNSS) provides the position of stations with millimetre accuracy. These positions are not constant over time due to motion of the tectonic plates on which they are installed. The motion of tectonic plates is not the same all over the world, leading stress to build up in some areas, thus increasing the occurrence of earthquakes. The knowledge of such movements is of extreme importance, as is the determination of their uncertainty. The tectonic motion is a very slow process that is constant over thousands of years. As a result, the motion can be represented by a linear trend. Programs such as Hector can estimate linear trends in time series with temporal correlated noise. Correlated noise means that GNSS observations made today are similar to those of the days before, and it implies that one actually has less information than when all observations were independent. This is the reason why the real uncertainty of the estimated tectonic motion is between 5 to 11 times larger than when this temporal correlation was not considered. Unfortunately, taking this temporal correlation into account in the analysis slows down the computations considerably. With the ever growing number of GNSS stations and the increasing length of the existing time series, it is necessary to speed up these computations. This behaviour can also be found in other geodetic time series such as sea level observed at tide gauges, and surface temperature time series. For climate change studies it is important that besides estimating sea level rise or an increase in temperature, also the associated uncertainties are realistic. The Hector software package was developed with also the aim to reduce the computation time as much as possible. In this thesis we will investigate if the use of a powerful Graphics Processing Unit (GPU) can further reduce this computation time. So far, no Hector-like software uses GPUs to perform data processing. This dissertation is therefore the first attempt to go in that direction.Os Sistemas de Navegação por Satélite, (do inglês Global Navigation Satellite System - GNSS) fornecem a posição de estações com uma precisão milimétrica. Estas posições não são constantes ao longo do tempo devido ao movimento das placas tectónicas nas quais as estações se encontram instaladas. O movimento das placas tectónicas não é igual em todas as partes do globo, o que leva à acumulação de stress em algumas áreas, aumentado assim a ocorrência de movimentos sísmicos. O conhecimento de tais movimentos é de extrema importância, bem como a determinação da sua incerteza. O movimento tectónico é um processo muito lento e constante ao longo de milhares de anos. Como resultado, o movimento pode ser representado por uma tendência linear. Programas como o Hector podem estimar tendências lineares em séries temporais com ruído temporal correlacionado. Ruído correlacionado significa que as observações dos Sistemas de Navegação por Satélite feitas hoje são similares às dos dias anteriores, o que implica que haja menos informação do que se as observações fossem independentes. Esta é a razão pela qual a incerteza real do movimento tectónico estimado é entre 5 a 11 vezes maior do que quando esta correlação temporal não é considerada. Infelizmente, ter em consideração essa correlação temporal na análise aumenta consideravelmente o tempo de computação. Com o crescente número de estações GNSS e o aumento da duração das séries temporais existentes, é necessário acelerar essa computação. Este comportamento também pode ser encontrado em outras séries temporais geodésicas, como as do nível do mar observado em medidores de maré ou séries temporais da temperatura da superfície. Para estudos das alterações climáticas, é importante que, para além de estimar a elevação do nível do mar ou um aumento da temperatura, também as incertezas associadas sejam realísticas. O conjunto de software Hector foi desenvolvido com o objetivo de reduzir o tempo de computação tanto quanto possível. Nesta tese, investigaremos se o uso de uma potente Unidade de Processamento Gráfico (do inglês Graphics Processing Unit), vulgarmente conhecida como placa gráfica, pode reduzir ainda mais este tempo de computação. Até ao momento, nenhum software semelhante ao Hector usa placas gráficas para realizar o processamento de dados. Esta dissertação é, portanto, a primeira tentativa a ir nesse sentido

    Simulations of fluids using the Monte Carlo method on graphic processing units

    Get PDF
    Molecular simulations are a set of methods for performing computer experiments on models of molecular systems. They act as a bridge between theoretical predictions and experimental results. The need for greater computational power grows with the complexity and size of the simulation model. Graphics processing units are increasingly being used for general-purpose computing due to their favourable ratio of computing capacity to power consumption and price. In our work, we focus on the Monte Carlo method for simulation of fluids. We have successfully adapted it for execution on graphics processing units using the CUDA platform and the energy decomposition principle. Throughout the simulation the system energy and radial distribution function are calculated. Inter-atom interactions are modelled using the Lennard-Jones potential. We have also implemented support for molecules composed of several different atoms. We have analysed the performance of our parallel implementation in comparison to a sequential implementation. We have achieved up to 172-fold speedups when using double precision for floating-point number representation and almost up to 640-fold speedups when using single precision

    A model-based design flow for embedded vision applications on heterogeneous architectures

    Get PDF
    The ability to gather information from images is straightforward to human, and one of the principal input to understand external world. Computer vision (CV) is the process to extract such knowledge from the visual domain in an algorithmic fashion. The requested computational power to process these information is very high. Until recently, the only feasible way to meet non-functional requirements like performance was to develop custom hardware, which is costly, time-consuming and can not be reused in a general purpose. The recent introduction of low-power and low-cost heterogeneous embedded boards, in which CPUs are combine with heterogeneous accelerators like GPUs, DSPs and FPGAs, can combine the hardware efficiency needed for non-functional requirements with the flexibility of software development. Embedded vision is the term used to identify the application of the aforementioned CV algorithms applied in the embedded field, which usually requires to satisfy, other than functional requirements, also non-functional requirements such as real-time performance, power, and energy efficiency. Rapid prototyping, early algorithm parametrization, testing, and validation of complex embedded video applications for such heterogeneous architectures is a very challenging task. This thesis presents a comprehensive framework that: 1) Is based on a model-based paradigm. Differently from the standard approaches at the state of the art that require designers to manually model the algorithm in any programming language, the proposed approach allows for a rapid prototyping, algorithm validation and parametrization in a model-based design environment (i.e., Matlab/Simulink). The framework relies on a multi-level design and verification flow by which the high-level model is then semi-automatically refined towards the final automatic synthesis into the target hardware device. 2) Relies on a polyglot parallel programming model. The proposed model combines different programming languages and environments such as C/C++, OpenMP, PThreads, OpenVX, OpenCV, and CUDA to best exploit different levels of parallelism while guaranteeing a semi-automatic customization. 3) Optimizes the application performance and energy efficiency through a novel algorithm for the mapping and scheduling of the application 3 tasks on the heterogeneous computing elements of the device. Such an algorithm, called exclusive earliest finish time (XEFT), takes into consideration the possible multiple implementation of tasks for different computing elements (e.g., a task primitive for CPU and an equivalent parallel implementation for GPU). It introduces and takes advantage of the notion of exclusive overlap between primitives to improve the load balancing. This thesis is the result of three years of research activity, during which all the incremental steps made to compose the framework have been tested on real case studie

    Scalable Parallel Optimization of Digital Signal Processing in the Fourier Domain

    Get PDF
    The aim of the research presented in this thesis is to study different approaches to the parallel optimization of digital signal processing algorithms and optical coherence tomography methods. The parallel approaches are based on multithreading for multi-core and many-core architectures. The thesis follows the process of designing and implementing the parallel algorithms and programs and their integration into optical coherence tomography systems. Evaluations of the performance and the scalability of the proposed parallel solutions are presented. The digital signal processing considered in this thesis is divided into two groups. The first one includes generally employed algorithms operating with digital signals in Fourier domain. Those include forward and inverse Fourier transform, cross-correlation, convolution and others. The second group involves optical coherence tomography methods, which incorporate the aforementioned algorithms. These methods are used to generate cross-sectional, en-face and confocal images. Identifying the optimal parallel approaches to these methods allows improvements in the generated imagery in terms of performance and content. The proposed parallel accelerations lead to the generation of comprehensive imagery in real-time. Providing detailed visual information in real-time improves the utilization of the optical coherence tomography systems, especially in areas such as ophthalmology

    Overview and comparison of OpenCL and CUDA technology for GPGPU

    No full text
    corecore