171 research outputs found

    A Comparative study and evaluation of parallel programming models for shared-memory parallel architectures

    Get PDF
    Nowadays, shared-memory parallel architectures have evolved and new programming frameworks have appeared that exploit these architectures: OpenMP, TBB, Cilk Plus, ArBB and OpenCL. This article focuses on the most extended of these frameworks in commercial and scientific areas. This paper shows a comparative study of these frameworks and an evaluation. The study covers several capacities, such as task deployment, scheduling techniques, or programming language abstractions. The evaluation measures three dimensions: code development complexity, performance and efficiency, measure as speedup per watt. For this evaluation, several parallel benchmarks have been implemented with each framework. These benchmarks are created to cover certain scenarios, like regular memory access or irregular computation. The conclusions show some highlights, like the fact that some frameworks (OpenMP, Cilk Plus) are better for transforming quickly a sequential code, others (TBB) have a small footprint which is ideal for small problems, and others (OpenCL) are suited for heterogeneous architectures but they require a very complex development process. The conclusions also show that the vectorization support is more critical than multitasking to achieve efficiency for those problems where this approach fits.This work has been partially funded by the project “Input/Output Scalable Techniques for distributed and high-performance computing environments” of MINISTERIO DE CIENCIA E INNOVACIÓN, TIN2010-16497. The work of J. Daniel García has been funded by "FUNDACIÓN CAJAMADRID" through a grant for Mobility of Madrid Public Universities Professors

    Applications in GNSS water vapor tomography

    Get PDF
    Algebraic reconstruction algorithms are iterative algorithms that are used in many area including medicine, seismology or meteorology. These algorithms are known to be highly computational intensive. This may be especially troublesome for real-time applications or when processed by conventional low-cost personnel computers. One of these real time applications is the reconstruction of water vapor images from Global Navigation Satellite System (GNSS) observations. The parallelization of algebraic reconstruction algorithms has the potential to diminish signi cantly the required resources permitting to obtain valid solutions in time to be used for nowcasting and forecasting weather models. The main objective of this dissertation was to present and analyse diverse shared memory libraries and techniques in CPU and GPU for algebraic reconstruction algorithms. It was concluded that the parallelization compensates over sequential implementations. Overall the GPU implementations were found to be only slightly faster than the CPU implementations, depending on the size of the problem being studied. A secondary objective was to develop a software to perform the GNSS water vapor reconstruction using the implemented parallel algorithms. This software has been developed with success and diverse tests were made namely with synthetic and real data, the preliminary results shown to be satisfactory. This dissertation was written in the Space & Earth Geodetic Analysis Laboratory (SEGAL) and was carried out in the framework of the Structure of Moist convection in high-resolution GNSS observations and models (SMOG) (PTDC/CTE-ATM/119922/2010) project funded by FCT.Algoritmos de reconstrução algébrica são algoritmos iterativos que são usados em muitas áreas incluindo medicina, sismologia ou meteorologia. Estes algoritmos são conhecidos por serem bastante exigentes computacionalmente. Isto pode ser especialmente complicado para aplicações de tempo real ou quando processados por computadores pessoais de baixo custo. Uma destas aplicações de tempo real é a reconstrução de imagens de vapor de água a partir de observações de sistemas globais de navegação por satélite. A paralelização dos algoritmos de reconstrução algébrica permite que se reduza significativamente os requisitos computacionais permitindo obter soluções válidas para previsão meteorológica num curto espaço de tempo. O principal objectivo desta dissertação é apresentar e analisar diversas bibliotecas e técnicas multithreading para a reconstrução algébrica em CPU e GPU. Foi concluído que a paralelização compensa sobre a implementações sequenciais. De um modo geral as implementações GPU obtiveram resultados relativamente melhores que implementações em CPU, isto dependendo do tamanho do problema a ser estudado. Um objectivo secundário era desenvolver uma aplicação que realizasse a reconstrução de imagem de vapor de água através de sistemas globais de navegação por satélite de uma forma paralela. Este software tem sido desenvolvido com sucesso e diversos testes foram realizados com dados sintéticos e dados reais, os resultados preliminares foram satisfatórios. Esta dissertação foi escrita no Space & Earth Geodetic Analysis Laboratory (SEGAL) e foi realizada de acordo com o projecto Structure 01' Moist convection in high-resolution GNSS observations and models (SMOG) (PTDC / CTE-ATM/ 11992212010) financiado pelo FCT.Fundação para a Ciência e a Tecnologia (FCT

    Análisis del desempeño de aplicaciones paralelas con openMP en dispositivos móviles multicore, caso de estudio: multiplicación de matrices

    Get PDF
    El presente trabajo analiza el desempeño de procesadores multicore en dispositivos móviles al ejecutar una aplicación paralela implementada con OpenMP y C. La arquitectura multicore ha sido la respuesta de los fabricantes de microprocesadores a los problemas de eficiencia energética que se presentan al incrementar la frecuencia del reloj para incrementar el desempeño de procesadores de un solo núcleo. Esta arquitectura reúne varias unidades de procesamiento energéticamente eficientes en un solo microprocesador. Sin embargo para explotar el potencial del conjunto de núcleos, las aplicaciones deberán diseñarse bajo el paradigma de computación paralela. Se aplicó una metodología de programación multi-hilos, propuesta por Intel, para la implementación de una aplicación que multiplica matrices en paralelo. Esta aplicación se ejecutó en tres diferentes dispositivos móviles. Los resultados obtenidos muestran un incremento en el desempeño de la aplicación al incrementar el número de núcleos que participan en la ejecución, con un nivel de eficiencia del sistema de al menos el 88% en un procesador quad-core.Palabras Clave: Android, NDK, OpenMP, Programación Paralela

    Performance improvements of an atmospheric radiative transfer model on GPU-based platform using CUDA

    Get PDF
    Classical applications of Atmospheric Radiative Transfer Model (ARTM) for modelization of absorption coefficient line-by-line on the atmosphere consume large computational time since seconds up to a few minutes depending on the atmospheric characterization chosen. ARTM is used together with Ground- Based or Satellite measurements to retrieve atmospheric parameters such as ozone, water vapour and temperature profiles. Nowadays in the Atmospheric Observatory of Southern Patagonia (OAPA) at the Patagonian City of Río Gallegos have been deployed a Spectral Millimeter Wave Radiometer belonging Nagoya Univ. (Japan) with the aim of retrieve stratospheric ozone profiles between 20-80 Km. Around 2 GBytes of data are recorder by the instrument per day and the ozone profiles are retrieving using one hour integration spectral data, resulting at 24 profiles per day. Actually the data reduction is performed by Laser and Application Research Center (CEILAP) group using the Matlab package ARTS/QPACK2. Using the classical data reduction procedure, the computational time estimated per profile is between 4-5 minutes determined mainly by the computational time of the ARTM and matrix operations. We propose in this work first add a novel scheme to accelerate the processing speed of the ARTM using the powerful multi-threading setup of GPGPU based at Compute Unified Device Architecture (CUDA) and compare it with the existing schemes. Performance of the ARTM has been calculated using various settings applied on a NVIDIA graphic Card GeForce GTX 560 Compute Capability 2.1. Comparison of the execution time between sequential mode, Open-MP and CUDA has been tested in this paper.XV Workshop de Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informática (RedUNCI

    Performance improvements of an atmospheric radiative transfer model on GPU-based platform using CUDA

    Get PDF
    Classical applications of Atmospheric Radiative Transfer Model (ARTM) for modelization of absorption coefficient line-by-line on the atmosphere consume large computational time since seconds up to a few minutes depending on the atmospheric characterization chosen. ARTM is used together with Ground- Based or Satellite measurements to retrieve atmospheric parameters such as ozone, water vapour and temperature profiles. Nowadays in the Atmospheric Observatory of Southern Patagonia (OAPA) at the Patagonian City of Río Gallegos have been deployed a Spectral Millimeter Wave Radiometer belonging Nagoya Univ. (Japan) with the aim of retrieve stratospheric ozone profiles between 20-80 Km. Around 2 GBytes of data are recorder by the instrument per day and the ozone profiles are retrieving using one hour integration spectral data, resulting at 24 profiles per day. Actually the data reduction is performed by Laser and Application Research Center (CEILAP) group using the Matlab package ARTS/QPACK2. Using the classical data reduction procedure, the computational time estimated per profile is between 4-5 minutes determined mainly by the computational time of the ARTM and matrix operations. We propose in this work first add a novel scheme to accelerate the processing speed of the ARTM using the powerful multi-threading setup of GPGPU based at Compute Unified Device Architecture (CUDA) and compare it with the existing schemes. Performance of the ARTM has been calculated using various settings applied on a NVIDIA graphic Card GeForce GTX 560 Compute Capability 2.1. Comparison of the execution time between sequential mode, Open-MP and CUDA has been tested in this paper.XV Workshop de Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informática (RedUNCI

    RetScan: efficient fovea and optic disc detection in retinographies

    Get PDF
    Dissertação de mestrado em Engenharia de InformáticaThe Fovea and Optic Disc are relevant anatomical eye structures to diagnose various diseases. Its automatic detection can provide both a cost reduction when analysing large populations and improve the effectiveness of ophthalmologists and optometrists. This dissertation describes a methodology to automatically detect these structures and analyses a, CPU only, MATLAB implementation of this methodology. RetScan is a port to a freeware environment of this methodology, its functionality and performance are evaluated and compared to the original. The results of both evaluations lead to a discussion on possible improvements in the metodology that influence the functionality and performance. The resulting improvements are implemented and integrated in RetScan. To further improve performance, a parallelization of RetScan to take advantage of a multi-core architecture or a CUDA-enabled accelerator was designed, coded and evaluated.This evaluation reveals that RetScan achieves its best throughput efficiency using a multi-core architecture only and analysing several images at once. For one image usage, using multi-core only is also the best solution, but with a small speed-up. The usage of CUDA-enabled accelerators is not recommended for this scope as the images are small and the cost of the data transfer to and from the accelerator has a severe impact on performance.A Fóvea e o Disco Ótico são estruturas oculares importantes quando se procura diagnosticar doenças no olho. A sua deteção automática permite reduzir o custo de um rastreio a grandes populações e também aumentar a eficácia de oftalmologistas e optometristas. Nesta dissertação é descrita uma metodologia para detetar estas estruturas automaticamente e é analisada uma implementação em MATLAB desta metodologia. RetScan é o resultado do porte para um ambiente de desenvolvimento com ferramentas livres (open source) desta metodologia. O RetScan é avaliado quer em funcionalidade, quer em performance. Os resultados da avaliação levam a uma reflexão sobre mudanças a realizar à metodologia para melhorar os resultados em ambas as avaliações. Estas melhorias são implementadas e integradas no RetScan. Para melhorar a sua performance é também realizada um paralelização do RetScan de forma a que tire partido de uma arquitetura multi-core ou de um acelerador compatível com CUDA. Após realizar uma nova avaliação conclui-se que o RetScan atinge o seu melhor débito de dados (throughput) quando usa apenas os CPUs numa arquitetura multi-core e analisando várias imagens em paralelo. Para a análise de uma só imagem, o uso apenas de CPUs numa arquitetura multi-core também é o melhor resultado, embora tenha um ganho (speed up) reduzido. O uso de aceleradores compatíveis com CUDA não é recomendado neste âmbito pois as imagens têm um tamanho reduzido e o custo da transferência de e para estes aceleradores tem um grande impacto no tempo tota

    Efficient Computation of K-Nearest Neighbor Graphs for Large High-Dimensional Data Sets on GPU Clusters

    Get PDF
    The k-Nearest Neighbor Graph (k-NNG) and the related k-Nearest Neighbor (k-NN) methods have a wide variety of applications in areas such as bioinformatics, machine learning, data mining, clustering analysis, and pattern recognition. Our application of interest is manifold embedding. Due to the large dimensionality of the input data (\u3c15k), spatial subdivision based techniques such OBBs, k-d tree, BSP etc., are not viable. The only alternative is the brute-force search, which has two distinct parts. The first finds distances between individual vectors in the corpus based on a pre-defined metric. Given the distance matrix, the second step selects k nearest neighbors for each member of the query data set. This thesis presents the development and implementation of a distributed exact k-Nearest Neighbor Graph (k-NNG) construction method. The proposed method uses Graphics Processing Units (GPUs) and exploits multiple levels of parallelism for distributed computational systems using GPUs. It is scalable for different cluster sizes, with each compute node in the cluster containing multiple GPUs. The distance computation is formulated as a basic matrix multiplication and reduction operation. The optimized CUBLAS matrix multiplication library is used for this purpose. Various distance metrics such as Euclidian, cosine, and Pearson are supported. For k-NNG construction, two different methods are presented. The first is based on an approach called batch index sorting to build the k-NNG with three sorting operations. This method uses the optimized radix sort implementation in the Thrust library for GPU. The second is an efficient implementation using the latest GPU functionalities of a variant of the quick select algorithm. Overall, the batch index sorting based k-NNG method is approximately 13x faster than a distributed MATLAB implementation. The quick select algorithm itself has a 5x speedup over state-of-the art GPU methods. This has enabled the processing of k-NNG construction on a data set containing 20 million image vectors, each with dimension 15,000, as part of a manifold embedding technique for analyzing the conformations of biomolecules

    Parallelization Strategies for Modern Computing Platforms: Application to Illustrative Image Processing and Computer Vision Applications

    Get PDF
    RÉSUMÉ L’évolution spectaculaire des technologies dans le domaine du matériel et du logiciel a permis l’émergence des nouvelles plateformes parallèles très performantes. Ces plateformes ont marqué le début d’une nouvelle ère de la computation et il est préconisé qu’elles vont rester dans le domaine pour une bonne période de temps. Elles sont présentes déjà dans le domaine du calcul de haute performance (en anglais HPC, High Performance Computer) ainsi que dans le domaine des systèmes embarqués. Récemment, dans ces domaines le concept de calcul hétérogène a été adopté pour atteindre des performances élevées. Ainsi, plusieurs types de processeurs sont utilisés, dont les plus populaires sont les unités centrales de traitement ou CPU (de l’anglais Central Processing Unit) et les processeurs graphiques ou GPU (de l’anglais Graphics Processing Units). La programmation efficace pour ces nouvelles plateformes parallèles amène actuellement non seulement des opportunités mais aussi des défis importants pour les concepteurs. Par conséquent, l’industrie a besoin de l’appui de la communauté de recherche pour assurer le succès de ce nouveau changement de paradigme vers le calcul parallèle. Trois défis principaux présents pour les processeurs GPU massivement parallèles (ou “many-cores”) ainsi que pour les processeurs CPU multi-coeurs sont: (1) la sélection de la meilleure plateforme parallèle pour une application donnée, (2) la sélection de la meilleure stratégie de parallèlisation et (3) le réglage minutieux des performances (ou en anglais performance tuning) pour mieux exploiter les plateformes existantes. Dans ce contexte, l’objectif global de notre projet de recherche est de définir de nouvelles solutions pour aider à la programmation efficace des applications complexes sur les plateformes parallèles modernes. Les principales contributions à la recherche sont: 1. L’évaluation de l’efficacité d’accélération pour plusieurs plateformes parallèles, dans le cas des applications de calcul intensif. 2. Une analyse quantitative des stratégies de parallèlisation et implantation sur les plateformes à base de processeurs CPU multi-cœur ainsi que pour les plateformes à base de processeurs GPU massivement parallèles. 3. La définition et la mise en place d’une approche de réglage de performances (en Anglais performance tuning) pour les plateformes parallèles. Les contributions proposées ont été validées en utilisant des applications réelles illustratives et un ensemble varié de plateformes parallèles modernes.----------ABSTRACT With the technology improvement for both hardware and software, parallel platforms started a new computing era and they are here to stay. Parallel platforms may be found in High Performance Computers (HPC) or embedded computers. Recently, both HPC and embedded computers are moving toward heterogeneous computing platforms. They are employing both Central Processing Units (CPUs) and Graphics Processing Units (GPUs) to achieve the highest performance. Programming efficiently for parallel platforms brings new opportunities but also several challenges. Therefore, industry needs help from the research community to succeed in its recent dramatic shift to parallel computing. Parallel programing presents several major challenges. These challenges are equally present whether one programs on a many-core GPU or on a multi-core CPU. Three of the main challenges are: (1) Finding the best platform providing the required acceleration (2) Select the best parallelization strategy (3) Performance tuning to efficiently leverage the parallel platforms. In this context, the overall objective of our research is to propose a new solution helping designers to efficiently program complex applications on modern parallel architectures. The contributions of this thesis are: 1. The evaluation of the efficiency of several target parallel platforms to speedup compute-intensive applications. 2. The quantitative analysis for parallelization and implementation strategies on multicore CPUs and many-core GPUs. 3. The definition and implementation of a new performance tuning framework for heterogeneous parallel platforms. The contributions were validated using real computation intensive applications and modern parallel platform based on multi-core CPU and many-core GPU
    corecore