57 research outputs found

    Parallelization Strategies for Modern Computing Platforms: Application to Illustrative Image Processing and Computer Vision Applications

    Get PDF
    RÉSUMÉ L’évolution spectaculaire des technologies dans le domaine du matériel et du logiciel a permis l’émergence des nouvelles plateformes parallèles très performantes. Ces plateformes ont marqué le début d’une nouvelle ère de la computation et il est préconisé qu’elles vont rester dans le domaine pour une bonne période de temps. Elles sont présentes déjà dans le domaine du calcul de haute performance (en anglais HPC, High Performance Computer) ainsi que dans le domaine des systèmes embarqués. Récemment, dans ces domaines le concept de calcul hétérogène a été adopté pour atteindre des performances élevées. Ainsi, plusieurs types de processeurs sont utilisés, dont les plus populaires sont les unités centrales de traitement ou CPU (de l’anglais Central Processing Unit) et les processeurs graphiques ou GPU (de l’anglais Graphics Processing Units). La programmation efficace pour ces nouvelles plateformes parallèles amène actuellement non seulement des opportunités mais aussi des défis importants pour les concepteurs. Par conséquent, l’industrie a besoin de l’appui de la communauté de recherche pour assurer le succès de ce nouveau changement de paradigme vers le calcul parallèle. Trois défis principaux présents pour les processeurs GPU massivement parallèles (ou “many-cores”) ainsi que pour les processeurs CPU multi-coeurs sont: (1) la sélection de la meilleure plateforme parallèle pour une application donnée, (2) la sélection de la meilleure stratégie de parallèlisation et (3) le réglage minutieux des performances (ou en anglais performance tuning) pour mieux exploiter les plateformes existantes. Dans ce contexte, l’objectif global de notre projet de recherche est de définir de nouvelles solutions pour aider à la programmation efficace des applications complexes sur les plateformes parallèles modernes. Les principales contributions à la recherche sont: 1. L’évaluation de l’efficacité d’accélération pour plusieurs plateformes parallèles, dans le cas des applications de calcul intensif. 2. Une analyse quantitative des stratégies de parallèlisation et implantation sur les plateformes à base de processeurs CPU multi-cœur ainsi que pour les plateformes à base de processeurs GPU massivement parallèles. 3. La définition et la mise en place d’une approche de réglage de performances (en Anglais performance tuning) pour les plateformes parallèles. Les contributions proposées ont été validées en utilisant des applications réelles illustratives et un ensemble varié de plateformes parallèles modernes.----------ABSTRACT With the technology improvement for both hardware and software, parallel platforms started a new computing era and they are here to stay. Parallel platforms may be found in High Performance Computers (HPC) or embedded computers. Recently, both HPC and embedded computers are moving toward heterogeneous computing platforms. They are employing both Central Processing Units (CPUs) and Graphics Processing Units (GPUs) to achieve the highest performance. Programming efficiently for parallel platforms brings new opportunities but also several challenges. Therefore, industry needs help from the research community to succeed in its recent dramatic shift to parallel computing. Parallel programing presents several major challenges. These challenges are equally present whether one programs on a many-core GPU or on a multi-core CPU. Three of the main challenges are: (1) Finding the best platform providing the required acceleration (2) Select the best parallelization strategy (3) Performance tuning to efficiently leverage the parallel platforms. In this context, the overall objective of our research is to propose a new solution helping designers to efficiently program complex applications on modern parallel architectures. The contributions of this thesis are: 1. The evaluation of the efficiency of several target parallel platforms to speedup compute-intensive applications. 2. The quantitative analysis for parallelization and implementation strategies on multicore CPUs and many-core GPUs. 3. The definition and implementation of a new performance tuning framework for heterogeneous parallel platforms. The contributions were validated using real computation intensive applications and modern parallel platform based on multi-core CPU and many-core GPU

    Generic Techniques in General Purpose GPU Programming with Applications to Ant Colony and Image Processing Algorithms

    Get PDF
    In 2006 NVIDIA introduced a new unified GPU architecture facilitating general-purpose computation on the GPU. The following year NVIDIA introduced CUDA, a parallel programming architecture for developing general purpose applications for direct execution on the new unified GPU. CUDA exposes the GPU's massively parallel architecture of the GPU so that parallel code can be written to execute much faster than its sequential counterpart. Although CUDA abstracts the underlying architecture, fully utilising and scheduling the GPU is non-trivial and has given rise to a new active area of research. Due to the inherent complexities pertaining to GPU development, in this thesis we explore and find efficient parallel mappings of existing and new parallel algorithms on the GPU using NVIDIA CUDA. We place particular emphasis on metaheuristics, image processing and designing reusable techniques and mappings that can be applied to other problems and domains. We begin by focusing on Ant Colony Optimisation (ACO), a nature inspired heuristic approach for solving optimisation problems. We present a versatile improved data-parallel approach for solving the Travelling Salesman Problem using ACO resulting in significant speedups. By extending our initial work, we show how existing mappings of ACO on the GPU are unable to compete against their sequential counterpart when common CPU optimisation strategies are employed and detail three distinct candidate set parallelisation strategies for execution on the GPU. By further extending our data-parallel approach we present the first implementation of an ACO-based edge detection algorithm on the GPU to reduce the execution time and improve the viability of ACO-based edge detection. We finish by presenting a new color edge detection technique using the volume of a pixel in the HSI color space along with a parallel GPU implementation that is able to withstand greater levels of noise than existing algorithms

    A model-based design flow for embedded vision applications on heterogeneous architectures

    Get PDF
    The ability to gather information from images is straightforward to human, and one of the principal input to understand external world. Computer vision (CV) is the process to extract such knowledge from the visual domain in an algorithmic fashion. The requested computational power to process these information is very high. Until recently, the only feasible way to meet non-functional requirements like performance was to develop custom hardware, which is costly, time-consuming and can not be reused in a general purpose. The recent introduction of low-power and low-cost heterogeneous embedded boards, in which CPUs are combine with heterogeneous accelerators like GPUs, DSPs and FPGAs, can combine the hardware efficiency needed for non-functional requirements with the flexibility of software development. Embedded vision is the term used to identify the application of the aforementioned CV algorithms applied in the embedded field, which usually requires to satisfy, other than functional requirements, also non-functional requirements such as real-time performance, power, and energy efficiency. Rapid prototyping, early algorithm parametrization, testing, and validation of complex embedded video applications for such heterogeneous architectures is a very challenging task. This thesis presents a comprehensive framework that: 1) Is based on a model-based paradigm. Differently from the standard approaches at the state of the art that require designers to manually model the algorithm in any programming language, the proposed approach allows for a rapid prototyping, algorithm validation and parametrization in a model-based design environment (i.e., Matlab/Simulink). The framework relies on a multi-level design and verification flow by which the high-level model is then semi-automatically refined towards the final automatic synthesis into the target hardware device. 2) Relies on a polyglot parallel programming model. The proposed model combines different programming languages and environments such as C/C++, OpenMP, PThreads, OpenVX, OpenCV, and CUDA to best exploit different levels of parallelism while guaranteeing a semi-automatic customization. 3) Optimizes the application performance and energy efficiency through a novel algorithm for the mapping and scheduling of the application 3 tasks on the heterogeneous computing elements of the device. Such an algorithm, called exclusive earliest finish time (XEFT), takes into consideration the possible multiple implementation of tasks for different computing elements (e.g., a task primitive for CPU and an equivalent parallel implementation for GPU). It introduces and takes advantage of the notion of exclusive overlap between primitives to improve the load balancing. This thesis is the result of three years of research activity, during which all the incremental steps made to compose the framework have been tested on real case studie

    Radial Basis Functions: Biomedical Applications and Parallelization

    Get PDF
    Radial basis function (RBF) is a real-valued function whose values depend only on the distances between an interpolation point and a set of user-specified points called centers. RBF interpolation is one of the primary methods to reconstruct functions from multi-dimensional scattered data. Its abilities to generalize arbitrary space dimensions and to provide spectral accuracy have made it particularly popular in different application areas, including but not limited to: finding numerical solutions of partial differential equations (PDEs), image processing, computer vision and graphics, deep learning and neural networks, etc. The present thesis discusses three applications of RBF interpolation in biomedical engineering areas: (1) Calcium dynamics modeling, in which we numerically solve a set of PDEs by using meshless numerical methods and RBF-based interpolation techniques; (2) Image restoration and transformation, where an image is restored from its triangular mesh representation or transformed under translation, rotation, and scaling, etc. from its original form; (3) Porous structure design, in which the RBF interpolation used to reconstruct a 3D volume containing porous structures from a set of regularly or randomly placed points inside a user-provided surface shape. All these three applications have been investigated and their effectiveness has been supported with numerous experimental results. In particular, we innovatively utilize anisotropic distance metrics to define the distance in RBF interpolation and apply them to the aforementioned second and third applications, which show significant improvement in preserving image features or capturing connected porous structures over the isotropic distance-based RBF method. Beside the algorithm designs and their applications in biomedical areas, we also explore several common parallelization techniques (including OpenMP and CUDA-based GPU programming) to accelerate the performance of the present algorithms. In particular, we analyze how parallel programming can help RBF interpolation to speed up the meshless PDE solver as well as image processing. While RBF has been widely used in various science and engineering fields, the current thesis is expected to trigger some more interest from computational scientists or students into this fast-growing area and specifically apply these techniques to biomedical problems such as the ones investigated in the present work

    Extending OpenVX for Model-based Design of Embedded Vision Applications

    Get PDF
    Developing computer vision applications for lowpower heterogeneous systems is increasingly gaining interest in the embedded systems community. Even more interesting is the tuning of such embedded software for the target architecture when this is driven by multiple constraints (e.g., performance, peak power, energy consumption). Indeed, developers frequently run into system-level inefficiencies and bottlenecks that can not be quickly addressed by traditional methods. In this context OpenVX has been proposed as the standard platform to develop portable, optimized and powerefficient applications for vision algorithms targeting embedded systems. Nevertheless, adopting OpenVX for rapid prototyping, early algorithm parametrization and validation of complex embedded applications is a very challenging task. This paper presents a methodology to integrate a model-based design environment to OpenVX. The methodology allows applying Matlab/Simulink for the model-based design, parametrization, and validation of computer vision applications. Then, it allows for the automatic synthesis of the application model into an OpenVX description for the hardware and constraints-aware application tuning. Experimental results have been conducted with an application for digital image stabilization developed through Simulink and, then, automatically synthesized into OpenVX-VisionWorks code for an NVIDIA Jetson TX1 boar

    Efficient Algorithms for Large-Scale Image Analysis

    Get PDF
    This work develops highly efficient algorithms for analyzing large images. Applications include object-based change detection and screening. The algorithms are 10-100 times as fast as existing software, sometimes even outperforming FGPA/GPU hardware, because they are designed to suit the computer architecture. This thesis describes the implementation details and the underlying algorithm engineering methodology, so that both may also be applied to other applications

    Programming issues for video analysis on Graphics Processing Units

    Get PDF
    El procesamiento de vídeo es la parte del procesamiento de señales, donde las señales de entrada y/o de salida son secuencias de vídeo. Cubre una amplia variedad de aplicaciones que son, en general, de cálculo intensivo, debido a su complejidad algorítmica. Por otra parte, muchas de estas aplicaciones exigen un funcionamiento en tiempo real. El cumplimiento de estos requisitos hace necesario el uso de aceleradores hardware como las Unidades de Procesamiento Gráfico (GPU). El procesamiento de propósito general en GPU representa una tendencia exitosa en la computación de alto rendimiento, desde el lanzamiento de la arquitectura y el modelo de programación NVIDIA CUDA. Esta tesis doctoral trata sobre la paralelización eficiente de aplicaciones de procesamiento de vídeo en GPU. Este objetivo se aborda desde dos vertientes: por un lado, la programación adecuada de la GPU para aplicaciones de vídeo; por otro lado, la GPU debe ser considerada como parte de un sistema heterogéneo. Dado que las secuencias de vídeo se componen de fotogramas, que son estructuras de datos regulares, muchos componentes de las aplicaciones de vídeo son inherentemente paralelizables. Sin embargo, otros componentes son irregulares en el sentido de que llevan a cabo cálculos que dependen de la carga de trabajo, sufren contención en la escritura, contienen partes inherentemente secuenciales o desbalanceadas en carga... Esta tesis propone estrategias para hacer frente a estos aspectos, a través de varios casos de estudio. También se describe una aproximación optimizada al cálculo de histogramas basada en un modelo de rendimiento de la memoria. Las secuencias de vídeo son flujos continuos que deben ser transferidos desde el ¿host¿ (CPU) al dispositivo (GPU), y los resultados del dispositivo al ¿host¿. Esta tesis doctoral propone el uso de CUDA streams para implementar el paradigma de ¿stream processing¿ en la GPU, con el fin de controlar la ejecución simultánea de las transferencias de datos y de la computación. También propone modelos de rendimiento que permiten una ejecución óptima

    Integrating Simulink, OpenVX, and ROS for Model-Based Design of Embedded Vision Applications

    Get PDF
    OpenVX is increasingly gaining consensus as standard platform to develop portable, optimized and power-efficient embedded vision applications. Nevertheless, adopting OpenVX for rapid prototyping, early algorithm parametrization and validation of complex embedded applications is a very challenging task. This paper presents a comprehensive framework that integrates Simulink, OpenVX, and ROS for model-based design of embedded vision applications. The framework allows applying Matlab-Simulink for the model-based design, parametrization, and validation of computer vision applications. Then, it allows for the automatic synthesis of the application model into an OpenVX description for the hardware and constraints-aware application tuning. Finally, the methodology allows integrating the OpenVX application with Robot Operating System (ROS), which is the de-facto reference standard for developing robotic software applications. The OpenVX-ROS interface allows co-simulating and parametrizing the application by considering the actual robotic environment and the application reuse in any ROS-compliant system. Experimental results have been conducted with two real case studies: An application for digital image stabilization and the ORB descriptor for simultaneous localization and mapping (SLAM), which have been developed through Simulink and, then, automatically synthesized into OpenVX-VisionWorks code for an NVIDIA Jetson TX2 boar
    • …
    corecore