20 research outputs found

    Scheduling Dynamic OpenMP Applications over Multicore Architectures

    Get PDF
    International audienceApproaching the theoretical performance of hierarchical multicore machines requires a very careful distribution of threads and data among the underlying non-uniform architecture in order to minimize cache misses and NUMA penalties. While it is acknowledged that OpenMP can enhance the quality of thread scheduling on such architectures in a portable way, by transmitting precious information about the affinities between threads and data to the underlying runtime system, most OpenMP runtime systems are actually unable to efficiently support highly irregular, massively parallel applications on NUMA machines. In this paper, we present a thread scheduling policy suited to the execution of OpenMP programs featuring irregular and massive nested parallelism over hierarchical architectures. Our policy enforces a distribution of threads that maximizes the proximity of threads belonging to the same parallel section, and uses a NUMA-aware work stealing strategy when load balancing is needed. It has been developed as a plug-in to the ForestGOMP OpenMP platform. We demonstrate the efficiency of our approach with a highly irregular recursive OpenMP program resulting from the generic parallelization of a surface reconstruction application. We achieve a speedup of 14 on a 16-core machine with no application-level optimization

    Learning from the Success of MPI

    Full text link
    The Message Passing Interface (MPI) has been extremely successful as a portable way to program high-performance parallel computers. This success has occurred in spite of the view of many that message passing is difficult and that other approaches, including automatic parallelization and directive-based parallelism, are easier to use. This paper argues that MPI has succeeded because it addresses all of the important issues in providing a parallel programming model.Comment: 12 pages, 1 figur

    On the adequacy of lightweight thread approaches for high-level parallel programming models

    Get PDF
    High-level parallel programming models (PMs) are becoming crucial in order to extract the computational power of current on-node multi-threaded parallelism. The most popular PMs, such as OpenMP or OmpSs, are directive-based: the complexity of the hardware is hidden by the underlying runtime system, improving coding productivity. The implementations of OpenMP usually rely on POSIX threads (pthreads), offering excellent performance for coarse-grained parallelism and a perfect match with the current hardware. OmpSs is a task oriented PM based on an ad hoc runtime solution called Nanos++; it is the precursor of the tasking parallelism in the OpenMP tasking specification. A recent trend in runtimes and applications points to leveraging massive on-node parallelism in conjunction with fine-grained and dynamic scheduling paradigms. In this paper we analyze the behavior of the OpenMP and OmpSs PMs on top of the recently emerged Generic Lightweight Threads (GLT) API. GLT exposes a common API for lightweight thread (LWT) libraries that offers the possibility of running the same application over different native LWT solutions. We describe the design details of those high-level PMs implemented on top of GLT and analyze different scenarios in order to assess where the use of LWTs may benefit application performance. Our work reveals those scenarios where LWTs overperform pthread-based solutions and compares the performance between an ad hoc solution and a generic implementation.The researchers from the Universitat Jaume I de Castelló were supported by project TIN2014-53495-R of the MINECO, Spain and FEDER, Spain, the Generalitat Valenciana fellowship programme, Spain Vali+d 2015. Antonio J. Peña is cofinanced by the Spanish Ministry of Economy and Competitiveness, Spain under Juan de la Cierva fellowship number IJCI-2015-23266. This work was partially supported by the U.S. Dept. of Energy, Office of Science, Office of Advanced Scientific Computing Research (SC-21), under contract DE-AC02-06CH11357. We gratefully acknowledge Enrique S. Quintana-Ortí (Universitat Jaume I) and Sangmin Seo (Samsung Corp.) for their advice in this work and the computing resources provided and operated by the Joint Laboratory for System Evaluation (JLSE) at Argonne National Laboratory.Peer ReviewedPostprint (author's final draft

    Object tracking using a many-core embedded system

    Get PDF
    Object localization and tracking is essential for many practical applications, such as mancomputer interaction, security and surveillance, robot competitions, and Industry 4.0. Because of the large amount of data present in an image, and the algorithmic complexity involved, this task can be computationally demanding, mainly for traditional embedded systems, due to their processing and storage limitations. This calls for investigation and experimentation with new approaches, as emergent heterogeneous embedded systems, that promise higher performance, without compromising energy e ciency. This work explores several real-time color-based object tracking techniques, applied to images supplied by a RGB-D sensor attached to di erent embedded platforms. The main motivation was to explore an heterogeneous Parallella board with a 16-core Epiphany coprocessor, to reduce image processing time. Another goal was to confront this platform with more conventional embedded systems, namely the popular Raspberry Pi family. In this regard, several processing options were pursued, from low-level implementations specially tailored to the Parallella, to higher-level multi-platform approaches. The results achieved allow to conclude that the programming e ort required to e - ciently use the Epiphany co-processor is considerable. Also, for the selected case study, the performance attained was bellow the one o ered by simpler approaches running on quad-core Raspberry Pi boards.A localização e o seguimento de objetos são essenciais para muitas aplicações práticas, como interação homem-computador, segurança e vigilância, competições de robôs e Industria 4.0. Devido `a grande quantidade de dados presentes numa imagem, e a` complexidade algorítmica envolvida, esta tarefa pode ser computacionalmente exigente, principalmente para os sistemas embebidos tradicionais, devido às suas limitações de processamento e armazenamento. Desta forma, ´e importante a investigação e experimentação com novas abordagens, tais como sistemas embebidos heterogéneos emergentes, que trazem consigo a promessa de melhor desempenho, sem comprometer a eficiência energética. Este trabalho explora várias t´técnicas de seguimento de objetos em tempo real baseado em imagens a cores adquiridas por um sensor RBD-D, conectado a diferentes sistemas em- bebidos. A motivação principal foi a exploração de uma placa heterogénea Parallella com um co-processador Epiphany de 16 núcleos, a fim de reduzir o tempo de processamento das imagens. Outro objetivo era confrontar esta plataforma com sistemas embebidos mais convencionais, nomeadamente a popular família Raspberry Pi. Nesse sentido, foram prosseguidas diversas opções de processamento, desde implementações de baixo nível, específicas da placa Parallella, até abordagens multi-plataforma de mais alto nível. Os resultados alcançados permitem concluir que o esforço de programação necessário para utilizar eficientemente o co-processador Epiphany é considerável. Adicionalmente, para o caso de estudo deste trabalho, o desempenho alcançado fica aquém do conseguido por abordagens mais simples executando em sistemas Raspberry Pi com quatro núcleos

    Advanced synchronization techniques for task-based runtime systems

    Get PDF
    Task-based programming models like OmpSs-2 and OpenMP provide a flexible data-flow execution model to exploit dynamic, irregular and nested parallelism. Providing an efficient implementation that scales well with small granularity tasks remains a challenge, and bottlenecks can manifest in several runtime components. In this paper, we analyze the limiting factors in the scalability of a task-based runtime system and propose individual solutions for each of the challenges, including a wait-free dependency system and a novel scalable scheduler design based on delegation. We evaluate how the optimizations impact the overall performance of the runtime, both individually and in combination. We also compare the resulting runtime against state of the art OpenMP implementations, showing equivalent or better performance, especially for fine-grained tasks.This project is supported by the European Union’s Horizon 2020 Research and Innovation programme under grant agreement No.s 754304 (DEEP-EST), by the Spanish Ministry of Science and Innovation (contract PID2019-107255GB and TIN2015-65316P) and by the Generalitat de Catalunya (2017-SGR-1414).Peer ReviewedPostprint (author's final draft

    Factores de rendimiento en entornos multicore

    Get PDF
    Este documento refleja el estudio de investigación para la detección de factores que afectan al rendimiento en entornos multicore. Debido a la gran diversidad de arquitecturas multicore se ha definido un marco de trabajo, que consiste en la adopción de una arquitectura específica, un modelo de programación basado en paralelismo de datos, y aplicaciones del tipo Single Program Multiple Data. Una vez definido el marco de trabajo, se han evaluado los factores de rendimiento con especial atención al modelo de programación. Por este motivo, se ha analizado la librería de threads y la API OpenMP para detectar aquellas funciones sensibles de ser sintonizadas al permitir un comportamiento adaptativo de la aplicación al entorno, y que dependiendo de su adecuada utilización han de mejorar el rendimiento de la aplicación.Aquest document reflexa l'estudi d'investigació per a la detecció de factors que afecten al rendiment en entorns multicore. Degut a la gran quantitat d'arquitectures multicore s'ha definit un marc de treball acotat, que consisteix en la adopció d'una arquitectura específica, un model de programació basat en paral·lelisme de dates, i aplicacions del tipus Single Program Multiple Data. Una vegada definit el marc de treball, s'han avaluat els factors de rendiment amb especial atenció al model de programació. Per aquest motiu, s'ha analitzat la llibreria de thread i la API OpenMP per a detectar aquelles funcions sensibles de ser sintonitzades, al permetre un comportament adaptatiu de l'aplicació a l'entorn, i que, depenent de la seva adequada utilització s'aconsegueix una millora en el rendiment de la aplicació.This work reflects research studies for the detection of factors that affect performance in multicore environments. Due to the wide variety of multicore architectures we have defined a framework, consisting of a specific architecture, a programming model based on data parallelism, and Single Program Multiple Data applications. Having defined the framework, we evaluate the performance factors with special attention to programming model. For this reason, we have analyzed threaad libreary and OpenMP API to detect thos candidates functions to be tuned, allowin applications to beave adaptively to the computing environment, and based on their propper use will improve performance
    corecore