93 research outputs found

    OpenCL + OpenSHMEM Hybrid Programming Model for the Adapteva Epiphany Architecture

    Full text link
    There is interest in exploring hybrid OpenSHMEM + X programming models to extend the applicability of the OpenSHMEM interface to more hardware architectures. We present a hybrid OpenCL + OpenSHMEM programming model for device-level programming for architectures like the Adapteva Epiphany many-core RISC array processor. The Epiphany architecture comprises a 2D array of low-power RISC cores with minimal uncore functionality connected by a 2D mesh Network-on-Chip (NoC). The Epiphany architecture offers high computational energy efficiency for integer and floating point calculations as well as parallel scalability. The Epiphany-III is available as a coprocessor in platforms that also utilize an ARM CPU host. OpenCL provides good functionality for supporting a co-design programming model in which the host CPU offloads parallel work to a coprocessor. However, the OpenCL memory model is inconsistent with the Epiphany memory architecture and lacks support for inter-core communication. We propose a hybrid programming model in which OpenSHMEM provides a better solution by replacing the non-standard OpenCL extensions introduced to achieve high performance with the Epiphany architecture. We demonstrate the proposed programming model for matrix-matrix multiplication based on Cannon's algorithm showing that the hybrid model addresses the deficiencies of using OpenCL alone to achieve good benchmark performance.Comment: 12 pages, 5 figures, OpenSHMEM 2016: Third workshop on OpenSHMEM and Related Technologie

    EMVS: Embedded Multi Vector-core System

    Get PDF
    With the increase in the density and performance of digital electronics, the demand for a power-efficient high-performance computing (HPC) system has been increased for embedded applications. The existing embedded HPC systems suffer from issues like programmability, scalability, and portability. Therefore, a parameterizable and programmable high-performance processor system architecture is required to execute the embedded HPC applications. In this work, we proposed an Embedded Multi Vector-core System (EMVS) which executes the embedded application by managing the multiple vectorized tasks and their memory operations. The system is designed and ported on an Altera DE4 FPGA development board. The performance of EMVS is compared with the Heterogeneous Multi-Processing Odroid XU3, Parallela and GPU Jetson TK1 embedded systems. In contrast to the embedded systems, the results show that EMVS improves 19.28 and 10.22 times of the application and system performance respectively and consumes 10.6 times less energy.Peer ReviewedPostprint (author's final draft

    High level programming abstractions for leveraging hierarchical memories with micro-core architectures

    Get PDF
    Micro-core architectures combine many low memory, low power computing cores together in a single package. These are attractive for use as accelerators but due to limited on-chip memory and multiple levels of memory hierarchy, the way in which programmers offload kernels needs to be carefully considered. In this paper we use Python as a vehicle for exploring the semantics and abstractions of higher level programming languages to support the offloading of computational kernels to these devices. By moving to a pass by reference model, along with leveraging memory kinds, we demonstrate the ability to easily and efficiently take advantage of multiple levels in the memory hierarchy, even ones that are not directly accessible to the micro-cores. Using a machine learning benchmark, we perform experiments on both Epiphany-III and MicroBlaze based micro-cores, demonstrating the ability to compute with data sets of arbitrarily large size. To provide context of our results, we explore the performance and power efficiency of these technologies, demonstrating that whilst these two micro-core technologies are competitive within their own embedded class of hardware, there is still a way to go to reach HPC class GPUs.Comment: Accepted manuscript of paper in Journal of Parallel and Distributed Computing 13

    Compact Native Code Generation for Dynamic Languages on Micro-core Architectures

    Get PDF
    Micro-core architectures combine many simple, low memory, low power-consuming CPU cores onto a single chip. Potentially providing significant performance and low power consumption, this technology is not only of great interest in embedded, edge, and IoT uses, but also potentially as accelerators for data-center workloads. Due to the restricted nature of such CPUs, these architectures have traditionally been challenging to program, not least due to the very constrained amounts of memory (often around 32KB) and idiosyncrasies of the technology. However, more recently, dynamic languages such as Python have been ported to a number of micro-cores, but these are often delivered as interpreters which have an associated performance limitation. Targeting the four objectives of performance, unlimited code-size, portability between architectures, and maintaining the programmer productivity benefits of dynamic languages, the limited memory available means that classic techniques employed by dynamic language compilers, such as just-in-time (JIT), are simply not feasible. In this paper we describe the construction of a compilation approach for dynamic languages on micro-core architectures which aims to meet these four objectives, and use Python as a vehicle for exploring the application of this in replacing the existing micro-core interpreter. Our experiments focus on the metrics of performance, architecture portability, minimum memory size, and programmer productivity, comparing our approach against that of writing native C code. The outcome of this work is the identification of a series of techniques that are not only suitable for compiling Python code, but also applicable to a wide variety of dynamic languages on micro-cores.Comment: Preprint of paper accepted to ACM SIGPLAN 2021 International Conference on Compiler Construction (CC 2021

    Object tracking using a many-core embedded system

    Get PDF
    Object localization and tracking is essential for many practical applications, such as mancomputer interaction, security and surveillance, robot competitions, and Industry 4.0. Because of the large amount of data present in an image, and the algorithmic complexity involved, this task can be computationally demanding, mainly for traditional embedded systems, due to their processing and storage limitations. This calls for investigation and experimentation with new approaches, as emergent heterogeneous embedded systems, that promise higher performance, without compromising energy e ciency. This work explores several real-time color-based object tracking techniques, applied to images supplied by a RGB-D sensor attached to di erent embedded platforms. The main motivation was to explore an heterogeneous Parallella board with a 16-core Epiphany coprocessor, to reduce image processing time. Another goal was to confront this platform with more conventional embedded systems, namely the popular Raspberry Pi family. In this regard, several processing options were pursued, from low-level implementations specially tailored to the Parallella, to higher-level multi-platform approaches. The results achieved allow to conclude that the programming e ort required to e - ciently use the Epiphany co-processor is considerable. Also, for the selected case study, the performance attained was bellow the one o ered by simpler approaches running on quad-core Raspberry Pi boards.A localização e o seguimento de objetos são essenciais para muitas aplicações práticas, como interação homem-computador, segurança e vigilância, competições de robôs e Industria 4.0. Devido `a grande quantidade de dados presentes numa imagem, e a` complexidade algorítmica envolvida, esta tarefa pode ser computacionalmente exigente, principalmente para os sistemas embebidos tradicionais, devido às suas limitações de processamento e armazenamento. Desta forma, ´e importante a investigação e experimentação com novas abordagens, tais como sistemas embebidos heterogéneos emergentes, que trazem consigo a promessa de melhor desempenho, sem comprometer a eficiência energética. Este trabalho explora várias t´técnicas de seguimento de objetos em tempo real baseado em imagens a cores adquiridas por um sensor RBD-D, conectado a diferentes sistemas em- bebidos. A motivação principal foi a exploração de uma placa heterogénea Parallella com um co-processador Epiphany de 16 núcleos, a fim de reduzir o tempo de processamento das imagens. Outro objetivo era confrontar esta plataforma com sistemas embebidos mais convencionais, nomeadamente a popular família Raspberry Pi. Nesse sentido, foram prosseguidas diversas opções de processamento, desde implementações de baixo nível, específicas da placa Parallella, até abordagens multi-plataforma de mais alto nível. Os resultados alcançados permitem concluir que o esforço de programação necessário para utilizar eficientemente o co-processador Epiphany é considerável. Adicionalmente, para o caso de estudo deste trabalho, o desempenho alcançado fica aquém do conseguido por abordagens mais simples executando em sistemas Raspberry Pi com quatro núcleos

    Low-level trace correlation on heterogeneous embedded systems

    Get PDF
    Tracing is a common method used to debug, analyze, and monitor various systems. Even though standard tools and tracing methodologies exist for standard and distributed environments, it is not the case for heterogeneous embedded systems. This paper proposes to fill this gap and discusses how efficient tracing can be achieved without having common system tools, such as the Linux Trace Toolkit (LTTng), at hand on every core. We propose a generic solution to trace embedded heterogeneous systems and overcome the challenges brought by their peculiar architectures (little available memory, bare-metal CPUs, or exotic components for instance). The solution described in this paper focuses on a generic way of correlating traces among different kinds of processors through traces synchronization, to analyze the global state of the system as a whole. The proposed solution was first tested on the Adapteva Parallella board. It was then improved and thoroughly validated on TI’s Keystone 2 System-on-Chip (SoC)
    corecore