93 research outputs found
OpenCL + OpenSHMEM Hybrid Programming Model for the Adapteva Epiphany Architecture
There is interest in exploring hybrid OpenSHMEM + X programming models to
extend the applicability of the OpenSHMEM interface to more hardware
architectures. We present a hybrid OpenCL + OpenSHMEM programming model for
device-level programming for architectures like the Adapteva Epiphany many-core
RISC array processor. The Epiphany architecture comprises a 2D array of
low-power RISC cores with minimal uncore functionality connected by a 2D mesh
Network-on-Chip (NoC). The Epiphany architecture offers high computational
energy efficiency for integer and floating point calculations as well as
parallel scalability. The Epiphany-III is available as a coprocessor in
platforms that also utilize an ARM CPU host. OpenCL provides good functionality
for supporting a co-design programming model in which the host CPU offloads
parallel work to a coprocessor. However, the OpenCL memory model is
inconsistent with the Epiphany memory architecture and lacks support for
inter-core communication. We propose a hybrid programming model in which
OpenSHMEM provides a better solution by replacing the non-standard OpenCL
extensions introduced to achieve high performance with the Epiphany
architecture. We demonstrate the proposed programming model for matrix-matrix
multiplication based on Cannon's algorithm showing that the hybrid model
addresses the deficiencies of using OpenCL alone to achieve good benchmark
performance.Comment: 12 pages, 5 figures, OpenSHMEM 2016: Third workshop on OpenSHMEM and
Related Technologie
EMVS: Embedded Multi Vector-core System
With the increase in the density and performance of digital electronics, the demand for a power-efficient high-performance computing (HPC) system has been increased for embedded applications. The existing embedded HPC systems suffer from issues like programmability, scalability, and portability. Therefore, a parameterizable and programmable high-performance processor system architecture is required to execute the embedded HPC applications. In this work, we proposed an Embedded Multi Vector-core System (EMVS) which executes the embedded application by managing the multiple vectorized tasks and their memory operations. The system is designed and ported on an Altera DE4 FPGA development board. The performance of EMVS is compared with the Heterogeneous Multi-Processing Odroid XU3, Parallela and GPU Jetson TK1 embedded systems. In contrast to the embedded systems, the results show that EMVS improves 19.28 and 10.22 times of the application and system performance respectively and consumes 10.6 times less energy.Peer ReviewedPostprint (author's final draft
High level programming abstractions for leveraging hierarchical memories with micro-core architectures
Micro-core architectures combine many low memory, low power computing cores
together in a single package. These are attractive for use as accelerators but
due to limited on-chip memory and multiple levels of memory hierarchy, the way
in which programmers offload kernels needs to be carefully considered. In this
paper we use Python as a vehicle for exploring the semantics and abstractions
of higher level programming languages to support the offloading of
computational kernels to these devices. By moving to a pass by reference model,
along with leveraging memory kinds, we demonstrate the ability to easily and
efficiently take advantage of multiple levels in the memory hierarchy, even
ones that are not directly accessible to the micro-cores. Using a machine
learning benchmark, we perform experiments on both Epiphany-III and MicroBlaze
based micro-cores, demonstrating the ability to compute with data sets of
arbitrarily large size. To provide context of our results, we explore the
performance and power efficiency of these technologies, demonstrating that
whilst these two micro-core technologies are competitive within their own
embedded class of hardware, there is still a way to go to reach HPC class GPUs.Comment: Accepted manuscript of paper in Journal of Parallel and Distributed
Computing 13
Compact Native Code Generation for Dynamic Languages on Micro-core Architectures
Micro-core architectures combine many simple, low memory, low power-consuming
CPU cores onto a single chip. Potentially providing significant performance and
low power consumption, this technology is not only of great interest in
embedded, edge, and IoT uses, but also potentially as accelerators for
data-center workloads. Due to the restricted nature of such CPUs, these
architectures have traditionally been challenging to program, not least due to
the very constrained amounts of memory (often around 32KB) and idiosyncrasies
of the technology. However, more recently, dynamic languages such as Python
have been ported to a number of micro-cores, but these are often delivered as
interpreters which have an associated performance limitation.
Targeting the four objectives of performance, unlimited code-size,
portability between architectures, and maintaining the programmer productivity
benefits of dynamic languages, the limited memory available means that classic
techniques employed by dynamic language compilers, such as just-in-time (JIT),
are simply not feasible. In this paper we describe the construction of a
compilation approach for dynamic languages on micro-core architectures which
aims to meet these four objectives, and use Python as a vehicle for exploring
the application of this in replacing the existing micro-core interpreter. Our
experiments focus on the metrics of performance, architecture portability,
minimum memory size, and programmer productivity, comparing our approach
against that of writing native C code. The outcome of this work is the
identification of a series of techniques that are not only suitable for
compiling Python code, but also applicable to a wide variety of dynamic
languages on micro-cores.Comment: Preprint of paper accepted to ACM SIGPLAN 2021 International
Conference on Compiler Construction (CC 2021
Object tracking using a many-core embedded system
Object localization and tracking is essential for many practical applications, such as mancomputer
interaction, security and surveillance, robot competitions, and Industry 4.0.
Because of the large amount of data present in an image, and the algorithmic complexity
involved, this task can be computationally demanding, mainly for traditional embedded
systems, due to their processing and storage limitations. This calls for investigation and
experimentation with new approaches, as emergent heterogeneous embedded systems,
that promise higher performance, without compromising energy e ciency.
This work explores several real-time color-based object tracking techniques, applied to
images supplied by a RGB-D sensor attached to di erent embedded platforms. The main
motivation was to explore an heterogeneous Parallella board with a 16-core Epiphany coprocessor,
to reduce image processing time. Another goal was to confront this platform
with more conventional embedded systems, namely the popular Raspberry Pi family.
In this regard, several processing options were pursued, from low-level implementations
specially tailored to the Parallella, to higher-level multi-platform approaches.
The results achieved allow to conclude that the programming e ort required to e -
ciently use the Epiphany co-processor is considerable. Also, for the selected case study,
the performance attained was bellow the one o ered by simpler approaches running on
quad-core Raspberry Pi boards.A localização e o seguimento de objetos são essenciais para muitas aplicações práticas, como interação homem-computador, segurança e vigilância, competições de robôs e Industria 4.0. Devido `a grande quantidade de dados presentes numa imagem, e a` complexidade algorítmica envolvida, esta tarefa pode ser computacionalmente exigente, principalmente para os sistemas embebidos tradicionais, devido às suas limitações de processamento e armazenamento. Desta forma, ´e importante a investigação e experimentação com novas abordagens, tais como sistemas embebidos heterogéneos emergentes, que trazem consigo a promessa de melhor desempenho, sem comprometer a eficiência energética.
Este trabalho explora várias t´técnicas de seguimento de objetos em tempo real baseado em imagens a cores adquiridas por um sensor RBD-D, conectado a diferentes sistemas em- bebidos. A motivação principal foi a exploração de uma placa heterogénea Parallella com um co-processador Epiphany de 16 núcleos, a fim de reduzir o tempo de processamento das imagens. Outro objetivo era confrontar esta plataforma com sistemas embebidos mais convencionais, nomeadamente a popular família Raspberry Pi. Nesse sentido, foram prosseguidas diversas opções de processamento, desde implementações de baixo nível, específicas da placa Parallella, até abordagens multi-plataforma de mais alto nível.
Os resultados alcançados permitem concluir que o esforço de programação necessário para utilizar eficientemente o co-processador Epiphany é considerável. Adicionalmente, para o caso de estudo deste trabalho, o desempenho alcançado fica aquém do conseguido
por abordagens mais simples executando em sistemas Raspberry Pi com quatro núcleos
Low-level trace correlation on heterogeneous embedded systems
Tracing is a common method used to debug, analyze, and monitor various systems. Even though standard tools and tracing methodologies exist for standard and distributed environments, it is not the case for heterogeneous embedded systems. This paper proposes to fill this gap and discusses how efficient tracing can be achieved without having common system tools, such as the Linux Trace Toolkit (LTTng), at hand on every core. We propose a generic solution to trace embedded heterogeneous systems and overcome the challenges brought by their peculiar architectures (little available memory, bare-metal CPUs, or exotic components for instance). The solution described in this paper focuses on a generic way of correlating traces among different kinds of processors through traces synchronization, to analyze the global state of the system as a whole. The proposed solution was first tested on the Adapteva Parallella board. It was then improved and thoroughly validated on TI’s Keystone 2 System-on-Chip (SoC)
- …