41 research outputs found
Pipeline template for streaming applications on heterogeneous chips
We address the problem of providing support for executing single
streaming applications implemented as a pipeline of stages that run
on heterogeneous chips comprised of several cores and one on-chip
GPU. In this paper, we mainly focus on the API that allows the user
to specify the type of parallelism exploited by each pipeline stage
running on the multicore CPU, the mapping of the pipeline stages to
the devices (GPU or CPU), and the number of active threads. We use
a real streaming application as a case of study to illustrate the
experimental results that can be obtained with this API. With this
example, we evaluate how the different parameter values affect the
performance and energy efficiency of a heterogenous on-chip
processor (Exynos 5 Octa) that has three different computational
cores: a GPU, an ARM Cortex-A15 quad-core, and an ARM Cortex-A7
quad-core.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech. Proyecto de Excelencia de la Junta de Andalucía P11-TIC-08144
Solving Large-Scale Markov Decision Processes on Low-Power Heterogeneous Platforms
Markov Decision Processes (MDPs) provide a framework for a machine to act autonomously and intelligently in environments where the effects of its actions are not deterministic. MDPs have numerous applications. We focus on practical applications for decision making, such as autonomous driving and service robotics, that have to run on mobile platforms with scarce computing and power resources. In our study, we use Value Iteration to solve MDPs, a core method of the paradigm to find optimal sequences of actions, which is well known for its high computational cost.
In order to solve these computationally complex problems efficiently in platforms with stringent power consumption constraints, high-performance accelerator hardware and parallelised software come to the rescue. We introduce a generalisable approach to implement practical applications for decision making, such as autonomous driving on mobile and embedded low-power heterogeneous SoC platforms that integrate an accelerator (GPU) with a multicore. We evaluate three scheduling strategies that enable concurrent execution and efficient use of resources on a variety of SoCs embedding a multicore CPU and integrated GPU, namely Oracle, Dynamic, and LogFit. We compare these strategies for solving an MDP modelling the use-case of autonomous robot navigation in indoor environments on four representative platforms for mobile decision-making applications with a power use ranging from 4 to 65 Watts. We provide a rigorous analysis of the results to better understand their behaviour depending on the MDP size and the computing platform. Our experimental results show that by using CPU-GPU heterogeneous strategies, the computation time and energy required are considerably reduced with respect to multicore implementation, regardless of the computational platform.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech.
This work was partially supported by the Spanish project TIN 2016-80920-R
Adaptive Partition Strategies for Loop Parallelism in Heterogeneous Architectures
Este trabajo describe nuestra contribución para la ejecución de bucles paralelos en arquitecturas multi-core/multi-GPU de forma que la carga computacional se distribuya de forma balanceada entre todas las unidades de computación.This paper explores the possibility of efficiently using multicores
in conjunction with multiple GPU accelerators under a parallel task
programming paradigm. In particular, we address the challenge of
extending a parallel_for template to allow its
exploitation on heterogeneous systems. The extension is based on a
two-stages pipeline engine which is responsible for partitioning and
scheduling the chunks into the computational resources. Under this
engine, we propose a dynamic scheduling strategy coupled with an
adaptive partitioning heuristic that resizes chunks to prevent
underutilization and load unbalance of CPUs and GPUs. In this paper
we introduce the adaptive
partitioning heuristic which is derived from an analytical model that
minimizes the load unbalance while maximizes the throughput in the
system. Using two benchmarks we evaluate the
overhead introduced by our template extensions finding that it is
negligible. We also evaluate the efficiency of our adaptive
partitioning strategies and compared them with related work.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech. TIN2010-16144, P08-TIC-3500, P11-TIC-0814
Prácticas de ensamblador basadas en Raspberry Pi
Este trabajo se enmarca dentro del Proyecto de Innovación Educativa PIE13-082, “Motivando al alumno de ingeniería mediante la plataforma Raspberry Pi” cuyo principal objetivo es aumentar el grado de motivación del alumno que cursa asignaturas impartidas por el Departamento de Arquitectura de Computadores. La estrategia propuesta se apoya en el hecho de que muchos alumnos de Ingeniería perciben que las asignaturas de la carrera están alejadas de su realidad cotidiana, y que por ello, pierden cierto atractivo. Sin embargo, bastantes de estos alumnos han comprado o piensan comprar un minicomputador Raspberry Pi que se caracteriza por proporcionar una gran funcionalidad, gracias a estar basado en un procesador y Sistema Operativo de referencia en los dispositivos móviles. En este proyecto proponemos aprovechar el interés que los alumnos ya demuestran por la plataforma Raspberry Pi, para ponerlo a trabajar en pro del siguiente objetivo docente: facilitar el estudio de conceptos y técnicas impartidas en varias asignaturas del Departamento. Más concretamente, el principal objetivo de este trabajo es la creación de un conjunto de prácticas enfocadas al aprendizaje de la programación en ensamblador, en concreto del ARMv6 que es el procesador de la plataforma que se va a utilizar para el desarrollo de las prácticas, así como al manejo a bajo nivel de las interrupciones y la entrada/salida en dicho procesador. La presente memoria está dividida cinco capítulos y cuatro apéndices. De los 5 capítulos, el primero es introductorio. Los dos siguientes se centran en la programación de ejecutables en Linux, tratando las estructuras de control en el capítulo 2 y las subrutinas (funciones) en el capítulo 3. Los dos últimos capítulos muestran la programación en Bare Metal, explicando el subsistema de entrada/salida (puertos de entrada/salida y temporizadores) de la plataforma Raspberry Pi y su manejo a bajo nivel en el capítulo 4 y las interrupciones en el capítulo 5. En los apéndices hemos añadido aspectos laterales pero de suficiente relevancia como para ser considerados en la memoria, como el apendice A que explica el funcionamiento de la macro ADDEXC, el apéndice B que muestra todos los detalles de la placa auxiliar, el apéndice C que nos enseña a agilizar la carga de programas Bare Metal y por último tenemos el apéndice D, que profundiza en aspectos del GPIO como las resistencias programables
Reducing overheads of dynamic scheduling on heterogeneous chips
In recent processor development, we have witnessed the integration of GPU and CPUs into a single chip. The result of this integration is a reduction of the data communication overheads. This enables an efficient collaboration of both devices in the execution of parallel workloads.
In this work, we focus on the problem of efficiently scheduling chunks of iterations of parallel loops among the computing devices on the chip (the GPU and the CPU cores) in the context of irregular applications. In particular, we analyze the sources of overhead that the host thread experiments when a chunk of iterations is offloaded to the GPU while other threads are executing concurrently other chunks on the CPU cores. We carefully study these overheads on different processor architectures and operating systems using Barnes Hut as a study case representative of irregular applications. We also propose a set of optimizations to mitigate the overheads that arise in presence of oversubscription and take advantage of the different features of the heterogeneous architectures. Thanks to these optimizations we reduce Energy-Delay Product (EDP) by 18% and 84% on Intel Ivy Bridge and Haswell architectures, respectively, and by 57% on the Exynos big.LITTLE.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech
Efficient Floating-Point Representation for Balanced Codes for FPGA Devices
Trabajo premiado con Best paper AwardWe propose a floating–point representation to deal
efficiently with arithmetic operations in codes with a balanced
number of additions and multiplications for FPGA devices. The
variable shift operation is very slow in these devices. We propose
a format that reduces the variable shifter penalty. It is based on
a radix–64 representation such that the number of the possible
shifts is considerably reduced. Thus, the execution time of the
floating–point addition is highly optimized when it is performed
in an FPGA device, which compensates for the multiplication
penalty when a high radix is used, as experimental results have
shown. Consequently, the main problem of previous specific highradix
FPGA designs (no speedup for codes with a balanced
number of multiplications and additions) is overcome with our
proposal. The inherent architecture supporting the new format
works with greater bit precision than the corresponding single
precision (SP) IEEE–754 standard.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech. IEEE, IEEE Computer Societ
Exploring heterogeneous scheduling for edge computing with CPU and FPGA MPSoCs
This paper presents a framework targeted to low-cost and low-power heterogeneous MultiProcessors that exploits FPGAs and multicore CPUs, with the overarching goal of providing developers with a productive programming model and runtime support to fully use all the processing resources available. FPGA productivity is achieved using a high-level programming model based on OpenCL, the standard for cross-platform parallel heterogeneous programming. In this work, we focus on the parallel for pattern, and as part of the runtime support for this pattern, we leverage a new scheduler that strives to maximize the number of iterations per joule by dynamically and adaptively partitioning the iteration space between the multicore and the accelerator when working simultaneously. A total of 7 benchmarks are ported and optimized for a low-cost DE1 board. The results show that the heterogeneous solution can improve performance up to 2.9x and increases energy efficiency up to 2.7x compared tothe traditional approach of keeping all the CPU cores idle while the accelerator computes the workload. Our results also demonstrate two interesting insights: First, an adaptive scheduler able to find at runtime the right chunk size for each type of application and device configuration is an essential component for these kinds of heterogeneous platforms, and second, device configurations that provide higher throughput do not always achieve better energy eciency when only the running power (excluding the idle power component) is considered
An Experience of e-assessment in an Introductory Course on Computer Organization
This work describes how the CTPracticalsMoodle module can be used for e-assessment in an introductory course on computer organization, where the practical content consists of the design and simulation of a basic CPU implemented using Logisim, a schematic-based educational tool for the design and simulation of digital circuits. A previous work extended this module to support the verification of codes written in Matlab. This work shows how this feature can be exploited to invoke external tools, i.e. Logisim, not only for the automatic verification of the student submissions, but also for a detailed analysis of the results that improves the assessment. The paper will also present some customization of Logisim that was necessary to improve its batch mode simulation in order to add some useful options that make it generate a richer output format.Campus de Excelencia Internacional Andalucía Tech.Proyectos de Innovación Educativa PIE08-062 y PIE10-140. Universidad de Málaga