19 research outputs found
Accelerating multi-channel filtering of audio signal on ARM processors
The researchers from Universitat Jaume I are supported by the CICYT projects
TIN2014-53495-R and TIN2011-23283 of the Ministerio de Economía y Competitividad and FEDER. The
authors from the Universitat Politècnica de València are supported by projects TEC2015-67387-C4-1-R and
PROMETEOII/2014/003. This work was also supported from the European Union FEDER (CAPAP-H5
network TIN2014-53522-REDT)
Evaluation of the Suitability of NEON SIMD Microprocessor Extensions Under Proton Irradiation
This paper analyzes the suitability of single-instruction multiple data (SIMD) extensions of current microprocessors under radiation environments. SIMD extensions are intended for software acceleration, focusing mostly in applications that require high computational effort, which are common in many fields such as computer vision. SIMD extensions use a dedicated coprocessor that makes possible packing several instructions in one single extended instruction. Applications that require high performance could benefit from the use of SIMD coprocessors, but their reliability needs to be studied. In this paper, NEON, the SIMD coprocessor of ARM microprocessors, has been selected as a case study to explore the behavior of SIMD extensions under radiation. Radiation experiments of ARM CORTEX-A9 microprocessors have been accomplished with the objective of determining how the use of this kind of coprocessor can affect the system reliability
Optimized Fundamental Signal Processing Operations for Energy Minimization on Heterogeneous Mobile Devices
[EN] Numerous signal processing applications are emerging on both mobile and high-performance computing systems. These applications are subject to responsiveness constraints for user interactivity and, at the same time, must be optimized for energy efficiency. The increasingly heterogeneous power-versus-performance profile of modern hardware introduces new opportunities for energy savings as well as challenges. In this line, recent systems-on-chip (SoC) composed of low-power multicore processors, combined with a small graphics accelerator (or GPU), yield a notable increment of the computational capacity while partially retaining the appealing low power consumption of embedded systems. This paper analyzes the potential of these new hardware systems to accelerate applications that involve a large number of floating-point arithmetic operations mainly in the form of convolutions. To assess the performance, a headphone-based spatial audio application for mobile devices based on a Samsung Exynos 5422 SoC has been developed. We discuss different implementations and analyze the tradeoffs between performance and energy efficiency for different scenarios and configurations. Our experimental results reveal that we can extend the battery lifetime of a device featuring such an architecture by a 238% by properly configuring and leveraging the computational resources.This work was supported by the Spanish Ministerio de Economia y Competitividad projects under Grant TIN2014-53495-R and Grant TEC2015-67387-C4-1-R, in part by the University Project UJI-B2016-20, in part by the Project PROMETEOII/2014/003. The work of J. A. Belloch was supported by the GVA Post-Doctoral Contract under Grant APOSTD/2016/069. This paper was recommended by Associate Editor Y. Ha.Belloch Rodríguez, JA.; Badia Contelles, JM.; Igual Peña, FD.; Gonzalez, A.; Quintana Ortí, ES. (2017). Optimized Fundamental Signal Processing Operations for Energy Minimization on Heterogeneous Mobile Devices. IEEE Transactions on Circuits and Systems I Regular Papers. 65(5):1614-1627. https://doi.org/10.1109/TCSI.2017.2761909S1614162765
An efficient multi-core SIMD implementation for H.264/AVC encoder
The optimization process of a H.264/AVC encoder on three different architectures is presented. The architectures are multi- and singlecore and SIMD instruction sets have different vector registers size. The need of code optimization is fundamental when addressing HD resolutions with real-time constraints. The encoder is subdivided in functional modules in order to better understand where the optimization is a key factor and to evaluate in details the performance improvement. Common issues in both partitioning a video encoder into parallel architectures and SIMD optimization are described, and author solutions are presented for all the architectures. Besides showing efficient video encoder implementations, one of the main purposes of this paper is to discuss how the characteristics of different architectures and different set of SIMD instructions can impact on the target application performance. Results about the achieved speedup are provided in order to compare the different implementations and evaluate the more suitable solutions for present and next generation video-coding algorithms
Exploring Processor and Memory Architectures for Multimedia
Multimedia has become one of the cornerstones of our 21st century society and, when combined with mobility, has enabled a tremendous evolution of our society. However, joining these two concepts introduces many technical challenges. These range from having sufficient performance for handling multimedia content to having the battery stamina for acceptable mobile usage. When taking a projection of where we are heading, we see these issues becoming ever more challenging by increased mobility as well as advancements in multimedia content, such as introduction of stereoscopic 3D and augmented reality. The increased performance needs for handling multimedia come not only from an ongoing step-up in resolution going from QVGA (320x240) to Full HD (1920x1080) a 27x increase in less than half a decade. On top of this, there is also codec evolution (MPEG-2 to H.264 AVC) that adds to the computational load increase. To meet these performance challenges there has been processing and memory architecture advances (SIMD, out-of-order superscalarity, multicore processing and heterogeneous multilevel memories) in the mobile domain, in conjunction with ever increasing operating frequencies (200MHz to 2GHz) and on-chip memory sizes (128KB to 2-3MB). At the same time there is an increase in requirements for mobility, placing higher demands on battery-powered systems despite the steady increase in battery capacity (500 to 2000mAh). This leaves negative net result in-terms of battery capacity versus performance advances. In order to make optimal use of these architectural advances and to meet the power limitations in mobile systems, there is a need for taking an overall approach on how to best utilize these systems. The right trade-off between performance and power is crucial. On top of these constraints, the flexibility aspects of the system need to be addressed. All this makes it very important to reach the right architectural balance in the system. The first goal for this thesis is to examine multimedia applications and propose a flexible solution that can meet the architectural requirements in a mobile system. Secondly, propose an automated methodology of optimally mapping multimedia data and instructions to a heterogeneous multilevel memory subsystem. The proposed methodology uses constraint programming for solving a multidimensional optimization problem. Results from this work indicate that using today’s most advanced mobile processor technology together with a multi-level heterogeneous on-chip memory subsystem can meet the performance requirements for handling multimedia. By utilizing the automated optimal memory mapping method presented in this thesis lower total power consumption can be achieved, whilst performance for multimedia applications is improved, by employing enhanced memory management. This is achieved through reduced external accesses and better reuse of memory objects. This automatic method shows high accuracy, up to 90%, for predicting multimedia memory accesses for a given architecture
IMPLEMENTASI HEVC CODEC PADA PLATFORM BERBASIS FPGA
High Efficiency Video Coding (HEVC) telah di desain sebagai standar
baru untuk beberapa aplikasi video dan memiliki peningkatan performa dibanding
dengan standar sebelumnya. Meskipun HEVC mencapai efisiensi coding yang
tinggi, namun HEVC memiliki kekurangan pada beban pemrosesan tinggi dan
loading yang berat ketika melakukan proses encoding video. Untuk meningkatkan
performa encoder, kami bertujuan untuk mengimplementasikan HEVC codec
pada Zynq 7000 AP SoC.
Kami mencoba mengimplementasikan HEVC menggunakan tiga desain
sistem. Pertama, HEVC codec di implementasikan pada Zynq PS. Kedua, encoder
HEVC di implementasikan dengan hardware/software co-design. Ketiga,
mengimplementasikan sebagian dari encoder HEVC pada Zynq PL. Pada
implementasi kami menggunakan Xilinx Vivado HLS untuk mengembangkan
codec.
Hasil menunjukkan bahwa HEVC codec dapat di implementasikan pada
Zynq PS. Codec dapat mengurangi ukuran video dibanding ukuran asli video pada
format H.264. Kualitas video hampir sama dengan format H.264. Sayangnya,
kami tidak dapat menyelesaikan desain dengan hardware/software co-design
karena kompleksitas coding untuk validasi kode C pada Vivado HLS. Hasil lain,
sebagian dari encoder HEVC dapat di implementasikan pada Zynq PL, yaitu
HEVC 2D IDCT. Dari implementasi kami dapat mengoptimalkan fungsi loop
pada HEVC 2D dan 1D IDCT menggunakan pipelining. Perbandingan hasil
antara pipelining inner-loop dan outer-loop menunjukkan bahwa pipelining di
outer-loop dapat meningkatkan performa dilihat dari nilai latency
Low-power System-on-Chip Processors for Energy Efficient High Performance Computing: The Texas Instruments Keystone II
The High Performance Computing (HPC) community recognizes energy
consumption as a major problem. Extensive research is underway to
identify means to increase energy efficiency of HPC systems
including consideration of alternative
building blocks for future systems. This thesis considers one
such system, the Texas Instruments Keystone II, a heterogeneous
Low-Power System-on-Chip (LPSoC) processor that combines a quad
core ARM CPU with an octa-core Digital Signal Processor (DSP). It
was first released in 2012.
Four issues are considered: i) maximizing the Keystone II ARM CPU
performance; ii) implementation and extension of the OpenMP
programming model for the Keystone II; iii) simultaneous use of
ARM and DSP cores across multiple Keystone SoCs; and iv) an
energy model for applications running on LPSoCs like the Keystone
II and heterogeneous systems in general.
Maximizing the performance of the ARM CPU on the Keystone II
system is fundamental to adoption of this system by the HPC
community and, of the ARM architecture more broadly. Key to
achieving good performance is exploitation of the ARM vector
instructions. This thesis presents the first detailed comparison
of the use of ARM compiler intrinsic functions with automatic
compiler vectorization across four generations of ARM processors.
Comparisons are also made with x86 based platforms and the use of
equivalent Intel vector instructions.
Implementation of the OpenMP programming model on the Keystone II
system presents both challenges and opportunities. Challenges in
that the OpenMP model was originally developed for a homogeneous
programming environment with a common instruction set
architecture, and in 2012 work had only just begun to consider
how OpenMP might work with accelerators. Opportunities in that
shared memory is accessible to all processing elements on the
LPSoC, offering performance advantages over what typically exists
with attached accelerators. This thesis presents an analysis of a
prototype version of OpenMP implemented as a bare-metal runtime
on the DSP of a Keystone I system. An implementation for the
Keystone II that maps OpenMP 4.0 accelerator directives to OpenCL
runtime library operations is presented and evaluated.
Exploitation of some of the underlying hardware features of the
Keystone II is also discussed.
Simultaneous use of the ARM and DSP cores across multiple
Keystone II boards is fundamental to the creation of commercially
viable HPC offerings based on Keystone technology. The nCore
BrownDwarf and HPE Moonshot systems represent two such systems.
This thesis presents a proof-of-concept implementation of matrix
multiplication (GEMM) for the BrownDwarf system. The BrownDwarf
utilizes both Keystone II and Keystone I SoCs through a
point-to-point interconnect called Hyperlink. Details of how a
novel message passing communication framework across Hyperlink
was implemented to support this complex environment are
provided.
An energy model that can be used to predict energy usage as a
function of what fraction of a particular computation is
performed on each of the available compute devices offers the
opportunity for making runtime decisions on how best to minimize
energy usage. This thesis presents a basic energy usage model
that considers rates of executions on each device and their
active and idle power usages. Using this model, it is shown that
only under certain conditions does there exist an energy-optimal
work partition that uses multiple compute devices. To validate
the model a high resolution energy measurement environment is
developed and used to gather energy measurements for a matrix
multiplication benchmark running on a variety of systems. Results
presented support the model.
Drawing on the four issues noted above and other developments
that have occurred since the Keystone II system was first
announced, the thesis concludes by making comments regarding the
future of LPSoCs as building blocks for HPC systems
Compressão de imagem e vídeo com controlador ARM
Este trabalho foi desenvolvido num estágio na empresa ABS GmbH sucursal em Portugal, e teve como foco a compressão de imagem e vídeo com os padrões JPEG e H.264, respetivamente. Foi utilizada a plataforma LeopardBoard DM368, com um controlador ARM9.
A análise do desempenho de compressão de ambos os padrões foi realizada através de programas em linguagem C, para execução no processador DM368. O programa para compressão de imagem recebe como parâmetros de entrada o nome e a resolução da imagem a comprimir, e comprime-a com 10 níveis de quantização diferentes. Os resultados mostram que é possível obter uma velocidade de compressão até 73 fps (frames per second) para a resolução 1280x720, e que imagens de boa qualidade podem ser obtidas com rácios de compressão até cerca de 22:1.
No programa para compressão de vídeo, o codificador está configurado de acordo com as recomendações para as seguintes aplicações: videoconferência, videovigilância, armazenamento e broadcasting/streaming. As configurações em cada processo de codificação, o nome do ficheiro, o número de frames e a resolução do mesmo representam os parâmetros de entrada. Para a resolução 1280x720, foram obtidas velocidades de compressão até cerca de 68 fps, enquanto para a resolução 1920x1088 esse valor foi cerca de 30 fps.
Foi ainda desenvolvida uma aplicação com capacidades para capturar imagens ou vídeos, aplicar processamento de imagem, compressão, armazenamento e transmissão para uma saída DVI (Digital Visual Interface). O processamento de imagem em software permite melhorar dinamicamente as imagens, e a taxa média de captura, compressão e armazenamento é cerca de 5 fps para a resolução 1280x720, adequando-se à captura de imagens individuais. Sem processamento em software, a taxa sobe para cerca de 23 fps para a resolução 1280x720, sendo cerca de 28 fps para a resolução 1280x1088, o que é favorável à captura de vídeo
High definition IEEE AVS decoder on ARM NEON platform
Nowadays, mobile devices are capable of displaying video up to HD resolution. In this paper, we propose two acceleration strategies for Audio Video coding Standard (AVS) software decoder on multi-core ARM NEON platform. Firstly, data level parallelism is utilized to effectively use the SIMD capability of NEON and key modules are redesigned to make them SIMD friendly. Secondly, a macroblock level wavefront parallelism is designed based on the decoding dependencies among macroblocks to utilize the processing capability of multiple cores. Experiment results show that AVS (IEEE 1857) HD video stream can be decoded in real-time by applying the proposed two acceleration strategies.Imaging Science & Photographic TechnologyEICPCI-S(ISTP)