Search CORE

180 research outputs found

Data Flow Algorithms for Processors with Vector Extensions

Author: Lee Barford
Shuvra S. Bhattacharyya
Yanzhou Liu
Publication venue: Springer Nature
Publication date: 01/01/2015
Field of study

ACOTES project: Advanced compiler technologies for embedded streaming

Author: Albert Cohen
Alex Ramírez
Andrea Ornstein
Antoniu Pop
Ayal Zaks
Cupertino Miranda
Cédric Bastoul
David Ródenas
Dorit Nuzman
E. Blossom
E.A. Lee
Eduard Ayguadé
Erven Rohou
Harm Munk
Ira Rosen
J. Hoogerbrugge
Konrad Trifunović
Louis-Noël Pouchet
M. Gschwind
M. Wolfe
Marc Duranton
Marco Cornero
Menno Lindwer
Mohammed Fellahi
Paul Carpenter
Philippe Dumont
R. Allen
R.G. Scarborough
Razya Ladelsky
Roger Ferrer
S. Campanoni
Sebastian Pop
Uzi Shvadron
Xavier Martorell
Zbigniew Chamski
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Streaming applications are built of data-driven, computational components, consuming and producing unbounded data streams. Streaming oriented systems have become dominant in a wide range of domains, including embedded applications and DSPs. However, programming efficiently for streaming architectures is a challenging task, having to carefully partition the computation and map it to processes in a way that best matches the underlying streaming architecture, taking into account the distributed resources (memory, processing, real-time requirements) and communication overheads (processing and delay). These challenges have led to a number of suggested solutions, whose goal is to improve the programmer’s productivity in developing applications that process massive streams of data on programmable, parallel embedded architectures. StreamIt is one such example. Another more recent approach is that developed by the ACOTES project (Advanced Compiler Technologies for Embedded Streaming). The ACOTES approach for streaming applications consists of compiler-assisted mapping of streaming tasks to highly parallel systems in order to maximize cost-effectiveness, both in terms of energy and in terms of design effort. The analysis and transformation techniques automate large parts of the partitioning and mapping process, based on the properties of the application domain, on the quantitative information about the target systems, and on programmer directives. This paper presents the outcomes of the ACOTES project, a 3-year collaborative work of industrial (NXP, ST, IBM, Silicon Hive, NOKIA) and academic (UPC, INRIA, MINES ParisTech) partners, and advocates the use of Advanced Compiler Technologies that we developed to support Embedded Streaming.Peer ReviewedPostprint (published version

HAL-CentraleSupelec

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

UPCommons. Portal del coneixement obert de la UPC

INRIA a CCSD electronic archive server

HAL-MINES ParisTech

The University of Manchester - Institutional Repository

HAL-Rennes 1

A Study of the use of SIMD instructions for two image processing algorithms

Author: Welch Eric
Publication venue: RIT Scholar Works
Publication date: 01/12/2012
Field of study

Many media processing algorithms suffer from long execution times, which are most often not acceptable from an end user point of view. Recently, this problem has been exacerbated because media has higher resolution. One possible solution is through the use of Single Instruction Multiple Data (SIMD) architectures, such as ARM\u27s NEON. These architectures take advantage of the parallelism in media processing algorithms by operating on multiple pieces of data with just one instruction. SIMD instructions can significantly decrease the execution time of the algorithm, but require more time to implement. This thesis studies the use of SIMD instructions on a Cortex-A8 processor with NEON SIMD coprocessor. Both image processing algorithms, bilinear interpolation and distortion, are altered to process multiple pixels or colors simultaneously using the NEON coprocessor\u27s instruction set. The distortion algorithm is also altered at the assembly level through the removal of memory accesses and branches, adding data prefetch instructions, and interlacing ARM and NEON instructions. Altering the assembly code requires a deeper understanding of the code and more time, but allows for more control and higher speedups. The theoretical speedup for the bilinear interpolation and distortion algorithms is three and four times respectively. The actual measured speedup for the bilinear interpolation algorithm is more than two times, and for the distortion algorithm is more than three times. The results show that SIMD instructions can provide a speedup to image processing algorithms following a correct sequence of modifications of the code

RIT Scholar Works

Optimizing SIMD execution in HW/SW co-designed processors

Author: Kumar Rakesh
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2014
Field of study

SIMD accelerators are ubiquitous in microprocessors from different computing domains. Their high compute power and hardware simplicity improve overall performance in an energy efficient manner. Moreover, their replicated functional units and simple control mechanism make them amenable to scaling to higher vector lengths. However, code generation for these accelerators has been a challenge from the days of their inception. Compilers generate vector code conservatively to ensure correctness. As a result they lose significant vectorization opportunities and fail to extract maximum benefits out of SIMD accelerators. This thesis proposes to vectorize the program binary at runtime in a speculative manner, in addition to the compile time static vectorization. There are different environments that support runtime profiling and optimization support required for dynamic vectorization, one of most prominent ones being: 1) Dynamic Binary Translators and Optimizers (DBTO) and 2) Hardware/Software (HW/SW) Co-designed Processors. HW/SW co-designed environment provides several advantages over DBTOs like transparent incorporations of new hardware features, binary compatibility, etc. Therefore, we use HW/SW co-designed environment to assess the potential of speculative dynamic vectorization. Furthermore, we analyze vector code generation for wider vector units and find out that even though SIMD accelerators are amenable to scaling from the hardware point of view, vector code generation at higher vector length is even more challenging. The two major factors impeding vectorization for wider SIMD units are: 1) Reduced dynamic instruction stream coverage for vectorization and 2) Large number of permutation instructions. To solve the first problem we propose Variable Length Vectorization that iteratively vectorizes for multiple vector lengths to improve dynamic instruction stream coverage. Secondly, to reduce the number of permutation instructions we propose Selective Writing that selectively writes to different parts of a vector register and avoids permutations. Finally, we tackle the problem of leakage energy in SIMD accelerators. Since SIMD accelerators consume significant amount of real estate on the chip, they become the principle source of leakage if not utilized judiciously. Power gating is one of the most widely used techniques to reduce leakage energy of functional units. However, power gating has its own energy and performance overhead associated with it. We propose to selectively devectorize the vector code when higher SIMD lanes are used intermittently. This selective devectorization keeps the higher SIMD lanes idle and power gated for maximum duration. Therefore, resulting in overall leakage energy reduction.Postprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

A Real-Time Compressed Sensing-Based Personal Electrocardiogram Monitoring System

Author: Atienza Alonso David
Kanoun Karim
Khaled Nadia
Mamaghanian Hossein
Publication venue: New York, ACM and IEEE Press
Publication date: 04/11/2010
Field of study

Wireless body sensor networks (WBSN) hold the promise to enable next-generation patient-centric tele-cardiology systems. A WBSN-enabled electrocardiogram (ECG) monitor consists of wearable, miniaturized and wireless sensors able to measure and wirelessly report cardiac signals to a WBSN coordinator, which is responsible for reporting them to the tele-health provider. However, state-of-the-art WBSN-enabled ECG monitors still fall short of the required functionality, miniaturization and energy efficiency. Among others, energy efficiency can be significantly improved through embedded ECG compression, which reduces airtime over energy-hungry wireless links. In this paper, we propose a novel real-time energy-aware ECG monitoring system based on the emerging compressed sensing (CS) signal acquisition/compression paradigm for WBSN applications. For the first time, CS is demonstrated as an advantageous real-time and energy-efficient ECG compression technique, with a computationally light ECG encoder on the state-of-the-art ShimmerTM wearable sensor node and a realtime decoder running on an iPhone (acting as a WBSN coordinator). Interestingly, our results show an average CPU usage of less than 5% on the node, and of less than 30% on the iPhone

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Meta-Programming and Policy-Based Design as a Technique of Architecting Modular and Efficient DSP Algorithm Implementations

Author: Gawlik Ireneusz
Pałka Szymon
Pędzimąż Tomasz
Ziółko Bartosz
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 03/07/2018
Field of study

Meta-programming paradigm and policy-based design are less known programming techniques in Digital Signal Processing (DSP) community, used to coding in pure C or assembly language. Major software components, like C++ STL, have proven usefulness of such paradigms in providing top performance of highly optimised native code, along with abstraction and modularity necessary in complex software projects. This paper describes composition of DSP code using these techniques, bringing as an example implementation of Feedback Delay Network (FDN) artificial reverberation algorithm. The proposed approach was proven to be practical, especially in case of prototyping computationally intense algorithms. To provide further performance insight, we discuss the techniques in context of other optimisation methods, like Single Instruction Multiple Data (SIMD) instruction sets usage and exploitation of superscalar architecture capabilities

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

Instruction set extensions for software defined radio on a multithreaded processor

Author: Daniel Iancu
Emily R. Blem
John Glossner
Mayan Moudgill
Michael J. Schulte
Sanjay Jinturkar
Suman Mamidi
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2005
Field of study

Software dened radios, which provide a programmable solu-tion for implementing the physical layer processing of multi-ple communication standards, are widely recognized as one of the most important new technologies for wireless com-munication systems. Emerging communication standards, however, require tremendous processing capabilities to per-form high-bandwidth physical-layer processing in real time. In this paper, we present instruction set extensions for sev-eral important communication algorithms including convo-lutional encoding, Viterbi decoding, turbo decoding, and Reed-Solomon encoding and decoding. The performance bene ts of these extensions are evaluated using a supercom-puter class vectorizing compiler and the Sandblaster low-power multithreaded processor for software dened radio. The proposed instruction set extensions provide signicant performance improvements, while maintaining a high degree of programmability. Categories and Subject Descriptors C.3 [Computer Systems Organization]: Special-purpose and Application-based Systems|Real-time and embedded sys

CiteSeerX

Crossref

State of the art baseband DSP platforms for Software Defined Radio: A survey

Author: Claudio Brunelli
Fabio Garzia
Heikki Berg
Jari Nurmi
Omer Anjum
Tapani Ahonen
Publication venue: Springer Nature
Publication date: 01/01/2011
Field of study

Software Defined Radio (SDR) is an innovative approach which is becoming a more and more promising technology for future mobile handsets. Several proposals in the field of embedded systems have been introduced by different universities and industries to support SDR applications. This article presents an overview of current platforms and analyzes the related architectural choices, the current issues in SDR, as well as potential future trends.Peer reviewe

Springer - Publisher Connector

Trepo - Institutional Repository of Tampere University

Computing the fast Fourier transform on SIMD microprocessors

Author: Blake Anthony Martin
Publication venue: 'University of Waikato'
Publication date: 18/06/2012
Field of study

This thesis describes how to compute the fast Fourier transform (FFT) of a power-of-two length signal on single-instruction, multiple-data (SIMD) microprocessors faster than or very close to the speed of state of the art libraries such as FFTW (“Fastest Fourier Transform in the West”), SPIRAL and Intel Integrated Performance Primitives (IPP). The conjugate-pair algorithm has advantages in terms of memory bandwidth, and three implementations of this algorithm, which incorporate latency and spatial locality optimizations, are automatically vectorized at the algorithm level of abstraction. Performance results on 2- way, 4-way and 8-way SIMD machines show that the performance scales much better than FFTW or SPIRAL. The implementations presented in this thesis are compiled into a high-performance FFT library called SFFT (“Streaming Fast Fourier Trans- form”), and benchmarked against FFTW, SPIRAL, Intel IPP and Apple Accelerate on sixteen x86 machines and two ARM NEON machines, and shown to be, in many cases, faster than these state of the art libraries, but without having to perform extensive machine specific calibration, thus demonstrating that there are good heuristics for predicting the performance of the FFT on SIMD microprocessors (i.e., the need for empirical optimization may be overstated)

Research Commons@Waikato

Recommended from our members

Automatic data/program partitioning using the single assignment principle

Author: Bic Lubomir
Nagel Mark D.
Roy John M.A.
Publication venue: eScholarship, University of California
Publication date: 01/01/1989
Field of study

Loosely-coupled MIMD architectures do not suffer from memory contention; hence large numbers of processors may be utilized. The main problem, however, is how to partition data and programs in order to exploit the available parallelism. In this paper we show that efficient schemes for automatic data/program partitioning and synchronization may be employed if single assignment is used. Using simulations of program loops common to scientific computations (the Livermore Loops), we demonstrate that only a small fraction of data accesses are remote and thus the degradation in network performance due to multiprocessing is minimal

eScholarship - University of California