Search CORE

1,133 research outputs found

Low-power Programmable Processor for Fast Fourier Transform Based on Transport Triggered Architecture

Author: Takala Jarmo
Žádník Jakub
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 17/04/2019
Field of study

This paper describes a low-power processor tailored for fast Fourier transform computations where transport triggering template is exploited. The processor is software-programmable while retaining an energy-efficiency comparable to existing fixed-function implementations. The power savings are achieved by compressing the computation kernel into one instruction word. The word is stored in an instruction loop buffer, which is more power-efficient than regular instruction memory storage. The processor supports all power-of-two FFT sizes from 64 to 16384 and given 1 mJ of energy, it can compute 20916 transforms of size 1024.Comment: 5 pages, 4 figures, 1 table, ICASSP 2019 conferenc

arXiv.org e-Print Archive

Crossref

Trepo - Institutional Repository of Tampere University

Design and Test Space Exploration of Transport-Triggered Architectures

Author: Kerkhoff H.G.
Tangelder R.J.W.T.
Zivkovic V.A.
Publication venue: IEEE
Publication date: 01/01/2000
Field of study

This paper describes a new approach in the high level design and test of transport-triggered architectures (TTA), a special type of application specific instruction processors (ASIP). The proposed method introduces the test as an additional constraint, besides throughput and circuit area. The method, that calculates the testability of the system, helps the designer to assess the obtained architectures with respect to test, area and throughput in the early phase of the design and selects the most suitable one. In order to create the templated TTA, the ¿MOVE¿ framework has been addressed. The approach is validated with respect to the ¿Crypt¿ Unix applicatio

CiteSeerX

University of Twente Research Information

pocl: A Performance-Portable OpenCL Implementation

Author: Berg Heikki
de La Lama Carlos Sánchez
Jääskeläinen Pekka
Raiskila Kalle
Schnetter Erik
Takala Jarmo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

OpenCL is a standard for parallel programming of heterogeneous systems. The benefits of a common programming standard are clear; multiple vendors can provide support for application descriptions written according to the standard, thus reducing the program porting effort. While the standard brings the obvious benefits of platform portability, the performance portability aspects are largely left to the programmer. The situation is made worse due to multiple proprietary vendor implementations with different characteristics, and, thus, required optimization strategies. In this paper, we propose an OpenCL implementation that is both portable and performance portable. At its core is a kernel compiler that can be used to exploit the data parallelism of OpenCL programs on multiple platforms with different parallel hardware styles. The kernel compiler is modularized to perform target-independent parallel region formation separately from the target-specific parallel mapping of the regions to enable support for various styles of fine-grained parallel resources such as subword SIMD extensions, SIMD datapaths and static multi-issue. Unlike previous similar techniques that work on the source level, the parallel region formation retains the information of the data parallelism using the LLVM IR and its metadata infrastructure. This data can be exploited by the later generic compiler passes for efficient parallelization. The proposed open source implementation of OpenCL is also platform portable, enabling OpenCL on a wide range of architectures, both already commercialized and on those that are still under research. The paper describes how the portability of the implementation is achieved. Our results show that most of the benchmarked applications when compiled using pocl were faster or close to as fast as the best proprietary OpenCL implementation for the platform at hand.Comment: This article was published in 2015; it is now openly accessible via arxi

arXiv.org e-Print Archive

Trepo - Institutional Repository of Tampere University

Low power architectures for streaming applications

Author: He Y.
Publication venue: Technische Universiteit Eindhoven
Publication date: 01/01/2013
Field of study

Repository TU/e

Pure OAI Repository

An adaptive detector implementation for MIMO-OFDM downlink

Author: Janhunen Janne
Juntti Markku
Shahabuddin Shahriar
Steendam Heidi
Suikkanen Essi
Publication venue: 'European Alliance for Innovation n.o.'
Publication date: 01/01/2014
Field of study

Cognitive radio (CR) systems require flexible and adaptive implementations of signal processing algorithms. An adaptive symbol detector is needed in the baseband receiver chain to achieve the desired flexibility of a CR system. This paper presents a novel design of an adaptive detector as an application-specific instruction-set processor (ASIP). The ASIP template is based on transport triggered architecture (TTA). The processor architecture is designed in such a manner that it can be programmed to support different suboptimal multiple-input multiple-output (MIMO) detection algorithms in a single TTA processor. The linear minimum mean-square error (LMMSE) and three variants of the selective spanning for fast enumeration (SSFE) detection algorithms are considered. The detection algorithm can be switched between the LMMSE and SSFE according to the bit error rate (BER) performance requirement in the TTA processor. The design can be scaled for different antenna configurations and different modulations. Some of the algorithm architecture co-optimization techniques used here are also presented. Unlike most other detector ASIPs, high level language is used to program the processor to meet the time-to-market requirements. The adaptive detector delivers 4.88 - 49.48 Mbps throughput at a clock frequency of 200 MHz on 90 nm technology

CiteSeerX

Ghent University Academic Bibliography

Synthetic Aperture Radar Algorithms on Transport Triggered Architecture Processors using OpenCL

Author: Blume Holger
Jääskeläinen Pekka
Leppänen Topi
Mätzner Leonard
Rother Niklas
Schleusner Jens
Publication venue: Piscataway, NJ : IEEE
Publication date: 01/01/2024
Field of study

Live SAR imaging from small UAVs is an emerging field. On-board processing of the radar data requires high-performance and energy-efficient platforms. One candidate for this are Transport Triggered Architecture (TTA) processors. We implement Backprojection and Backprojection Autofocus on a TTA processor specially adapted for this task using OpenCL. The resulting implementation is compared to other platforms in terms of energy efficiency. We find that the TTA is on-par with embedded GPUs and surpasses other OpenCL-based platforms. It is outperformed only by a dedicated FPGA implementation. © 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

Institutionelles Repositorium der Leibniz Universität Hannover

SynZEN: a hybrid TTA/VLIW architecture with a distributed register file

Author: Hauser Stefan
Juurlink Ben
Moser Nico
Publication venue
Publication date: 01/01/2012
Field of study

The quest for higher performance within a certain power budget in the fields of embedded computing demands unconventional architectural approaches. To this end, in this paper we present synZEN (sZ): a (micro-)architecture that combines features of very long instruction word (VLIW) and transport triggered architectures (TTAs) to cover the needs of different applications. SynZEN features a distributed register file (RF) (i.e., each functional unit (FU) has its own RF) and a wide memory connection to exploit spatial data locality. FPGA synthesis results demonstrate that due to the distributed RF the sZ design can be implemented in less area (in terms of FPGA slices) than existing TTA and VLIW designs. Furthermore, using two micro-benchmarks we show that because of the wide memory connection, sZ outperforms both the TTA as well as the VLIW design

DepositOnce

Crossref