Search CORE

848 research outputs found

Non-power-of-Two FFTs: Exploring the Flexibility of the Montium TP

Author: Hauck S.A.
Smit Gerardus Johannes Maria
van de Burgwal M.D.
Wolkotte P.T.
Publication venue: Hindawi Publishing Corporation
Publication date: 01/01/2009
Field of study

Coarse-grain reconfigurable architectures, like the Montium TP, have proven to be a very successful approach for low-power and high-performance computation of regular digital signal processing algorithms. This paper presents the implementation of a class of non-power-of-two FFTs to discover the limitations and Flexibility of the Montium TP for less regular algorithms. A non-power-of-two FFT is less regular compared to a traditional power-of-two FFT. The results of the implementation show the processing time, accuracy, energy consumption and Flexibility of the implementation

CiteSeerX

Directory of Open Access Journals

University of Twente Research Information

On the implementation of 128-pt FFT/IFFT for high-performance WPAN

Author: Huggett Clare
Maharatna Koushik
Paul Kolin
Publication venue
Publication date: 01/01/2005
Field of study

Southampton (e-Prints Soton)

Elements of Design for Containers and Solutions in the LinBox Library

Author: B. Boyer
B. Boyer
D. Harvey
J. Gathen von zur
J.-G. Dumas
J.-G. Dumas
Publication venue
Publication date: 01/01/2014
Field of study

We describe in this paper new design techniques used in the \cpp exact linear algebra library \linbox, intended to make the library safer and easier to use, while keeping it generic and efficient. First, we review the new simplified structure for containers, based on our \emph{founding scope allocation} model. We explain design choices and their impact on coding: unification of our matrix classes, clearer model for matrices and submatrices, \etc Then we present a variation of the \emph{strategy} design pattern that is comprised of a controller--plugin system: the controller (solution) chooses among plug-ins (algorithms) that always call back the controllers for subtasks. We give examples using the solution \mul. Finally we present a benchmark architecture that serves two purposes: Providing the user with easier ways to produce graphs; Creating a framework for automatically tuning the library and supporting regression testing.Comment: 8 pages, 4th International Congress on Mathematical Software, Seoul : Korea, Republic Of (2014

arXiv.org e-Print Archive

HAL-ENS-LYON

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Hal-Diderot

VLSI Architecture for Polar Codes Using Fast Fourier Transform-Like Design

Author: Tan Weihang
Publication venue: Clemson University Libraries
Publication date: 01/08/2020
Field of study

Polar code is a novel and high-performance communication algorithm with the ability to theoretically achieving the Shannon limit, which has attracted increasing attention recently due to its low encoding and decoding complexity. Hardware optimization further reduces the cost and achieves better timing performance enabling real-time applications on resource-constrained devices. This thesis presents an area-efficient architecture for a successive cancellation (SC) polar decoder. Our design applies high-level transformations to reduce the number of Processing Elements (PEs), i.e., only log2 N pre-computed PEs are required in our architecture for an N-bit code. We also propose a customized loop-based shifting register to reduce the consumption of the delay elements further. Our experimental results demonstrate that our architecture reduces 98.90% and 93.38% in the area and area-time product, respectively, compared to prior works

Reducing Memory Requirements for the IPU using Butterfly Factorizations

Author: Alles Christian
Fröning Holger
Shekofteh S. -Kazem
Publication venue
Publication date: 16/09/2023
Field of study

High Performance Computing (HPC) benefits from different improvements during last decades, specially in terms of hardware platforms to provide more processing power while maintaining the power consumption at a reasonable level. The Intelligence Processing Unit (IPU) is a new type of massively parallel processor, designed to speedup parallel computations with huge number of processing cores and on-chip memory components connected with high-speed fabrics. IPUs mainly target machine learning applications, however, due to the architectural differences between GPUs and IPUs, especially significantly less memory capacity on an IPU, methods for reducing model size by sparsification have to be considered. Butterfly factorizations are well-known replacements for fully-connected and convolutional layers. In this paper, we examine how butterfly structures can be implemented on an IPU and study their behavior and performance compared to a GPU. Experimental results indicate that these methods can provide 98.5% compression ratio to decrease the immense need for memory, the IPU implementation can benefit from 1.3x and 1.6x performance improvement for butterfly and pixelated butterfly, respectively. We also reach to 1.62x training time speedup on a real-word dataset such as CIFAR10

arXiv.org e-Print Archive