91 research outputs found
Fast algorithm for the 3-D DCT-II
Recently, many applications for three-dimensional
(3-D) image and video compression have been proposed using 3-D discrete cosine transforms (3-D DCTs). Among different types of DCTs, the type-II DCT (DCT-II) is the most used. In order to use the 3-D DCTs in practical applications, fast 3-D algorithms are essential. Therefore, in this paper, the 3-D vector-radix decimation-in-frequency (3-D VR DIF) algorithm that calculates the 3-D DCT-II directly is introduced. The mathematical analysis and the implementation of the developed algorithm are presented,
showing that this algorithm possesses a regular structure, can be implemented in-place for efficient use of memory, and is faster than the conventional row-column-frame (RCF) approach. Furthermore, an application of 3-D video compression-based 3-D DCT-II is implemented using the 3-D new algorithm. This has led to a substantial speed improvement for 3-D DCT-II-based compression systems and proved the validity of the developed algorithm
Radix-2 x 2 x 2 algorithm for the 3-D discrete hartley transform
The discrete Hartley transform (DHT) has proved
to be a valuable tool in digital signal/image processing and communications and has also attracted research interests in many multidimensional applications. Although many fast algorithms have been developed for the calculation of one- and two-dimensional (1-D and 2-D) DHT, the development of multidimensional algorithms in three and more dimensions is still unexplored and has not been given similar attention; hence, the multidimensional
Hartley transform is usually calculated through the row-column approach. However, proper multidimensional algorithms can be more efficient than the row-column method and need to be developed. Therefore, it is the aim of this paper to introduce the concept and derivation of the three-dimensional (3-D) radix-2 2X 2X
algorithm for fast calculation of the 3-D discrete Hartley transform. The proposed algorithm is based on the principles of the divide-and-conquer approach applied directly in 3-D. It has a simple butterfly structure and has been found to offer significant savings in arithmetic operations compared with the row-column approach based on similar algorithms
Towards efficient exploitation of GPUs : a methodology for mapping index-digit algorithms
[Resumen]La computación de propósito general en GPUs supuso un gran paso, llevando la
computación de alto rendimiento a los equipos domésticos. Lenguajes de programación de alto nivel como OpenCL y CUDA redujeron en gran medida la complejidad
de programación. Sin embargo, para poder explotar totalmente el poder computacional
de las GPUs, se requieren algoritmos paralelos especializados. La complejidad
en la jerarquía de memoria y su arquitectura masivamente paralela hace que la
programación de GPUs sea una tarea compleja incluso para programadores experimentados.
Debido a la novedad, las librerías de propósito general son escasas y las
versiones paralelas de los algoritmos no siempre están disponibles.
En lugar de centrarnos en la paralelización de algoritmos concretos, en esta tesis
proponemos una metodología general aplicable a la mayoría de los problemas de tipo
divide y vencerás con una estructura de mariposa que puedan formularse a través de
la representación Indice-Dígito. En primer lugar, se analizan los diferentes factores que afectan al rendimiento de la arquitectura de las GPUs. A continuación, estudiamos
varias técnicas de optimización y diseñamos una serie de bloques constructivos
modulares y reutilizables, que se emplean para crear los diferentes algoritmos. Por último, estudiamos el equilibrio óptimo de los recursos, y usando vectores de mapeo
y operadores algebraicos ajustamos los algoritmos para las configuraciones deseadas.
A pesar del enfoque centrado en la exibilidad y la facilidad de programación, las
implementaciones resultantes ofrecen un rendimiento muy competitivo, que llega a superar conocidas librerías recientes.[Resumo] A computación de propósito xeral en GPUs supuxo un gran paso, levando a
computación de alto rendemento aos equipos domésticos. Linguaxes de programación de alto nivel como OpenCL e CUDA reduciron en boa medida a complexidade
da programación. Con todo, para poder aproveitar totalmente o poder computacional
das GPUs, requírense algoritmos paralelos especializados. A complexidade na
xerarquía de memoria e a súa arquitectura masivamente paralela fai que a programación de GPUs sexa unha tarefa complexa mesmo para programadores experimentados.
Debido á novidade, as librarías de propósito xeral son escasas e as versións
paralelas dos algoritmos non sempre están dispoñibles.
En lugar de centrarnos na paralelización de algoritmos concretos, nesta tese propoñemos unha metodoloxía xeral aplicable á maioría dos problemas de tipo divide e
vencerás cunha estrutura de bolboreta que poidan formularse a través da representación Índice-Díxito. En primeiro lugar, analízanse os diferentes factores que afectan
ao rendemento da arquitectura das GPUs. A continuación, estudamos varias técnicas
de optimización e deseñamos unha serie de bloques construtivos modulares e
reutilizables, que se empregan para crear os diferentes algoritmos. Por último, estudamos
o equilibrio óptimo dos recursos, e usando vectores de mapeo e operadores
alxbricos axustamos os algoritmos para as configuracións desexadas. A pesar do enfoque
centrado na exibilidade e a facilidade de programación, as implementacións
resultantes ofrecen un rendemento moi competitivo, que chega a superar coñecidas
librarías recentes.[Abstract]GPU computing supposed a major step forward, bringing high performance computing
to commodity hardware. Feature-rich parallel languages like CUDA and
OpenCL reduced the programming complexity. However, to fully take advantage of
their computing power, specialized parallel algorithms are required. Moreover, the
complex GPU memory hierarchy and highly threaded architecture makes programming
a difficult task even for experienced programmers. Due to the novelty of GPU
programming, common general purpose libraries are scarce and parallel versions of
the algorithms are not always readily available.
Instead of focusing in the parallelization of particular algorithms, in this thesis
we propose a general methodology applicable to most divide-and-conquer problems
with a buttery structure which can be formulated through the Index-Digit
representation. First, we analyze the different performance factors of the GPU architecture.
Next, we study several optimization techniques and design a series of
modular and reusable building blocks, which will be used to create the different
algorithms. Finally, we study the optimal resource balance, and through a mapping
vector representation and operator algebra, we tune the algorithms for the desired
configurations. Despite the focus on programmability and exibility, the resulting
implementations offer very competitive performance, being able to surpass other
well-known state of the art libraries
Computing the fast Fourier transform on SIMD microprocessors
This thesis describes how to compute the fast Fourier transform (FFT) of a power-of-two length signal on single-instruction, multiple-data (SIMD) microprocessors faster than or very close to the speed of state of the art libraries such as FFTW (“Fastest Fourier Transform in the West”), SPIRAL and Intel Integrated Performance Primitives (IPP).
The conjugate-pair algorithm has advantages in terms of memory bandwidth, and three implementations of this algorithm, which incorporate latency and spatial locality optimizations, are automatically vectorized at the algorithm level of abstraction. Performance results on 2- way, 4-way and 8-way SIMD machines show that the performance scales much better than FFTW or SPIRAL.
The implementations presented in this thesis are compiled into a high-performance FFT library called SFFT (“Streaming Fast Fourier Trans- form”), and benchmarked against FFTW, SPIRAL, Intel IPP and Apple Accelerate on sixteen x86 machines and two ARM NEON machines, and shown to be, in many cases, faster than these state of the art libraries, but without having to perform extensive machine specific calibration, thus demonstrating that there are good heuristics for predicting the performance of the FFT on SIMD microprocessors (i.e., the need for empirical optimization may be overstated)
- …