1,249 research outputs found
Design And Implementation Of Radix-4 Fast Fourier Transform In Asia Chip With 0.18 M Standard CMOS Technology [TK5102.9. S624 2008 f rb].
Jelmaan Fourier pantas (FFT) merupakan blok yang penting dan digunakan secara meluas dalam algoritma pemprosesan isyarat digital.
The Fast Fourier Transform (FFT) is a critical block and widely used in digital signal processing algorithm
Implementation of a Combined OFDM-Demodulation and WCDMA-Equalization Module
For a dual-mode baseband receiver for the OFDMWireless LAN andWCDMA standards, integration of the demodulation and equalization tasks on a dedicated hardware module has been investigated. For OFDM demodulation, an FFT algorithm based on cascaded twiddle factor decomposition has been selected. This type of algorithm combines high spatial and temporal regularity in the FFT data-flow graphs with a minimal number of computations. A frequency-domain algorithm based on a circulant channel approximation has been selected for WCDMA equalization. It has good performance, low hardware complexity and a low number of computations. Its main advantage is the reuse of the FFT kernel, which contributes to the integration of both tasks. The demodulation and equalization module has been described at the register transfer level with the in-house developed Arx language. The core of the module is a pipelined radix-23 butterfly combined with a complex multiplier and complex divider. The module has an area of 0.447 mm2 in 0.18 ¿m technology and a power consumption of 10.6 mW. The proposed module compares favorably with solutions reported in literature
An equalization technique for high rate OFDM systems
In a typical orthogonal frequency division multiplexing (OFDM) broadband wireless communication system, a guard interval using cyclic prefix is inserted to avoid the inter-symbol interference and the inter-carrier interference. This guard interval is required to be at least equal to, or longer than the maximum channel delay spread. This method is very simple, but it reduces the transmission efficiency. This efficiency is very low in the communication systems, which inhibit a long channel delay spread with a small number of sub-carriers such as the IEEE 802.11a wireless LAN (WLAN).
To increase the transmission efficiency, it is usual that a time domain equalizer (TEQ) is included in an OFDM system to shorten the effective channel impulse response within the guard interval. There are many TEQ algorithms developed for the low rate OFDM applications such as asymmetrical digital subscriber line (ADSL). The drawback of these algorithms is a high computational load. Most of the popular TEQ algorithms are not suitable for the IEEE 802.11a system, a high data rate wireless LAN based on the OFDM technique. In this thesis, a TEQ algorithm based on the minimum mean square error criterion is investigated for the high rate IEEE 802.11a system. This algorithm has a comparatively reduced computational complexity for practical use in the high data rate OFDM systems. In forming the model to design the TEQ, a reduced convolution matrix is exploited to lower the computational complexity. Mathematical analysis and simulation results are provided to show the validity and the advantages of the algorithm. In particular, it is shown that a high performance gain at a data rate of 54Mbps can be obtained with a moderate order of TEQ finite impulse response (FIR) filter. The algorithm is implemented in a field programmable gate array (FPGA). The characteristics and regularities between the elements in matrices are further exploited to reduce the hardware complexity in the matrix multiplication implementation. The optimum TEQ coefficients can be found in less than 4µs for the 7th order of the TEQ FIR filter. This time is the interval of an OFDM symbol in the IEEE 802.11a system. To compensate for the effective channel impulse response, a function block of 64-point radix-4 pipeline fast Fourier transform is implemented in FPGA to perform zero forcing equalization in frequency domain. The offsets between the hardware implementations and the mathematical calculations are provided and analyzed. The system performance loss introduced by the hardware implementation is also tested. Hardware implementation output and simulation results verify that the chips function properly and satisfy the requirements of the system running at a data rate of 54 Mbps
Parallel prefix operations on heterogeneous platforms
Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Resumo]
As tarxetas gráficas, coñecidas como GPUs, aportan grandes vantaxes no rendemento
computacional e na eficiencia enerxética, sendo un piar clave para a computación
de altas prestacións (HPC). Sen embargo, esta tecnoloxía tamén é custosa
de programar, e ten certos problemas asociados á portabilidade entre as diferentes
tarxetas. Por autra banda, os algoritmos de prefixo paralelo son un conxunto de
algoritmos paralelos regulares e moi empregados nas ciencias compuacionais, cuxa
eficiencia é esencial en moita."3 aplicacións. Neste eiclo, aínda que as GPUs poden
acelerar a computación destes algoritmos, tamén poden ser unha limitación cando
non explotan axeitadamente o paralelismo da arquitectura CPU.
Esta Tese presenta dúas perspectivas. Dunha parte, deséñanse novos algoritmos
de prefixo paralelo para calquera paradigma de programación paralela. Pola outra
banda, tamén se propón unha metodoloxÍa xeral que implementa eficientemente
algoritmos de prefixo paralelos, de xeito doado e portable, sobre arquitecturas GPU
CUDA, mais que se centrar nun algoritmo particular ou nun modelo concreto de
tarxeta. Para isto, a metodoloxía identifica os paramétros da GPU que inflúen no
rendemento e, despois, seguindo unha serie de premisas teóricas, obtéñense os valores
óptimos destes parámetros dependendo do algoritmo, do tamaño do problema e
da arquitectura GPU empregada. Ademais, esta Tese tamén prové unha serie de
fUllciólls GPU compostas de bloques de código CUDA modulares e reutilizables, o
que permite a implementación de calquera algoritmo de xeito sinxelo. Segundo o
tamaño do problema, propóñense tres aproximacións. As dúas primeiras resolven
problemas pequenos, medios e grandes nunha única GPU) mentras que a terceira
trata con tamaños extremad8.1nente grandes, usando varias GPUs.
As nosas propostas proporcionan uns resultados moi competitivos a nivel de
rendemento, mellorando as propostas existentes na bibliografía para as operacións
probadas: a primitiva sean, ordenación e a resolución de sistemas tridiagonais.[Resumen]
Las tarjetas gráficas (GPUs) han demostrado gmndes ventajas en el rendimiento
computacional y en la eficiencia energética, siendo una tecnología clave para la
computación de altas prestaciones (HPC). Sin embargo, esta tecnología también es
costosa de progTamar, y tiene ciertos problemas asociados a la portabilidad de sus
códigos entre diferentes generaciones de tarjetas. Por otra parte, los algoritmos de
prefijo paralelo son un conjunto de algoritmos regulares y muy utilizados en las
ciencias computacionales, cuya eficiencia es crucial en muchas aplicaciones. Aunque
las GPUs puedan acelerar la computación de estos algoritmos, también pueden ser
una limitación si no explotan correctamente el paralelismo de la arquitectura CPU.
Esta Tesis presenta dos perspectivas. De un lado, se han diseñado nuevos algoritmos
de prefijo paralelo que pueden ser implementados en cualquier paradigma de
programación paralela. Por otra parte, se propone una metodología general que implementa
eficientemente algoritmos de prefijo paralelo, de forma sencilla y portable,
sobre cualquier arquitectura GPU CUDA, sin centrarse en un algoritmo particular o
en un modelo de tarjeta. Para ello, la metodología identifica los parámetros GPU que
influyen en el rendimiento y, siguiendo un conjunto de premisas teóricas, obtiene los
valores óptimos para cada algoritmo, tamaño de problema y arquitectura. Además,
las funciones GPU proporcionadas están compuestas de bloques de código CUDA
reutilizable y modular, lo que permite la implementación de cualquier algoritmo de
prefijo paralelo sencillamente. Dependiendo del tamaño del problema, se proponen
tres aproximaciones. Las dos primeras resuelven tamaños pequeños, medios y grandes,
utilizando para ello una única GPU i mientras que la tercera aproximación trata
con tamaños extremadamente grandes, usando varias GPUs.
Nuestras propuestas proporcionan resultados muy competitivos, mejorando el
rendimiento de las propuestas existentes en la bibliografía para las operaciones probadas:
la primitiva sean, ordenación y la resolución de sistemas tridiagonales.[Abstract]
Craphics Processing Units (CPUs) have shown remarkable advantages in computing
performance and energy efficiency, representing oue of the most promising
trends fúr the near-fnture of high perfonnance computing. However, these devices
also bring sorne programming complexities, and many efforts are required tú provide
portability between different generations. Additionally, parallel prefix algorithms
are a 8et of regular and highly-used parallel algorithms, whose efficiency is crutial
in roany computer sCience applications. Although GPUs can accelerate the computation
of such algorithms, they can also be a limitation when they do not match
correctly to the CPU architecture or do not exploit the CPU parallelism properly.
This dissertation presents two different perspectives. Gn the Oile hand, new
parallel prefix algorithms have been algorithmicany designed for any paranel progrannning
paradigm. On the other hand, a general tuning CPU methodology is
proposed to provide an easy and portable mechanism tú efficiently implement paranel
prefix algorithms on any CUDA CPU architecture, rather than focusing on a
particular algorithm or a CPU mode!. To accomplish this goal, the methodology
identifies the GPU parameters which influence on the performance and, following a
set oí performance premises, obtains the cOllvillient values oí these parameters depending
on the algorithm, the problem size and the CPU architecture. Additionally,
the provided CPU functions are composed of modular and reusable CUDA blocks
of code, which allow the easy implementation of any paranel prefix algorithm. Depending
on the size of the dataset, three different approaches are proposed. The first
two approaches solve small and medium-large datasets on a single GPU; whereas the
third approach deals with extremely large datasets on a Multiple-CPU environment.
OUT proposals provide very competitive performance, outperforming the stateof-
the-art for many parallel prefix operatiOllS, such as the sean primitive, sorting and solving tridiagonal systems
CCL: a portable and tunable collective communication library for scalable parallel computers
A collective communication library for parallel computers includes frequently used operations such as broadcast, reduce, scatter, gather, concatenate, synchronize, and shift. Such a library provides users with a convenient programming interface, efficient communication operations, and the advantage of portability. A library of this nature, the Collective Communication Library (CCL), intended for the line of scalable parallel computer products by IBM, has been designed. CCL is part of the parallel application programming interface of the recently announced IBM 9076 Scalable POWERparallel System 1 (SP1). In this paper, we examine several issues related to the functionality, correctness, and performance of a portable collective communication library while focusing on three novel aspects in the design and implementation of CCL: 1) the introduction of process groups, 2) the definition of semantics that ensures correctness, and 3) the design of new and tunable algorithms based on a realistic point-to-point communication model
- …