1,562 research outputs found

    DCT Implementation on GPU

    Get PDF
    There has been a great progress in the field of graphics processors. Since, there is no rise in the speed of the normal CPU processors; Designers are coming up with multi-core, parallel processors. Because of their popularity in parallel processing, GPUs are becoming more and more attractive for many applications. With the increasing demand in utilizing GPUs, there is a great need to develop operating systems that handle the GPU to full capacity. GPUs offer a very efficient environment for many image processing applications. This thesis explores the processing power of GPUs for digital image compression using Discrete cosine transform

    Design for scalability in 3D computer graphics architectures

    Get PDF

    Design of a High-Speed Architecture for Stabilization of Video Captured Under Non-Uniform Lighting Conditions

    Get PDF
    Video captured in shaky conditions may lead to vibrations. A robust algorithm to immobilize the video by compensating for the vibrations from physical settings of the camera is presented in this dissertation. A very high performance hardware architecture on Field Programmable Gate Array (FPGA) technology is also developed for the implementation of the stabilization system. Stabilization of video sequences captured under non-uniform lighting conditions begins with a nonlinear enhancement process. This improves the visibility of the scene captured from physical sensing devices which have limited dynamic range. This physical limitation causes the saturated region of the image to shadow out the rest of the scene. It is therefore desirable to bring back a more uniform scene which eliminates the shadows to a certain extent. Stabilization of video requires the estimation of global motion parameters. By obtaining reliable background motion, the video can be spatially transformed to the reference sequence thereby eliminating the unintended motion of the camera. A reflectance-illuminance model for video enhancement is used in this research work to improve the visibility and quality of the scene. With fast color space conversion, the computational complexity is reduced to a minimum. The basic video stabilization model is formulated and configured for hardware implementation. Such a model involves evaluation of reliable features for tracking, motion estimation, and affine transformation to map the display coordinates of a stabilized sequence. The multiplications, divisions and exponentiations are replaced by simple arithmetic and logic operations using improved log-domain computations in the hardware modules. On Xilinx\u27s Virtex II 2V8000-5 FPGA platform, the prototype system consumes 59% logic slices, 30% flip-flops, 34% lookup tables, 35% embedded RAMs and two ZBT frame buffers. The system is capable of rendering 180.9 million pixels per second (mpps) and consumes approximately 30.6 watts of power at 1.5 volts. With a 1024×1024 frame, the throughput is equivalent to 172 frames per second (fps). Future work will optimize the performance-resource trade-off to meet the specific needs of the applications. It further extends the model for extraction and tracking of moving objects as our model inherently encapsulates the attributes of spatial distortion and motion prediction to reduce complexity. With these parameters to narrow down the processing range, it is possible to achieve a minimum of 20 fps on desktop computers with Intel Core 2 Duo or Quad Core CPUs and 2GB DDR2 memory without a dedicated hardware

    Accuracy-guaranteed bit-width optimization

    No full text
    Published versio

    Multiplierless CSD techniques for high performance FPGA implementation of digital filters.

    Get PDF
    I leverage FastCSD to develop a new, high performance iterative multiplierless structure based on a novel real-time CSD recoding, so that more zero partial products are introduced. Up to 66.7% zero partial products occur compared to 50% in the traditional modified Booth's recoding. Also, this structure reduces the non-zero partial products to a minimum. As a result, the number of arithmetic operations in the carry-save structure is reduced. Thus, an overall speed-up, as well as low-power consumption can be achieved. Furthermore, because the proposed structure involves real time CSD recoding and does not require a fixed value for the multiplier input to be known a priori, the proposed multiplier can be applied to implement digital filters with non-fixed filter coefficients, such as adaptive filters.My work is based on a dramatic new technique for converting between 2's complement and CSD number systems, and results in high-performance structures that are particularly effective for implementing adaptive systems in reconfigurable logic.My research focus is on two key ideas for improving DSP performance: (1) Develop new high performance, efficient shift-add techniques ("multiplierless") to implement the multiply-add operations without the need for a traditional multiplier structure. (2) There is a growing trend toward design prototyping and even production in FPGAs as opposed to dedicated DSP processors or ASICs; leverage this trend synergistically with the new multiplierless structures to improve performance.Implementation of digital signal processing (DSP) algorithms in hardware, such as field programmable gate arrays (FPGAs), requires a large number of multipliers. Fast, low area multiply-adds have become critical in modern commercial and military DSP applications. In many contemporary real-time DSP and multimedia applications, system performance is severely impacted by the limitations of currently available speed, energy efficiency, and area requirement of an onboard silicon multiplier.I also introduce a new multi-input Canonical Signed Digit (CSD) multiplier unit, which requires fewer shift/add/subtract operations and reduced CSD number conversion overhead compared to existing techniques. This results in reduced power consumption and area requirements in the hardware implementation of DSP algorithms. Furthermore, because all the products are produced simultaneously, the multiplication speed and thus the throughput are improved. The multi-input multiplier unit is applied to implement digital filters with non-fixed filter coefficients, such as adaptive filters. The implementation cost of these digital filters can be further reduced by limiting the wordlength of the input signal with little or no sacrifice to the filter performance, which is confirmed by my simulation results. The proposed multiplier unit can also be applied to other DSP algorithms, such as digital filter banks or matrix and vector multiplications.Finally, the tradeoff between filter order and coefficient length in the design and implementation of high-performance filters in Field Programmable Gate Arrays (FPGAs) is discussed. Non-minimum order FIR filters are designed for implementation using Canonical Signed Digit (CSD) multiplierless implementation techniques. By increasing the filter order, the length of the coefficients can be decreased without reducing the filter performance. Thus, an overall hardware savings can be achieved.Adaptive system implementations require real-time conversion of coefficients to Canonical Signed Digit (CSD) or similar representations to benefit from multiplierless techniques for implementing filters. Multiplierless approaches are used to reduce the hardware and increase the throughput. This dissertation introduces the first non-iterative hardware algorithm to convert 2's complement numbers to their CSD representations (FastCSD) using a fixed number of shift and logic operations. As a result, the power consumption and area requirements required for hardware implementation of DSP algorithms in which the coefficients are not known a priori can be greatly reduced. Because all CSD digits are produced simultaneously, the conversion speed and thus the throughput are improved when compared to overlap-and-scan techniques such as Booth's recoding

    Approximate Inference for Constructing Astronomical Catalogs from Images

    Full text link
    We present a new, fully generative model for constructing astronomical catalogs from optical telescope image sets. Each pixel intensity is treated as a random variable with parameters that depend on the latent properties of stars and galaxies. These latent properties are themselves modeled as random. We compare two procedures for posterior inference. One procedure is based on Markov chain Monte Carlo (MCMC) while the other is based on variational inference (VI). The MCMC procedure excels at quantifying uncertainty, while the VI procedure is 1000 times faster. On a supercomputer, the VI procedure efficiently uses 665,000 CPU cores to construct an astronomical catalog from 50 terabytes of images in 14.6 minutes, demonstrating the scaling characteristics necessary to construct catalogs for upcoming astronomical surveys.Comment: accepted to the Annals of Applied Statistic

    An on-line approach for evaluating trigonometric functions

    Get PDF
    This thesis investigates the evaluation of trigonometric functions based on an on-line arithmetic approach. On-line algorithms have been developed to evaluate the sine and cosine functions. Error analysis and heuristics are carried out to arrive at a minimal error algorithm based on the series expansion of the sine and cosine function. A logical design based on the algorithm is presented where the unit is designed as a set of basic modules. A detailed bit slice design of each module is also presented. A simulator was designed as an experimental tool for synthesis of the on-line algorithms, and a tool for performance evaluation

    Realtime image noise reduction FPGA implementation with edge detection

    Get PDF
    The purpose of this dissertation was to develop and implement, in a Field Programmable Gate Array (FPGA), a noise reduction algorithm for real-time sensor acquired images. A Moving Average filter was chosen due to its fulfillment of a low demanding computational expenditure nature, speed, good precision and low to medium hardware resources utilization. The technique is simple to implement, however, if all pixels are indiscriminately filtered, the result will be a blurry image which is undesirable. Since human eye is more sensitive to contrasts, a technique was introduced to preserve sharp contour transitions which, in the author’s opinion, is the dissertation contribution. Synthetic and real images were tested. Synthetic, composed both with sharp and soft tone transitions, were generated with a developed algorithm, while real images were captured with an 8-kbit (8192 shades) high resolution sensor scaled up to 10 × 103 shades. A least-squares polynomial data smoothing filter, Savitzky-Golay, was used as comparison. It can be adjusted using 3 degrees of freedom ─ the window frame length which varies the filtering relation size between pixels’ neighborhood, the derivative order, which varies the curviness and the polynomial coefficients which change the adaptability of the curve. Moving Average filter only permits one degree of freedom, the window frame length. Tests revealed promising results with 2 and 4ℎ polynomial orders. Higher qualitative results were achieved with Savitzky-Golay’s better signal characteristics preservation, especially at high frequencies. FPGA algorithms were implemented in 64-bit integer registers serving two purposes: increase precision, hence, reducing the error comparatively as if it were done in floating-point registers; accommodate the registers’ growing cumulative multiplications. Results were then compared with MATLAB’s double precision 64-bit floating-point computations to verify the error difference between both. Used comparison parameters were Mean Squared Error, Signalto-Noise Ratio and Similarity coefficient.O objetivo desta dissertação foi desenvolver e implementar, em FPGA, um algoritmo de redução de ruído para imagens adquiridas em tempo real. Optou-se por um filtro de Média Deslizante por não exigir uma elevada complexidade computacional, ser rápido, ter boa precisão e requerer moderada utilização de recursos. A técnica é simples, mas se abordada como filtragem monotónica, o resultado é uma indesejável imagem desfocada. Dado o olho humano ser mais sensível ao contraste, introduziu-se uma técnica para preservar os contornos que, na opinião do autor, é a sua principal contribuição. Utilizaram-se imagens sintéticas e reais nos testes. As sintéticas, compostas por fortes e suaves contrastes foram geradas por um algoritmo desenvolvido. As reais foram capturadas com um sensor de alta resolução de 8-kbit (8192 tons) e escalonadas a 10 × 103 tons. Um filtro com suavização polinomial de mínimos quadrados, SavitzkyGolay, foi usado como comparação. Possui 3 graus de liberdade: o tamanho da janela, que varia o tamanho da relação de filtragem entre os pixels vizinhos; a ordem da derivada, que varia a curvatura do filtro e os coeficientes polinomiais, que variam a adaptabilidade da curva aos pontos a suavizar. O filtro de Média Deslizante é apenas ajustável no tamanho da janela. Os testes revelaram-se promissores nas 2ª e 4ª ordens polinomiais. Obtiveram-se resultados qualitativos com o filtro Savitzky-Golay que detém melhores características na preservação do sinal, especialmente em altas frequências. Os algoritmos em FPGA foram implementados em registos de vírgula fixa de 64-bits, servindo dois propósitos: aumentar a precisão, reduzindo o erro comparativamente ao terem sido em vírgula flutuante; acomodar o efeito cumulativo das multiplicações. Os resultados foram comparados com os cálculos de 64-bits obtidos pelo MATLAB para verificar a diferença de erro entre ambos. Os parâmetros de medida foram MSE, SNR e coeficiente de Semelhança

    FPGA BASED PARALLEL IMPLEMENTATION OF STACKED ERROR DIFFUSION ALGORITHM

    Get PDF
    Digital halftoning is a crucial technique used in digital printers to convert a continuoustone image into a pattern of black and white dots. Halftoning is used since printers have a limited availability of inks and cannot reproduce all the color intensities in a continuous image. Error Diffusion is an algorithm in halftoning that iteratively quantizes pixels in a neighborhood dependent fashion. This thesis focuses on the development and design of a parallel scalable hardware architecture for high performance implementation of a high quality Stacked Error Diffusion algorithm. The algorithm is described in ‘C’ and requires a significant processing time when implemented on a conventional CPU. Thus, a new hardware processor architecture is developed to implement the algorithm and is implemented to and tested on a Xilinx Virtex 5 FPGA chip. There is an extraordinary decrease in the run time of the algorithm when run on the newly proposed parallel architecture implemented to FPGA technology compared to execution on a single CPU. The new parallel architecture is described using the Verilog Hardware Description Language. Post-synthesis and post-implementation, performance based Hardware Description Language (HDL), simulation validation of the new parallel architecture is achieved via use of the ModelSim CAD simulation tool
    • …
    corecore