146 research outputs found

    Implementation of JPEG compression and motion estimation on FPGA hardware

    Full text link
    A hardware implementation of JPEG allows for real-time compression in data intensivve applications, such as high speed scanning, medical imaging and satellite image transmission. Implementation options include dedicated DSP or media processors, FPGA boards, and ASICs. Factors that affect the choice of platform selection involve cost, speed, memory, size, power consumption, and case of reconfiguration. The proposed hardware solution is based on a Very high speed integrated circuit Hardware Description Language (VHDL) implememtation of the codec with prefered realization using an FPGA board due to speed, cost and flexibility factors; The VHDL language is commonly used to model hardware impletations from a top down perspective. The VHDL code may be simulated to correct mistakes and subsequently synthesized into hardware using a synthesis tool, such as the xilinx ise suite. The same VHDL code may be synthesized into a number of sifferent hardware architetcures based on constraints given. For example speed was the major constraint when synthesizing the pipeline of jpeg encoding and decoding, while chip area and power consumption were primary constraints when synthesizing the on-die memory because of large area. Thus, there is a trade off between area and speed in logic synthesis

    VIABILITY OF TIME-MEMORY TRADE-OFFS IN LARGE DATA SETS

    Get PDF
    The main hypothesis of this paper is whether compression performance – both hardware and software – is at, approaching, or will ever reach a point where real-time compression of cached data in large data sets will be viable to improve hit ratios and overall throughput. The problem identified is: storage access is unable to keep up with application and user demands, and cache (RAM) is too small to contain full data sets. A literature review of several existing techniques discusses how storage IO is reduced or optimized to maximize the available performance of the storage medium. However, none of the techniques discovered preclude, or are mutually exclusive with, the hypothesis proposed herein. The methodology includes gauging three popular compressors which meet the criteria for viability: zlib, lz4, and zstd. Common storage devices are also benchmarked to establish costs for both IO and compression operations to help build charts and discover break-even points under various circumstances. The results indicate that modern CISC processors and compressors are already approaching tradeoff viability, and that FPGA and ASIC could potentially reduce all overhead by pipelining compression – nearly eliminating the cost portion of the tradeoff, leaving mostly benefit

    Gbit/second lossless data compression hardware

    Get PDF
    This thesis investigates how to improve the performance of lossless data compression hardware as a tool to reduce the cost per bit stored in a computer system or transmitted over a communication network. Lossless data compression allows the exact reconstruction of the original data after decompression. Its deployment in some high-bandwidth applications has been hampered due to performance limitations in the compressing hardware that needs to match the performance of the original system to avoid becoming a bottleneck. Advancing the area of lossless data compression hardware, hence, offers a valid motivation with the potential of doubling the performance of the system that incorporates it with minimum investment. This work starts by presenting an analysis of current compression methods with the objective of identifying the factors that limit performance and also the factors that increase it. [Continues.

    Hardware Acceleration of the Robust Header Compression (RoHC) Algorithm

    Get PDF
    With the proliferation of Long Term Evolution (LTE) networks, many cellular carriers are embracing the emerging eld of mobile Voice over Internet Protocol (VoIP). The robust header compression (RoHC) framework was introduced as a part of the LTE Layer 2 stack to compress the large headers of the VoIP packets before transmitted over LTE IP-based architectures. The headers, which are encapsulated Real-time Transport Protocol (RTP)/User Datagram Protocol (UDP)/Internet Protocol (IP) stack, are large compared to the small payload. This header-compression scheme is especially useful for ecient utilization of the radio bandwidth and network resources. In an LTE base-station implementation, RoHC is a processing-intensive algorithm that may be the bottleneck of the system, and thus, may be the limiting factor when it comes to number of users served. In this thesis, a hardware-software and a full-hardware solution are proposed, targeting LTE base-stations to accelerate this computationally intensive algorithm and enhance the throughput and the capacity of the system. The results of both solutions are discussed and compared with respect to design metrics like throughput, capacity, power consumption, chip area and exibility. This comparison is instrumental in taking architectural level trade-o decisions in-order to meet the present day requirements and also be ready to support future evolution. In terms of throughput, a gain of 20% (6250 packets/sec can be processed at a frequency of 150 MHz) is achieved in the HW-SW solution compared to the SW-Only solution by implementing the Cyclic Redundancy Check (CRC) and the Least Signicant Bit(LSB) encoding blocks as hardware accelerators . Whereas, a Full-HW implementation leads to a throughput of 45 times (244000 packets/sec can be processed at a frequency of 100 MHz) the throughput of the SW-Only solution. However, the full-HW solution consumes more Lookup Tables (LUTs) when it is synthesized on an Field-Programmable Gate Array (FPGA) platform compared to the HW-SW solution. In Arria II GX, the HW-SW and the full-HW solutions use 2578 and 7477 LUTs and consume 1.5 and 0.9 Watts, respectively. Finally, both solutions are synthesized and veried on Altera's Arria II GX FPGA

    Efficient Algorithms for Large-Scale Image Analysis

    Get PDF
    This work develops highly efficient algorithms for analyzing large images. Applications include object-based change detection and screening. The algorithms are 10-100 times as fast as existing software, sometimes even outperforming FGPA/GPU hardware, because they are designed to suit the computer architecture. This thesis describes the implementation details and the underlying algorithm engineering methodology, so that both may also be applied to other applications

    A Low Power Application-Specific Integrated Circuit (ASIC) Implementation of Wavelet Transform/Inverse Transform

    Get PDF
    A unique ASIC was designed implementing the Haar Wavelet transform for image compression/decompression. ASIC operations include performing the Haar wavelet transform on a 512 by 512 square pixel image, preparing the image for transmission by quantizing and thresholding the transformed data, and performing the inverse Haar wavelet transform, returning the original image with only minor degradation. The ASIC is based on an existing four-chip FPGA implementation. Implementing the design using a dedicated ASIC enhances the speed, decreases chip count to a single die, and uses significantly less power compared to the FPGA implementation. A reduction of RAM accesses was realized and a tradeoff between states and duplication of components for parallel operation were key to the performance gains. Almost half of the external RAM accesses were removed from the FPGA design by incorporating an internal register file. This reduction reduced the number of states needed to process an image increasing the image frame rate by 13% and decreased I/O traffic on the bus by 47%. Adding control lines to the ALU components, thus eliminating unnecessary switching of combination logic blocks, further reduced power requirements. The 22 mm2 ASIC consumes an estimated 430 mW of power when operating at the maximum frequency of 17 MHz

    Automated Debugging Methodology for FPGA-based Systems

    Get PDF
    Electronic devices make up a vital part of our lives. These are seen from mobiles, laptops, computers, home automation, etc. to name a few. The modern designs constitute billions of transistors. However, with this evolution, ensuring that the devices fulfill the designer’s expectation under variable conditions has also become a great challenge. This requires a lot of design time and effort. Whenever an error is encountered, the process is re-started. Hence, it is desired to minimize the number of spins required to achieve an error-free product, as each spin results in loss of time and effort. Software-based simulation systems present the main technique to ensure the verification of the design before fabrication. However, few design errors (bugs) are likely to escape the simulation process. Such bugs subsequently appear during the post-silicon phase. Finding such bugs is time-consuming due to inherent invisibility of the hardware. Instead of software simulation of the design in the pre-silicon phase, post-silicon techniques permit the designers to verify the functionality through the physical implementations of the design. The main benefit of the methodology is that the implemented design in the post-silicon phase runs many order-of-magnitude faster than its counterpart in pre-silicon. This allows the designers to validate their design more exhaustively. This thesis presents five main contributions to enable a fast and automated debugging solution for reconfigurable hardware. During the research work, we used an obstacle avoidance system for robotic vehicles as a use case to illustrate how to apply the proposed debugging solution in practical environments. The first contribution presents a debugging system capable of providing a lossless trace of debugging data which permits a cycle-accurate replay. This methodology ensures capturing permanent as well as intermittent errors in the implemented design. The contribution also describes a solution to enhance hardware observability. It is proposed to utilize processor-configurable concentration networks, employ debug data compression to transmit the data more efficiently, and partially reconfiguring the debugging system at run-time to save the time required for design re-compilation as well as preserve the timing closure. The second contribution presents a solution for communication-centric designs. Furthermore, solutions for designs with multi-clock domains are also discussed. The third contribution presents a priority-based signal selection methodology to identify the signals which can be more helpful during the debugging process. A connectivity generation tool is also presented which can map the identified signals to the debugging system. The fourth contribution presents an automated error detection solution which can help in capturing the permanent as well as intermittent errors without continuous monitoring of debugging data. The proposed solution works for designs even in the absence of golden reference. The fifth contribution proposes to use artificial intelligence for post-silicon debugging. We presented a novel idea of using a recurrent neural network for debugging when a golden reference is present for training the network. Furthermore, the idea was also extended to designs where golden reference is not present

    Efficient query processing for scalable web search

    Get PDF
    Search engines are exceptionally important tools for accessing information in today’s world. In satisfying the information needs of millions of users, the effectiveness (the quality of the search results) and the efficiency (the speed at which the results are returned to the users) of a search engine are two goals that form a natural trade-off, as techniques that improve the effectiveness of the search engine can also make it less efficient. Meanwhile, search engines continue to rapidly evolve, with larger indexes, more complex retrieval strategies and growing query volumes. Hence, there is a need for the development of efficient query processing infrastructures that make appropriate sacrifices in effectiveness in order to make gains in efficiency. This survey comprehensively reviews the foundations of search engines, from index layouts to basic term-at-a-time (TAAT) and document-at-a-time (DAAT) query processing strategies, while also providing the latest trends in the literature in efficient query processing, including the coherent and systematic reviews of techniques such as dynamic pruning and impact-sorted posting lists as well as their variants and optimisations. Our explanations of query processing strategies, for instance the WAND and BMW dynamic pruning algorithms, are presented with illustrative figures showing how the processing state changes as the algorithms progress. Moreover, acknowledging the recent trends in applying a cascading infrastructure within search systems, this survey describes techniques for efficiently integrating effective learned models, such as those obtained from learning-to-rank techniques. The survey also covers the selective application of query processing techniques, often achieved by predicting the response times of the search engine (known as query efficiency prediction), and making per-query tradeoffs between efficiency and effectiveness to ensure that the required retrieval speed targets can be met. Finally, the survey concludes with a summary of open directions in efficient search infrastructures, namely the use of signatures, real-time, energy-efficient and modern hardware and software architectures

    Diseño, implementación y optimización del sistema de compresión de imágenes sobre el ordenador de a bordo del proyecto de nanosátelite Eye-Sat

    Get PDF
    Eye-Sat es un Proyecto de nano satélites, dirigido por el CNES (Centre National d’Etudes Spatiales) y desarrollado principalmente por estudiantes de varias escuelas de ingeniería del territorio francés. El objetivo de este pequeño telescopio no solo radica en la oportunidad de realizar la demostración de distintos dispositivos tecnológicos, sino que también tiene como misión la adquisición de fotografías en la bandas de color e infrarrojo de la vía Láctea, así como el estudio de la intensidad y polarización de la luz Zodiacal. Los requerimientos de la misión exigen el desarrollo de un algoritmo de compresión de imágenes sin pérdidas para las imágenes “Color Filter Array” CFA (Bayer) e infrarrojas adquiridas por el satélite. Como miembro de la comisión consultativa para los sistemas espaciales, CNES ha seleccionado el estándar CCSDS-123.0-B como algoritmo base para cumplir los requerimientos de la misión. A este algoritmo se le añadirán modificaciones o mejoras, adaptadas a las imágenes tipo, con el fin de mejorar las prestaciones de compresión y de complejidad. La implementación y la optimización del algoritmo será desarrollada sobre la plataforma Xilinx Zynq® All Programmable SoC, el cual incluye una FPGA y un Dual-core ARM® Cortex™-A9 processor with NEONTM DSP/FPU Engine

    Analytical Query Processing Using Heterogeneous SIMD Instruction Sets

    Get PDF
    Numerous applications gather increasing amounts of data, which have to be managed and queried. Different hardware developments help to meet this challenge. The grow-ing capacity of main memory enables database systems to keep all their data in memory. Additionally, the hardware landscape is becoming more diverse. A plethora of homo-geneous and heterogeneous co-processors is available, where heterogeneity refers not only to a different computing power, but also to different instruction set architectures. For instance, modern Intel® CPUs offer different instruction sets supporting the Single Instruction Multiple Data (SIMD) paradigm, e.g. SSE, AVX, and AVX512. Database systems have started to exploit SIMD to increase performance. However, this is still a challenging task, because existing algorithms were mainly developed for scalar processing and because there is a huge variety of different instruction sets, which were never standardized and have no unified interface. This requires to completely rewrite the source code for porting a system to another hardware architecture, even if those archi-tectures are not fundamentally different and designed by the same company. Moreover, operations on large registers, which are the core principle of SIMD processing, behave counter-intuitively in several cases. This is especially true for analytical query process-ing, where different memory access patterns and data dependencies caused by the com-pression of data, challenge the limits of the SIMD principle. Finally, there are physical constraints to the use of such instructions affecting the CPU frequency scaling, which is further influenced by the use of multiple cores. This is because the supply power of a CPU is limited, such that not all transistors can be powered at the same time. Hence, there is a complex relationship between performance and power, and therefore also between performance and energy consumption. This thesis addresses the specific challenges, which are introduced by the application of SIMD in general, and the heterogeneity of SIMD ISAs in particular. Hence, the goal of this thesis is to exploit the potential of heterogeneous SIMD ISAs for increasing the performance as well as the energy-efficiency
    corecore