387 research outputs found

    Energy-efficient acceleration of MPEG-4 compression tools

    Get PDF
    We propose novel hardware accelerator architectures for the most computationally demanding algorithms of the MPEG-4 video compression standard-motion estimation, binary motion estimation (for shape coding), and the forward/inverse discrete cosine transforms (incorporating shape adaptive modes). These accelerators have been designed using general low-energy design philosophies at the algorithmic/architectural abstraction levels. The themes of these philosophies are avoiding waste and trading area/performance for power and energy gains. Each core has been synthesised targeting TSMC 0.09 Ī¼m TCBN90LP technology, and the experimental results presented in this paper show that the proposed cores improve upon the prior art

    A distributed arithmetic based CORDIC algorithm and its use in the FPGA implementation of the 2-D IDCT

    Get PDF
    The discrete cosine transform (DCT) based image compression techniques play an important role in today's digital applications. A video codec chip requires an integration of high-speed DCT and inverse DCT (IDCT) hardware units in a limited silicon space. This thesis presents a distributed arithmetic based CORDIC algorithm for the computation of the 1-D IDCT and an FPGA implementation of a cost-effective architecture for a 2-D IDCT processor using the proposed algorithm. The processor consisting of two 1-D IDCT cores, a transpose memory and a control logic block performs the 2-D IDCT computation by using the row-column decomposition approach. The basis of the proposed scheme is a combined use of the distributed arithmetic and the CORDIC algorithm in order to provide a small access time to the lookup tables and a reduced complexity for its architecture. In the proposed design, the deep pipeline structure of an existing CORDIC based architecture is replaced by much smaller DA-based ROM accumulators. In the proposed design, a bit-level digit-serial structure based on the redundant number system using an on-line algorithm is employed

    A New RTL Design Approach for a DCT/IDCT-Based Image Compression Architecture using the mCBE Algorithm

    Get PDF
    InĀ  theĀ  literature, several approachesĀ  ofĀ  designingĀ  aĀ  DCT/IDCT-based image compression system have been proposed.Ā  In this paper,Ā  we present a new RTL design approach with as mainĀ  focus developing aĀ  DCT/IDCT-based image compressionĀ  architectureĀ  usingĀ  aĀ  self-createdĀ  algorithm.Ā  ThisĀ  algorithmĀ  can efficientlyĀ  minimizeĀ  theĀ  amountĀ  ofĀ  shifter -addersĀ  toĀ  substituteĀ  multiplier s.Ā  We callĀ  thisĀ  newĀ  algorithmĀ  theĀ  multiplicationĀ  fromĀ  CommonĀ  BinaryĀ  Expression (mCBE)Ā  Algorithm. Besides this algorithm, we propose alternative quantization numbers,Ā  whichĀ  canĀ  beĀ  implementedĀ  simplyĀ  asĀ  shiftersĀ  inĀ  digitalĀ  hardware. Mostly, these numbers can retain a good compressed-image qualityĀ  compared to JPEGĀ  recommendations.Ā  TheseĀ  ideasĀ  leadĀ  toĀ  ourĀ  designĀ  beingĀ  smallĀ  inĀ  circuit area,Ā  multiplierless,Ā  andĀ  lowĀ  inĀ  complexity.Ā  TheĀ  proposedĀ  8-pointĀ  1D-DCT designĀ  hasĀ  onlyĀ  sixĀ  stages,Ā  whileĀ  theĀ  8-pointĀ  1D-IDCTĀ  designĀ  hasĀ  onlyĀ  seven stagesĀ  (oneĀ  stageĀ  beingĀ  defined asĀ  equalĀ  toĀ  theĀ  delayĀ  ofĀ  oneĀ  shifterĀ  orĀ  2-input adder). By using the pipelining method, we can achieve a high-speed architecture with latency asĀ Ā Ā  aĀ  trade-off consideration. TheĀ  design has been synthesized and can reach a speed of up to 1.41ns critical path delay (709.22MHz).

    Towards adaptive balanced computing (ABC) using reconfigurable functional caches (RFCs)

    Get PDF
    The general-purpose computing processor performs a wide range of functions. Although the performance of general-purpose processors has been steadily increasing, certain software technologies like multimedia and digital signal processing applications demand ever more computing power. Reconfigurable computing has emerged to combine the versatility of general-purpose processors with the customization ability of ASICs. The basic premise of reconfigurability is to provide better performance and higher computing density than fixed configuration processors. Most of the research in reconfigurable computing is dedicated to on-chip functional logic. If computing resources are adaptable to the computing requirement, the maximum performance can be achieved. To overcome the gap between processor and memory technology, the size of on-chip cache memory has been consistently increasing. The larger cache memory capacity, though beneficial in general, does not guarantee a higher performance for all the applications as they may not utilize all of the cache efficiently. To utilize on-chip resources effectively and to accelerate the performance of multimedia applications specifically, we propose a new architecture---Adaptive Balanced Computing (ABC). ABC uses dynamic resource configuration of on-chip cache memory by integrating Reconfigurable Functional Caches (RFC). RFC can work as a conventional cache or as a specialized computing unit when necessary. In order to convert a cache memory to a computing unit, we include additional logic to embed multi-bit output LUTs into the cache structure. We add the reconfigurability of cache memory to a conventional processor with minimal modification to the load/store microarchitecture and with minimal compiler assistance. ABC architecture utilizes resources more efficiently by reconfiguring the cache memory to computing units dynamically. The area penalty for this reconfiguration is about 50--60% of the memory cell cache array-only area with faster cache access time. In a base array cache (parallel decoding caches), the area penalty is 10--20% of the data array with 1--2% increase in the cache access time. However, we save 27% for FIR and 44% for DCT/IDCT in area with respect to memory cell array cache and about 80% for both applications with respect to base array cache if we were to implement all these units separately (such as ASICs). The simulations with multimedia and DSP applications (DCT/IDCT and FIR/IIR) show that the resource configuration with the RFC speedups ranging from 1.04X to 3.94X in overall applications and from 2.61X to 27.4X in the core computations. The simulations with various parameters indicate that the impact of reconfiguration can be minimized if an appropriate cache organization is selected

    Low power techniques for video compression

    Get PDF
    This paper gives an overview of low-power techniques proposed in the literature for mobile multimedia and Internet applications. Exploitable aspects are discussed in the behavior of different video compression tools. These power-efficient solutions are then classified by synthesis domain and level of abstraction. As this paper is meant to be a starting point for further research in the area, a lowpower hardware & software co-design methodology is outlined in the end as a possible scenario for video-codec-on-a-chip implementations on future mobile multimedia platforms

    A Comparative Study of Scheduling Techniques for Multimedia Applications on SIMD Pipelines

    Full text link
    Parallel architectures are essential in order to take advantage of the parallelism inherent in streaming applications. One particular branch of these employ hardware SIMD pipelines. In this paper, we analyse several scheduling techniques, namely ad hoc overlapped execution, modulo scheduling and modulo scheduling with unrolling, all of which aim to efficiently utilize the special architecture design. Our investigation focuses on improving throughput while analysing other metrics that are important for streaming applications, such as register pressure, buffer sizes and code size. Through experiments conducted on several media benchmarks, we present and discuss trade-offs involved when selecting any one of these scheduling techniques.Comment: Presented at DATE Friday Workshop on Heterogeneous Architectures and Design Methods for Embedded Image Systems (HIS 2015) (arXiv:1502.07241

    Synthesis of application specific processor architectures for ultra-low energy consumption

    No full text
    In this paper we suggest that further energy savings can be achieved by a new approach to synthesis of embedded processor cores, where the architecture is tailored to the algorithms that the core executes. In the context of embedded processor synthesis, both single-core and many-core, the types of algorithms and demands on the execution efficiency are usually known at the chip design time. This knowledge can be utilised at the design stage to synthesise architectures optimised for energy consumption. Firstly, we present an overview of both traditional energy saving techniques and new developments in architectural approaches to energy-efficient processing. Secondly, we propose a picoMIPS architecture that serves as an architectural template for energy-efficient synthesis. As a case study, we show how the picoMIPS architecture can be tailored to an energy efficient execution of the DCT algorithm
    • ā€¦
    corecore