509 research outputs found
Simple Signal Extension Method for Discrete Wavelet Transform
Discrete wavelet transform of finite-length signals must necessarily handle
the signal boundaries. The state-of-the-art approaches treat such boundaries in
a complicated and inflexible way, using special prolog or epilog phases. This
holds true in particular for images decomposed into a number of scales,
exemplary in JPEG 2000 coding system. In this paper, the state-of-the-art
approaches are extended to perform the treatment using a compact streaming
core, possibly in multi-scale fashion. We present the core focused on CDF 5/3
wavelet and the symmetric border extension method, both employed in the JPEG
2000. As a result of our work, every input sample is visited only once, while
the results are produced immediately, i.e. without buffering.Comment: preprint; presented on ICSIP 201
A C++-embedded Domain-Specific Language for programming the MORA soft processor array
MORA is a novel platform for high-level FPGA programming of streaming vector and matrix operations, aimed at multimedia applications. It consists of soft array of pipelined low-complexity SIMD processors-in-memory (PIM). We present a Domain-Specific Language (DSL) for high-level programming of the MORA soft processor array. The DSL is embedded in C++, providing designers with a familiar language framework and the ability to compile designs using a standard compiler for functional testing before generating the FPGA bitstream using the MORA toolchain. The paper discusses the MORA-C++ DSL and the compilation route into the assembly for the MORA machine and provides examples to illustrate the programming model and performance
Bitplane image coding with parallel coefficient processing
Image coding systems have been traditionally tailored for multiple instruction, multiple data (MIMD) computing. In general, they partition the (transformed) image in codeblocks that can be coded in the cores of MIMD-based processors. Each core executes a sequential flow of instructions to process the coefficients in the codeblock, independently and asynchronously from the others cores. Bitplane coding is a common strategy to code such data. Most of its mechanisms require sequential processing of the coefficients. The last years have seen the upraising of processing accelerators with enhanced computational performance and power efficiency whose architecture is mainly based on the single instruction, multiple data (SIMD) principle. SIMD computing refers to the execution of the same instruction to multiple data in a lockstep synchronous way. Unfortunately, current bitplane coding strategies cannot fully profit from such processors due to inherently sequential coding task. This paper presents bitplane image coding with parallel coefficient (BPC-PaCo) processing, a coding method that can process many coefficients within a codeblock in parallel and synchronously. To this end, the scanning order, the context formation, the probability model, and the arithmetic coder of the coding engine have been re-formulated. The experimental results suggest that the penalization in coding performance of BPC-PaCo with respect to the traditional strategies is almost negligible
Parallel 3D Fast Wavelet Transform comparison on CPUs and GPUs
We present in this paper several implementations of the 3D Fast Wavelet Transform (3D-FWT) on multicore CPUs and manycore GPUs. On the GPU side, we focus on CUDA and OpenCL programming to develop methods for an efficient mapping on manycores. On multicore CPUs, OpenMP and Pthreads are used as counterparts to maximize parallelism, and renowned techniques like tiling and blocking are exploited to optimize the use of memory. We evaluate these proposals and make a comparison between a new Fermi Tesla C2050 and an Intel Core 2 QuadQ6700. Speedups of the CUDA version are the best results, improving the execution times on CPU, ranging from 5.3x to 7.4x for different image sizes, and up to 81 times faster when communications are neglected. Meanwhile, OpenCL obtains solid gains which range from 2x factors on small frame sizes to 3x factors on larger ones
Fully Scalable Video Coding Using Redundant-Wavelet Multihypothesis and Motion-Compensated Temporal Filtering
In this dissertation, a fully scalable video coding system is proposed. This system achieves full temporal, resolution, and fidelity scalability by combining mesh-based motion-compensated temporal filtering, multihypothesis motion compensation, and an embedded 3D wavelet-coefficient coder. The first major contribution of this work is the introduction of the redundant-wavelet multihypothesis paradigm into motion-compensated temporal filtering, which is achieved by deploying temporal filtering in the domain of a spatially redundant wavelet transform. A regular triangle mesh is used to track motion between frames, and an affine transform between mesh triangles implements motion compensation within a lifting-based temporal transform. Experimental results reveal that the incorporation of redundant-wavelet multihypothesis into mesh-based motion-compensated temporal filtering significantly improves the rate-distortion performance of the scalable coder. The second major contribution is the introduction of a sliding-window implementation of motion-compensated temporal filtering such that video sequences of arbitrarily length may be temporally filtered using a finite-length frame buffer without suffering from severe degradation at buffer boundaries. Finally, as a third major contribution, a novel 3D coder is designed for the coding of the 3D volume of coefficients resulting from the redundant-wavelet based temporal filtering. This coder employs an explicit estimate of the probability of coefficient significance to drive a nonadaptive arithmetic coder, resulting in a simple software implementation. Additionally, the coder offers the possibility of a high degree of vectorization particularly well suited to the data-parallel capabilities of modern general-purpose processors or customized hardware. Results show that the proposed coder yields nearly the same rate-distortion performance as a more complicated coefficient coder considered to be state of the art
Evaluation of GPU/CPU Co-Processing Models for JPEG 2000 Packetization
With the bottom-line goal of increasing the
throughput of a GPU-accelerated JPEG 2000 encoder, this paper
evaluates whether the post-compression rate control and
packetization routines should be carried out on the CPU or on
the GPU. Three co-processing models that differ in how the
workload is split among the CPU and GPU are introduced. Both
routines are discussed and algorithms for executing them in
parallel are presented. Experimental results for compressing a
detail-rich UHD sequence to 4 bits/sample indicate speed-ups of
200x for the rate control and 100x for the packetization
compared to the single-threaded implementation in the
commercial Kakadu library. These two routines executed on the
CPU take 4x as long as all remaining coding steps on the GPU
and therefore present a bottleneck. Even if the CPU bottleneck
could be avoided with multi-threading, it is still beneficial to
execute all coding steps on the GPU as this minimizes the
required device-to-host transfer and thereby speeds up the
critical path from 17.2 fps to 19.5 fps for 4 bits/sample and to
22.4 fps for 0.16 bits/sample
GPU-oriented architecture for an end-to-end image/video codec based on JPEG2000
Modern image and video compression standards employ computationally intensive algorithms that provide advanced features to the coding system. Current standards often need to be implemented in hardware or using expensive solutions to meet the real-time requirements of some environments. Contrarily to this trend, this paper proposes an end-to-end codec architecture running on inexpensive Graphics Processing Units (GPUs) that is based on, though not compatible with, the JPEG2000 international standard for image and video compression. When executed in a commodity Nvidia GPU, it achieves real time processing of 12K video. The proposed S/W architecture utilizes four CUDA kernels that minimize memory transfers, use registers instead of shared memory, and employ a double-buffer strategy to optimize the streaming of data. The analysis of throughput indicates that the proposed codec yields results at least 10× superior on average to those achieved with JPEG2000 implementations devised for CPUs, and approximately 4× superior to those achieved with hardwired solutions of the HEVC/H.265 video compression standard
Performance Analysis of Modified Lifting Based DWT Architecture and FPGA Implementation for Speed and Power
Demand for high speed and low power architecture for DWT computation have led to design of novel algorithms and architecture In this paper we design model and implement a hardware efficient high speed and power efficient DWT architecture based on modified lifting scheme algorithm The design is interfaced with SIPO and PISO to reduce the number of I O lines on the FPGA The design is implemented on Spartan III device and is compared with lifting scheme logic The proposed design operates at frequency of 280 MHz and consumes power less than 42 mW The presynthesis and post-synthesis results are verified and suitable test vectors are used in verifying the functionality of the design The design is suitable for real time data processin
- …