18 research outputs found

    The Parallel Algorithm for the 2-D Discrete Wavelet Transform

    Full text link
    The discrete wavelet transform can be found at the heart of many image-processing algorithms. Until now, the transform on general-purpose processors (CPUs) was mostly computed using a separable lifting scheme. As the lifting scheme consists of a small number of operations, it is preferred for processing using single-core CPUs. However, considering a parallel processing using multi-core processors, this scheme is inappropriate due to a large number of steps. On such architectures, the number of steps corresponds to the number of points that represent the exchange of data. Consequently, these points often form a performance bottleneck. Our approach appropriately rearranges calculations inside the transform, and thereby reduces the number of steps. In other words, we propose a new scheme that is friendly to parallel environments. When evaluating on multi-core CPUs, we consistently overcome the original lifting scheme. The evaluation was performed on 61-core Intel Xeon Phi and 8-core Intel Xeon processors.Comment: accepted for publication at ICGIP 201

    Discrete Wavelet Transformation Implementation in GPU through Register Based Strategy

    Get PDF
    The significant architectural changes made by Nvidia during the launch of Kepler architecture in 2012, upgraded its GPUs with greater register memory and rich instructions set to have communication between registers through available threads. This created a potential for new programming approach which uses registers for sharing and reusing of data in the context of the shared memory. This kind of approach can considerably improve the performance of applications which reuses implied data heavily. This work is based upon of register-based implementation of the Discrete Wavelet Transform (DWT) with the help of CUDA and openCV. DWT is the data decorrelation approach in the area of video and image coding. Results of this particular approach indicate that this technique performs at least four times better than the best GPU implementation of the DWT in past. Experimental tests also prove that this approach shows the performance close to the GPUs performance limits

    Use of CUDA for the Continuous Space Language Model

    Get PDF
    The training phase of the Continuous Space Language Model (CSLM) was implemented in the NVIDIA hardware/software architecture Compute Unified Device Architecture (CUDA). Implementation was accomplished using a combination of CUBLAS library routines and CUDA kernel calls on three different CUDA enabled devices of varying compute capability and a time savings over the traditional CPU approach demonstrated

    Two-Dimensional Discrete Wavelet Transform on Large Images for Hybrid Computing Architectures: GPU and CELL

    Full text link

    Parallel 3D Fast Wavelet Transform comparison on CPUs and GPUs

    Get PDF
    We present in this paper several implementations of the 3D Fast Wavelet Transform (3D-FWT) on multicore CPUs and manycore GPUs. On the GPU side, we focus on CUDA and OpenCL programming to develop methods for an efficient mapping on manycores. On multicore CPUs, OpenMP and Pthreads are used as counterparts to maximize parallelism, and renowned techniques like tiling and blocking are exploited to optimize the use of memory. We evaluate these proposals and make a comparison between a new Fermi Tesla C2050 and an Intel Core 2 QuadQ6700. Speedups of the CUDA version are the best results, improving the execution times on CPU, ranging from 5.3x to 7.4x for different image sizes, and up to 81 times faster when communications are neglected. Meanwhile, OpenCL obtains solid gains which range from 2x factors on small frame sizes to 3x factors on larger ones

    Finding faint HI structure in and around galaxies: scraping the barrel

    Get PDF
    Soon to be operational HI survey instruments such as APERTIF and ASKAP will produce large datasets. These surveys will provide information about the HI in and around hundreds of galaxies with a typical signal-to-noise ratio of ∼\sim 10 in the inner regions and ∼\sim 1 in the outer regions. In addition, such surveys will make it possible to probe faint HI structures, typically located in the vicinity of galaxies, such as extra-planar-gas, tails and filaments. These structures are crucial for understanding galaxy evolution, particularly when they are studied in relation to the local environment. Our aim is to find optimized kernels for the discovery of faint and morphologically complex HI structures. Therefore, using HI data from a variety of galaxies, we explore state-of-the-art filtering algorithms. We show that the intensity-driven gradient filter, due to its adaptive characteristics, is the optimal choice. In fact, this filter requires only minimal tuning of the input parameters to enhance the signal-to-noise ratio of faint components. In addition, it does not degrade the resolution of the high signal-to-noise component of a source. The filtering process must be fast and be embedded in an interactive visualization tool in order to support fast inspection of a large number of sources. To achieve such interactive exploration, we implemented a multi-core CPU (OpenMP) and a GPU (OpenGL) version of this filter in a 3D visualization environment (SlicerAstro\tt{SlicerAstro}).Comment: 17 pages, 9 figures, 4 tables. Astronomy and Computing, accepte

    Accelerating wavelet-based video coding on graphics hardware using CUDA

    Full text link

    Bitplane image coding with parallel coefficient processing

    Get PDF
    Image coding systems have been traditionally tailored for multiple instruction, multiple data (MIMD) computing. In general, they partition the (transformed) image in codeblocks that can be coded in the cores of MIMD-based processors. Each core executes a sequential flow of instructions to process the coefficients in the codeblock, independently and asynchronously from the others cores. Bitplane coding is a common strategy to code such data. Most of its mechanisms require sequential processing of the coefficients. The last years have seen the upraising of processing accelerators with enhanced computational performance and power efficiency whose architecture is mainly based on the single instruction, multiple data (SIMD) principle. SIMD computing refers to the execution of the same instruction to multiple data in a lockstep synchronous way. Unfortunately, current bitplane coding strategies cannot fully profit from such processors due to inherently sequential coding task. This paper presents bitplane image coding with parallel coefficient (BPC-PaCo) processing, a coding method that can process many coefficients within a codeblock in parallel and synchronously. To this end, the scanning order, the context formation, the probability model, and the arithmetic coder of the coding engine have been re-formulated. The experimental results suggest that the penalization in coding performance of BPC-PaCo with respect to the traditional strategies is almost negligible
    corecore