Search CORE

18 research outputs found

The Parallel Algorithm for the 2-D Discrete Wavelet Transform

Author: Barina David
Kleparnik Petr
Kula Michal
Najman Pavel
Zemcik Pavel
Publication venue
Publication date: 26/09/2017
Field of study

The discrete wavelet transform can be found at the heart of many image-processing algorithms. Until now, the transform on general-purpose processors (CPUs) was mostly computed using a separable lifting scheme. As the lifting scheme consists of a small number of operations, it is preferred for processing using single-core CPUs. However, considering a parallel processing using multi-core processors, this scheme is inappropriate due to a large number of steps. On such architectures, the number of steps corresponds to the number of points that represent the exchange of data. Consequently, these points often form a performance bottleneck. Our approach appropriately rearranges calculations inside the transform, and thereby reduces the number of steps. In other words, we propose a new scheme that is friendly to parallel environments. When evaluating on multi-core CPUs, we consistently overcome the original lifting scheme. The evaluation was performed on 61-core Intel Xeon Phi and 8-core Intel Xeon processors.Comment: accepted for publication at ICGIP 201

arXiv.org e-Print Archive

Crossref

Discrete Wavelet Transformation Implementation in GPU through Register Based Strategy

Author: Hemkant Balasaheb Gangurde, M. U. Kharat
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 30/06/2017
Field of study

The significant architectural changes made by Nvidia during the launch of Kepler architecture in 2012, upgraded its GPUs with greater register memory and rich instructions set to have communication between registers through available threads. This created a potential for new programming approach which uses registers for sharing and reusing of data in the context of the shared memory. This kind of approach can considerably improve the performance of applications which reuses implied data heavily. This work is based upon of register-based implementation of the Discrete Wavelet Transform (DWT) with the help of CUDA and openCV. DWT is the data decorrelation approach in the area of video and image coding. Results of this particular approach indicate that this technique performs at least four times better than the best GPU implementation of the DWT in past. Experimental tests also prove that this approach shows the performance close to the GPUs performance limits

International Journal on Recent and Innovation Trends in Computing and Communication

Use of CUDA for the Continuous Space Language Model

Author: Anderson Timothy
Thompson Elizabeth A.
Publication venue: Opus: Research \u26 Creativity at IPFW
Publication date: 10/09/2012
Field of study

The training phase of the Continuous Space Language Model (CSLM) was implemented in the NVIDIA hardware/software architecture Compute Unified Device Architecture (CUDA). Implementation was accomplished using a combination of CUBLAS library routines and CUDA kernel calls on three different CUDA enabled devices of varying compute capability and a time savings over the traditional CPU approach demonstrated

Crossref

Opus: Research and Creativity at IPFW

Two-Dimensional Discrete Wavelet Transform on Large Images for Hybrid Computing Architectures: GPU and CELL

Author
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Crossref

Parallel 3D Fast Wavelet Transform comparison on CPUs and GPUs

Author: Bernabé Gregorio
Publication venue: University of Granada-University of Cadiz
Publication date: 01/01/2015
Field of study

We present in this paper several implementations of the 3D Fast Wavelet Transform (3D-FWT) on multicore CPUs and manycore GPUs. On the GPU side, we focus on CUDA and OpenCL programming to develop methods for an efficient mapping on manycores. On multicore CPUs, OpenMP and Pthreads are used as counterparts to maximize parallelism, and renowned techniques like tiling and blocking are exploited to optimize the use of memory. We evaluate these proposals and make a comparison between a new Fermi Tesla C2050 and an Intel Core 2 QuadQ6700. Speedups of the CUDA version are the best results, improving the execution times on CPU, ranging from 5.3x to 7.4x for different image sizes, and up to 81 times faster when communications are neglected. Meanwhile, OpenCL obtains solid gains which range from 2x factors on small frame sizes to 3x factors on larger ones

Portal de revistas de la Universidad de Granada

DIALNET

Finding faint HI structure in and around galaxies: scraping the barrel

Author: Punzo D.
Roerdink J. B. T. M.
van der Hulst J. M.
Publication venue
Publication date: 13/09/2016
Field of study

Soon to be operational HI survey instruments such as APERTIF and ASKAP will produce large datasets. These surveys will provide information about the HI in and around hundreds of galaxies with a typical signal-to-noise ratio of

\sim

10 in the inner regions and

\sim

1 in the outer regions. In addition, such surveys will make it possible to probe faint HI structures, typically located in the vicinity of galaxies, such as extra-planar-gas, tails and filaments. These structures are crucial for understanding galaxy evolution, particularly when they are studied in relation to the local environment. Our aim is to find optimized kernels for the discovery of faint and morphologically complex HI structures. Therefore, using HI data from a variety of galaxies, we explore state-of-the-art filtering algorithms. We show that the intensity-driven gradient filter, due to its adaptive characteristics, is the optimal choice. In fact, this filter requires only minimal tuning of the input parameters to enhance the signal-to-noise ratio of faint components. In addition, it does not degrade the resolution of the high signal-to-noise component of a source. The filtering process must be fast and be embedded in an interactive visualization tool in order to support fast inspection of a large number of sources. To achieve such interactive exploration, we implemented a multi-core CPU (OpenMP) and a GPU (OpenGL) version of this filter in a 3D visualization environment (

\tt{SlicerAstro}

).Comment: 17 pages, 9 figures, 4 tables. Astronomy and Computing, accepte

arXiv.org e-Print Archive

Proceedings - University of Groningen

Crossref

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen

Accelerating wavelet-based video coding on graphics hardware using CUDA

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Bitplane image coding with parallel coefficient processing

Author: Auli-Llinas Francesc
Enfedaque Pablo
Moure Juan C.
Sanchez Silva Victor
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

Image coding systems have been traditionally tailored for multiple instruction, multiple data (MIMD) computing. In general, they partition the (transformed) image in codeblocks that can be coded in the cores of MIMD-based processors. Each core executes a sequential flow of instructions to process the coefficients in the codeblock, independently and asynchronously from the others cores. Bitplane coding is a common strategy to code such data. Most of its mechanisms require sequential processing of the coefficients. The last years have seen the upraising of processing accelerators with enhanced computational performance and power efficiency whose architecture is mainly based on the single instruction, multiple data (SIMD) principle. SIMD computing refers to the execution of the same instruction to multiple data in a lockstep synchronous way. Unfortunately, current bitplane coding strategies cannot fully profit from such processors due to inherently sequential coding task. This paper presents bitplane image coding with parallel coefficient (BPC-PaCo) processing, a coding method that can process many coefficients within a codeblock in parallel and synchronously. To this end, the scanning order, the context formation, the probability model, and the arithmetic coder of the coding engine have been re-formulated. The experimental results suggest that the penalization in coding performance of BPC-PaCo with respect to the traditional strategies is almost negligible

CiteSeerX

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Warwick Research Archives Portal Repository

Diposit Digital de Documents de la UAB

Massively parallel non-stationary EEG data processing on GPGPU platforms with Morlet continuous wavelet transform

Author: A Klein
C Tenllado
C Torrence
J Franco
J Nickolls
J Polygiannakis
JP Lachaux
K Gurley
M Fligge
MP SouzaEcher
MT Akhtar
P Kumar
P Pioft
RW Johnson
S Erol
SG Park
TT Wong
WJ Laan van der
X Li
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref