613 research outputs found
The Parallel Algorithm for the 2-D Discrete Wavelet Transform
The discrete wavelet transform can be found at the heart of many
image-processing algorithms. Until now, the transform on general-purpose
processors (CPUs) was mostly computed using a separable lifting scheme. As the
lifting scheme consists of a small number of operations, it is preferred for
processing using single-core CPUs. However, considering a parallel processing
using multi-core processors, this scheme is inappropriate due to a large number
of steps. On such architectures, the number of steps corresponds to the number
of points that represent the exchange of data. Consequently, these points often
form a performance bottleneck. Our approach appropriately rearranges
calculations inside the transform, and thereby reduces the number of steps. In
other words, we propose a new scheme that is friendly to parallel environments.
When evaluating on multi-core CPUs, we consistently overcome the original
lifting scheme. The evaluation was performed on 61-core Intel Xeon Phi and
8-core Intel Xeon processors.Comment: accepted for publication at ICGIP 201
Parallel 3D Fast Wavelet Transform comparison on CPUs and GPUs
We present in this paper several implementations of the 3D Fast Wavelet Transform (3D-FWT) on multicore CPUs and manycore GPUs. On the GPU side, we focus on CUDA and OpenCL programming to develop methods for an efficient mapping on manycores. On multicore CPUs, OpenMP and Pthreads are used as counterparts to maximize parallelism, and renowned techniques like tiling and blocking are exploited to optimize the use of memory. We evaluate these proposals and make a comparison between a new Fermi Tesla C2050 and an Intel Core 2 QuadQ6700. Speedups of the CUDA version are the best results, improving the execution times on CPU, ranging from 5.3x to 7.4x for different image sizes, and up to 81 times faster when communications are neglected. Meanwhile, OpenCL obtains solid gains which range from 2x factors on small frame sizes to 3x factors on larger ones
Mengenal pasti tahap pengetahuan pelajar tahun akhir Ijazah Sarjana Muda Kejuruteraan di KUiTTHO dalam bidang keusahawanan dari aspek pengurusan modal
Malaysia ialah sebuah negara membangun di dunia. Dalam proses pembangunan
ini, hasrat negara untuk melahirkan bakal usahawan beijaya tidak boleh dipandang
ringan. Oleh itu, pengetahuan dalam bidang keusahawanan perlu diberi perhatian
dengan sewajarnya; antara aspek utama dalam keusahawanan ialah modal. Pengurusan
modal yang tidak cekap menjadi punca utama kegagalan usahawan. Menyedari hakikat
ini, kajian berkaitan Pengurusan Modal dijalankan ke atas 100 orang pelajar Tahun
Akhir Kejuruteraan di KUiTTHO. Sampel ini dipilih kerana pelajar-pelajar ini akan
menempuhi alam pekeijaan di mana mereka boleh memilih keusahawanan sebagai satu
keijaya. Walau pun mereka bukanlah pelajar dari jurusan perniagaan, namun mereka
mempunyai kemahiran dalam mereka cipta produk yang boleh dikomersialkan. Hasil
dapatan kajian membuktikan bahawa pelajar-pelajar ini berminat dalam bidang
keusahawanan namun masih kurang pengetahuan tentang pengurusan modal
terutamanya dalam menentukan modal permulaan, pengurusan modal keija dan caracara
menentukan pembiayaan kewangan menggunakan kaedah jualan harian. Oleh itu,
satu garis panduan Pengurusan Modal dibina untuk memberi pendedahan kepada
mereka
DCT Implementation on GPU
There has been a great progress in the field of graphics processors. Since, there is no rise in the speed of the normal CPU processors; Designers are coming up with multi-core, parallel processors. Because of their popularity in parallel processing, GPUs are becoming more and more attractive for many applications. With the increasing demand in utilizing GPUs, there is a great need to develop operating systems that handle the GPU to full capacity. GPUs offer a very efficient environment for many image processing applications. This thesis explores the processing power of GPUs for digital image compression using Discrete cosine transform
Using hybrid GPU/CPU kernel splitting to accelerate spherical convolutions
We present a general method for accelerating by more than an order of
magnitude the convolution of pixelated functions on the sphere with a
radially-symmetric kernel. Our method splits the kernel into a compact
real-space component and a compact spherical harmonic space component. These
components can then be convolved in parallel using an inexpensive commodity GPU
and a CPU. We provide models for the computational cost of both real-space and
Fourier space convolutions and an estimate for the approximation error. Using
these models we can determine the optimum split that minimizes the wall clock
time for the convolution while satisfying the desired error bounds. We apply
this technique to the problem of simulating a cosmic microwave background (CMB)
anisotropy sky map at the resolution typical of the high resolution maps
produced by the Planck mission. For the main Planck CMB science channels we
achieve a speedup of over a factor of ten, assuming an acceptable fractional
rms error of order 1.e-5 in the power spectrum of the output map.Comment: 9 pages, 11 figures, 1 table, accepted by Astronomy & Computing w/
minor revisions. arXiv admin note: substantial text overlap with
arXiv:1211.355
Evaluation of GPU/CPU Co-Processing Models for JPEG 2000 Packetization
With the bottom-line goal of increasing the
throughput of a GPU-accelerated JPEG 2000 encoder, this paper
evaluates whether the post-compression rate control and
packetization routines should be carried out on the CPU or on
the GPU. Three co-processing models that differ in how the
workload is split among the CPU and GPU are introduced. Both
routines are discussed and algorithms for executing them in
parallel are presented. Experimental results for compressing a
detail-rich UHD sequence to 4 bits/sample indicate speed-ups of
200x for the rate control and 100x for the packetization
compared to the single-threaded implementation in the
commercial Kakadu library. These two routines executed on the
CPU take 4x as long as all remaining coding steps on the GPU
and therefore present a bottleneck. Even if the CPU bottleneck
could be avoided with multi-threading, it is still beneficial to
execute all coding steps on the GPU as this minimizes the
required device-to-host transfer and thereby speeds up the
critical path from 17.2 fps to 19.5 fps for 4 bits/sample and to
22.4 fps for 0.16 bits/sample
Computational Physics on Graphics Processing Units
The use of graphics processing units for scientific computations is an
emerging strategy that can significantly speed up various different algorithms.
In this review, we discuss advances made in the field of computational physics,
focusing on classical molecular dynamics, and on quantum simulations for
electronic structure calculations using the density functional theory, wave
function techniques, and quantum field theory.Comment: Proceedings of the 11th International Conference, PARA 2012,
Helsinki, Finland, June 10-13, 201
- …