11,462 research outputs found
High-Efficient Parallel CAVLC Encoders on Heterogeneous Multicore Architectures
This article presents two high-efficient parallel realizations of the context-based adaptive variable length coding (CAVLC) based on heterogeneous multicore processors. By optimizing the architecture of the CAVLC encoder, three kinds of dependences are eliminated or weaken, including the context-based data dependence, the memory accessing dependence and the control dependence. The CAVLC pipeline is divided into three stages: two scans, coding, and lag packing, and be implemented on two typical heterogeneous multicore architectures. One is a block-based SIMD parallel CAVLC encoder on multicore stream processor STORM. The other is a component-oriented SIMT parallel encoder on massively parallel architecture GPU. Both of them exploited rich data-level parallelism. Experiments results show that compared with the CPU version, more than 70 times of speedup can be obtained for STORM and over 50 times for GPU. The implementation of encoder on STORM can make a real-time processing for 1080p @30fps and GPU-based version can satisfy the requirements for 720p real-time encoding. The throughput of the presented CAVLC encoders is more than 10 times higher than that of published software encoders on DSP and multicore platforms
EWT: Efficient Wavelet-Transformer for Single Image Denoising
Transformer-based image denoising methods have achieved encouraging results
in the past year. However, it must uses linear operations to model long-range
dependencies, which greatly increases model inference time and consumes GPU
storage space. Compared with convolutional neural network-based methods,
current Transformer-based image denoising methods cannot achieve a balance
between performance improvement and resource consumption. In this paper, we
propose an Efficient Wavelet Transformer (EWT) for image denoising.
Specifically, we use Discrete Wavelet Transform (DWT) and Inverse Wavelet
Transform (IWT) for downsampling and upsampling, respectively. This method can
fully preserve the image features while reducing the image resolution, thereby
greatly reducing the device resource consumption of the Transformer model.
Furthermore, we propose a novel Dual-stream Feature Extraction Block (DFEB) to
extract image features at different levels, which can further reduce model
inference time and GPU memory usage. Experiments show that our method speeds up
the original Transformer by more than 80%, reduces GPU memory usage by more
than 60%, and achieves excellent denoising results. All code will be public.Comment: 12 pages, 11 figur
A Streaming Multi-GPU Implementation of Image Simulation Algorithms for Scanning Transmission Electron Microscopy
Simulation of atomic resolution image formation in scanning transmission
electron microscopy can require significant computation times using traditional
methods. A recently developed method, termed plane-wave reciprocal-space
interpolated scattering matrix (PRISM), demonstrates potential for significant
acceleration of such simulations with negligible loss of accuracy. Here we
present a software package called Prismatic for parallelized simulation of
image formation in scanning transmission electron microscopy (STEM) using both
the PRISM and multislice methods. By distributing the workload between multiple
CUDA-enabled GPUs and multicore processors, accelerations as high as 1000x for
PRISM and 30x for multislice are achieved relative to traditional multislice
implementations using a single 4-GPU machine. We demonstrate a potentially
important application of Prismatic, using it to compute images for atomic
electron tomography at sufficient speeds to include in the reconstruction
pipeline. Prismatic is freely available both as an open-source CUDA/C++ package
with a graphical user interface and as a Python package, PyPrismatic
3D high definition video coding on a GPU-based heterogeneous system
H.264/MVC is a standard for supporting the sensation of 3D, based on coding from 2 (stereo) to N views. H.264/MVC adopts many coding options inherited from single view H.264/AVC, and thus its complexity is even higher, mainly because the number of processing views is higher. In this manuscript, we aim at an efficient parallelization of the most computationally intensive video encoding module for stereo sequences. In particular, inter prediction and its collaborative execution on a heterogeneous platform. The proposal is based on an efficient dynamic load balancing algorithm and on breaking encoding dependencies. Experimental results demonstrate the proposed algorithm's ability to reduce the encoding time for different stereo high definition sequences. Speed-up values of up to 90× were obtained when compared with the reference encoder on the same platform. Moreover, the proposed algorithm also provides a more energy-efficient approach and hence requires less energy than the sequential reference algorith
- …