9 research outputs found
Neighbor cache prefetching for multimedia image and video processing
Cache performance is strongly influenced by the type of locality embodied in programs. In particular, multimedia programs handling images and videos are characterized by a bidimensional spatial locality, which is not adequately exploited by standard caches. In this paper we propose novel cache prefetching techniques for image data, called neighbor prefetching, able to improve exploitation of bidimensional spatial locality. A performance comparison is provided against other assessed prefetching techniques on a multimedia workload (with MPEG-2 and MPEG-4 decoding, image processing, and visual object segmentation), including a detailed evaluation of both the miss rate and the memory access time. Results prove that neighbor prefetching achieves a significant reduction in the time due to delayed memory cycles (more than 97% on MPEG-4 with respect to 75% of the second performing technique). This reduction leads to a substantial speedup on the overall memory access time (up to 140% for MPEG-4). Performance has been measured with the PRIMA trace-driven simulator, specifically devised to support cache prefetching
Parallel implementation of fractal image compression
Thesis (M.Sc.Eng.)-University of Natal, Durban, 2000.Fractal image compression exploits the piecewise self-similarity present in real images
as a form of information redundancy that can be eliminated to achieve compression. This
theory based on Partitioned Iterated Function Systems is presented. As an alternative to the
established JPEG, it provides a similar compression-ratio to fidelity trade-off. Fractal
techniques promise faster decoding and potentially higher fidelity, but the computationally
intensive compression process has prevented commercial acceptance.
This thesis presents an algorithm mapping the problem onto a parallel processor
architecture, with the goal of reducing the encoding time. The experimental work involved
implementation of this approach on the Texas Instruments TMS320C80 parallel processor
system. Results indicate that the fractal compression process is unusually well suited to
parallelism with speed gains approximately linearly related to the number of processors used.
Parallel processing issues such as coherency, management and interfacing are discussed. The
code designed incorporates pipelining and parallelism on all conceptual and practical levels
ensuring that all resources are fully utilised, achieving close to optimal efficiency.
The computational intensity was reduced by several means, including conventional
classification of image sub-blocks by content with comparisons across class boundaries
prohibited. A faster approach adopted was to perform estimate comparisons between blocks
based on pixel value variance, identifying candidates for more time-consuming, accurate
RMS inter-block comparisons. These techniques, combined with the parallelism, allow
compression of 512x512 pixel x 8 bit images in under 20 seconds, while maintaining a 30dB
PSNR. This is up to an order of magnitude faster than reported for conventional sequential
processor implementations. Fractal based compression of colour images and video sequences
is also considered.
The work confirms the potential of fractal compression techniques, and demonstrates
that a parallel implementation is appropriate for addressing the compression time problem.
The processor system used in these investigations is faster than currently available PC
platforms, but the relevance lies in the anticipation that future generations of affordable
processors will exceed its performance. The advantages of fractal image compression may
then be accessible to the average computer user, leading to commercial acceptance
A framework for efficient execution of matrix computations
Matrix computations lie at the heart of most scientific computational tasks. The solution of linear systems of equations is a very frequent operation in many fields in science, engineering, surveying, physics and others. Other matrix operations occur frequently in many other fields such as pattern recognition and classification, or multimedia applications. Therefore, it is important to perform matrix operations efficiently. The work in this thesis focuses on the efficient execution on commodity processors of matrix operations which arise frequently in different fields.We study some important operations which appear in the solution of real world problems: some sparse and dense linear algebra codes and a classification algorithm. In particular, we focus our attention on the efficient execution of the following operations: sparse Cholesky factorization; dense matrix multiplication; dense Cholesky factorization; and Nearest Neighbor Classification.A lot of research has been conducted on the efficient parallelization of numerical algorithms. However, the efficiency of a parallel algorithm depends ultimately on the performance obtained from the computations performed on each node. The work presented in this thesis focuses on the sequential execution on a single processor.There exists a number of data structures for sparse computations which can be used in order to avoid the storage of and computation on zero elements. We work with a hierarchical data structure known as hypermatrix. A matrix is subdivided recursively an arbitrary number of times. Several pointer matrices are used to store the location ofsubmatrices at each level. The last level consists of data submatrices which are dealt with as dense submatrices. When the block size of this dense submatrices is small, the number of zeros can be greatly reduced. However, the performance obtained from BLAS3 routines drops heavily. Consequently, there is a trade-off in the size of data submatrices used for a sparse Cholesky factorization with the hypermatrix scheme. Our goal is that of reducing the overhead introduced by the unnecessary operation on zeros when a hypermatrix data structure is used to produce a sparse Cholesky factorization. In this work we study several techniques for reducing such overhead in order to obtain high performance.One of our goals is the creation of codes which work efficiently on different platforms when operating on dense matrices. To obtain high performance, the resources offered by the CPU must be properly utilized. At the same time, the memory hierarchy must be exploited to tolerate increasing memory latencies. To achieve the former, we produce inner kernels which use the CPU very efficiently. To achieve the latter, we investigate nonlinear data layouts. Such data formats can contribute to the effective use of the memory system.The use of highly optimized inner kernels is of paramount importance for obtaining efficient numerical algorithms. Often, such kernels are created by hand. However, we want to create efficient inner kernels for a variety of processors using a general approach and avoiding hand-made codification in assembly language. In this work, we present an alternative way to produce efficient kernels automatically, based on a set of simple codes written in a high level language, which can be parameterized at compilation time. The advantage of our method lies in the ability to generate very efficient inner kernels by means of a good compiler. Working on regular codes for small matrices most of the compilers we used in different platforms were creating very efficient inner kernels for matrix multiplication. Using the resulting kernels we have been able to produce high performance sparse and dense linear algebra codes on a variety of platforms.In this work we also show that techniques used in linear algebra codes can be useful in other fields. We present the work we have done in the optimization of the Nearest Neighbor classification focusing on the speed of the classification process.Tuning several codes for different problems and machines can become a heavy and unbearable task. For this reason we have developed an environment for development and automatic benchmarking of codes which is presented in this thesis.As a practical result of this work, we have been able to create efficient codes for several matrix operations on a variety of platforms. Our codes are highly competitive with other state-of-art codes for some problems
Parallel implementation of fractal image compression
Thesis (M.Sc.Eng.)-University of Natal, Durban, 2000.Fractal image compression exploits the piecewise self-similarity present in real images
as a form of information redundancy that can be eliminated to achieve compression. This
theory based on Partitioned Iterated Function Systems is presented. As an alternative to the
established JPEG, it provides a similar compression-ratio to fidelity trade-off. Fractal
techniques promise faster decoding and potentially higher fidelity, but the computationally
intensive compression process has prevented commercial acceptance.
This thesis presents an algorithm mapping the problem onto a parallel processor
architecture, with the goal of reducing the encoding time. The experimental work involved
implementation of this approach on the Texas Instruments TMS320C80 parallel processor
system. Results indicate that the fractal compression process is unusually well suited to
parallelism with speed gains approximately linearly related to the number of processors used.
Parallel processing issues such as coherency, management and interfacing are discussed. The
code designed incorporates pipelining and parallelism on all conceptual and practical levels
ensuring that all resources are fully utilised, achieving close to optimal efficiency.
The computational intensity was reduced by several means, including conventional
classification of image sub-blocks by content with comparisons across class boundaries
prohibited. A faster approach adopted was to perform estimate comparisons between blocks
based on pixel value variance, identifying candidates for more time-consuming, accurate
RMS inter-block comparisons. These techniques, combined with the parallelism, allow
compression of 512x512 pixel x 8 bit images in under 20 seconds, while maintaining a 30dB
PSNR. This is up to an order of magnitude faster than reported for conventional sequential
processor implementations. Fractal based compression of colour images and video sequences
is also considered.
The work confirms the potential of fractal compression techniques, and demonstrates
that a parallel implementation is appropriate for addressing the compression time problem.
The processor system used in these investigations is faster than currently available PC
platforms, but the relevance lies in the anticipation that future generations of affordable
processors will exceed its performance. The advantages of fractal image compression may
then be accessible to the average computer user, leading to commercial acceptance
Image Processing on High Performance RISC Systems
none4restrictedBAGLIETTO; M. MARESCA; MIGLIARDI; ZINGIRIAN NBaglietto, ; Maresca, Massimo; Migliardi, Mauro; Zingirian, Nicol