9 research outputs found

    Energy efficient address assignment through minimized memory row switching

    No full text
    Published versio

    Neighbor cache prefetching for multimedia image and video processing

    Full text link
    Cache performance is strongly influenced by the type of locality embodied in programs. In particular, multimedia programs handling images and videos are characterized by a bidimensional spatial locality, which is not adequately exploited by standard caches. In this paper we propose novel cache prefetching techniques for image data, called neighbor prefetching, able to improve exploitation of bidimensional spatial locality. A performance comparison is provided against other assessed prefetching techniques on a multimedia workload (with MPEG-2 and MPEG-4 decoding, image processing, and visual object segmentation), including a detailed evaluation of both the miss rate and the memory access time. Results prove that neighbor prefetching achieves a significant reduction in the time due to delayed memory cycles (more than 97% on MPEG-4 with respect to 75% of the second performing technique). This reduction leads to a substantial speedup on the overall memory access time (up to 140% for MPEG-4). Performance has been measured with the PRIMA trace-driven simulator, specifically devised to support cache prefetching

    Parallel implementation of fractal image compression

    Get PDF
    Thesis (M.Sc.Eng.)-University of Natal, Durban, 2000.Fractal image compression exploits the piecewise self-similarity present in real images as a form of information redundancy that can be eliminated to achieve compression. This theory based on Partitioned Iterated Function Systems is presented. As an alternative to the established JPEG, it provides a similar compression-ratio to fidelity trade-off. Fractal techniques promise faster decoding and potentially higher fidelity, but the computationally intensive compression process has prevented commercial acceptance. This thesis presents an algorithm mapping the problem onto a parallel processor architecture, with the goal of reducing the encoding time. The experimental work involved implementation of this approach on the Texas Instruments TMS320C80 parallel processor system. Results indicate that the fractal compression process is unusually well suited to parallelism with speed gains approximately linearly related to the number of processors used. Parallel processing issues such as coherency, management and interfacing are discussed. The code designed incorporates pipelining and parallelism on all conceptual and practical levels ensuring that all resources are fully utilised, achieving close to optimal efficiency. The computational intensity was reduced by several means, including conventional classification of image sub-blocks by content with comparisons across class boundaries prohibited. A faster approach adopted was to perform estimate comparisons between blocks based on pixel value variance, identifying candidates for more time-consuming, accurate RMS inter-block comparisons. These techniques, combined with the parallelism, allow compression of 512x512 pixel x 8 bit images in under 20 seconds, while maintaining a 30dB PSNR. This is up to an order of magnitude faster than reported for conventional sequential processor implementations. Fractal based compression of colour images and video sequences is also considered. The work confirms the potential of fractal compression techniques, and demonstrates that a parallel implementation is appropriate for addressing the compression time problem. The processor system used in these investigations is faster than currently available PC platforms, but the relevance lies in the anticipation that future generations of affordable processors will exceed its performance. The advantages of fractal image compression may then be accessible to the average computer user, leading to commercial acceptance

    A framework for efficient execution of matrix computations

    Get PDF
    Matrix computations lie at the heart of most scientific computational tasks. The solution of linear systems of equations is a very frequent operation in many fields in science, engineering, surveying, physics and others. Other matrix operations occur frequently in many other fields such as pattern recognition and classification, or multimedia applications. Therefore, it is important to perform matrix operations efficiently. The work in this thesis focuses on the efficient execution on commodity processors of matrix operations which arise frequently in different fields.We study some important operations which appear in the solution of real world problems: some sparse and dense linear algebra codes and a classification algorithm. In particular, we focus our attention on the efficient execution of the following operations: sparse Cholesky factorization; dense matrix multiplication; dense Cholesky factorization; and Nearest Neighbor Classification.A lot of research has been conducted on the efficient parallelization of numerical algorithms. However, the efficiency of a parallel algorithm depends ultimately on the performance obtained from the computations performed on each node. The work presented in this thesis focuses on the sequential execution on a single processor.There exists a number of data structures for sparse computations which can be used in order to avoid the storage of and computation on zero elements. We work with a hierarchical data structure known as hypermatrix. A matrix is subdivided recursively an arbitrary number of times. Several pointer matrices are used to store the location ofsubmatrices at each level. The last level consists of data submatrices which are dealt with as dense submatrices. When the block size of this dense submatrices is small, the number of zeros can be greatly reduced. However, the performance obtained from BLAS3 routines drops heavily. Consequently, there is a trade-off in the size of data submatrices used for a sparse Cholesky factorization with the hypermatrix scheme. Our goal is that of reducing the overhead introduced by the unnecessary operation on zeros when a hypermatrix data structure is used to produce a sparse Cholesky factorization. In this work we study several techniques for reducing such overhead in order to obtain high performance.One of our goals is the creation of codes which work efficiently on different platforms when operating on dense matrices. To obtain high performance, the resources offered by the CPU must be properly utilized. At the same time, the memory hierarchy must be exploited to tolerate increasing memory latencies. To achieve the former, we produce inner kernels which use the CPU very efficiently. To achieve the latter, we investigate nonlinear data layouts. Such data formats can contribute to the effective use of the memory system.The use of highly optimized inner kernels is of paramount importance for obtaining efficient numerical algorithms. Often, such kernels are created by hand. However, we want to create efficient inner kernels for a variety of processors using a general approach and avoiding hand-made codification in assembly language. In this work, we present an alternative way to produce efficient kernels automatically, based on a set of simple codes written in a high level language, which can be parameterized at compilation time. The advantage of our method lies in the ability to generate very efficient inner kernels by means of a good compiler. Working on regular codes for small matrices most of the compilers we used in different platforms were creating very efficient inner kernels for matrix multiplication. Using the resulting kernels we have been able to produce high performance sparse and dense linear algebra codes on a variety of platforms.In this work we also show that techniques used in linear algebra codes can be useful in other fields. We present the work we have done in the optimization of the Nearest Neighbor classification focusing on the speed of the classification process.Tuning several codes for different problems and machines can become a heavy and unbearable task. For this reason we have developed an environment for development and automatic benchmarking of codes which is presented in this thesis.As a practical result of this work, we have been able to create efficient codes for several matrix operations on a variety of platforms. Our codes are highly competitive with other state-of-art codes for some problems

    Parallel implementation of fractal image compression

    Get PDF
    Thesis (M.Sc.Eng.)-University of Natal, Durban, 2000.Fractal image compression exploits the piecewise self-similarity present in real images as a form of information redundancy that can be eliminated to achieve compression. This theory based on Partitioned Iterated Function Systems is presented. As an alternative to the established JPEG, it provides a similar compression-ratio to fidelity trade-off. Fractal techniques promise faster decoding and potentially higher fidelity, but the computationally intensive compression process has prevented commercial acceptance. This thesis presents an algorithm mapping the problem onto a parallel processor architecture, with the goal of reducing the encoding time. The experimental work involved implementation of this approach on the Texas Instruments TMS320C80 parallel processor system. Results indicate that the fractal compression process is unusually well suited to parallelism with speed gains approximately linearly related to the number of processors used. Parallel processing issues such as coherency, management and interfacing are discussed. The code designed incorporates pipelining and parallelism on all conceptual and practical levels ensuring that all resources are fully utilised, achieving close to optimal efficiency. The computational intensity was reduced by several means, including conventional classification of image sub-blocks by content with comparisons across class boundaries prohibited. A faster approach adopted was to perform estimate comparisons between blocks based on pixel value variance, identifying candidates for more time-consuming, accurate RMS inter-block comparisons. These techniques, combined with the parallelism, allow compression of 512x512 pixel x 8 bit images in under 20 seconds, while maintaining a 30dB PSNR. This is up to an order of magnitude faster than reported for conventional sequential processor implementations. Fractal based compression of colour images and video sequences is also considered. The work confirms the potential of fractal compression techniques, and demonstrates that a parallel implementation is appropriate for addressing the compression time problem. The processor system used in these investigations is faster than currently available PC platforms, but the relevance lies in the anticipation that future generations of affordable processors will exceed its performance. The advantages of fractal image compression may then be accessible to the average computer user, leading to commercial acceptance

    Image Processing on High Performance RISC Systems

    No full text
    none4restrictedBAGLIETTO; M. MARESCA; MIGLIARDI; ZINGIRIAN NBaglietto, ; Maresca, Massimo; Migliardi, Mauro; Zingirian, Nicol

    Image Processing on High Performance RISC Systems

    No full text

    Image processing on high-performance RISC systems

    No full text
    corecore