170 research outputs found

    A metadata-enhanced framework for high performance visual effects

    No full text
    This thesis is devoted to reducing the interactive latency of image processing computations in visual effects. Film and television graphic artists depend upon low-latency feedback to receive a visual response to changes in effect parameters. We tackle latency with a domain-specific optimising compiler which leverages high-level program metadata to guide key computational and memory hierarchy optimisations. This metadata encodes static and dynamic information about data dependence and patterns of memory access in the algorithms constituting a visual effect – features that are typically difficult to extract through program analysis – and presents it to the compiler in an explicit form. By using domain-specific information as a substitute for program analysis, our compiler is able to target a set of complex source-level optimisations that a vendor compiler does not attempt, before passing the optimised source to the vendor compiler for lower-level optimisation. Three key metadata-supported optimisations are presented. The first is an adaptation of space and schedule optimisation – based upon well-known compositions of the loop fusion and array contraction transformations – to the dynamic working sets and schedules of a runtimeparameterised visual effect. This adaptation sidesteps the costly solution of runtime code generation by specialising static parameters in an offline process and exploiting dynamic metadata to adapt the schedule and contracted working sets at runtime to user-tunable parameters. The second optimisation comprises a set of transformations to generate SIMD ISA-augmented source code. Our approach differs from autovectorisation by using static metadata to identify parallelism, in place of data dependence analysis, and runtime metadata to tune the data layout to user-tunable parameters for optimal aligned memory access. The third optimisation comprises a related set of transformations to generate code for SIMT architectures, such as GPUs. Static dependence metadata is exploited to guide large-scale parallelisation for tens of thousands of in-flight threads. Optimal use of the alignment-sensitive, explicitly managed memory hierarchy is achieved by identifying inter-thread and intra-core data sharing opportunities in memory access metadata. A detailed performance analysis of these optimisations is presented for two industrially developed visual effects. In our evaluation we demonstrate up to 8.1x speed-ups on Intel and AMD multicore CPUs and up to 6.6x speed-ups on NVIDIA GPUs over our best hand-written implementations of these two effects. Programmability is enhanced by automating the generation of SIMD and SIMT implementations from a single programmer-managed scalar representation

    A survey of parallel algorithms for fractal image compression

    Get PDF
    This paper presents a short survey of the key research work that has been undertaken in the application of parallel algorithms for Fractal image compression. The interest in fractal image compression techniques stems from their ability to achieve high compression ratios whilst maintaining a very high quality in the reconstructed image. The main drawback of this compression method is the very high computational cost that is associated with the encoding phase. Consequently, there has been significant interest in exploiting parallel computing architectures in order to speed up this phase, whilst still maintaining the advantageous features of the approach. This paper presents a brief introduction to fractal image compression, including the iterated function system theory upon which it is based, and then reviews the different techniques that have been, and can be, applied in order to parallelize the compression algorithm

    Fast data-parallel rendering of digital volume images.

    Get PDF
    by Song Zou.Year shown on spine: 1997.Thesis (M.Phil.)--Chinese University of Hong Kong, 1995.Includes bibliographical references (leaves 69-[72]).Chapter 1 --- Introduction --- p.1Chapter 2 --- Related works --- p.7Chapter 2.1 --- Spatial domain methods --- p.8Chapter 2.2 --- Transformation based methods --- p.9Chapter 2.3 --- Parallel Implement ation --- p.10Chapter 3 --- Parallel computation model --- p.12Chapter 3.1 --- Introduction --- p.12Chapter 3.2 --- Classifications of Parallel Computers --- p.13Chapter 3.3 --- The SIMD machine architectures --- p.15Chapter 3.4 --- The communication within the parallel processors --- p.16Chapter 3.5 --- The parallel display mechanisms --- p.17Chapter 4 --- Data preparation --- p.20Chapter 4.1 --- Introduction --- p.20Chapter 4.2 --- Original data layout in the processor array --- p.21Chapter 4.3 --- Shading --- p.21Chapter 4.4 --- Classification --- p.23Chapter 5 --- Fast data parallel rotation and resampling algorithms --- p.25Chapter 5.1 --- Introduction --- p.25Chapter 5.2 --- Affine Transformation --- p.26Chapter 5.3 --- Related works --- p.28Chapter 5.3.1 --- Resampling in ray tracing --- p.28Chapter 5.3.2 --- Direct Rotation --- p.28Chapter 5.3.3 --- General resampling approaches --- p.29Chapter 5.3.4 --- Rotation by shear --- p.29Chapter 5.4 --- The minimum mismatch rotation --- p.31Chapter 5.5 --- Load balancing --- p.33Chapter 5.6 --- Resampling algorithm --- p.35Chapter 5.6.1 --- Nearest neighbor --- p.36Chapter 5.6.2 --- Linear Interpolation --- p.36Chapter 5.6.3 --- Aitken's Algorithm --- p.38Chapter 5.6.4 --- Polynomial resampling in 3D --- p.40Chapter 5.7 --- A comparison between the resampling algorithms --- p.40Chapter 5.7.1 --- The quality --- p.42Chapter 5.7.2 --- Implement ation and cost --- p.44Chapter 6 --- Data reordering using binary swap --- p.47Chapter 6.1 --- The sorting algorithm --- p.48Chapter 6.2 --- The communication cost --- p.51Chapter 7 --- Ray composition --- p.53Chapter 7.1 --- Introduction --- p.53Chapter 7.2 --- Ray Composition by Monte Carlo Method --- p.54Chapter 7.3 --- The Associative Color Model --- p.56Chapter 7.4 --- Parallel Implementation --- p.60Chapter 7.5 --- Discussion and further improvement --- p.63Chapter 8 --- Conclusion and further work --- p.67Bibliography --- p.6

    GPU-oriented architecture for an end-to-end image/video codec based on JPEG2000

    Get PDF
    Modern image and video compression standards employ computationally intensive algorithms that provide advanced features to the coding system. Current standards often need to be implemented in hardware or using expensive solutions to meet the real-time requirements of some environments. Contrarily to this trend, this paper proposes an end-to-end codec architecture running on inexpensive Graphics Processing Units (GPUs) that is based on, though not compatible with, the JPEG2000 international standard for image and video compression. When executed in a commodity Nvidia GPU, it achieves real time processing of 12K video. The proposed S/W architecture utilizes four CUDA kernels that minimize memory transfers, use registers instead of shared memory, and employ a double-buffer strategy to optimize the streaming of data. The analysis of throughput indicates that the proposed codec yields results at least 10× superior on average to those achieved with JPEG2000 implementations devised for CPUs, and approximately 4× superior to those achieved with hardwired solutions of the HEVC/H.265 video compression standard

    Parallelization of fast wavelet transform.

    Get PDF
    Shum Yu Hing.Thesis (M.Phil.)--Chinese University of Hong Kong, 1994.Includes bibliographical references (leaves 140-143).ABSTRACT --- p.1Chapter 1. --- INTRODUCTIONChapter 1.1. --- Fourier Analysis --- p.3Chapter 1.2. --- Wavelet Analysis --- p.6Chapter 1.3. --- Parallelization --- p.10Chapter 1.3.1. --- Data Dependency AnalysisChapter 2. --- LITERATURE SURVEYChapter 2.1. --- One Dimensional Fast Wavelet Transform (Discrete) --- p.13Chapter 2.2. --- Shared Memory Architecture : Parallel Virtual Machine (PVM)Chapter 2.3. --- Distributed Memory Architecture : Massively Parallel Machine (DECmpp) --- p.21Chapter 3. --- THEORYChapter 3.1. --- Parallel ProcessingChapter 3.1.1. --- Amdahl ´ةs Law --- p.25Chapter 3.1.2. --- Quality Factor --- p.31Chapter 3.2. --- Parallel ArchitectureChapter 3.2.1. --- Pipelining --- p.32Chapter 3.2.2. --- Vector Processors --- p.34Chapter 3.2.3. --- Multiprocessor --- p.34Chapter 3.2.4 --- Array Processors --- p.36Chapter 3.2.5. --- Systolic Array Processing --- p.37Chapter 3.2.6. --- Granularity --- p.40Chapter 3.2.7. --- Load Balancing & Throughput --- p.42Chapter 3.3. --- Parallel Programming --- p.43Chapter 3.4. --- Parallel Numerical AlgorithmChapter 3.4.1. --- Parallelism Within a Statement --- p.44Chapter 3.4.2. --- Parallelism Between Statements --- p.47Chapter 4. --- IMPLEMENTATIONChapter 4.1. --- Sequential Version --- p.49Chapter 4.2. --- Parallel VersionChapter 4.2.1. --- Matrix Representation of Wavelet TransformChapter 4.2.1.1. --- Decomposition --- p.52Chapter 4.2.1.2. --- Reconstruct ion --- p.55Chapter 4.2.2. --- Parallel Virtual Machine (PVM)Chapter 4.2.2.1. --- Parallel AlgorithmChapter (a) --- HOST --- p.56Chapter (b) --- NODE --- p.57Chapter 4.2.2.2. --- Flowcharts --- p.59Chapter 4.2.2.3. --- Timing Model Analysis --- p.65Chapter 4.2.2.4 --- Quality FactorChapter (a) --- Decomposition --- p.71Chapter (b) --- Reconstruction --- p.72Chapter 4.2.3. --- Massively Parallel Machine - DECmpp --- p.73Chapter 4.2.3.1. --- Parallel Algorithm for ACU & PEsChapter 4.2.3.2. --- Flowcharts --- p.75Chapter 4.2.3.3. --- Timing Model AnalysisChapter (a) --- Communication Strategy --- p.77Chapter (b) --- Decomposition --- p.78Chapter (c) --- Reconstruct ion --- p.87Chapter 4.2.3.4. --- Quality FactorChapter (a) --- Decomposition --- p.89Chapter (b) --- Reconstruction --- p.89Chapter 4.2.3.5. --- Mapping --- p.92Chapter 5. --- RESULTChapter 5.1. --- Parallel Virtual Machine (PVM)Chapter 5.1.1. --- Sequential Version --- p.93Chapter 5.1.2. --- Parallel Version --- p.103Chapter 5.2. --- Massively Parallel Machine - DECmppChapter 5.2.1. --- Sequential Vers ion --- p.104Chapter 5.2.2. --- Parallel Version --- p.110Chapter 5.3. --- Output File Generated from both machines --- p.118Chapter 6. --- DISCUSSIONChapter 6.1. --- Application on real time situation --- p.123Chapter 6.2. --- Two dimensional or Multidimensional case --- p.123Chapter 6.3. --- Block Algorithm ApproachChapter 6.3.1. --- Blocked --- p.124Chapter 6.3.2. --- Row Wrapped --- p.126Chapter 6.4. --- Memory Requirement --- p.127Chapter 6.5. --- Signal Size PredictionChapter 6.5.1. --- Method A --- p.131Chapter 6.5.2. --- Method B --- p.133Chapter 7. --- CONCLUSION --- p.134Chapter 8. --- FUTURE MODIFICATION --- p.138REFERENCE --- p.140LISTING --- p.144APPENDIX I - Technical Information of PVM --- p.145Chapter II - --- Technical Information of DECmpp --- p.152Chapter III - --- Some Tips/Guide --- p.16

    Local Binary Patterns in Focal-Plane Processing. Analysis and Applications

    Get PDF
    Feature extraction is the part of pattern recognition, where the sensor data is transformed into a more suitable form for the machine to interpret. The purpose of this step is also to reduce the amount of information passed to the next stages of the system, and to preserve the essential information in the view of discriminating the data into different classes. For instance, in the case of image analysis the actual image intensities are vulnerable to various environmental effects, such as lighting changes and the feature extraction can be used as means for detecting features, which are invariant to certain types of illumination changes. Finally, classification tries to make decisions based on the previously transformed data. The main focus of this thesis is on developing new methods for the embedded feature extraction based on local non-parametric image descriptors. Also, feature analysis is carried out for the selected image features. Low-level Local Binary Pattern (LBP) based features are in a main role in the analysis. In the embedded domain, the pattern recognition system must usually meet strict performance constraints, such as high speed, compact size and low power consumption. The characteristics of the final system can be seen as a trade-off between these metrics, which is largely affected by the decisions made during the implementation phase. The implementation alternatives of the LBP based feature extraction are explored in the embedded domain in the context of focal-plane vision processors. In particular, the thesis demonstrates the LBP extraction with MIPA4k massively parallel focal-plane processor IC. Also higher level processing is incorporated to this framework, by means of a framework for implementing a single chip face recognition system. Furthermore, a new method for determining optical flow based on LBPs, designed in particular to the embedded domain is presented. Inspired by some of the principles observed through the feature analysis of the Local Binary Patterns, an extension to the well known non-parametric rank transform is proposed, and its performance is evaluated in face recognition experiments with a standard dataset. Finally, an a priori model where the LBPs are seen as combinations of n-tuples is also presentedSiirretty Doriast

    Implementing textural features on GPUs for improved real-time pavement distress detection

    Get PDF
    The condition of municipal roads has deteriorated considerably in recent years, leading to large scale pavement distress such as cracks or potholes. In order to enable road maintenance, pavement distress should be timely detected. However, manual investigation, which is still the most widely applied approach toward pavement assessment, puts maintenance personnel at risk and is time-consuming. During the last decade, several efforts have been made to automatically assess the condition of the municipal roads without any human intervention. Vehicles are equipped with sensors and cameras in order to collect data related to pavement distress and record videos of the pavement surface. Yet, this data are usually not processed while driving, but instead it is recorded and later analyzed off-line. As a result, a vast amount of memory is required to store the data and the available memory may not be sufficient. To reduce the amount of saved data, the authors have previously proposed a graphics processing units (GPU)-enabled pavement distress detection approach based on the wavelet transform of pavement images. The GPU implementation enables pavement distress detection in real time. Although the method used in the approach provides very good results, the method can still be improved by incorporating pavement surface texture characteristics. This paper presents an implementation of textural features on GPUs for pavement distress detection. Textural features are based on gray-tone spatial dependencies in an image and characterize the image texture. To evaluate the computational efficiency of the GPU implementation, performance tests are carried out. The results show that the speedup achieved by implementing the textural features on the GPU is sufficient to enable real-time detection of pavement distress. In addition, classification results obtained by applying the approach on 16,601 pavement images are compared to the results without integrating textural features. There results demonstrate that an improvement of 27% is achieved by incorporating pavement surface texture characteristics
    • …
    corecore