111 research outputs found

    Enhanced applicability of loop transformations

    Get PDF

    Research and developments of distributed video coding

    Get PDF
    This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.The recent developed Distributed Video Coding (DVC) is typically suitable for the applications such as wireless/wired video sensor network, mobile camera etc. where the traditional video coding standard is not feasible due to the constrained computation at the encoder. With DVC, the computational burden is moved from encoder to decoder. The compression efficiency is achieved via joint decoding at the decoder. The practical application of DVC is referred to Wyner-Ziv video coding (WZ) where the side information is available at the decoder to perform joint decoding. This join decoding inevitably causes a very complex decoder. In current WZ video coding issues, many of them emphasise how to improve the system coding performance but neglect the huge complexity caused at the decoder. The complexity of the decoder has direct influence to the system output. The beginning period of this research targets to optimise the decoder in pixel domain WZ video coding (PDWZ), while still achieves similar compression performance. More specifically, four issues are raised to optimise the input block size, the side information generation, the side information refinement process and the feedback channel respectively. The transform domain WZ video coding (TDWZ) has distinct superior performance to the normal PDWZ due to the exploitation in spatial direction during the encoding. However, since there is no motion estimation at the encoder in WZ video coding, the temporal correlation is not exploited at all at the encoder in all current WZ video coding issues. In the middle period of this research, the 3D DCT is adopted in the TDWZ to remove redundancy in both spatial and temporal direction thus to provide even higher coding performance. In the next step of this research, the performance of transform domain Distributed Multiview Video Coding (DMVC) is also investigated. Particularly, three types transform domain DMVC frameworks which are transform domain DMVC using TDWZ based 2D DCT, transform domain DMVC using TDWZ based on 3D DCT and transform domain residual DMVC using TDWZ based on 3D DCT are investigated respectively. One of the important applications of WZ coding principle is error-resilience. There have been several attempts to apply WZ error-resilient coding for current video coding standard e.g. H.264/AVC or MEPG 2. The final stage of this research is the design of WZ error-resilient scheme for wavelet based video codec. To balance the trade-off between error resilience ability and bandwidth consumption, the proposed scheme emphasises the protection of the Region of Interest (ROI) area. The efficiency of bandwidth utilisation is achieved by mutual efforts of WZ coding and sacrificing the quality of unimportant area. In summary, this research work contributed to achieves several advances in WZ video coding. First of all, it is targeting to build an efficient PDWZ with optimised decoder. Secondly, it aims to build an advanced TDWZ based on 3D DCT, which then is applied into multiview video coding to realise advanced transform domain DMVC. Finally, it aims to design an efficient error-resilient scheme for wavelet video codec, with which the trade-off between bandwidth consumption and error-resilience can be better balanced

    Energy efficient enabling technologies for semantic video processing on mobile devices

    Get PDF
    Semantic object-based processing will play an increasingly important role in future multimedia systems due to the ubiquity of digital multimedia capture/playback technologies and increasing storage capacity. Although the object based paradigm has many undeniable benefits, numerous technical challenges remain before the applications becomes pervasive, particularly on computational constrained mobile devices. A fundamental issue is the ill-posed problem of semantic object segmentation. Furthermore, on battery powered mobile computing devices, the additional algorithmic complexity of semantic object based processing compared to conventional video processing is highly undesirable both from a real-time operation and battery life perspective. This thesis attempts to tackle these issues by firstly constraining the solution space and focusing on the human face as a primary semantic concept of use to users of mobile devices. A novel face detection algorithm is proposed, which from the outset was designed to be amenable to be offloaded from the host microprocessor to dedicated hardware, thereby providing real-time performance and reducing power consumption. The algorithm uses an Artificial Neural Network (ANN), whose topology and weights are evolved via a genetic algorithm (GA). The computational burden of the ANN evaluation is offloaded to a dedicated hardware accelerator, which is capable of processing any evolved network topology. Efficient arithmetic circuitry, which leverages modified Booth recoding, column compressors and carry save adders, is adopted throughout the design. To tackle the increased computational costs associated with object tracking or object based shape encoding, a novel energy efficient binary motion estimation architecture is proposed. Energy is reduced in the proposed motion estimation architecture by minimising the redundant operations inherent in the binary data. Both architectures are shown to compare favourable with the relevant prior art

    Design-time performance analysis of component-based real-time systems

    Get PDF
    In current real-time systems, performance metrics are one of the most challenging properties to specify, predict and measure. Performance properties depend on various factors, like environmental context, load profile, middleware, operating system, hardware platform and sharing of internal resources. Performance failures and not satisfying related requirements cause delays, cost overruns, and even abandonment of projects. In order to avoid these performancerelated project failures, the performance properties should be obtained and analyzed already at the early design phase of a project. In this thesis we employ principles of component-based software engineering (CBSE), which enable building software systems from individual components. The advantage of CBSE is that individual components can be modeled, reused and traded. The main objective of this thesis is to develop a method that enables to predict the performance properties of a system, based on the performance properties of the involved individual components. The prediction method serves rapid prototyping and performance analysis of the architecture or related alternatives, without performing the usual testing and implementation stages. The involved research questions are as follows. How should the behaviour and performance properties of individual components be specified in order to enable automated composition of these properties into an analyzable model of a complete system? How to synthesize the models of individual components into a model of a complete system in an automated way, such that the resulting system model can be analyzed against the performance properties? The thesis presents a new framework called DeepCompass, which realizes the concept of predictable assembly throughout all phases of the system design. The cornerstones of the framework are the composable models of individual software components and hardware blocks. The models are specified at the component development time and shipped in a component package. At the component composition phase, the models of the constituent components are synthesized into an executable system model. Since the thesis focuses on performance properties, we introduce performance-related types of component models, such as behaviour, performance and resource models. The dynamics of the system execution are captured in scenario models. The essential advantage of the introduced models is that, through the behaviour of individual components and scenario models, the behaviour of the complete system is synthesized in the executable system model. Further simulation-based analysis of the obtained executable system model provides application-specific and system-specific performance property values. To support the performance analysis, we have developed a CARAT software toolkit that provides and automates the algorithms for model synthesis and simulation. Besides this, the toolkit provides graphical tools for designing alternative architectures and visualization of obtained performance properties. We have conducted an empirical case study on the use of scenarios in the industry to analyze the system performance at the early design phase. It was found that industrial architects make extensive use of scenarios for performance evaluation. Based on the inputs of the architects, we have provided a set of guidelines for identification and use of performance-critical scenarios. At the end of this thesis, we have validated the DeepCompass framework by performing three case studies on performance prediction of real-time systems: an MPEG-4 video decoder, a Car Radio Navigation system and a JPEG application. For each case study, we have constructed models of the individual components, defined the SW/HW architecture, and used the CARAT toolkit to synthesize and simulate the executable system model. The simulation provided the predicted performance properties, which we later compared with the actual performance properties of the realized systems. With respect to resource usage properties and average task latencies, the variation of the prediction error showed to be within 30% of the actual performance. Concerning the pick loads on the processor nodes, the actual values were sometimes three times larger than the predicted values. As a conclusion, the framework has proven to be effective in rapid architecture prototyping and performance analysis of a complete system. This is valid, as in the case studies we have spent not more than 4-5 days on the average for the complete iteration cycle, including the design of several architecture alternatives. The framework can handle different architectural styles, which makes it widely applicable. A conceptual limitation of the framework is that it assumes that the models of individual components are already available at the design phase

    Parallel implementation of fractal image compression

    Get PDF
    Thesis (M.Sc.Eng.)-University of Natal, Durban, 2000.Fractal image compression exploits the piecewise self-similarity present in real images as a form of information redundancy that can be eliminated to achieve compression. This theory based on Partitioned Iterated Function Systems is presented. As an alternative to the established JPEG, it provides a similar compression-ratio to fidelity trade-off. Fractal techniques promise faster decoding and potentially higher fidelity, but the computationally intensive compression process has prevented commercial acceptance. This thesis presents an algorithm mapping the problem onto a parallel processor architecture, with the goal of reducing the encoding time. The experimental work involved implementation of this approach on the Texas Instruments TMS320C80 parallel processor system. Results indicate that the fractal compression process is unusually well suited to parallelism with speed gains approximately linearly related to the number of processors used. Parallel processing issues such as coherency, management and interfacing are discussed. The code designed incorporates pipelining and parallelism on all conceptual and practical levels ensuring that all resources are fully utilised, achieving close to optimal efficiency. The computational intensity was reduced by several means, including conventional classification of image sub-blocks by content with comparisons across class boundaries prohibited. A faster approach adopted was to perform estimate comparisons between blocks based on pixel value variance, identifying candidates for more time-consuming, accurate RMS inter-block comparisons. These techniques, combined with the parallelism, allow compression of 512x512 pixel x 8 bit images in under 20 seconds, while maintaining a 30dB PSNR. This is up to an order of magnitude faster than reported for conventional sequential processor implementations. Fractal based compression of colour images and video sequences is also considered. The work confirms the potential of fractal compression techniques, and demonstrates that a parallel implementation is appropriate for addressing the compression time problem. The processor system used in these investigations is faster than currently available PC platforms, but the relevance lies in the anticipation that future generations of affordable processors will exceed its performance. The advantages of fractal image compression may then be accessible to the average computer user, leading to commercial acceptance

    Parallel implementation of fractal image compression

    Get PDF
    Thesis (M.Sc.Eng.)-University of Natal, Durban, 2000.Fractal image compression exploits the piecewise self-similarity present in real images as a form of information redundancy that can be eliminated to achieve compression. This theory based on Partitioned Iterated Function Systems is presented. As an alternative to the established JPEG, it provides a similar compression-ratio to fidelity trade-off. Fractal techniques promise faster decoding and potentially higher fidelity, but the computationally intensive compression process has prevented commercial acceptance. This thesis presents an algorithm mapping the problem onto a parallel processor architecture, with the goal of reducing the encoding time. The experimental work involved implementation of this approach on the Texas Instruments TMS320C80 parallel processor system. Results indicate that the fractal compression process is unusually well suited to parallelism with speed gains approximately linearly related to the number of processors used. Parallel processing issues such as coherency, management and interfacing are discussed. The code designed incorporates pipelining and parallelism on all conceptual and practical levels ensuring that all resources are fully utilised, achieving close to optimal efficiency. The computational intensity was reduced by several means, including conventional classification of image sub-blocks by content with comparisons across class boundaries prohibited. A faster approach adopted was to perform estimate comparisons between blocks based on pixel value variance, identifying candidates for more time-consuming, accurate RMS inter-block comparisons. These techniques, combined with the parallelism, allow compression of 512x512 pixel x 8 bit images in under 20 seconds, while maintaining a 30dB PSNR. This is up to an order of magnitude faster than reported for conventional sequential processor implementations. Fractal based compression of colour images and video sequences is also considered. The work confirms the potential of fractal compression techniques, and demonstrates that a parallel implementation is appropriate for addressing the compression time problem. The processor system used in these investigations is faster than currently available PC platforms, but the relevance lies in the anticipation that future generations of affordable processors will exceed its performance. The advantages of fractal image compression may then be accessible to the average computer user, leading to commercial acceptance

    Parallelism and the software-hardware interface in embedded systems

    Get PDF
    This thesis by publications addresses issues in the architecture and microarchitecture of next generation, high performance streaming Systems-on-Chip through quantifying the most important forms of parallelism in current and emerging embedded system workloads. The work consists of three major research tracks, relating to data level parallelism, thread level parallelism and the software-hardware interface which together reflect the research interests of the author as they have been formed in the last nine years. Published works confirm that parallelism at the data level is widely accepted as the most important performance leverage for the efficient execution of embedded media and telecom applications and has been exploited via a number of approaches the most efficient being vectorlSIMD architectures. A further, complementary and substantial form of parallelism exists at the thread level but this has not been researched to the same extent in the context of embedded workloads. For the efficient execution of such applications, exploitation of both forms of parallelism is of paramount importance. This calls for a new architectural approach in the software-hardware interface as its rigidity, manifested in all desktop-based and the majority of embedded CPU's, directly affects the performance ofvectorized, threaded codes. The author advocates a holistic, mature approach where parallelism is extracted via automatic means while at the same time, the traditionally rigid hardware-software interface is optimized to match the temporal and spatial behaviour of the embedded workload. This ultimate goal calls for the precise study of these forms of parallelism for a number of applications executing on theoretical models such as instruction set simulators and parallel RAM machines as well as the development of highly parametric microarchitectural frameworks to encapSUlate that functionality.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Power aware data and memory management for dynamic applications

    Get PDF
    In recent years, the semiconductor industry has turned its focus towards heterogeneous multiprocessor platforms. They are an economically viable solution for coping with the growing setup and manufacturing cost of silicon systems. Furthermore, their inherent flexibility perfectly supports the emerging market of interactive, mobile data and content services. The platform’s performance and energy depend largely on how well the data-dominated services are mapped on the memory subsystem. A crucial aspect thereby is how efficient data is transferred between the different memory layers. Several compilation techniques have been developed to optimally use the available bandwidth. Unfortunately, they do not take the interaction between multiple threads into account and do not deal with the dynamic behaviour of these novel applications. The main limitations of current techniques are outlined and an approach for dealing with them is introduced

    Compilation Techniques for High-Performance Embedded Systems with Multiple Processors

    Get PDF
    Institute for Computing Systems ArchitectureDespite the progress made in developing more advanced compilers for embedded systems, programming of embedded high-performance computing systems based on Digital Signal Processors (DSPs) is still a highly skilled manual task. This is true for single-processor systems, and even more for embedded systems based on multiple DSPs. Compilers often fail to optimise existing DSP codes written in C due to the employed programming style. Parallelisation is hampered by the complex multiple address space memory architecture, which can be found in most commercial multi-DSP configurations. This thesis develops an integrated optimisation and parallelisation strategy that can deal with low-level C codes and produces optimised parallel code for a homogeneous multi-DSP architecture with distributed physical memory and multiple logical address spaces. In a first step, low-level programming idioms are identified and recovered. This enables the application of high-level code and data transformations well-known in the field of scientific computing. Iterative feedback-driven search for “good” transformation sequences is being investigated. A novel approach to parallelisation based on a unified data and loop transformation framework is presented and evaluated. Performance optimisation is achieved through exploitation of data locality on the one hand, and utilisation of DSP-specific architectural features such as Direct Memory Access (DMA) transfers on the other hand. The proposed methodology is evaluated against two benchmark suites (DSPstone & UTDSP) and four different high-performance DSPs, one of which is part of a commercial four processor multi-DSP board also used for evaluation. Experiments confirm the effectiveness of the program recovery techniques as enablers of high-level transformations and automatic parallelisation. Source-to-source transformations of DSP codes yield an average speedup of 2.21 across four different DSP architectures. The parallelisation scheme is – in conjunction with a set of locality optimisations – able to produce linear and even super-linear speedups on a number of relevant DSP kernels and applications

    Optimisation mémoire et exploration architecturale d'applications multimédias sur un réseau sur puce

    Full text link
    Mémoire numérisé par la Direction des bibliothèques de l'Université de Montréal
    corecore