12 research outputs found

    Analysis of runtime re-configuration systems

    Full text link
    In recent years Programmable Logic Devices (PLD) and in particular Field Programmable Gate Arrays (FPGAs) have seen a tremendous increase in sales and applications in the area of embedded systems. The main advantage of FPGAs is the flexibility that they offer a designer in reconfiguring the hardware. The flexibility achieved through re-configuration of FPGAs usually incurs an overhead of extra execution time, data memory and also power dissipation; FPGAs provide an ideal template for run-time reconfigurable (RTR) designs. Only recently have RTR enabling design tools that bypass the traditional synthesis and bitstream generation process for FPGAs become available, JBits is one of them. With run-time reconfiguration of FPGAs, we can perform partial reconfiguration, which allows reconfiguration of a part of an FPGA while the other part is executing some functional computation. The partial reconfiguration of a function can be performed earlier than the time when the function is really needed. Such configuration pre-fetch can hide the reconfiguration overhead more effectively; This thesis will implement a reconfigurable system and study the effect of runtime reconfiguration using VERILOG and a new Java based tool JBITS. This work will provide pointers to high level synthesis tools targeting runtime re-configuration

    CABAC accelerator architectures for video compression in future multimedida : a survey

    Get PDF
    The demands for high quality, real-time performance and multi-format video support in consumer multimedia products are ever increasing. In particular, the future multimedia systems require efficient video coding algorithms and corresponding adaptive high-performance computational platforms. The H.264/AVC video coding algorithms provide high enough compression efficiency to be utilized in these systems, and multimedia processors are able to provide the required adaptability, but the algorithms complexity demands for more efficient computing platforms. Heterogeneous (re-)configurable systems composed of multimedia processors and hardware accelerators constitute the main part of such platforms. In this paper, we survey the hardware accelerator architectures for Context-based Adaptive Binary Arithmetic Coding (CABAC) of Main and High profiles of H.264/AVC. The purpose of the survey is to deliver a critical insight in the proposed solutions, and this way facilitate further research on accelerator architectures, architecture development methods and supporting EDA tools. The architectures are analyzed, classified and compared based on the core hardware acceleration concepts, algorithmic characteristics, video resolution support and performance parameters, and some promising design directions are discussed. The comparative analysis shows that the parallel pipeline accelerator architecture seems to be the most promising

    Hardware Implementation of a Novel Image Compression Algorithm

    Get PDF
    Image-related communications are forming an increasingly large part of modern communications, bringing the need for efficient and effective compression. Image compression is important for effective storage and transmission of images. Many techniques have been developed in the past, including transform coding, vector quantization and neural networks. In this thesis, a novel adaptive compression technique is introduced based on adaptive rather than fixed transforms for image compression. The proposed technique is similar to Neural Network (NN)-based image compression and its superiority over other techniques is presented It is shown that the proposed algorithm results in higher image quality for a given compression ratio than existing Neural Network algorithms and that the training of this algorithm is significantly faster than the NN based algorithms. This is also compared to the JPEG in terms of Peak Signal to Noise Ratio (PSNR) for a given compression ratio and computational complexity. Advantages of this idea over JPEG are also presented in this thesis

    Algorithms for compression of high dynamic range images and video

    Get PDF
    The recent advances in sensor and display technologies have brought upon the High Dynamic Range (HDR) imaging capability. The modern multiple exposure HDR sensors can achieve the dynamic range of 100-120 dB and LED and OLED display devices have contrast ratios of 10^5:1 to 10^6:1. Despite the above advances in technology the image/video compression algorithms and associated hardware are yet based on Standard Dynamic Range (SDR) technology, i.e. they operate within an effective dynamic range of up to 70 dB for 8 bit gamma corrected images. Further the existing infrastructure for content distribution is also designed for SDR, which creates interoperability problems with true HDR capture and display equipment. The current solutions for the above problem include tone mapping the HDR content to fit SDR. However this approach leads to image quality associated problems, when strong dynamic range compression is applied. Even though some HDR-only solutions have been proposed in literature, they are not interoperable with current SDR infrastructure and are thus typically used in closed systems. Given the above observations a research gap was identified in the need for efficient algorithms for the compression of still images and video, which are capable of storing full dynamic range and colour gamut of HDR images and at the same time backward compatible with existing SDR infrastructure. To improve the usability of SDR content it is vital that any such algorithms should accommodate different tone mapping operators, including those that are spatially non-uniform. In the course of the research presented in this thesis a novel two layer CODEC architecture is introduced for both HDR image and video coding. Further a universal and computationally efficient approximation of the tone mapping operator is developed and presented. It is shown that the use of perceptually uniform colourspaces for internal representation of pixel data enables improved compression efficiency of the algorithms. Further proposed novel approaches to the compression of metadata for the tone mapping operator is shown to improve compression performance for low bitrate video content. Multiple compression algorithms are designed, implemented and compared and quality-complexity trade-offs are identified. Finally practical aspects of implementing the developed algorithms are explored by automating the design space exploration flow and integrating the high level systems design framework with domain specific tools for synthesis and simulation of multiprocessor systems. The directions for further work are also presented

    KAVUAKA: a low-power application-specific processor architecture for digital hearing aids

    Get PDF
    The power consumption of digital hearing aids is very restricted due to their small physical size and the available hardware resources for signal processing are limited. However, there is a demand for more processing performance to make future hearing aids more useful and smarter. Future hearing aids should be able to detect, localize, and recognize target speakers in complex acoustic environments to further improve the speech intelligibility of the individual hearing aid user. Computationally intensive algorithms are required for this task. To maintain acceptable battery life, the hearing aid processing architecture must be highly optimized for extremely low-power consumption and high processing performance.The integration of application-specific instruction-set processors (ASIPs) into hearing aids enables a wide range of architectural customizations to meet the stringent power consumption and performance requirements. In this thesis, the application-specific hearing aid processor KAVUAKA is presented, which is customized and optimized with state-of-the-art hearing aid algorithms such as speaker localization, noise reduction, beamforming algorithms, and speech recognition. Specialized and application-specific instructions are designed and added to the baseline instruction set architecture (ISA). Among the major contributions are a multiply-accumulate (MAC) unit for real- and complex-valued numbers, architectures for power reduction during register accesses, co-processors and a low-latency audio interface. With the proposed MAC architecture, the KAVUAKA processor requires 16 % less cycles for the computation of a 128-point fast Fourier transform (FFT) compared to related programmable digital signal processors. The power consumption during register file accesses is decreased by 6 %to 17 % with isolation and by-pass techniques. The hardware-induced audio latency is 34 %lower compared to related audio interfaces for frame size of 64 samples.The final hearing aid system-on-chip (SoC) with four KAVUAKA processor cores and ten co-processors is integrated as an application-specific integrated circuit (ASIC) using a 40 nm low-power technology. The die size is 3.6 mm2. Each of the processors and co-processors contains individual customizations and hardware features with a varying datapath width between 24-bit to 64-bit. The core area of the 64-bit processor configuration is 0.134 mm2. The processors are organized in two clusters that share memory, an audio interface, co-processors and serial interfaces. The average power consumption at a clock speed of 10 MHz is 2.4 mW for SoC and 0.6 mW for the 64-bit processor.Case studies with four reference hearing aid algorithms are used to present and evaluate the proposed hardware architectures and optimizations. The program code for each processor and co-processor is generated and optimized with evolutionary algorithms for operation merging,instruction scheduling and register allocation. The KAVUAKA processor architecture is com-pared to related processor architectures in terms of processing performance, average power consumption, and silicon area requirements

    Codage d'images avec et sans pertes à basse complexité et basé contenu

    Get PDF
    This doctoral research project aims at designing an improved solution of the still image codec called LAR (Locally Adaptive Resolution) for both compression performance and complexity. Several image compression standards have been well proposed and used in the multimedia applications, but the research does not stop the progress for the higher coding quality and/or lower coding consumption. JPEG was standardized twenty years ago, while it is still a widely used compression format today. With a better coding efficiency, the application of the JPEG 2000 is limited by its larger computation cost than the JPEG one. In 2008, the JPEG Committee announced a Call for Advanced Image Coding (AIC). This call aims to standardize potential technologies going beyond existing JPEG standards. The LAR codec was proposed as one response to this call. The LAR framework tends to associate the compression efficiency and the content-based representation. It supports both lossy and lossless coding under the same structure. However, at the beginning of this study, the LAR codec did not implement the rate-distortion-optimization (RDO). This shortage was detrimental for LAR during the AIC evaluation step. Thus, in this work, it is first to characterize the impact of the main parameters of the codec on the compression efficiency, next to construct the RDO models to configure parameters of LAR for achieving optimal or sub-optimal coding efficiencies. Further, based on the RDO models, a “quality constraint” method is introduced to encode the image at a given target MSE/PSNR. The accuracy of the proposed technique, estimated by the ratio between the error variance and the setpoint, is about 10%. Besides, the subjective quality measurement is taken into consideration and the RDO models are locally applied in the image rather than globally. The perceptual quality is improved with a significant gain measured by the objective quality metric SSIM (structural similarity). Aiming at a low complexity and efficient image codec, a new coding scheme is also proposed in lossless mode under the LAR framework. In this context, all the coding steps are changed for a better final compression ratio. A new classification module is also introduced to decrease the entropy of the prediction errors. Experiments show that this lossless codec achieves the equivalent compression ratio to JPEG 2000, while saving 76% of the time consumption in average in encoding and decoding.Ce projet de recherche doctoral vise à proposer solution améliorée du codec de codage d’images LAR (Locally Adaptive Resolution), à la fois d’un point de vue performances de compression et complexité. Plusieurs standards de compression d’images ont été proposés par le passé et mis à profit dans de nombreuses applications multimédia, mais la recherche continue dans ce domaine afin d’offrir de plus grande qualité de codage et/ou de plus faibles complexité de traitements. JPEG fut standardisé il y a vingt ans, et il continue pourtant à être le format de compression le plus utilisé actuellement. Bien qu’avec de meilleures performances de compression, l’utilisation de JPEG 2000 reste limitée due à sa complexité plus importe comparée à JPEG. En 2008, le comité de standardisation JPEG a lancé un appel à proposition appelé AIC (Advanced Image Coding). L’objectif était de pouvoir standardiser de nouvelles technologies allant au-delà des standards existants. Le codec LAR fut alors proposé comme réponse à cet appel. Le système LAR tend à associer une efficacité de compression et une représentation basée contenu. Il supporte le codage avec et sans pertes avec la même structure. Cependant, au début de cette étude, le codec LAR ne mettait pas en oeuvre de techniques d’optimisation débit/distorsions (RDO), ce qui lui fut préjudiciable lors de la phase d’évaluation d’AIC. Ainsi dans ce travail, il s’agit dans un premier temps de caractériser l’impact des principaux paramètres du codec sur l’efficacité de compression, sur la caractérisation des relations existantes entre efficacité de codage, puis de construire des modèles RDO pour la configuration des paramètres afin d’obtenir une efficacité de codage proche de l’optimal. De plus, basée sur ces modèles RDO, une méthode de « contrôle de qualité » est introduite qui permet de coder une image à une cible MSE/PSNR donnée. La précision de la technique proposée, estimée par le rapport entre la variance de l’erreur et la consigne, est d’environ 10%. En supplément, la mesure de qualité subjective est prise en considération et les modèles RDO sont appliqués localement dans l’image et non plus globalement. La qualité perceptuelle est visiblement améliorée, avec un gain significatif mesuré par la métrique de qualité objective SSIM. Avec un double objectif d’efficacité de codage et de basse complexité, un nouveau schéma de codage LAR est également proposé dans le mode sans perte. Dans ce contexte, toutes les étapes de codage sont modifiées pour un meilleur taux de compression final. Un nouveau module de classification est également introduit pour diminuer l’entropie des erreurs de prédiction. Les expérimentations montrent que ce codec sans perte atteint des taux de compression équivalents à ceux de JPEG 2000, tout en économisant 76% du temps de codage et de décodage

    Discrete Wavelet Transforms

    Get PDF
    The discrete wavelet transform (DWT) algorithms have a firm position in processing of signals in several areas of research and industry. As DWT provides both octave-scale frequency and spatial timing of the analyzed signal, it is constantly used to solve and treat more and more advanced problems. The present book: Discrete Wavelet Transforms: Algorithms and Applications reviews the recent progress in discrete wavelet transform algorithms and applications. The book covers a wide range of methods (e.g. lifting, shift invariance, multi-scale analysis) for constructing DWTs. The book chapters are organized into four major parts. Part I describes the progress in hardware implementations of the DWT algorithms. Applications include multitone modulation for ADSL and equalization techniques, a scalable architecture for FPGA-implementation, lifting based algorithm for VLSI implementation, comparison between DWT and FFT based OFDM and modified SPIHT codec. Part II addresses image processing algorithms such as multiresolution approach for edge detection, low bit rate image compression, low complexity implementation of CQF wavelets and compression of multi-component images. Part III focuses watermaking DWT algorithms. Finally, Part IV describes shift invariant DWTs, DC lossless property, DWT based analysis and estimation of colored noise and an application of the wavelet Galerkin method. The chapters of the present book consist of both tutorial and highly advanced material. Therefore, the book is intended to be a reference text for graduate students and researchers to obtain state-of-the-art knowledge on specific applications

    Efficient software development for microprocessor based embedded system.

    Get PDF
    Tang Tze Yeung Eric.Thesis submitted in: July 2003.Thesis (M.Phil.)--Chinese University of Hong Kong, 2004.Includes bibliographical references (leaves 69-75).Abstracts in English and Chinese.ABSTRACT --- p.IIACKNOWLEDGMENT --- p.IIChapter 1 --- INTRODUCTION --- p.1Chapter 1.1 --- Embedded System --- p.1Chapter 1.2 --- Embedded Processor --- p.1Chapter 1.3 --- Embedded System Design --- p.3Chapter 1.3.1 --- Current Embedded System Design Challenges --- p.3Chapter 1.3.2 --- Embedded System Design Trend --- p.4Chapter 1.4 --- Efficient Software Development for Microprocessor --- p.8Chapter 1.4.1 --- Efficient Software Development Methodology --- p.8Chapter 1.5 --- Thesis Organization --- p.10Chapter 2 --- SOURCE CODE OPTIMIZATION --- p.11Chapter 2.1 --- Source Code Optimization Strategy --- p.11Chapter 2.2 --- Source Code Transformations --- p.12Chapter 2.2.1 --- Strength Reduction --- p.12Chapter 2.2.2 --- Function Inlining --- p.13Chapter 2.2.3 --- Table Lookup --- p.13Chapter 2.2.4 --- Loop Transformations --- p.13Chapter 2.2.5 --- Software Pipelining --- p.15Chapter 2.2.6 --- Register Allocation --- p.17Chapter 2.3 --- Case Study: Source Code Optimization on the StrongARM (SA1110) Platform --- p.18Chapter 2.3.1 --- StrongARM architecture --- p.18Chapter 2.3.2 --- StrongARM pipeline hazard illustration --- p.20Chapter 2.3.3 --- Source Code Optimization on StrongARM --- p.21Chapter 2.3.4 --- Instruction Set Optimization of StrongARM --- p.27Chapter 2.4 --- Conclusion --- p.32Chapter 3 --- FLOAT-TO-FIXED OPTIMIZATION --- p.33Chapter 3.1 --- Introduction to Fixed-point --- p.34Chapter 3.1.1 --- Fixed-point representation --- p.34Chapter 3.1.2 --- Fixed-point implementation --- p.35Chapter 3.1.3 --- Mathematical functions implementation --- p.38Chapter 3.2 --- Case Study: Fingerprint Minutiae Extraction Algorithms on the Strong ARM platform --- p.41Chapter 3.2.1 --- Fingerprint Verification Overview --- p.42Chapter 3.2.2 --- Fixed-point Implementation of Fingerprint Minutiae Extraction Algorithm --- p.49Chapter 3.2.3 --- Experimental Results --- p.51Chapter 3.3 --- Conclusion --- p.56Chapter 4 --- DOMAIN SPECIFIC OPTIMIZATION --- p.57Chapter 4.1 --- Case Study: Font Rasterization on the Strong ARM platform --- p.57Chapter 4.1.1 --- Outline Font --- p.57Chapter 4.1.2 --- Font Rasterization --- p.59Chapter 4.1.3 --- Experiments --- p.63Chapter 4.2 --- Conclusion --- p.66Chapter 5 --- CONCLUSION --- p.67BIBLIOGRAPHY --- p.6

    High Performance Multiview Video Coding

    Get PDF
    Following the standardization of the latest video coding standard High Efficiency Video Coding in 2013, in 2014, multiview extension of HEVC (MV-HEVC) was published and brought significantly better compression performance of around 50% for multiview and 3D videos compared to multiple independent single-view HEVC coding. However, the extremely high computational complexity of MV-HEVC demands significant optimization of the encoder. To tackle this problem, this work investigates the possibilities of using modern parallel computing platforms and tools such as single-instruction-multiple-data (SIMD) instructions, multi-core CPU, massively parallel GPU, and computer cluster to significantly enhance the MVC encoder performance. The aforementioned computing tools have very different computing characteristics and misuse of the tools may result in poor performance improvement and sometimes even reduction. To achieve the best possible encoding performance from modern computing tools, different levels of parallelism inside a typical MVC encoder are identified and analyzed. Novel optimization techniques at various levels of abstraction are proposed, non-aggregation massively parallel motion estimation (ME) and disparity estimation (DE) in prediction unit (PU), fractional and bi-directional ME/DE acceleration through SIMD, quantization parameter (QP)-based early termination for coding tree unit (CTU), optimized resource-scheduled wave-front parallel processing for CTU, and workload balanced, cluster-based multiple-view parallel are proposed. The result shows proposed parallel optimization techniques, with insignificant loss to coding efficiency, significantly improves the execution time performance. This , in turn, proves modern parallel computing platforms, with appropriate platform-specific algorithm design, are valuable tools for improving the performance of computationally intensive applications

    FDSOI Design using Automated Standard-Cell-Grained Body Biasing

    Get PDF
    With the introduction of FDSOI processes at competitive technology nodes, body biasing on an unprecedented scale was made possible. Body biasing influences one of the central transistor characteristics, the threshold voltage. By being able to heighten or lower threshold voltage by more than 100mV, the very physics of transistor switching can be manipulated at run time. Furthermore, as body biasing does not lead to different signal levels, it can be applied much more fine-grained than, e.g., DVFS. With the state of the art mainly focused on combinations of body biasing with DVFS, it has thus ignored granularities unfeasible for DVFS. This thesis fills this gap by proposing body bias domain partitioning techniques and for body bias domain partitionings thereby generated, algorithms that search for body bias assignments. Several different granularities ranging from entire cores to small groups of standard cells were examined using two principal approaches: Designer aided pre-partitioning based determination of body bias domains and a first-time, fully automatized, netlist based approach called domain candidate exploration. Both approaches operate along the lines of activation and timing of standard cell groups. These approaches were evaluated using the example of a Dynamically Reconfigurable Processor (DRP), a highly efficient category of reconfigurable architectures which consists of an array of processing elements and thus offers many opportunities for generalization towards many-core architectures. Finally, the proposed methods were validated by manufacturing a test-chip. Extensive simulation runs as well as the test-chip evaluation showed the validity of the proposed methods and indicated substantial improvements in energy efficiency compared to the state of the art. These improvements were accomplished by the fine-grained partitioning of the DRP design. This method allowed reducing dynamic power through supply voltage levels yielding higher clock frequencies using forward body biasing, while simultaneously reducing static power consumption in unused parts.Die Einführung von FDSOI Prozessen in gegenwärtigen Prozessgrößen ermöglichte die Nutzung von Substratvorspannung in nie zuvor dagewesenem Umfang. Substratvorspannung beeinflusst unter anderem eine zentrale Eigenschaft von Transistoren, die Schwellspannung. Mittels Substratvorspannung kann diese um mehr als 100mV erhöht oder gesenkt werden, was es ermöglicht, die schiere Physik des Schaltvorgangs zu manipulieren. Da weiterhin hiervon der Signalpegel der digitalen Signale unberührt bleibt, kann diese Technik auch in feineren Granularitäten angewendet werden, als z.B. Dynamische Spannungs- und Frequenz Anpassung (Engl. Dynamic Voltage and Frequency Scaling, Abk. DVFS). Da jedoch der Stand der Technik Substratvorspannung hauptsächlich in Kombinationen mit DVFS anwendet, werden feinere Granularitäten, welche für DVFS nicht mehr wirtschaftlich realisierbar sind, nicht berücksichtigt. Die vorliegende Arbeit schließt diese Lücke, indem sie Partitionierungsalgorithmen zur Unterteilung eines Entwurfs in Substratvorspannungsdomänen vorschlägt und für diese hierdurch unterteilten Domänen entsprechende Substratvorspannungen berechnet. Hierzu wurden verschiedene Granularitäten berücksichtigt, von ganzen Prozessorkernen bis hin zu kleinen Gruppen von Standardzellen. Diese Entwürfe wurden dann mit zwei verschiedenen Herangehensweisen unterteilt: Chipdesigner unterstützte, vorpartitionierungsbasierte Bestimmung von Substratvorspannungsdomänen, sowie ein erstmals vollautomatisierter, Netzlisten basierter Ansatz, in dieser Arbeit Domänen Kandidaten Exploration genannt. Beide Ansätze funktionieren nach dem Prinzip der Aktivierung, d.h. zu welchem Zeitpunkt welcher Teil des Entwurfs aktiv ist, sowie der Signallaufzeit durch die entsprechenden Entwurfsteile. Diese Ansätze wurden anhand des Beispiels Dynamisch Rekonfigurierbarer Prozessoren (DRP) evaluiert. DRPs stellen eine Klasse hocheffizienter rekonfigurierbarer Architekturen dar, welche hauptsächlich aus einem Feld von Rechenelementen besteht und dadurch auch zahlreiche Möglichkeiten zur Verallgemeinerung hinsichtlich Many-Core Architekturen zulässt. Schließlich wurden die vorgeschlagenen Methoden in einem Testchip validiert. Alle ermittelten Ergebnisse zeigen im Vergleich zum Stand der Technik drastische Verbesserungen der Energieeffizienz, welche durch die feingranulare Unterteilung in Substratvorspannungsdomänen erzielt wurde. Hierdurch konnten durch die Anwendung von Substratvorspannung höhere Taktfrequenzen bei gleicher Versorgungsspannung erzielt werden, während zeitgleich in zeitlich unkritischen oder ungenutzten Entwurfsteilen die statische Leistungsaufnahme minimiert wurde
    corecore