12 research outputs found

    A DSP Based H.264 Decoder for a Multi-Format IP Set-Top Box

    Get PDF
    In this paper, the implementation of a digital signal processor (DSP) based H.264 decoder for a multi-format set-top box is described. Baseline and main profiles are supported. Using several software optimization techniques, the decoder has been fitted into a low-cost DSP. The decoder alone has been tested in simulation, achieving real-time performance with a 600 MHz system clock. Moreover, it has been integrated in a multi-format IP set-top box allowing the implementation of actual environment tests with excellent results. Finally, the decoder has been ported to a latest generation DSP

    Network-on-Chip Based H.264 Video Decoder on a Field Programmable Gate Array

    Get PDF
    This thesis develops the first fully network-on-chip (NoC) based h.264 video decoder implemented in real hardware on a field programmable gate array (FPGA). This thesis starts with an overview of the h.264 video coding standard and an introduction to the NoC communication paradigm. Following this, a series of processing elements (PEs) are developed which implement the component algorithms making up the h.264 video decoder. These PEs, described primarily in VHDL with some Verilog and C, are then mapped to an NoC which is generated using the CONNECT NoC generation tool. To demonstrate the scalability of the proposed NoC based design, a second NoC based video decoder is implemented on a smaller FPGA using the same PEs on a more compact NoC topology. The performance of both decoders, as well as their component PEs, is evaluated on real hardware. An analysis of the performance results is conducted and recommendations for future work are made based on the results of this analysis. Aside from the development of the proposed decoder, a major contribution of this thesis is the release of all source materials for this design as open source hardware and software. The release of these materials will allow other researchers to more easily replicate this work, as well as create derivative works in the areas of NoC based designs for FPGA, video coding and decoding, and related areas

    재구성형 구조에서의 효율적인 조건실행 기법

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2013. 8. 최기영.재구성형 구조는 연산량이 많은 프로그램을 내장형 시스템에서 가속시키는 데 적합한 방법 중 하나이다. 이는 일반적으로 많은 연산유닛들과 하나의 컨트롤러로 구성되어 고성능, 유연성, 저전력을 동시에 달성할 수 있도록 해준다. 많은 연산유닛을 바탕으로 한 병렬처리는 응용프로그램의 실행속도를 빠르게 하며, 재구성 기능은 다양한 응용프로그램에의 활용을 가능하게 해준다. 또한, 명령어와 데이터에 대한 스케쥴을 미리 정해놓음으로써 제어구조를 단순화시킬 수 있으며 이는 연산량 대비 전력소모를 최소한으 로 줄여준다. 하지만 응용프로그램이 복잡해짐에 따라 연산량이 많은 부분들에 분기문이 생기게 되었으며 이는 재구성형 구조를 사용함에 있어 큰 위협이 되고 있다. 분기문을 다룰 수 있는 컨트롤러가 하나이기 때문에 컨트롤러에 병목현상이 발생하거나 동시에 서로 다른 제어를 요구하게 되면 해당 프로그램은 가속이 불가능해진다. 조건실행이라는 기술을 사용할 경우 이를 부분적으로 해소할 수 있지만 기존에 개발되어 있는 조건실행 기술들은 재구성형 구조에 성능 및 전력소모 면에서 부정적인 영향을 끼친다. 따라서 본 논문에서는 연산량이 많지만 분기문을 가진 응용프로그램에서 조건실행이 성능과 전력 면에서 어떠한 영향을 미치는지 밝히며 이를 바탕으로 고성능과 저전력을 가진 조건실행 방법을 제안한다. 실험 결과에 따르면 제안한 방식은 기존의 세가지 방식보다 성능과 전력소모를 곱으로 표현한 수치에 있어서 11.9%, 14.7%, 23.8% 만큼의 이득을 보였다. 또한, 제안한 조건실행 방법에 적합한 컴파일 체계도 제안하였다. 제안한 조건실행은 절전모드를 사용함에 따라 전력을 아낄 수 있지만 기존의 컴파일방식으로는 여러 조건문을 병렬적으로 수행하도록 컴파일할 수 없는 문제가 생긴다. 따라서 본 논문에서는 이런 문제를 밝히고 조건문들을 서로 다른 연산유닛에 할당함으로써 문제를 해결하는 방식을 제안하고 있다. 제안한 방식을 사용할 경우 단순하고 직관적인 방법에 비하여 평균적으로 2.21배의 높은 성능을 얻을 수 있었다.Coarse-Grained Reconfigurable Architecture (CGRA) is one of viable solutions in embedded systems to accelerate data-intensive applications. It typically consists of an array of processing elements (PEs) and a centralized controller, which can provide high performance, flexibility, and low power. Parallel array processing reduces execution time of applications, reconfigurability of PEs allows changing its functionality, and simplified control structure with static scheduling for instruction fetching and data communication minimizes power consumption. However, as applications become complex so that data-intensive parts are having control flows in them, CGRAs face a challenge for its effectiveness. Since the entire PEs are controlled by a centralized unit, it is impossible to execute programs having control divergence among PEs. To overcome the problem, we can adopt the technique called predicated execution, which is the unique solution known so far, but conventional predication techniques have a negative impact on both performance and power consumption due to longer instruction words and unnecessary instruction-fetching/decoding/nullifying steps. Thus, this thesis reveals performance and power issues in predicated execution when a CGRA executes both data- and control-intensive applications, which have not been well-addressed yet. Then it proposes high-performance and low-power predication mechanisms. Experiments conducted through gate-level simulation show that the proposed mechanism improves energy-delay product by 11.9%, 14.7%, and 23.8% compared to three conventional techniques. In addition, this thesis also reveals mapping issues when mapping applications on CGRAs using the proposed predication. A power-saving mode introduced into PEs prohibits multiple conditionals from being parallelized if conventional mapping algorithms are used. Thus, this thesis proposes the framework to release this problem by mapping conditionals to different PEs. Experiments show that mapping results from the proposed approach lead to 2.21 times higher performance than those of the naïve approach.Abstract i Chapter 1 Introduction 1 Chapter 2 Background and Related Work 5 2.1 Coarse-Grained Reconfigurable Architecture . . . . . . . . . . . . 5 2.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Target Domain . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.3 Comparison with Other Architectures . . . . . . . . . . . 6 2.1.4 Application Mapping . . . . . . . . . . . . . . . . . . . . . 8 2.1.5 Target CGRA . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Predicated Execution Technique . . . . . . . . . . . . . . . . . . 11 2.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.3 Different Roles in ILP and DLP processors . . . . . . . . 13 2.2.4 Predication Support on CGRAs . . . . . . . . . . . . . . . 14 Chapter 3 Conventional Predicated Execution Techniques 15 3.1 Partial Predication (Partial) . . . . . . . . . . . . . . . . . . . . 16 3.2 Condition-Based Full Predication (CondFull) . . . . . . . . . . 18 Chapter 4 State-Based Full Predication 23 4.1 Previous Approach (PseudoBranch) . . . . . . . . . . . . . . . 24 4.2 Counter-Based Approach (StateFull) . . . . . . . . . . . . . . 25 4.3 Dual-Issue-Single-Execution (DISE) . . . . . . . . . . . . . . . . 28 4.4 Hybrid Predication . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.4.2 StateFull+Partial . . . . . . . . . . . . . . . . . . . . 34 4.4.3 StateFull+Partial+DISE . . . . . . . . . . . . . . . . 35 Chapter 5 Evaluation 39 5.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.1.1 Conventional Techniques . . . . . . . . . . . . . . . . . . . 39 5.1.2 Proposed Techniques . . . . . . . . . . . . . . . . . . . . . 40 5.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.3.1 Effect of Predication Mechanism on Power Consumption of a PE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.3.2 Quantitative Definitions of short-if and long-if . . . . . . 48 5.3.3 Compilation Strategy in StateFull+Partial . . . . . . 48 5.3.4 Conventional Techniques (Partial, CondFull, and PseudoBranch) vs. Proposed StateFull Technique . . . . . 49 5.3.5 Proposed Hybrid Predication Techniques . . . . . . . . . 53 5.3.6 Putting Together . . . . . . . . . . . . . . . . . . . . . . . 54 5.3.7 Speedup of Applications . . . . . . . . . . . . . . . . . . . 57 Chapter 6 Mapping Framework 61 6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.2 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.2.1 Overall Flow . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.2.2 From IR to CDFG . . . . . . . . . . . . . . . . . . . . . . 64 6.2.3 Separation . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.2.4 CDFG Mapping . . . . . . . . . . . . . . . . . . . . . . . 68 6.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 69 6.4.2 Verification of Mapping Framework . . . . . . . . . . . . . 70 6.4.3 Quality of Mapping Results . . . . . . . . . . . . . . . . . 70 Chapter 7 Conclusion 73 7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.2 Applicable Scope and Future Work . . . . . . . . . . . . . . . . . 75 Appendix 77 국문초록 93 감사의 글 95Docto

    SIMD based multicore processor for image and video processing

    Get PDF
    制度:新 ; 報告番号:甲3602号 ; 学位の種類:博士(工学) ; 授与年月日:2012/3/15 ; 早大学位記番号:新595

    Image and Video Coding Techniques for Ultra-low Latency

    Get PDF
    The next generation of wireless networks fosters the adoption of latency-critical applications such as XR, connected industry, or autonomous driving. This survey gathers implementation aspects of different image and video coding schemes and discusses their tradeoffs. Standardized video coding technologies such as HEVC or VVC provide a high compression ratio, but their enormous complexity sets the scene for alternative approaches like still image, mezzanine, or texture compression in scenarios with tight resource or latency constraints. Regardless of the coding scheme, we found inter-device memory transfers and the lack of sub-frame coding as limitations of current full-system and software-programmable implementations.publishedVersionPeer reviewe

    Polymorphic Pipeline Array: A Flexible Multicore Accelerator for Mobile Multimedia Applications.

    Full text link
    Mobile computing in the form of smart phones, netbooks, and PDAs has become an integral part of our everyday lives. Moving ahead to the next generation of mobile devices, we believe that multimedia will become a more critical and product-differentiating feature. High definition audio and video as well as 3D graphics provide richer interfaces and compelling capabilities. However, these algorithms also bring different computational challenges than wireless signal processing. Multimedia algorithms are more complex featuring more control flow and variable computational requirements where execution time is not dominated by innermost vector loops. Further, data access is more complex where media applications typically operate on multi-dimensional vectors of data rather than single-dimensional vectors with simple strides. Thus, the design of current mobile platforms requires re-examination to account for these new application domains. In this dissertation, we focus on the design of a programmable, low-power accelerator for multimedia algorithms referred to as a Polymorphic Pipeline Array (PPA). The PPA design is inspired by coarse-grain reconfigurable architectures (CGRAs) that consist of an array of function units interconnected by a mesh style interconnect. The PPA improves upon CGRAs by attacking two major limitations: scalability and acceleration limited to innermost loops. The large number of resources are fully utilized by exploiting both Lne-grain instruction-level and coarse-grain pipeline parallelism, and the acceleration is extended beyond innermost loops to encompass the whole region of applications. Various compiler and architectural optimizations are presented for CGRAs that form the basic building blocks of PPA. Two compiler techniques are presented that systematically construct the schedule with intelligent heuristics. Modulo graph embedding leverages graph embedding technique for scheduling in CGRAs and edgecentric modulo scheduling provides a communication-oriented way to address the scheduling problem. For architectural improvement, a novel control path design is presented that leverages the token network of dataflow machines to reduce the instructionmemory power. The PPA is designed with flexibility and programmability as first-order requirements to enable the hardware to be dynamically customizable to the application. A PPA exploit pipeline parallelism found in streaming applications to create a coarsegrain hardware pipeline to execute streaming media applications.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/64732/1/parkhc_1.pd

    Dynamically Reconfigurable Architectures and Systems for Time-varying Image Constraints (DRASTIC) for Image and Video Compression

    Get PDF
    In the current information booming era, image and video consumption is ubiquitous. The associated image and video coding operations require significant computing resources for both small-scale computing systems as well as over larger network systems. For different scenarios, power, bitrate and image quality can impose significant time-varying constraints. For example, mobile devices (e.g., phones, tablets, laptops, UAVs) come with significant constraints on energy and power. Similarly, computer networks provide time-varying bandwidth that can depend on signal strength (e.g., wireless networks) or network traffic conditions. Alternatively, the users can impose different constraints on image quality based on their interests. Traditional image and video coding systems have focused on rate-distortion optimization. More recently, distortion measures (e.g., PSNR) are being replaced by more sophisticated image quality metrics. However, these systems are based on fixed hardware configurations that provide limited options over power consumption. The use of dynamic partial reconfiguration with Field Programmable Gate Arrays (FPGAs) provides an opportunity to effectively control dynamic power consumption by jointly considering software-hardware configurations. This dissertation extends traditional rate-distortion optimization to rate-quality-power/energy optimization and demonstrates a wide variety of applications in both image and video compression. In each application, a family of Pareto-optimal configurations are developed that allow fine control in the rate-quality-power/energy optimization space. The term Dynamically Reconfiguration Architecture Systems for Time-varying Image Constraints (DRASTIC) is used to describe the derived systems. DRASTIC covers both software-only as well as software-hardware configurations to achieve fine optimization over a set of general modes that include: (i) maximum image quality, (ii) minimum dynamic power/energy, (iii) minimum bitrate, and (iv) typical mode over a set of opposing constraints to guarantee satisfactory performance. In joint software-hardware configurations, DRASTIC provides an effective approach for dynamic power optimization. For software configurations, DRASTIC provides an effective method for energy consumption optimization by controlling processing times. The dissertation provides several applications. First, stochastic methods are given for computing quantization tables that are optimal in the rate-quality space and demonstrated on standard JPEG compression. Second, a DRASTIC implementation of the DCT is used to demonstrate the effectiveness of the approach on motion JPEG. Third, a reconfigurable deblocking filter system is investigated for use in the current H.264/AVC systems. Fourth, the dissertation develops DRASTIC for all 35 intra-prediction modes as well as intra-encoding for the emerging High Efficiency Video Coding standard (HEVC)

    The Telecommunications and Data Acquisition Report

    Get PDF
    Tracking and ground-based navigation; communications, spacecraft-ground; station control and system technology; capabilities for new projects; networks consolidation program; and network sustaining are described

    Running stream-like programs on heterogeneous multi-core systems

    Get PDF
    All major semiconductor companies are now shipping multi-cores. Phones, PCs, laptops, and mobile internet devices will all require software that can make effective use of these cores. Writing high-performance parallel software is difficult, time-consuming and error prone, increasing both time-to-market and cost. Software outlives hardware; it typically takes longer to develop new software than hardware, and legacy software tends to survive for a long time, during which the number of cores per system will increase. Development and maintenance productivity will be improved if parallelism and technical details are managed by the machine, while the programmer reasons about the application as a whole. Parallel software should be written using domain-specific high-level languages or extensions. These languages reveal implicit parallelism, which would be obscured by a sequential language such as C. When memory allocation and program control are managed by the compiler, the program's structure and data layout can be safely and reliably modified by high-level compiler transformations. One important application domain contains so-called stream programs, which are structured as independent kernels interacting only through one-way channels, called streams. Stream programming is not applicable to all programs, but it arises naturally in audio and video encode and decode, 3D graphics, and digital signal processing. This representation enables high-level transformations, including kernel unrolling and kernel fusion. This thesis develops new compiler and run-time techniques for stream programming. The first part of the thesis is concerned with a statically scheduled stream compiler. It introduces a new static partitioning algorithm, which determines which kernels should be fused, in order to balance the loads on the processors and interconnects. A good partitioning algorithm is crucial if the compiler is to produce efficient code. The algorithm also takes account of downstream compiler passes---specifically software pipelining and buffer allocation---and it models the compiler's ability to fuse kernels. The latter is important because the compiler may not be able to fuse arbitrary collections of kernels. This thesis also introduces a static queue sizing algorithm. This algorithm is important when memory is distributed, especially when local stores are small. The algorithm takes account of latencies and variations in computation time, and is constrained by the sizes of the local memories. The second part of this thesis is concerned with dynamic scheduling of stream programs. First, it investigates the performance of known online, non-preemptive, non-clairvoyant dynamic schedulers. Second, it proposes two dynamic schedulers for stream programs. The first is specifically for one-dimensional stream programs. The second is more general: it does not need to be told the stream graph, but it has slightly larger overhead. This thesis also introduces some support tools related to stream programming. StarssCheck is a debugging tool, based on Valgrind, for the StarSs task-parallel programming language. It generates a warning whenever the program's behaviour contradicts a pragma annotation. Such behaviour could otherwise lead to exceptions or race conditions. StreamIt to OmpSs is a tool to convert a streaming program in the StreamIt language into a dynamically scheduled task based program using StarSs.Totes les empreses de semiconductors produeixen actualment multi-cores. Mòbils,PCs, portàtils, i dispositius mòbils d’Internet necessitaran programari quefaci servir eficientment aquests cores. Escriure programari paral·lel d’altrendiment és difícil, laboriós i propens a errors, incrementant tant el tempsde llançament al mercat com el cost. El programari té una vida més llarga queel maquinari; típicament pren més temps desenvolupar nou programi que noumaquinari, i el programari ja existent pot perdurar molt temps, durant el qualel nombre de cores dels sistemes incrementarà. La productivitat dedesenvolupament i manteniment millorarà si el paral·lelisme i els detallstècnics són gestionats per la màquina, mentre el programador raona sobre elconjunt de l’aplicació.El programari paral·lel hauria de ser escrit en llenguatges específics deldomini. Aquests llenguatges extrauen paral·lelisme implícit, el qual és ocultatper un llenguatge seqüencial com C. Quan l’assignació de memòria i lesestructures de control són gestionades pel compilador, l’estructura iorganització de dades del programi poden ser modificades de manera segura ifiable per les transformacions d’alt nivell del compilador.Un dels dominis de l’aplicació importants és el que consta dels programes destream; aquest programes són estructurats com a nuclis independents queinteractuen només a través de canals d’un sol sentit, anomenats streams. Laprogramació de streams no és aplicable a tots els programes, però sorgeix deforma natural en la codificació i descodificació d’àudio i vídeo, gràfics 3D, iprocessament de senyals digitals. Aquesta representació permet transformacionsd’alt nivell, fins i tot descomposició i fusió de nucli.Aquesta tesi desenvolupa noves tècniques de compilació i sistemes en tempsd’execució per a programació de streams. La primera part d’aquesta tesi esfocalitza amb un compilador de streams de planificació estàtica. Presenta unnou algorisme de partició estàtica, que determina quins nuclis han de serfusionats, per tal d’equilibrar la càrrega en els processadors i en lesinterconnexions. Un bon algorisme de particionat és fonamental per tal de queel compilador produeixi codi eficient. L’algorisme també té en compte elspassos de compilació subseqüents---específicament software pipelining il’arranjament de buffers---i modela la capacitat del compilador per fusionarnuclis. Aquesta tesi també presenta un algorisme estàtic de redimensionament de cues.Aquest algorisme és important quan la memòria és distribuïda, especialment quanles memòries locals són petites. L’algorisme té en compte latències ivariacions en els temps de càlcul, i considera el límit imposat per la mida deles memòries locals.La segona part d’aquesta tesi es centralitza en la planificació dinàmica deprogrames de streams. En primer lloc, investiga el rendiment dels planificadorsdinàmics online, non-preemptive i non-clairvoyant. En segon lloc, proposa dosplanificadors dinàmics per programes de stream. El primer és específicament pera programes de streams unidimensionals. El segon és més general: no necessitael graf de streams, però els overheads són una mica més grans.Aquesta tesi també presenta un conjunt d’eines de suport relacionades amb laprogramació de streams. StarssCheck és una eina de depuració, que és basa enValgrind, per StarSs, un llenguatge de programació paral·lela basat en tasques.Aquesta eina genera un avís cada vegada que el comportament del programa estàen contradicció amb una anotació pragma. Aquest comportament d’una altra manerapodria causar excepcions o situacions de competició. StreamIt to OmpSs és unaeina per convertir un programa de streams codificat en el llenguatge StreamIt aun programa de tasques en StarSs planificat de forma dinàmica.Postprint (published version

    Systematic Design Space Exploration of Dynamic Dataflow Programs for Multi-core Platforms

    Get PDF
    The limitations of clock frequency and power dissipation of deep sub-micron CMOS technology have led to the development of massively parallel computing platforms. They consist of dozens or hundreds of processing units and offer a high degree of parallelism. Taking advantage of that parallelism and transforming it into high program performances requires the usage of appropriate parallel programming models and paradigms. Currently, a common practice is to develop parallel applications using methods evolving directly from sequential programming models. However, they lack the abstractions to properly express the concurrency of the processes. An alternative approach is to implement dataflow applications, where the algorithms are described in terms of streams and operators thus their parallelism is directly exposed. Since algorithms are described in an abstract way, they can be easily ported to different types of platforms. Several dataflow models of computation (MoCs) have been formalized so far. They differ in terms of their expressiveness (ability to handle dynamic behavior) and complexity of analysis. So far, most of the research efforts have focused on the simpler cases of static dataflow MoCs, where many analyses are possible at compile-time and several optimization problems are greatly simplified. At the same time, for the most expressive and the most difficult to analyze dynamic dataflow (DDF), there is still a dearth of tools supporting a systematic and automated analysis minimizing the programming efforts of the designer. The objective of this Thesis is to provide a complete framework to analyze, evaluate and refactor DDF applications expressed using the RVC-CAL language. The methodology relies on a systematic design space exploration (DSE) examining different design alternatives in order to optimize the chosen objective function while satisfying the constraints. The research contributions start from a rigorous DSE problem formulation. This provides a basis for the definition of a complete and novel analysis methodology enabling systematic performance improvements of DDF applications. Different stages of the methodology include exploration heuristics, performance estimation and identification of refactoring directions. All of the stages are implemented as appropriate software tools. The contributions are substantiated by several experiments performed with complex dynamic applications on different types of physical platforms
    corecore