68 research outputs found

    A Comparative Study of Scheduling Techniques for Multimedia Applications on SIMD Pipelines

    Full text link
    Parallel architectures are essential in order to take advantage of the parallelism inherent in streaming applications. One particular branch of these employ hardware SIMD pipelines. In this paper, we analyse several scheduling techniques, namely ad hoc overlapped execution, modulo scheduling and modulo scheduling with unrolling, all of which aim to efficiently utilize the special architecture design. Our investigation focuses on improving throughput while analysing other metrics that are important for streaming applications, such as register pressure, buffer sizes and code size. Through experiments conducted on several media benchmarks, we present and discuss trade-offs involved when selecting any one of these scheduling techniques.Comment: Presented at DATE Friday Workshop on Heterogeneous Architectures and Design Methods for Embedded Image Systems (HIS 2015) (arXiv:1502.07241

    A General Hardware/Software Co-design Methodology for Embedded Signal Processing and Multimedia Workloads

    Get PDF
    This paper presents a hardware/software co-design methodology for partitioning real-time embedded multimedia applications between software programmable DSPs and hardware based FPGA coprocessors. By following a strict set of guidelines, the input application is partitioned between software executing on a programmable DSP and hardware based FPGA implementation to alleviate computational bottlenecks in modern VLIW style DSP architectures used in embedded systems. This methodology is applied to channel estimation firmware in 3.5G wireless receivers, as well as software based H.263 video decoders. As much as an 11x improvement in runtime performance can be achieved by partitioning performance critical software kernels in these workloads into a hardware based FPGA implementation executing in tandem with the existing host DSP.Nokia Inc.Texas InstrumentsNational Science Foundatio

    An efficient multi-core SIMD implementation for H.264/AVC encoder

    Get PDF
    The optimization process of a H.264/AVC encoder on three different architectures is presented. The architectures are multi- and singlecore and SIMD instruction sets have different vector registers size. The need of code optimization is fundamental when addressing HD resolutions with real-time constraints. The encoder is subdivided in functional modules in order to better understand where the optimization is a key factor and to evaluate in details the performance improvement. Common issues in both partitioning a video encoder into parallel architectures and SIMD optimization are described, and author solutions are presented for all the architectures. Besides showing efficient video encoder implementations, one of the main purposes of this paper is to discuss how the characteristics of different architectures and different set of SIMD instructions can impact on the target application performance. Results about the achieved speedup are provided in order to compare the different implementations and evaluate the more suitable solutions for present and next generation video-coding algorithms

    A Micro Power Hardware Fabric for Embedded Computing

    Get PDF
    Field Programmable Gate Arrays (FPGAs) mitigate many of the problemsencountered with the development of ASICs by offering flexibility, faster time-to-market, and amortized NRE costs, among other benefits. While FPGAs are increasingly being used for complex computational applications such as signal and image processing, networking, and cryptology, they are far from ideal for these tasks due to relatively high power consumption and silicon usage overheads compared to direct ASIC implementation. A reconfigurable device that exhibits ASIC-like power characteristics and FPGA-like costs and tool support is desirable to fill this void. In this research, a parameterized, reconfigurable fabric model named as domain specific fabric (DSF) is developed that exhibits ASIC-like power characteristics for Digital Signal Processing (DSP) style applications. Using this model, the impact of varying different design parameters on power and performance has been studied. Different optimization techniques like local search and simulated annealing are used to determine the appropriate interconnect for a specific set of applications. A design space exploration tool has been developed to automate and generate a tailored architectural instance of the fabric.The fabric has been synthesized on 160 nm cell-based ASIC fabrication process from OKI and 130 nm from IBM. A detailed power-performance analysis has been completed using signal and image processing benchmarks from the MediaBench benchmark suite and elsewhere with comparisons to other hardware and software implementations. The optimized fabric implemented using the 130 nm process yields energy within 3X of a direct ASIC implementation, 330X better than a Virtex-II Pro FPGA and 2016X better than an Intel XScale processor

    Performance evaluation of MPEG-4 Video encoder on Adres

    Get PDF
    Curs 2006-2007Actualment un típic embedded system (ex. telèfon mòbil) requereix alta qualitat per portar a terme tasques com codificar/descodificar a temps real; han de consumir poc energia per funcionar hores o dies utilitzant bateries lleugeres; han de ser el suficientment flexibles per integrar múltiples aplicacions i estàndards en un sol aparell; han de ser dissenyats i verificats en un període de temps curt tot i l’augment de la complexitat. Els dissenyadors lluiten contra aquestes adversitats, que demanen noves innovacions en arquitectures i metodologies de disseny. Coarse-grained reconfigurable architectures (CGRAs) estan emergent com a candidats potencials per superar totes aquestes dificultats. Diferents tipus d’arquitectures han estat presentades en els últims anys. L’alta granularitat redueix molt el retard, l’àrea, el consum i el temps de configuració comparant amb les FPGAs. D’altra banda, en comparació amb els tradicionals processadors coarse-grained programables, els alts recursos computacionals els permet d’assolir un alt nivell de paral•lelisme i eficiència. No obstant, els CGRAs existents no estant sent aplicats principalment per les grans dificultats en la programació per arquitectures complexes. ADRES és una nova CGRA dissenyada per I’Interuniversity Micro-Electronics Center (IMEC). Combina un processador very-long instruction word (VLIW) i un coarse-grained array per tenir dues opcions diferents en un mateix dispositiu físic. Entre els seus avantatges destaquen l’alta qualitat, poca redundància en les comunicacions i la facilitat de programació. Finalment ADRES és un patró enlloc d’una arquitectura concreta. Amb l’ajuda del compilador DRESC (Dynamically Reconfigurable Embedded System Compile), és possible trobar millors arquitectures o arquitectures específiques segons l’aplicació. Aquest treball presenta la implementació d’un codificador MPEG-4 per l’ADRES. Mostra l’evolució del codi per obtenir una bona implementació per una arquitectura donada. També es presenten les característiques principals d’ADRES i el seu compilador (DRESC). Els objectius són de reduir al màxim el nombre de cicles (temps) per implementar el codificador de MPEG-4 i veure les diferents dificultats de treballar en l’entorn ADRES. Els resultats mostren que els cícles es redueixen en un 67% comparant el codi inicial i final en el mode VLIW i un 84% comparant el codi inicial en VLIW i el final en mode CGA.Nowadays, a typical embedded system requires high performance to perform tasks such as video encoding/decoding at run-time. It should consume little energy to work hours or even days using a lightweight battery. It should be flexible enough to integrate multiple applications and standards in one single device. It has to be designed and verified in short time-to-market despite substantially increased complexity. The designers are struggling to meet these huge challenges, which call for innovations of both architectures and design methodology. Coarse-grained reconfigurable architectures (CGRAs) are emerging as potential candidates to meet the above challenges. Many of them were proposed in recent years. This coarse granularity greatly reduces delay, area, power and configuration time compared with FPGAs. On the other hand, compared with traditional "coarse-grained" programmable processors, their massive computational resources enable them to achieve high parallelism and efficiency. However, existing CGRAs have yet been widely adopted mainly because of programming difficulty for such complex architecture. ADRES is a novel CGRA designed by Interuniversity Micro-Electronics Center (IMEC). It tightly couples a very-long instruction word (VLIW) processor and a coarse-grained array by providing two functional views on the same physical resources. It brings advantages such as high performance, low communication overhead and easiness of programming. Finally, ADRES is a template instead of a concrete architecture. With the retargetable compilation support from DRESC (Dynamically Reconfigurable Embedded System Compile), architectural exploration becomes possible to discover better architectures or design domain-specific architectures. In this thesis, a performance of an MPEG-4 encoder in ADRES is presented. The thesis shows the code evolution to obtain a good implementation for a given architecture. The main features of ADRES and its compiler (DRESC) are presented. The objectives are to reduce as much as possible the amount of cycles (time) spent to encode video in MPEG-4 and test different issues working with ADRES environment. The cycles decrease a 67% comparing initial and final code in VLIW and 84% between initial VLIW and CGA mode.Director/a: Moisès Serra i SerraSupervisor: Eric Delfoss

    A Hybrid Hardware/Software Architecture That Combines a 4-wide Very Long Instruction Word Software Processor (VLIW) with Application-specific Super-complex Instruction Set Hardware Functions

    Get PDF
    Application-driven processor designs are becoming increasingly feasible. Today, advances in field-programmable gate array (FPGA) technology are opening the doors to fast and highly-feasible hardware/software co-designed architectures. Over 100,000 FPGA logic array blocks and nearly 100 ASIC multiply-accumulate cores combine with extensible CPU cores to foster the design of configurable, application-driven hybrid processors.This thesis proposes a hardware/software co-designed architecture targeted to an FPGA. The architecture is a very-long instruction-word (VLIW) processor coupled with super-complex instruction set (SuperCISC) hardware co-processors. Results of the VLIW/SuperCISC show performance speedups over a single-issue processor of 9x to 332x, and entire application speedups from 4x to 127x. Contributions of this research include a 4-way VLIW designed from the ground up, a zero-overhead implementation of a hardware/software interface, evaluation of the scalability of shared data stores, examples of application-specific hardware accelerants, a SystemC simulator, and an evaluation of shared memory configurations

    Hybrid multithreading for VLIW processors

    Get PDF
    © ACM, 2009. This is the author's version of the work: http://doi.acm.org/10.1145/1629395.1629403Several multithreading techniques have been proposed to reduce resource underutilization in Very Long Instruction Word (VLIW) processors. Simultaneous MultiThreading (SMT) is a popular technique that improves processor performance by issuing multiple instructions from di erent threads. In VLIW processors, SMT requires extra hardware to merge instructions from di erent threads. The complexity of this hardware increases substantially with the number of threads. On the other hand, techniques like Interleaved MultiThreading (IMT) do not need any merging hardware, and support a larger number of threads at reasonable cost. In this paper, we propose Hybrid MultiThreading (HMT), a technique that at each cycle merges instructions from only a subset of threads. HMT supports a reasonable number of threads with a low merging hardware cost. For instance, it is possible to support 8 hardware threads with a merging hardware for only 2 threads. The experimental results show that using HMT improves the multithreading performance significantly. Further, supporting 8 hardware threads with HMT but using a 4-thread merging hardware achieves a performance similar to merging 8 threads simultaneously with a significantly lower merging hardware cost.Peer ReviewedPostprint (author’s final draft

    Exploring boundaries in game processing

    Get PDF
    • …
    corecore