Search CORE

187 research outputs found

Compiler optimization and ordering effects on VLIW code compression

Author: Ros Montserrat
Sutton Peter
Publication venue: 'Sociological Research Online'
Publication date: 01/01/2003
Field of study

Code size has always been an important issue for all embedded applications as well as larger systems. Code compression techniques have been devised as a way of battling bloated code; however, the impact of VLIW compiler methods and outputs on these compression schemes has not been thoroughly investigated. This paper describes the application of single- and multipleinstruction dictionary methods for code compression to decrease overall code size for the TI TMS320C6xxx DSP family. The compression scheme is applied to benchmarks taken from the Mediabench benchmark suite built with differing compiler optimization parameters. In the single instruction encoding scheme, it was found that compression ratios were not a useful indicator of the best overall code size – the best results (smallest overall code size) were obtained when the compression scheme was applied to sizeoptimized code. In the multiple instruction encoding scheme, changing parallel instruction order was found to only slightly improve compression in unoptimized code and does not affect the code compression when it is applied to builds already optimized for size

Crossref

Research Online

Compiler optimization and ordering effects on VLIW code compression

Author: Montserrat Ros
Peter Sutton
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2004
Field of study

Crossref

Huffman-based Code Compression Techniques for Embedded Systems

Author: Bonny Talal
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2009
Field of study

KITopen

Architectural and Complier Mechanisms for Accelerating Single Thread Applications on Mulitcore Processors.

Author: Zhong Hongtao
Publication venue
Publication date
Field of study

Multicore systems have become the dominant mainstream computing platform. One of the biggest challenges going forward is how to efficiently utilize the ever increasing computational power provided by multicore systems. Applications with large amounts of explicit thread-level parallelism naturally scale performance with the number of cores. However, single-thread applications realize little to no gains from multicore systems. This work investigates architectural and compiler mechanisms to automatically accelerate single thread applications on multicore processors by efficiently exploiting three types of parallelism across multiple cores: instruction level parallelism (ILP), fine-grain thread level parallelism (TLP), and speculative loop level parallelism (LLP). A multicore architecture called Voltron is proposed to exploit different types of parallelism. Voltron can organize the cores for execution in either coupled or decoupled mode. In coupled mode, several in-order cores are coalesced to emulate a wide-issue VLIW processor. In decoupled mode, the cores execute a set of fine-grain communicating threads extracted by the compiler. By executing fine-grain threads in parallel, Voltron provides coarse-grained out-of-order execution capability using in-order cores. Architectural mechanisms for speculative execution of loop iterations are also supported under the decoupled mode. Voltron can dynamically switch between two modes with low overhead to exploit the best form of available parallelism. This dissertation also investigates compiler techniques to exploit different types of parallelism on the proposed architecture. First, this work proposes compiler techniques to manage multiple instruction streams to collectively function as a single logical stream on a conventional VLIW to exploit ILP. Second, this work studies compiler algorithms to extract fine-grain threads. Third, this dissertation proposes a series of systematic compiler transformations and a general code generation framework to expose hidden speculative LLP hindered by register and memory dependences in the code. These transformations collectively remove inter-iteration dependences that are caused by subsets of isolatable instructions, are unwindable, or occur infrequently. Experimental results show that proposed mechanisms can achieve speedups of 1.33 and 1.14 on 4 core machines by exploiting ILP and TLP respectively. The proposed transformations increase the DOALL loop coverage in applications from 27% to 61%, resulting in a speedup of 1.84 on 4 core systems.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/58419/1/hongtaoz_1.pd

Deep Blue Documents at the University of Michigan

Code Generation and Global Optimization Techniques for a Reconfigurable PRAM-NUMA Multicore Architecture

Author
Publication venue: 'Linkoping University Electronic Press'
Publication date
Field of study

Crossref

Implementation And Optimizaton Of Real-time H.264 Baseline Encoder On Tms320dm642 Dsp

Author: Meriç Ender
Publication venue: 'Nara Institute of Science and Technology'
Publication date: 07/04/2015
Field of study

Tez (Yüksek Lisans) -- İstanbul Teknik Üniversitesi, Fen Bilimleri Enstitüsü, 2007Thesis (M.Sc.) -- İstanbul Technical University, Institute of Science and Technology, 2007Günümüzde sayısal video kodlama sayısal gözetim sistemleri, video konferans, mobil uygulamalar ve video yayını gibi bir çok uygulamada zorunlu hale gelmiştir. Uluslararası bir video sıkıştırma standardı olan H.264/MPEG-4 bölüm 10, daha önceki standartlara göre kodlama verimini iyileştirmek amacıyla geliştirilmiştir. Fakat, bu kodlama geliştirmesi beraberinde kodlama karmaşıklığının da artmasına yol açmaktadır. Bu tez çalışmasında Texas Instruments TMS320DM642 sayısal sinyal işleyici üzerinde H.264 temel profil kodlayıcı gerçeklenmiştir. DM642 DSP çekirdeği üzerindeki gerçek zamanlı H.264/AVC kodlayıcı uygulaması hata esnekliği araçları ve çeyrek piksel hareket dengeleme dışında standart tüm H.264/AVC temel profil kodlama araçlarını sunmaktadır. Çeyrek piksel hareket dengelem yerine, tüm parlaklılık ve renklik bileşenleri için tam sayı ve yarım piksel pozisyonlarında hareket kestirim ve dengeleme gerçeklenmiştir. Kullanılan DM642 DSP çekirdeği platformu, 2-seviyeli bellek/önbellek aşama düzenine sahip ve VLIW içeren yüksek performanslı sayısal işlemci olarak tasarlanmıştır. Sunulan H.264 temel kodlayıcı sistemin gerçeklenmesi ve eniyilemesi bu tezin konusudur. Üstelik, algoritma bazlı, mimari ve bellek stratejilerini içeren eniyileme çalışma fazları detaylarıyla açıklanmaktadır. H.264/AVC video kodlayıcının hem geliştirme ortamında hem de DM642 EVM donanım ortamında çalışması doğrulanmıştır. Kısaca, kodlayıcı sisteme giriş olan CIF çözünürlükte sıkıştırılmamış YUV video dizisi H.264 Annex-B dosya biçiminde ve de ekrana video çıktı verilerek sıkıştırılmaktadır. Ek olarak, kodlayıcı çıktısı H.264 referans yazılımla doğruluğu kontrol edilmiş ve uyumluluğu kanıtlanmıştır.Recently, digital video coding is mandatory in many applications such as digital surveillance systems, video conferencing, mobile applications as well as video broadcasts. The H.264/MPEG-4 Part 10, an international video compression standard, is developed for improving the coding efficiency compared to previous standards. However, the coding improvement comes with an increase in coding complexity. In this thesis, an H.264 baseline profile encoder is implemented on Texas Instruments TMS320DM642 digital signal processor. The real-time implementation of the H.264/AVC encoder on DM642 DSP core offers most of the standard H.264/AVC baseline profile coding tools except error resiliency tools and quarter-pel motion estimation. Instead of quarter-pel motion compensation, integer and half pixel position motion estimation and compensation for all luminance and chrominance components are implemented. The target platform, DM64 DSP core, is designed as a high-performance digital media processor with two-level memory/cache hierarchy and VLIW architecture. The subject of the thesis is H.264 baseline encoder system realization and optimization on the target platform. Moreover, the study of optimization phases covering algorithmic, architectural and memory strategies are clarified in details. The H.264/AVC encoder system is verified both to execute on the development workstation and DM642 EVM (Evaluation Module) hardware platform. Briefly, the uncompressed input of a YUV video sequence with CIF resolution to the encoder system is compressed to H.264 Annex-B file format and displayed on screen. Additionally, the encoder output is verified with H.264 reference software and the compliancy is proven.Yüksek LisansM.Sc

Ulusal Üniversitelerarası Açık Erişim Sistemi - İstanbul Teknik Üniversitesi

Improving Energy Efficiency of Application-Specific Instruction-Set Processors

Author: Guzma Vladimír
Publication venue: Tampere University of Technology
Publication date: 01/01/2017
Field of study

Present-day consumer mobile devices seem to challenge the concept of embedded computing by bringing the equivalent of supercomputing power from two decades ago into hand-held devices. This challenge, however, is well met by pushing the boundaries of embedded computing further into areas previously monopolised by Application-Speciﬁc Integrated Circuits (ASICs). Furthermore, in areas traditionally associated with embedded computing, an increase in the complexity of algorithms and applications requires a continuous rise in availability of computing power and energy efﬁciency in order to ﬁt within the same, or smaller, power budget. It is, ultimately, the amount of energy the application execution consumes that dictates the usefulness of a programmable embedded system, in comparison with implementation of an ASIC.This Thesis aimed to explore the energy efﬁciency overheads of Application-Speciﬁc InstructionSet Processors (ASIPs), a class of embedded processors aiming to compete with ASICs. While an ASIC can be designed to provide precise performance and energy efﬁciency required by a speciﬁc application without unnecessary overheads, the cost of design and veriﬁcation, as well as the inability to upgrade or modify, favour more ﬂexible programmable solutions. The ASIP designs can match the computing performance of the ASIC for speciﬁc applications. What is left, therefore, is achieving energy efﬁciency of a similar order of magnitude.In the past, one area of ASIP design that has been identiﬁed as a major consumer of energy is storage of temporal values produced during computation – the Register File (RF), with the associated interconnection network to transport those values between registers and computational Function Units (FUs). In this Thesis, the energy efﬁciency of RF and interconnection network is studied using the Transport Triggered Architectures (TTAs) template. Speciﬁcally, compiler optimisations aiming at reducing the trafﬁc of temporal values between RF and FUs are presented in this Thesis. Bypassing of the temporal value, from the output of the FU which produces it directly in the input ports of the FUs that require it to continue with the computation, saves multiple RF reads. In addition, if all the uses of such a temporal value can be bypassed, the RF write can be eliminated as well. Such optimisations result in a simpliﬁcation of the RF, via a reduction in the actual number of registers present or a reduction in the number of read and write ports in the RF and improved energy efﬁciency. In cases where the limited number of the simultaneous RF reads or writes cause a performance bottleneck, such optimisations result in performance improvements leading to faster execution times, therefore, allowing for execution at lower clock frequencies resulting in additional energy savings.Another area of the ASIP design consuming a signiﬁcant amount of energy is the instruction memory subsystem, which is the artefact required for the programmability of the embedded processor. As this subsystem is not present in ASIC, the energy consumed for storing an application program and reading it from the instruction memories to control processor execution is an overhead that needs to be minimised. In this Thesis, one particular tool to improve the energy efﬁciency of the instruction memory subsystem – instruction buffer – is examined. While not trivially obvious, the presence of buffers for storing loop bodies, or parts of them, results in a reduced number of reads from the instruction memories. As a result, memories can be put to lower power state leading to lower overall energy consumption, pending energy-efﬁcient buffer implementation. Speciﬁcally, an energy-efﬁcient implementation of the instruction buffer is presented in this Thesis, together with analysis tools to identify candidate loops and assess their suitability for storing in the instruction buffer.The studies presented in this Thesis show that the energy overheads associated with the use of embedded processors, in comparison to ad-hoc ASIC solutions, are manageable when carefully considered during the design of an embedded system for a particular application, or application domain. Finally, the methods presented in this Thesis do not restrict the reprogrammability of the embedded system

Trepo - Institutional Repository of Tampere University

Exploring Processor and Memory Architectures for Multimedia

Author: Iranpour Ali
Publication venue
Publication date: 01/01/2012
Field of study

Multimedia has become one of the cornerstones of our 21st century society and, when combined with mobility, has enabled a tremendous evolution of our society. However, joining these two concepts introduces many technical challenges. These range from having sufficient performance for handling multimedia content to having the battery stamina for acceptable mobile usage. When taking a projection of where we are heading, we see these issues becoming ever more challenging by increased mobility as well as advancements in multimedia content, such as introduction of stereoscopic 3D and augmented reality. The increased performance needs for handling multimedia come not only from an ongoing step-up in resolution going from QVGA (320x240) to Full HD (1920x1080) a 27x increase in less than half a decade. On top of this, there is also codec evolution (MPEG-2 to H.264 AVC) that adds to the computational load increase. To meet these performance challenges there has been processing and memory architecture advances (SIMD, out-of-order superscalarity, multicore processing and heterogeneous multilevel memories) in the mobile domain, in conjunction with ever increasing operating frequencies (200MHz to 2GHz) and on-chip memory sizes (128KB to 2-3MB). At the same time there is an increase in requirements for mobility, placing higher demands on battery-powered systems despite the steady increase in battery capacity (500 to 2000mAh). This leaves negative net result in-terms of battery capacity versus performance advances. In order to make optimal use of these architectural advances and to meet the power limitations in mobile systems, there is a need for taking an overall approach on how to best utilize these systems. The right trade-off between performance and power is crucial. On top of these constraints, the flexibility aspects of the system need to be addressed. All this makes it very important to reach the right architectural balance in the system. The first goal for this thesis is to examine multimedia applications and propose a flexible solution that can meet the architectural requirements in a mobile system. Secondly, propose an automated methodology of optimally mapping multimedia data and instructions to a heterogeneous multilevel memory subsystem. The proposed methodology uses constraint programming for solving a multidimensional optimization problem. Results from this work indicate that using today’s most advanced mobile processor technology together with a multi-level heterogeneous on-chip memory subsystem can meet the performance requirements for handling multimedia. By utilizing the automated optimal memory mapping method presented in this thesis lower total power consumption can be achieved, whilst performance for multimedia applications is improved, by employing enhanced memory management. This is achieved through reduced external accesses and better reuse of memory objects. This automatic method shows high accuracy, up to 90%, for predicting multimedia memory accesses for a given architecture

CiteSeerX

Lund University Publications

From Parallel Programs to Customized Parallel Processors

Author: Jääskeläinen Pekka
Publication venue: Tampere University of Technology
Publication date: 01/01/2012
Field of study

The need for fast time to market of new embedded processor-based designs calls for a rapid design methodology of the included processors. The call for such a methodology is even more emphasized in the context of so called soft cores targeted to reconfigurable fabrics where per-design processor customization is commonplace. The C language has been commonly used as an input to hardware/software co-design flows. However, as C is a sequential language, its potential to generate parallel operations to utilize naturally parallel hardware constructs is far from optimal, leading to a customized processor design space with limited parallel resource scalability. In contrast, when utilizing a parallel programming language as an input, a wider processor design space can be explored to produce customized processors with varying degrees of utilized parallelism. This Thesis proposes a novel Multicore Application-Specific Instruction Set Processor (MCASIP) co-design methodology that exploits parallel programming languages as the application input format. In the methodology, the designer can explicitly capture the parallelism of the algorithm and exploit specialized instructions using a parallel programming language in contrast to being on the mercy of the compiler or the hardware to extract the parallelism from a sequential input. The Thesis proposes a multicore processor template based on the Transport Triggered Architecture, compiler techniques involved in static parallelization of computation kernels with barriers and a datapath integrated hardware accelerator for low overhead software synchronization implementation. These contributions enable scaling the customized processors both at the instruction and task levels to efficiently exploit the parallelism in the input program up to the implementation constraints such as the memory bandwidth or the chip area. The different contributions are validated with case studies, comparisons and design examples

Trepo - Institutional Repository of Tampere University