9 research outputs found

    Loop transformations for clustered VLIW architectures

    Get PDF
    With increasing demands for performance by embedded systems, especially by digital signal processing (DSP) applications, embedded processors must increase available instructionlevel parallelism (ILP) within significant constraints on power consumption and chip cost. Unfortunately, supporting a large amount of ILP on a processor while maintaining a single register file increases chip cost and potentially decreases overall performance due to increased cycle time. To address this problem, some modern embedded processors partition the register file into multiple low-ported register files, each directly connected with one or more functional units. These functional unit/register file groups are called clusters. Clustered VLIW (very long instruction word) architectures need extra copy operations or delays to transfer values among clusters. To take advantage of clustered architectures, the compiler must expose parallelism for maximal functional-unit utilization, and schedule instructions to reduce intercluster communication overhead. High-level loop transformations offer an excellent opportunity to enhance the abilities of low-level optimizers to generate code for clustered architectures. This dissertation investigates the effects of three loop transformations, i.e., loop fusion, loop unrolling, and unroll-and-jam, on clustered VLIW architectures. The objective is to achieve high performance with low communication overhead. This dissertation discusses the following techniques: Loop Fusion This research examines the impact of loop fusion on clustered architectures. A metric based upon communication costs for guiding loop fusion is developed and tested on DSP benchmarks. Unroll-and-jam and Loop Unrolling A new method that integrates a communication cost model with an integer-optimization problem is developed to determine unroll amounts for loop unrolling and unroll-and-jam automatically for a specific loop on a specific architecture. These techniques have been implemented and tested using DSP benchmarks on simulated, clustered VLIW architectures and a real clustered, embedded processor, the TI TMS320C64X. The results show that the new techniques achieve an average speedup of 1.72-1.89 on five different clustered architectures. These techniques have been implemented and tested using DSP benchmarks on simulated, clustered VLIW architectures and a real clustered, embedded processor, the TI TMS320C64X. The results show that the new techniques achieve an average speedup of 1.72-1.89 on five different clustered architectures

    Optimizing Instruction Scheduling and Register Allocation for Register-File-Connected Clustered VLIW Architectures

    Get PDF
    Clustering has become a common trend in very long instruction words (VLIW) architecture to solve the problem of area, energy consumption, and design complexity. Register-file-connected clustered (RFCC) VLIW architecture uses the mechanism of global register file to accomplish the inter-cluster data communications, thus eliminating the performance and energy consumption penalty caused by explicit inter-cluster data move operations in traditional bus-connected clustered (BCC) VLIW architecture. However, the limit number of access ports to the global register file has become an issue which must be well addressed; otherwise the performance and energy consumption would be harmed. In this paper, we presented compiler optimization techniques for an RFCC VLIW architecture called Lily, which is designed for encryption systems. These techniques aim at optimizing performance and energy consumption for Lily architecture, through appropriate manipulation of the code generation process to maintain a better management of the accesses to the global register file. All the techniques have been implemented and evaluated. The result shows that our techniques can significantly reduce the penalty of performance and energy consumption due to access port limitation of global register file

    Enabling Technologies for Cognitive Optical Networks

    Get PDF

    Fundamentals

    Get PDF
    Volume 1 establishes the foundations of this new field. It goes through all the steps from data collection, their summary and clustering, to different aspects of resource-aware learning, i.e., hardware, memory, energy, and communication awareness. Machine learning methods are inspected with respect to resource requirements and how to enhance scalability on diverse computing architectures ranging from embedded systems to large computing clusters

    Fundamentals

    Get PDF
    Volume 1 establishes the foundations of this new field. It goes through all the steps from data collection, their summary and clustering, to different aspects of resource-aware learning, i.e., hardware, memory, energy, and communication awareness. Machine learning methods are inspected with respect to resource requirements and how to enhance scalability on diverse computing architectures ranging from embedded systems to large computing clusters

    Vector-thread architecture and implementation

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (p. 181-186).This thesis proposes vector-thread architectures as a performance-efficient solution for all-purpose computing. The VT architectural paradigm unifies the vector and multithreaded compute models. VT provides the programmer with a control processor and a vector of virtual processors. The control processor can use vector-fetch commands to broadcast instructions to all the VPs or each VP can use thread-fetches to direct its own control flow. A seamless intermixing of the vector and threaded control mechanisms allows a VT architecture to flexibly and compactly encode application parallelism and locality. VT architectures can efficiently exploit a wide variety of loop-level parallelism, including non-vectorizable loops with cross-iteration dependencies or internal control flow. The Scale VT architecture is an instantiation of the vector-thread paradigm designed for low-power and high-performance embedded systems. Scale includes a scalar RISC control processor and a four-lane vector-thread unit that can execute 16 operations per cycle and supports up to 128 simultaneously active virtual processor threads. Scale provides unit-stride and strided-segment vector loads and stores, and it implements cache refill/access decoupling. The Scale memory system includes a four-port, non-blocking, 32-way set-associative, 32 KB cache. A prototype Scale VT processor was implemented in 180 nm technology using an ASIC-style design flow. The chip has 7.1 million transistors and a core area of 16.6 mm2, and it runs at 260 MHz while consuming 0.4-1.1 W. This thesis evaluates Scale using a diverse selection of embedded benchmarks, including example kernels for image processing, audio processing, text and data processing, cryptography, network processing, and wireless communication.(cont.) Larger applications also include a JPEG image encoder and an IEEE 802.11 la wireless transmitter. Scale achieves high performance on a range of different types of codes, generally executing 3-11 compute operations per cycle. Unlike other architectures which improve performance at the expense of increased energy consumption, Scale is generally even more energy efficient than a scalar RISC processor.by Ronny Meir Krashinsky.Ph.D

    Intelligent Sensor Networks

    Get PDF
    In the last decade, wireless or wired sensor networks have attracted much attention. However, most designs target general sensor network issues including protocol stack (routing, MAC, etc.) and security issues. This book focuses on the close integration of sensing, networking, and smart signal processing via machine learning. Based on their world-class research, the authors present the fundamentals of intelligent sensor networks. They cover sensing and sampling, distributed signal processing, and intelligent signal learning. In addition, they present cutting-edge research results from leading experts

    Anales del XIII Congreso Argentino de Ciencias de la Computaci贸n (CACIC)

    Get PDF
    Contenido: Arquitecturas de computadoras Sistemas embebidos Arquitecturas orientadas a servicios (SOA) Redes de comunicaciones Redes heterog茅neas Redes de Avanzada Redes inal谩mbricas Redes m贸viles Redes activas Administraci贸n y monitoreo de redes y servicios Calidad de Servicio (QoS, SLAs) Seguridad inform谩tica y autenticaci贸n, privacidad Infraestructura para firma digital y certificados digitales An谩lisis y detecci贸n de vulnerabilidades Sistemas operativos Sistemas P2P Middleware Infraestructura para grid Servicios de integraci贸n (Web Services o .Net)Red de Universidades con Carreras en Inform谩tica (RedUNCI

    Anales del XIII Congreso Argentino de Ciencias de la Computaci贸n (CACIC)

    Get PDF
    Contenido: Arquitecturas de computadoras Sistemas embebidos Arquitecturas orientadas a servicios (SOA) Redes de comunicaciones Redes heterog茅neas Redes de Avanzada Redes inal谩mbricas Redes m贸viles Redes activas Administraci贸n y monitoreo de redes y servicios Calidad de Servicio (QoS, SLAs) Seguridad inform谩tica y autenticaci贸n, privacidad Infraestructura para firma digital y certificados digitales An谩lisis y detecci贸n de vulnerabilidades Sistemas operativos Sistemas P2P Middleware Infraestructura para grid Servicios de integraci贸n (Web Services o .Net)Red de Universidades con Carreras en Inform谩tica (RedUNCI
    corecore