20,840 research outputs found

    An Interconnection Architecture for Seamless Inter and Intra-Chip Communication Using Wireless Links

    Get PDF
    As semiconductor technologies continues to scale, more and more cores are being integrated on the same multicore chip. This increase in complexity poses the challenge of efficient data transfer between these cores. Several on-chip network architectures are proposed to improve the design flexibility and communication efficiency of such multicore chips. However, in a larger system consisting of several multicore chips across a board or in a System-in-Package (SiP), the performance is limited by the communication among and within these chips. Such systems, most commonly found within computing modules in typical data center nodes or server racks, are in dire need of an efficient interconnection architecture. Conventional interchip communication using wireline links involve routing the data from the internal cores to the peripheral I/O ports, travelling over the interchip channels to the destination chip, and finally getting routed from the I/O to the internal cores there. This multihop communication increases latency and energy consumption while decreasing data bandwidth in a multichip system. Furthermore, the intrachip and interchip communication architectures are separately designed to maximize design flexibility. Jointly designing them could, however, improve the communication efficiency significantly and yield better solutions. Previous attempts at this include an all-photonic approach that provides a unified inter/intra-chip optical network, based on recent progress in nano-photonic technologies. Works on wireless inter-chip interconnects successfully yielded better results than their wired counterparts, but their scopes were limited to establishing a single wireless connection between two chips rather than a communication architecture for a system as a whole. In this thesis, the design of a seamless hybrid wired and wireless interconnection network for multichip systems in a package is proposed. The design utilizes on-chip wireless transceivers with dimensions spanning up to tens of centimeters. It manages to seamlessly bind both intrachip and interchip communication architectures and enables direct chip-to-chip communication between the internal cores. It is shown through cycle accurate simulations that the proposed design increases the bandwidth and reduces the energy consumption when compared to the state-of-the-art wireline I/O based multichip communications

    A unified modulo scheduling and register allocation technique for clustered processors

    Get PDF
    This work presents a modulo scheduling framework for clustered ILP processors that integrates the cluster assignment, instruction scheduling and register allocation steps in a single phase. This unified approach is more effective than traditional approaches based on sequentially performing some (or all) of the three steps, since it allows optimizing the global code generation problem instead of searching for optimal solutions to each individual step. Besides, it avoids the iterative nature of traditional approaches, which require repeated applications of the three steps until a valid solution is found. The proposed framework includes a mechanism to insert spill code on-the-fly and heuristics to evaluate the quality of partial schedules considering simultaneously inter-cluster communications, memory pressure and register pressure. Transformations that allow trading pressure on a type of resource for another resource are also included. We show that the proposed technique outperforms previously proposed techniques. For instance, the average speed-up for the SPECfp95 is 36% for a 4-cluster configuration.Peer ReviewedPostprint (published version

    A Power-Aware Framework for Executing Streaming Programs on Networks-on-Chip

    Get PDF
    Nilesh Karavadara, Simon Folie, Michael Zolda, Vu Thien Nga Nguyen, Raimund Kirner, 'A Power-Aware Framework for Executing Streaming Programs on Networks-on-Chip'. Paper presented at the Int'l Workshop on Performance, Power and Predictability of Many-Core Embedded Systems (3PMCES'14), Dresden, Germany, 24-28 March 2014.Software developers are discovering that practices which have successfully served single-core platforms for decades do no longer work for multi-cores. Stream processing is a parallel execution model that is well-suited for architectures with multiple computational elements that are connected by a network. We propose a power-aware streaming execution layer for network-on-chip architectures that addresses the energy constraints of embedded devices. Our proof-of-concept implementation targets the Intel SCC processor, which connects 48 cores via a network-on- chip. We motivate our design decisions and describe the status of our implementation

    Near-Memory Address Translation

    Full text link
    Memory and logic integration on the same chip is becoming increasingly cost effective, creating the opportunity to offload data-intensive functionality to processing units placed inside memory chips. The introduction of memory-side processing units (MPUs) into conventional systems faces virtual memory as the first big showstopper: without efficient hardware support for address translation MPUs have highly limited applicability. Unfortunately, conventional translation mechanisms fall short of providing fast translations as contemporary memories exceed the reach of TLBs, making expensive page walks common. In this paper, we are the first to show that the historically important flexibility to map any virtual page to any page frame is unnecessary in today's servers. We find that while limiting the associativity of the virtual-to-physical mapping incurs no penalty, it can break the translate-then-fetch serialization if combined with careful data placement in the MPU's memory, allowing for translation and data fetch to proceed independently and in parallel. We propose the Distributed Inverted Page Table (DIPTA), a near-memory structure in which the smallest memory partition keeps the translation information for its data share, ensuring that the translation completes together with the data fetch. DIPTA completely eliminates the performance overhead of translation, achieving speedups of up to 3.81x and 2.13x over conventional translation using 4KB and 1GB pages respectively.Comment: 15 pages, 9 figure
    • …
    corecore