68 research outputs found

    Doctor of Philosophy

    Get PDF
    dissertationIn recent years, a number of trends have started to emerge, both in microprocessor and application characteristics. As per Moore's law, the number of cores on chip will keep doubling every 18-24 months. International Technology Roadmap for Semiconductors (ITRS) reports that wires will continue to scale poorly, exacerbating the cost of on-chip communication. Cores will have to navigate an on-chip network to access data that may be scattered across many cache banks. The number of pins on the package, and hence available off-chip bandwidth, will at best increase at sublinear rate and at worst, stagnate. A number of disruptive memory technologies, e.g., phase change memory (PCM) have begun to emerge and will be integrated into the memory hierarchy sooner than later, leading to non-uniform memory access (NUMA) hierarchies. This will make the cost of accessing main memory even higher. In previous years, most of the focus has been on deciding the memory hierarchy level where data must be placed (L1 or L2 caches, main memory, disk, etc.). However, in modern and future generations, each level is getting bigger and its design is being subjected to a number of constraints (wire delays, power budget, etc.). It is becoming very important to make an intelligent decision about where data must be placed within a level. For example, in a large non-uniform access cache (NUCA), we must figure out the optimal bank. Similarly, in a multi-dual inline memory module (DIMM) non uniform memory access (NUMA) main memory, we must figure out the DIMM that is the optimal home for every data page. Studies have indicated that heterogeneous main memory hierarchies that incorporate multiple memory technologies are on the horizon. We must develop solutions for data management that take heterogeneity into account. For these memory organizations, we must again identify the appropriate home for data. In this dissertation, we attempt to verify the following thesis statement: "Can low-complexity hardware and OS mechanisms manage data placement within each memory hierarchy level to optimize metrics such as performance and/or throughput?" In this dissertation we argue for a hardware-software codesign approach to tackle the above mentioned problems at different levels of the memory hierarchy. The proposed methods utilize techniques like page coloring and shadow addresses and are able to handle a large number of problems ranging from managing wire-delays in large, shared NUCA caches to distributing shared capacity among different cores. We then examine data-placement issues in NUMA main memory for a many-core processor with a moderate number of on-chip memory controllers. Using codesign approaches, we achieve efficient data placement by modifying the operating system's (OS) page allocation algorithm for a wide variety of main memory architectures

    Hybrid Caching for Chip Multiprocessors Using Compiler-Based Data Classification

    Get PDF
    The high performance delivered by modern computer system keeps scaling with an increasingnumber of processors connected using distributed network on-chip. As a result, memory accesslatency, largely dominated by remote data cache access and inter-processor communication, is becoming a critical performance bottleneck. To release this problem, it is necessary to localize data access as much as possible while keep efficient on-chip cache memory utilization. Achieving this however, is application dependent and needs a keen insight into the memory access characteristics of the applications. This thesis demonstrates how using fairly simple thus inexpensive compiler analysis memory accesses can be classified into private data access and shared data access. In addition, we introduce a third classification named probably private access and demonstrate the impact of this category compared to traditional private and shared memory classification. The memory access classification information from the compiler analysis is then provided to the runtime system through a modified memory allocator and page table to facilitate a hybrid private-shared caching technique. The hybrid cache mechanism is aware of different data access classification and adopts appropriate placement and search policies accordingly to improve performance. Our analysis demonstrates that many applications have a significant amount of both private and shared data and that compiler analysis can identify the private data effectively for many applications. Experimentsresults show that the implemented hybrid caching scheme achieves 4.03% performance improvement over state of the art NUCA-base caching

    Near-Optimal Cache Block Placement with Reactive Nonuniform Cache Architectures

    Get PDF
    The growing core counts and caches of modern processors result in data access latency becoming a function of the data's physical location in the cache. Thus, the placement of cache blocks determines the cache's performance. Reactive nonuniform cache architectures (R-NUCA) achieve near-optimal cache block placement by classifying blocks online and placing data close to the cores that use them

    GreenCool: An Energy-Efficient Liquid Cooling Design Technique for 3-D MPSoCs Via Channel Width Modulation

    Get PDF
    Liquid cooling using interlayer microchannels has appeared as a viable and scalable packaging technology for 3-D multiprocessor system-on-chips (MPSoCs). Microchannel-based liquid cooling, however, can substantially increase the on-chip thermal gradients, which are undesirable for reliability, performance, and cooling efficiency. In this paper, we present GreenCool, an optimal design methodology for liquid-cooled 3-D MPSoCs. GreenCool simultaneously minimizes the cooling energy for a given system while maintaining thermal gradients and peak temperatures under safe limits. This is accomplished by tuning the heat transfer characteristics of the microchannels using channel width modulation. Channel width modulation is compatible with the current process technologies and incurs minimal additional fabrication costs. Through an extensive set of experiments, we show that channel width modulation is capable of complementing and enhancing the benefits of temperature-aware floorplanning. We also experiment with a 16-core 3-D system with stacked dynamic random-access memory, for which GreenCool improves energy efficiency by up to 53% with respect to no channel modulation

    TD-NUCA: runtime driven management of NUCA caches in task dataflow programming models

    Get PDF
    In high performance processors, the design of on-chip memory hierarchies is crucial for performance and energy efficiency. Current processors rely on large shared Non-Uniform Cache Architectures (NUCA) to improve performance and reduce data movement. Multiple solutions exploit information available at the microarchitecture level or in the operating system to optimize NUCA performance. However, existing methods have not taken advantage of the information captured by task dataflow programming models to guide the management of NUCA caches. In this paper we propose TD-NUCA, a hardware/software co-designed approach that leverages information present in the runtime system of task dataflow programming models to efficiently manage NUCA caches. TD-NUCA identifies the data access and reuse patterns of parallel applications in the runtime system and guides the operation of the NUCA caches in the hardware. As a result, TD-NUCA achieves a 1.18x average speedup over the baseline S-NUCA while requiring only 0.62x the data movement.This work has been supported by the Spanish Ministry of Science and Technology (contract PID2019-107255GB-C21) and the Generalitat de Catalunya (contract 2017-SGR-1414). M. Casas has been partially supported by the Grant RYC- 2017-23269 funded by MCIN/AEI/10.13039/501100011033 and ESF ‘Investing in your future’. M. Moreto has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship No. RYC-2016-21104.Peer ReviewedPostprint (published version

    Power and Thermal Management of System-on-Chip

    Get PDF

    Design and Implementation of Benes/Clos On-Chip Interconnection Networks

    Full text link
    Networks-on-Chip (NoCs) have emerged as the key on-chip communication architecture for multiprocessor systems-on-chip and chip multiprocessors. Single-hop non-blocking networks have the advantage of providing uniform latency and throughput, which is important for cachecoherent NoC systems. Existing work shows that Benes networks have much lower transistor count and smaller circuit area but longer delay than crossbars. To reduce the delay, we propose to design the Clos network built with larger size switches. Using less than half number of stages than the Benes network, the Clos network with 4x4 switches can significantly reduce the delay. This dissertation focuses on designing high performance Benes/Clos on-chip interconnection networks and implementing the switch setting circuits for these networks. The major contributions are summarized below: The circuit designs of both Benes and Clos networks in different sizes are conducted considering two types of implementation of the configurable switch: with NMOS transistors only and full transmission gates (TGs). The layout and simulation results under 45nm technology show that TG-based Benes networks have much better delay and power performance than their NMOS-based counterparts, though more transistor resources are needed in TG-based designs. Clos networks achieve average 60% lower delay than Benes networks with even smaller area and power consumption. The Lee’s switch setting algorithm is fully implemented in RTL and synthesized. We have refined the algorithm in data structure and initialization/updating of relation values to make it suitable for hardware implementation. The simulation and synthesis results of the switching setting circuits for 4x4 to 64x64 Benes networks under 65nm technology confirm that the trend of delay and area results of the circuit is consistent with that of the Lee’s algorithm. To the best of our knowledge, this is the first complete hardware implementation of the parallel switch setting algorithm which can handle all types of permutations including partial ones. The results in this dissertation confirm that the Benes/Clos networks are promising solution to implement on-chip interconnection network
    • 

    corecore