35 research outputs found

    Energy reduction in 3D NoCs through communication optimization

    Get PDF
    Cataloged from PDF version of article.Network-on-Chip (NoC) architectures and three-dimensional (3D) integrated circuits have been introduced as attractive options for overcoming the barriers in interconnect scaling while increasing the number of cores. Combining these two approaches is expected to yield better performance and higher scalability. This paper explores the possibility of combining these two techniques in a heterogeneity aware fashion. Specifically, on a heterogeneous 3D NoC architecture, we explore how different types of processors can be optimally placed to minimize data access costs. Moreover, we select the optimal set of links with optimal voltage levels. The experimental results indicate significant savings in energy consumption across a wide range of values of our major simulation parameters

    Hardware/software approaches for reducing the process variation impact on instruction fetches

    Get PDF
    Cataloged from PDF version of article.As technology moves towards finer process geometries, it is becoming extremely difficult to control critical physical parameters such as channel length, gate oxide thickness, and dopant ion concentration. Variations in these parameters lead to dramatic variations in access latencies in Static Random Access Memory (SRAM) devices. This means that different lines of the same cache may have different access latencies. A simple solution to this problem is to adopt the worst-case latency paradigm. While this egalitarian cache management is simple, it may introduce significant performance overhead during instruction fetches when both address translation (instruction Translation Lookaside Buffer (TLB) access) and instruction cache access take place, making this solution infeasible for future high-performance processors. In this study, we first propose some hardware and software enhancements and then, based on those, investigate several techniques to mitigate the effect of process variation on the instruction fetch pipeline stage in modern processors. For address translation, we study an approach that performs the virtual-to-physical page translation once, then stores it in a special register, reusing it as long as the execution remains on the same instruction page. To handle varying access latencies across different instruction cache lines, we annotate the cache access latency of instructions within themselves to give the circuitry a hint about how long to wait for the next instruction to become available

    Boosting performance of directory-based cache coherence protocols with coherence bypass at subpage granularity and a novel on-chip page table

    Get PDF
    Chip multiprocessors (CMPs) require effective cache coher-ence protocols as well as fast virtual-To-physical address trans-lation mechanisms for high performance. Directory-based cache coherence protocols are the state-of-The-Art approaches in many-core CMPs to keep the data blocks coherent at the last level private caches. However, the area overhead and high associativity requirement of the directory structures may not scale well with increasingly higher number of cores. As shown in some prior studies, a significant percentage of data blocks are accessed by only one core, therefore, it is not necessary to keep track of these in the directory struc-ture. In this study, we have two major contributions. First, we show that compared to the classification of cache blocks at page granularity as done in some previous studies, data block classification at subpage level helps to detect consid-erably more private data blocks. Consequently, it reduces the percentage of blocks required to be tracked in the di-rectory significantly compared to similar page level classification approaches. This, in turn, enables smaller directory caches with lower associativity to be used in CMPs without hurting performance, thereby helping the directory struc-ture to scale gracefully with the increasing number of cores. Memory block classification at subpage level, however, may increase the frequency of the Operating System's (OS) in-volvement in updating the maintenance bits belonging to subpages stored in page table entries, nullifying some por-tion of performance benefits of subpage level data classification. To overcome this, we propose a distributed on-chip page table as a our second contribution. © 2016 Copyright held by the owner/author(s)

    Energy reduction in 3D NoCs through communication optimization

    Get PDF
    Network-on-Chip (NoC) architectures and three-dimensional (3D) integrated circuits have been introduced as attractive options for overcoming the barriers in interconnect scaling while increasing the number of cores. Combining these two approaches is expected to yield better performance and higher scalability. This paper explores the possibility of combining these two techniques in a heterogeneity aware fashion. Specifically, on a heterogeneous 3D NoC architecture, we explore how different types of processors can be optimally placed to minimize data access costs. Moreover, we select the optimal set of links with optimal voltage levels. The experimental results indicate significant savings in energy consumption across a wide range of values of our major simulation parameters. © 2013, Springer-Verlag Wien

    ABSTRACT Optimizing Inter-Nest Data Locality

    No full text
    By examining data reuse patterns of four array-intensive embedded applications, we found that these codes exhibit a significant amount of inter-nest reuse (i.e., the data reuse that occurs between different nests). While traditional compiler techniques that target array-intensive applications can exploit intra-nest data reuse, there has not been much success in the past in taking advantage of internest data reuse. In this paper, we present a compiler strategy that optimizes inter-nest reuse using loop (iteration space) transformations. Our approach captures the impact of execution of a nest on cache contents using an abstraction called footprint vector. Then, it transforms a given nest such that the new (transformed) access pattern reuses the data left in cache by the previous nest in the code. In optimizing inter-nest locality, our approach also tries to achieve good intra-nest locality. Our simulation results indicate large performance improvements. In particular, inter-nest loop optimization generates competitive results with intra-nest loop and data optimizations

    Classifying Data Blocks at Subpage Granularity with an On-Chip Page Table to Improve Coherence in Tiled CMPs

    No full text
    As shown in some prior studies, a significant percentage of data blocks accessed in parallel codes are private, and not keeping track of those blocks can improve the effectiveness of directory structures in CMPs. In this study, we have two major contributions. First, we showed that compared to the classification of cache blocks at page granularity, data block classification at subpage level helps to detect considerably more private data blocks. Based on this idea, we propose two different approaches for enhancing the effectiveness of directory caches in tiled CMPs. In the first approach, which is called Quasi-Dynamic subpage level data Block Classification (QDBC), a data block is assumed to be private from the beginning of the program execution and stays private as long as the corresponding subpage is accessed by only one core. Our second approach, which is called Dynamic subpage level data Block Classification (DBC), turns a data block into private again after all blocks within the corresponding subpage are evicted from private cache hierarchy. Memory block classification at subpage level, however, may increase the frequency of the operating system involvement in updating the maintenance bits in page table entries. To overcome this, we propose, as a second contribution, a distributed table called as on-chip page table, which stores recently accessed page translations in the system. Our simulation results show that, compared to page level data classification, QDBC and DBC approaches relying on the on-chip page table can detect significantly more private data blocks and considerably improve system performance. IEE

    An Energy Saving Strategy Based on Adaptive Loop Parallelization

    No full text
    In this paper, we evaluate an adaptive loop parallelization strategy (i.e., a strategy that allows each loop nest to execute using different number of processors if doing so is beneficial) and measure the potential energy savings when unused processors during execution of a nested loop in a multi-processor on-a-chip (MPoC) are shut down (i.e., placed into a power-down or sleep state). Our results show that shutting down unused processors can lead to as much as 67% energy savings with up to 17% performance loss in a set of array-intensive applications. We also discuss and evaluate a processor pre-activation strategy based on compile-time analysis of nested loops. Based on our experiments, we conclude that an adaptive loop parallelization strategy combined with idle processor shut-down and pre-activation can be very effective in reducing energy consumption without increasing execution time

    An Integer Linear Programming Based Approach for Parallelizing Applications in On-Chip Multiprocessors

    No full text
    With energy consumption becoming one of the first-class optimization parameters in computer system design, compilation techniques that consider performance and energy simultaneously are expected to play a central role. In particular, compiling a given application code under performance and energy constraints is becoming an important problem. In this paper, we focus on an on-chip multiprocessor architecture and present a parallelization strategy based on integer linear programming. Given an array-intensive application, our optimization strategy determines the number of processors to be used in executing each nest based on the objective function and additional compilation constraints provided by the user. Our initial experience with this strategy shows that it is very successful in optimizing array-intensive applications on on-chip multiprocessors under energy and performance constraints
    corecore