254 research outputs found

    Reducing Waste in Memory Hierarchies

    Get PDF
    Memory hierarchies play an important role in microarchitectural design to bridge the performance gap between modern microprocessors and main memory. However, memory hierarchies are inefficient due to storing waste. This dissertation quantifies two types of waste, dead blocks and data redundancy. This dissertation studies waste in diverse memory hierarchies and proposes techniques to reduce waste to improve performance with limited overhead. This dissertation observes that waste of dead blocks in an inclusive last level cache consists of two kinds of blocks: blocks that are highly accessed in core caches and blocks that have low temporal locality in both core caches and the last-level cache. Blindly replacing all dead blocks in an inclusive last level cache may degrade performance. This dissertation proposes temporal-based multilevel correlating cache replacement to improve performance of inclusive cache hierarchies. This dissertation observes that waste exists in private caches of graphics processing units (GPUs) as zero-reuse blocks. This dissertation defines zero-reuse blocks as blocks that are dead after being inserted into caches. This dissertation proposes adaptive GPU cache bypassing technique to improve performance as well as reducing power consumption by dynamically bypassing zero-reuse blocks. This dissertation exploits waste of data redundancy at the block-level granularity and finds that conventional cache design wastes capacity because it stores duplicate data. This dissertation quantifies the percentage of data duplication and analyze causes. This dissertation proposes a practical cache deduplication technique to increase the effectiveness of the cache with limited area and power consumption

    A Survey of Techniques for Architecting TLBs

    Get PDF
    “Translation lookaside buffer” (TLB) caches virtual to physical address translation information and is used in systems ranging from embedded devices to high-end servers. Since TLB is accessed very frequently and a TLB miss is extremely costly, prudent management of TLB is important for improving performance and energy efficiency of processors. In this paper, we present a survey of techniques for architecting and managing TLBs. We characterize the techniques across several dimensions to highlight their similarities and distinctions. We believe that this paper will be useful for chip designers, computer architects and system engineers

    Software-Oriented Data Access Characterization for Chip Multiprocessor Architecture Optimizations

    Get PDF
    The integration of an increasing amount of on-chip hardware in Chip-Multiprocessors (CMPs) poses a challenge of efficiently utilizing the on-chip resources to maximize performance. Prior research proposals largely rely on additional hardware support to achieve desirable tradeoffs. However, these purely hardware-oriented mechanisms typically result in more generic but less efficient approaches. A new trend is designing adaptive systems by exploiting and leveraging application-level information. In this work a wide range of applications are analyzed and remarkable data access behaviors/patterns are recognized to be useful for architectural and system optimizations. In particular, this dissertation work introduces software-based techniques that can be used to extract data access characteristics for cross-layer optimizations on performance and scalability. The collected information is utilized to guide cache data placement, network configuration, coherence operations, address translation, memory configuration, etc. In particular, an approach is proposed to classify data blocks into different categories to optimize an on-chip coherent cache organization. For applications with compile-time deterministic data access localities, a compiler technique is proposed to determine data partitions that guide the last level cache data placement and communication patterns for network configuration. A page-level data classification is also demonstrated to improve address translation performance. The successful utilization of data access characteristics on traditional CMP architectures demonstrates that the proposed approach is promising and generic and can be potentially applied to future CMP architectures with emerging technologies such as the Spin-transfer torque RAM (STT-RAM)

    DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

    Full text link
    Data movement between the CPU and main memory is a first-order obstacle against improving performance, scalability, and energy efficiency in modern systems. Computer systems employ a range of techniques to reduce overheads tied to data movement, spanning from traditional mechanisms (e.g., deep multi-level cache hierarchies, aggressive hardware prefetchers) to emerging techniques such as Near-Data Processing (NDP), where some computation is moved close to memory. Our goal is to methodically identify potential sources of data movement over a broad set of applications and to comprehensively compare traditional compute-centric data movement mitigation techniques to more memory-centric techniques, thereby developing a rigorous understanding of the best techniques to mitigate each source of data movement. With this goal in mind, we perform the first large-scale characterization of a wide variety of applications, across a wide range of application domains, to identify fundamental program properties that lead to data movement to/from main memory. We develop the first systematic methodology to classify applications based on the sources contributing to data movement bottlenecks. From our large-scale characterization of 77K functions across 345 applications, we select 144 functions to form the first open-source benchmark suite (DAMOV) for main memory data movement studies. We select a diverse range of functions that (1) represent different types of data movement bottlenecks, and (2) come from a wide range of application domains. Using NDP as a case study, we identify new insights about the different data movement bottlenecks and use these insights to determine the most suitable data movement mitigation mechanism for a particular application. We open-source DAMOV and the complete source code for our new characterization methodology at https://github.com/CMU-SAFARI/DAMOV.Comment: Our open source software is available at https://github.com/CMU-SAFARI/DAMO

    Software-assisted data prefetching algorithms.

    Get PDF
    by Chi-sum, Ho.Thesis (M.Phil.)--Chinese University of Hong Kong, 1995.Includes bibliographical references (leaves 110-113).Abstract --- p.iAcknowledgement --- p.iiiChapter 1 --- Introduction --- p.1Chapter 1.1 --- Overview --- p.1Chapter 1.2 --- Cache Memories --- p.1Chapter 1.3 --- Improving Cache Performance --- p.3Chapter 1.4 --- Improving System Performance --- p.4Chapter 1.5 --- Organization of the dissertation --- p.6Chapter 2 --- Related Work --- p.8Chapter 2.1 --- Cache Performance --- p.8Chapter 2.2 --- Non-Blocking Cache --- p.9Chapter 2.3 --- Cache Prefetching --- p.10Chapter 2.3.1 --- Hardware Prefetching --- p.10Chapter 2.3.2 --- Software-assisted Prefetching --- p.13Chapter 2.3.3 --- Improving Cache Effectiveness --- p.22Chapter 2.4 --- Other Techniques to Reduce and Hide Memory Latencies --- p.25Chapter 2.4.1 --- Register Preloading --- p.25Chapter 2.4.2 --- Write Policies --- p.26Chapter 2.4.3 --- Small Specialized Cache --- p.26Chapter 2.4.4 --- Program Transformation --- p.27Chapter 3 --- Stride CAM Prefetching --- p.30Chapter 3.1 --- Introduction --- p.30Chapter 3.2 --- Architectural Model --- p.32Chapter 3.2.1 --- Compiler Support --- p.33Chapter 3.2.2 --- Hardware Support --- p.35Chapter 3.2.3 --- Model Details --- p.39Chapter 3.3 --- Optimization Issues --- p.39Chapter 3.3.1 --- Eliminating Reductant Prefetching --- p.40Chapter 3.3.2 --- Code Motion --- p.40Chapter 3.3.3 --- Burst Mode --- p.44Chapter 3.3.4 --- Stride CAM Overflow --- p.45Chapter 3.3.5 --- Effects of Loop Optimizations --- p.46Chapter 3.4 --- Practicability --- p.50Chapter 3.4.1 --- Evaluation Methodology --- p.51Chapter 3.4.2 --- Prefetch Accuracy --- p.54Chapter 3.4.3 --- Stride CAM Size --- p.56Chapter 3.4.4 --- Software Overhead --- p.60Chapter 4 --- Stride Register Prefetching --- p.67Chapter 4.1 --- Motivation --- p.67Chapter 4.2 --- Architectural Model --- p.67Chapter 4.2.1 --- Stride Register --- p.69Chapter 4.2.2 --- Compiler Support --- p.70Chapter 4.2.3 --- Prefetch Bits --- p.72Chapter 4.2.4 --- Operation Details --- p.77Chapter 4.3 --- Practicability and Optimizations --- p.78Chapter 4.3.1 --- Practicability on NASA7 Benchmark Programs --- p.78Chapter 4.3.2 --- Optimization Issues --- p.81Chapter 4.4 --- Comparison Between Stride CAM and Stride Register Models --- p.84Chapter 5 --- Small Software-Driven Array Cache --- p.87Chapter 5.1 --- Introduction --- p.87Chapter 5.2 --- Cache Pollution in MXM --- p.88Chapter 5.3 --- Architectural Model --- p.89Chapter 5.3.1 --- Operation Details --- p.91Chapter 5.4 --- Effectiveness of Array Cache --- p.92Chapter 6 --- Conclusion --- p.96Chapter 6.1 --- Conclusion --- p.96Chapter 6.2 --- Future Research: An Extension of the Stride CAM Model --- p.97Chapter 6.2.1 --- Background --- p.97Chapter 6.2.2 --- Reference Address Series --- p.98Chapter 6.2.3 --- Extending the Stride CAM Model --- p.100Chapter 6.2.4 --- Prefetch Overhead --- p.109Bibliography --- p.110Appendix --- p.114Chapter A --- Simulation Results - Stride CAM Model --- p.114Chapter A.l --- Execution Time --- p.114Chapter A.1.1 --- BTRIX --- p.114Chapter A.1.2 --- CFFT2D --- p.115Chapter A.1.3 --- CHOLSKY --- p.116Chapter A.1.4 --- EMIT --- p.117Chapter A.1.5 --- GMTRY --- p.118Chapter A.1.6 --- MXM --- p.119Chapter A.1.7 --- VPENTA --- p.120Chapter A.2 --- Memory Delay --- p.122Chapter A.2.1 --- BTRIX --- p.122Chapter A.2.2 --- CFFT2D --- p.123Chapter A.2.3 --- CHOLSKY --- p.124Chapter A.2.4 --- EMIT --- p.125Chapter A.2.5 --- GMTRY --- p.126Chapter A.2.6 --- MXM --- p.127Chapter A.2.7 --- VPENTA --- p.128Chapter A.3 --- Overhead --- p.129Chapter A.3.1 --- BTRIX --- p.129Chapter A.3.2 --- CFFT2D --- p.130Chapter A.3.3 --- CHOLSKY --- p.131Chapter A.3.4 --- EMIT --- p.132Chapter A.3.5 --- GMTRY --- p.133Chapter A.3.6 --- MXM --- p.134Chapter A.3.7 --- VPENTA --- p.135Chapter A.4 --- Hit Ratio --- p.136Chapter A.4.1 --- BTRIX --- p.136Chapter A.4.2 --- CFFT2D --- p.137Chapter A.4.3 --- CHOLSKY --- p.137Chapter A.4.4 --- EMIT --- p.138Chapter A.4.5 --- GMTRY --- p.139Chapter A.4.6 --- MXM --- p.139Chapter A.4.7 --- VPENTA --- p.140Chapter B --- Simulation Results - Array Cache --- p.141Chapter C --- NASA7 Benchmark --- p.145Chapter C.1 --- BTRIX --- p.145Chapter C.2 --- CFFT2D --- p.161Chapter C.2.1 --- cfft2dl --- p.161Chapter C.2.2 --- cfft2d2 --- p.169Chapter C.3 --- CHOLSKY --- p.179Chapter C.4 --- EMIT --- p.192Chapter C.5 --- GMTRY --- p.205Chapter C.6 --- MXM --- p.217Chapter C.7 --- VPENTA --- p.22

    Design of Efficient TLB-based Data Classification Mechanisms in Chip Multiprocessors

    Full text link
    Most of the data referenced by sequential and parallel applications running in current chip multiprocessors are referenced by a single thread, i.e., private. Recent proposals leverage this observation to improve many aspects of chip multiprocessors, such as reducing coherence overhead or the access latency to distributed caches. The effectiveness of those proposals depends to a large extent on the amount of detected private data. However, the mechanisms proposed so far either do not consider either thread migration or the private use of data within different application phases, or do entail high overhead. As a result, a considerable amount of private data is not detected. In order to increase the detection of private data, this thesis proposes a TLB-based mechanism that is able to account for both thread migration and private application phases with low overhead. Classification status in the proposed TLB-based classification mechanisms is determined by the presence of the page translation stored in other core's TLBs. The classification schemes are analyzed in multilevel TLB hierarchies, for systems with both private and distributed shared last-level TLBs. This thesis introduces a page classification approach based on inspecting other core's TLBs upon every TLB miss. In particular, the proposed classification approach is based on exchange and count of tokens. Token counting on TLBs is a natural and efficient way for classifying memory pages. It does not require the use of complex and undesirable persistent requests or arbitration, since when two ormore TLBs race for accessing a page, tokens are appropriately distributed classifying the page as shared. However, TLB-based ability to classify private pages is strongly dependent on TLB size, as it relies on the presence of a page translation in the system TLBs. To overcome that, different TLB usage predictors (UP) have been proposed, which allow a page classification unaffected by TLB size. Specifically, this thesis introduces a predictor that obtains system-wide page usage information by either employing a shared last-level TLB structure (SUP) or cooperative TLBs working together (CUP).La mayor parte de los datos referenciados por aplicaciones paralelas y secuenciales que se ejecutan enCMPs actuales son referenciadas por un único hilo, es decir, son privados. Recientemente, algunas propuestas aprovechan esta observación para mejorar muchos aspectos de los CMPs, como por ejemplo reducir el sobrecoste de la coherencia o la latencia de los accesos a cachés distribuidas. La efectividad de estas propuestas depende en gran medida de la cantidad de datos que son considerados privados. Sin embargo, los mecanismos propuestos hasta la fecha no consideran la migración de hilos de ejecución ni las fases de una aplicación. Por tanto, una cantidad considerable de datos privados no se detecta apropiadamente. Con el fin de aumentar la detección de datos privados, proponemos un mecanismo basado en las TLBs, capaz de reclasificar los datos a privado, y que detecta la migración de los hilos de ejecución sin añadir complejidad al sistema. Los mecanismos de clasificación en las TLBs se han analizado en estructuras de varios niveles, incluyendo TLBs privadas y con un último nivel de TLB compartido y distribuido. Esta tesis también presenta un mecanismo de clasificación de páginas basado en la inspección de las TLBs de otros núcleos tras cada fallo de TLB. De forma particular, el mecanismo propuesto se basa en el intercambio y el cuenteo de tokens (testigos). Contar tokens en las TLBs supone una forma natural y eficiente para la clasificación de páginas de memoria. Además, evita el uso de solicitudes persistentes o arbitraje alguno, ya que si dos o más TLBs compiten para acceder a una página, los tokens se distribuyen apropiadamente y la clasifican como compartida. Sin embargo, la habilidad de los mecanismos basados en TLB para clasificar páginas privadas depende del tamaño de las TLBs. La clasificación basada en las TLBs se basa en la presencia de una traducción en las TLBs del sistema. Para evitarlo, se han propuesto diversos predictores de uso en las TLBs (UP), los cuales permiten una clasificación independiente del tamaño de las TLBs. En concreto, esta tesis presenta un sistema mediante el que se obtiene información de uso de página a nivel de sistema con la ayuda de un nivel de TLB compartida (SUP) o mediante TLBs cooperando juntas (CUP).La major part de les dades referenciades per aplicacions paral·leles i seqüencials que s'executen en CMPs actuals són referenciades per un sol fil, és a dir, són privades. Recentment, algunes propostes aprofiten aquesta observació per a millorar molts aspectes dels CMPs, com és reduir el sobrecost de la coherència o la latència d'accés a memòries cau distribuïdes. L'efectivitat d'aquestes propostes depen en gran mesura de la quantitat de dades detectades com a privades. No obstant això, els mecanismes proposats fins a la data no consideren la migració de fils d'execució ni les fases d'una aplicació. Per tant, una quantitat considerable de dades privades no es detecta apropiadament. A fi d'augmentar la detecció de dades privades, aquesta tesi proposa un mecanisme basat en les TLBs, capaç de reclassificar les dades com a privades, i que detecta la migració dels fils d'execució sense afegir complexitat al sistema. Els mecanismes de classificació en les TLBs s'han analitzat en estructures de diversos nivells, incloent-hi sistemes amb TLBs d'últimnivell compartides i distribuïdes. Aquesta tesi presenta un mecanisme de classificació de pàgines basat en inspeccionar les TLBs d'altres nuclis després de cada fallada de TLB. Concretament, el mecanisme proposat es basa en l'intercanvi i el compte de tokens. Comptar tokens en les TLBs suposa una forma natural i eficient per a la classificació de pàgines de memòria. A més, evita l'ús de sol·licituds persistents o arbitratge, ja que si dues o més TLBs competeixen per a accedir a una pàgina, els tokens es distribueixen apropiadament i la classifiquen com a compartida. No obstant això, l'habilitat dels mecanismes basats en TLB per a classificar pàgines privades depenen de la grandària de les TLBs. La classificació basada en les TLBs resta en la presència d'una traducció en les TLBs del sistema. Per a evitar-ho, s'han proposat diversos predictors d'ús en les TLBs (UP), els quals permeten una classificació independent de la grandària de les TLBs. Específicament, aquesta tesi introdueix un predictor que obté informació d'ús de la pàgina a escala de sistema mitjançant un nivell de TLB compartida (SUP) or mitjançant TLBs cooperant juntes (CUP).Esteve García, A. (2017). Design of Efficient TLB-based Data Classification Mechanisms in Chip Multiprocessors [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/86136TESI
    corecore