Search CORE

254 research outputs found

Reducing Waste in Memory Hierarchies

Author: Tian Yingying
Publication venue
Publication date: 21/09/2015
Field of study

Memory hierarchies play an important role in microarchitectural design to bridge the performance gap between modern microprocessors and main memory. However, memory hierarchies are inefficient due to storing waste. This dissertation quantifies two types of waste, dead blocks and data redundancy. This dissertation studies waste in diverse memory hierarchies and proposes techniques to reduce waste to improve performance with limited overhead. This dissertation observes that waste of dead blocks in an inclusive last level cache consists of two kinds of blocks: blocks that are highly accessed in core caches and blocks that have low temporal locality in both core caches and the last-level cache. Blindly replacing all dead blocks in an inclusive last level cache may degrade performance. This dissertation proposes temporal-based multilevel correlating cache replacement to improve performance of inclusive cache hierarchies. This dissertation observes that waste exists in private caches of graphics processing units (GPUs) as zero-reuse blocks. This dissertation defines zero-reuse blocks as blocks that are dead after being inserted into caches. This dissertation proposes adaptive GPU cache bypassing technique to improve performance as well as reducing power consumption by dynamically bypassing zero-reuse blocks. This dissertation exploits waste of data redundancy at the block-level granularity and finds that conventional cache design wastes capacity because it stores duplicate data. This dissertation quantifies the percentage of data duplication and analyze causes. This dissertation proposes a practical cache deduplication technique to increase the effectiveness of the cache with limited area and power consumption

Texas A&M Repository

A Survey of Techniques for Architecting TLBs

Author: Mittal Sparsh
Publication venue: 'Wiley'
Publication date: 01/01/2016
Field of study

“Translation lookaside buffer” (TLB) caches virtual to physical address translation information and is used in systems ranging from embedded devices to high-end servers. Since TLB is accessed very frequently and a TLB miss is extremely costly, prudent management of TLB is important for improving performance and energy efficiency of processors. In this paper, we present a survey of techniques for architecting and managing TLBs. We characterize the techniques across several dimensions to highlight their similarities and distinctions. We believe that this paper will be useful for chip designers, computer architects and system engineers

Research Archive of Indian Institute of Technology Hyderabad

Software-Oriented Data Access Characterization for Chip Multiprocessor Architecture Optimizations

Author: Li Yong
Publication venue
Publication date: 29/01/2014
Field of study

The integration of an increasing amount of on-chip hardware in Chip-Multiprocessors (CMPs) poses a challenge of efficiently utilizing the on-chip resources to maximize performance. Prior research proposals largely rely on additional hardware support to achieve desirable tradeoffs. However, these purely hardware-oriented mechanisms typically result in more generic but less efficient approaches. A new trend is designing adaptive systems by exploiting and leveraging application-level information. In this work a wide range of applications are analyzed and remarkable data access behaviors/patterns are recognized to be useful for architectural and system optimizations. In particular, this dissertation work introduces software-based techniques that can be used to extract data access characteristics for cross-layer optimizations on performance and scalability. The collected information is utilized to guide cache data placement, network configuration, coherence operations, address translation, memory configuration, etc. In particular, an approach is proposed to classify data blocks into different categories to optimize an on-chip coherent cache organization. For applications with compile-time deterministic data access localities, a compiler technique is proposed to determine data partitions that guide the last level cache data placement and communication patterns for network configuration. A page-level data classification is also demonstrated to improve address translation performance. The successful utilization of data access characteristics on traditional CMP architectures demonstrates that the proposed approach is promising and generic and can be potentially applied to future CMP architectures with emerging technologies such as the Spin-transfer torque RAM (STT-RAM)

D-Scholarship@Pitt

DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

Author: Fernandez Ivan
Ghose Saugata
Gómez-Luna Juan
Mutlu Onur
Oliveira Geraldo F.
Orosa Lois
Sadrosadati Mohammad
Vijaykumar Nandita
Publication venue
Publication date: 01/01/2021
Field of study

Data movement between the CPU and main memory is a first-order obstacle against improving performance, scalability, and energy efficiency in modern systems. Computer systems employ a range of techniques to reduce overheads tied to data movement, spanning from traditional mechanisms (e.g., deep multi-level cache hierarchies, aggressive hardware prefetchers) to emerging techniques such as Near-Data Processing (NDP), where some computation is moved close to memory. Our goal is to methodically identify potential sources of data movement over a broad set of applications and to comprehensively compare traditional compute-centric data movement mitigation techniques to more memory-centric techniques, thereby developing a rigorous understanding of the best techniques to mitigate each source of data movement. With this goal in mind, we perform the first large-scale characterization of a wide variety of applications, across a wide range of application domains, to identify fundamental program properties that lead to data movement to/from main memory. We develop the first systematic methodology to classify applications based on the sources contributing to data movement bottlenecks. From our large-scale characterization of 77K functions across 345 applications, we select 144 functions to form the first open-source benchmark suite (DAMOV) for main memory data movement studies. We select a diverse range of functions that (1) represent different types of data movement bottlenecks, and (2) come from a wide range of application domains. Using NDP as a case study, we identify new insights about the different data movement bottlenecks and use these insights to determine the most suitable data movement mitigation mechanism for a particular application. We open-source DAMOV and the complete source code for our new characterization methodology at https://github.com/CMU-SAFARI/DAMOV.Comment: Our open source software is available at https://github.com/CMU-SAFARI/DAMO

arXiv.org e-Print Archive

Repository for Publications and Research Data

Directory of Open Access Journals

Analysis of data placement techniques in heterogeneous memories

Author: Βούλγαρης Δημήτριος Α.
Publication venue
Publication date: 01/01/2019
Field of study

University of Thessaly Institutional Repository

Software-assisted data prefetching algorithms.

Author
Publication venue: Department of Cultural and Religious Studies, The Chinese University of Hong Kong
Publication date: 01/01/1995
Field of study

by Chi-sum, Ho.Thesis (M.Phil.)--Chinese University of Hong Kong, 1995.Includes bibliographical references (leaves 110-113).Abstract --- p.iAcknowledgement --- p.iiiChapter 1 --- Introduction --- p.1Chapter 1.1 --- Overview --- p.1Chapter 1.2 --- Cache Memories --- p.1Chapter 1.3 --- Improving Cache Performance --- p.3Chapter 1.4 --- Improving System Performance --- p.4Chapter 1.5 --- Organization of the dissertation --- p.6Chapter 2 --- Related Work --- p.8Chapter 2.1 --- Cache Performance --- p.8Chapter 2.2 --- Non-Blocking Cache --- p.9Chapter 2.3 --- Cache Prefetching --- p.10Chapter 2.3.1 --- Hardware Prefetching --- p.10Chapter 2.3.2 --- Software-assisted Prefetching --- p.13Chapter 2.3.3 --- Improving Cache Effectiveness --- p.22Chapter 2.4 --- Other Techniques to Reduce and Hide Memory Latencies --- p.25Chapter 2.4.1 --- Register Preloading --- p.25Chapter 2.4.2 --- Write Policies --- p.26Chapter 2.4.3 --- Small Specialized Cache --- p.26Chapter 2.4.4 --- Program Transformation --- p.27Chapter 3 --- Stride CAM Prefetching --- p.30Chapter 3.1 --- Introduction --- p.30Chapter 3.2 --- Architectural Model --- p.32Chapter 3.2.1 --- Compiler Support --- p.33Chapter 3.2.2 --- Hardware Support --- p.35Chapter 3.2.3 --- Model Details --- p.39Chapter 3.3 --- Optimization Issues --- p.39Chapter 3.3.1 --- Eliminating Reductant Prefetching --- p.40Chapter 3.3.2 --- Code Motion --- p.40Chapter 3.3.3 --- Burst Mode --- p.44Chapter 3.3.4 --- Stride CAM Overflow --- p.45Chapter 3.3.5 --- Effects of Loop Optimizations --- p.46Chapter 3.4 --- Practicability --- p.50Chapter 3.4.1 --- Evaluation Methodology --- p.51Chapter 3.4.2 --- Prefetch Accuracy --- p.54Chapter 3.4.3 --- Stride CAM Size --- p.56Chapter 3.4.4 --- Software Overhead --- p.60Chapter 4 --- Stride Register Prefetching --- p.67Chapter 4.1 --- Motivation --- p.67Chapter 4.2 --- Architectural Model --- p.67Chapter 4.2.1 --- Stride Register --- p.69Chapter 4.2.2 --- Compiler Support --- p.70Chapter 4.2.3 --- Prefetch Bits --- p.72Chapter 4.2.4 --- Operation Details --- p.77Chapter 4.3 --- Practicability and Optimizations --- p.78Chapter 4.3.1 --- Practicability on NASA7 Benchmark Programs --- p.78Chapter 4.3.2 --- Optimization Issues --- p.81Chapter 4.4 --- Comparison Between Stride CAM and Stride Register Models --- p.84Chapter 5 --- Small Software-Driven Array Cache --- p.87Chapter 5.1 --- Introduction --- p.87Chapter 5.2 --- Cache Pollution in MXM --- p.88Chapter 5.3 --- Architectural Model --- p.89Chapter 5.3.1 --- Operation Details --- p.91Chapter 5.4 --- Effectiveness of Array Cache --- p.92Chapter 6 --- Conclusion --- p.96Chapter 6.1 --- Conclusion --- p.96Chapter 6.2 --- Future Research: An Extension of the Stride CAM Model --- p.97Chapter 6.2.1 --- Background --- p.97Chapter 6.2.2 --- Reference Address Series --- p.98Chapter 6.2.3 --- Extending the Stride CAM Model --- p.100Chapter 6.2.4 --- Prefetch Overhead --- p.109Bibliography --- p.110Appendix --- p.114Chapter A --- Simulation Results - Stride CAM Model --- p.114Chapter A.l --- Execution Time --- p.114Chapter A.1.1 --- BTRIX --- p.114Chapter A.1.2 --- CFFT2D --- p.115Chapter A.1.3 --- CHOLSKY --- p.116Chapter A.1.4 --- EMIT --- p.117Chapter A.1.5 --- GMTRY --- p.118Chapter A.1.6 --- MXM --- p.119Chapter A.1.7 --- VPENTA --- p.120Chapter A.2 --- Memory Delay --- p.122Chapter A.2.1 --- BTRIX --- p.122Chapter A.2.2 --- CFFT2D --- p.123Chapter A.2.3 --- CHOLSKY --- p.124Chapter A.2.4 --- EMIT --- p.125Chapter A.2.5 --- GMTRY --- p.126Chapter A.2.6 --- MXM --- p.127Chapter A.2.7 --- VPENTA --- p.128Chapter A.3 --- Overhead --- p.129Chapter A.3.1 --- BTRIX --- p.129Chapter A.3.2 --- CFFT2D --- p.130Chapter A.3.3 --- CHOLSKY --- p.131Chapter A.3.4 --- EMIT --- p.132Chapter A.3.5 --- GMTRY --- p.133Chapter A.3.6 --- MXM --- p.134Chapter A.3.7 --- VPENTA --- p.135Chapter A.4 --- Hit Ratio --- p.136Chapter A.4.1 --- BTRIX --- p.136Chapter A.4.2 --- CFFT2D --- p.137Chapter A.4.3 --- CHOLSKY --- p.137Chapter A.4.4 --- EMIT --- p.138Chapter A.4.5 --- GMTRY --- p.139Chapter A.4.6 --- MXM --- p.139Chapter A.4.7 --- VPENTA --- p.140Chapter B --- Simulation Results - Array Cache --- p.141Chapter C --- NASA7 Benchmark --- p.145Chapter C.1 --- BTRIX --- p.145Chapter C.2 --- CFFT2D --- p.161Chapter C.2.1 --- cfft2dl --- p.161Chapter C.2.2 --- cfft2d2 --- p.169Chapter C.3 --- CHOLSKY --- p.179Chapter C.4 --- EMIT --- p.192Chapter C.5 --- GMTRY --- p.205Chapter C.6 --- MXM --- p.217Chapter C.7 --- VPENTA --- p.22

CUHK Digital Repository

Design of Efficient TLB-based Data Classification Mechanisms in Chip Multiprocessors

Author: Esteve García Albert
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 01/09/2017
Field of study

Most of the data referenced by sequential and parallel applications running in current chip multiprocessors are referenced by a single thread, i.e., private. Recent proposals leverage this observation to improve many aspects of chip multiprocessors, such as reducing coherence overhead or the access latency to distributed caches. The effectiveness of those proposals depends to a large extent on the amount of detected private data. However, the mechanisms proposed so far either do not consider either thread migration or the private use of data within different application phases, or do entail high overhead. As a result, a considerable amount of private data is not detected. In order to increase the detection of private data, this thesis proposes a TLB-based mechanism that is able to account for both thread migration and private application phases with low overhead. Classification status in the proposed TLB-based classification mechanisms is determined by the presence of the page translation stored in other core's TLBs. The classification schemes are analyzed in multilevel TLB hierarchies, for systems with both private and distributed shared last-level TLBs. This thesis introduces a page classification approach based on inspecting other core's TLBs upon every TLB miss. In particular, the proposed classification approach is based on exchange and count of tokens. Token counting on TLBs is a natural and efficient way for classifying memory pages. It does not require the use of complex and undesirable persistent requests or arbitration, since when two ormore TLBs race for accessing a page, tokens are appropriately distributed classifying the page as shared. However, TLB-based ability to classify private pages is strongly dependent on TLB size, as it relies on the presence of a page translation in the system TLBs. To overcome that, different TLB usage predictors (UP) have been proposed, which allow a page classification unaffected by TLB size. Specifically, this thesis introduces a predictor that obtains system-wide page usage information by either employing a shared last-level TLB structure (SUP) or cooperative TLBs working together (CUP).La mayor parte de los datos referenciados por aplicaciones paralelas y secuenciales que se ejecutan enCMPs actuales son referenciadas por un único hilo, es decir, son privados. Recientemente, algunas propuestas aprovechan esta observación para mejorar muchos aspectos de los CMPs, como por ejemplo reducir el sobrecoste de la coherencia o la latencia de los accesos a cachés distribuidas. La efectividad de estas propuestas depende en gran medida de la cantidad de datos que son considerados privados. Sin embargo, los mecanismos propuestos hasta la fecha no consideran la migración de hilos de ejecución ni las fases de una aplicación. Por tanto, una cantidad considerable de datos privados no se detecta apropiadamente. Con el fin de aumentar la detección de datos privados, proponemos un mecanismo basado en las TLBs, capaz de reclasificar los datos a privado, y que detecta la migración de los hilos de ejecución sin añadir complejidad al sistema. Los mecanismos de clasificación en las TLBs se han analizado en estructuras de varios niveles, incluyendo TLBs privadas y con un último nivel de TLB compartido y distribuido. Esta tesis también presenta un mecanismo de clasificación de páginas basado en la inspección de las TLBs de otros núcleos tras cada fallo de TLB. De forma particular, el mecanismo propuesto se basa en el intercambio y el cuenteo de tokens (testigos). Contar tokens en las TLBs supone una forma natural y eficiente para la clasificación de páginas de memoria. Además, evita el uso de solicitudes persistentes o arbitraje alguno, ya que si dos o más TLBs compiten para acceder a una página, los tokens se distribuyen apropiadamente y la clasifican como compartida. Sin embargo, la habilidad de los mecanismos basados en TLB para clasificar páginas privadas depende del tamaño de las TLBs. La clasificación basada en las TLBs se basa en la presencia de una traducción en las TLBs del sistema. Para evitarlo, se han propuesto diversos predictores de uso en las TLBs (UP), los cuales permiten una clasificación independiente del tamaño de las TLBs. En concreto, esta tesis presenta un sistema mediante el que se obtiene información de uso de página a nivel de sistema con la ayuda de un nivel de TLB compartida (SUP) o mediante TLBs cooperando juntas (CUP).La major part de les dades referenciades per aplicacions paral·leles i seqüencials que s'executen en CMPs actuals són referenciades per un sol fil, és a dir, són privades. Recentment, algunes propostes aprofiten aquesta observació per a millorar molts aspectes dels CMPs, com és reduir el sobrecost de la coherència o la latència d'accés a memòries cau distribuïdes. L'efectivitat d'aquestes propostes depen en gran mesura de la quantitat de dades detectades com a privades. No obstant això, els mecanismes proposats fins a la data no consideren la migració de fils d'execució ni les fases d'una aplicació. Per tant, una quantitat considerable de dades privades no es detecta apropiadament. A fi d'augmentar la detecció de dades privades, aquesta tesi proposa un mecanisme basat en les TLBs, capaç de reclassificar les dades com a privades, i que detecta la migració dels fils d'execució sense afegir complexitat al sistema. Els mecanismes de classificació en les TLBs s'han analitzat en estructures de diversos nivells, incloent-hi sistemes amb TLBs d'últimnivell compartides i distribuïdes. Aquesta tesi presenta un mecanisme de classificació de pàgines basat en inspeccionar les TLBs d'altres nuclis després de cada fallada de TLB. Concretament, el mecanisme proposat es basa en l'intercanvi i el compte de tokens. Comptar tokens en les TLBs suposa una forma natural i eficient per a la classificació de pàgines de memòria. A més, evita l'ús de sol·licituds persistents o arbitratge, ja que si dues o més TLBs competeixen per a accedir a una pàgina, els tokens es distribueixen apropiadament i la classifiquen com a compartida. No obstant això, l'habilitat dels mecanismes basats en TLB per a classificar pàgines privades depenen de la grandària de les TLBs. La classificació basada en les TLBs resta en la presència d'una traducció en les TLBs del sistema. Per a evitar-ho, s'han proposat diversos predictors d'ús en les TLBs (UP), els quals permeten una classificació independent de la grandària de les TLBs. Específicament, aquesta tesi introdueix un predictor que obté informació d'ús de la pàgina a escala de sistema mitjançant un nivell de TLB compartida (SUP) or mitjançant TLBs cooperant juntes (CUP).Esteve García, A. (2017). Design of Efficient TLB-based Data Classification Mechanisms in Chip Multiprocessors [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/86136TESI

Crossref

RiuNet

Recommended from our members

Inline and Sideline Approaches for Low-cost Memory Safety in C

Author: Nam Myoung Jin
Publication venue: University of Cambridge
Publication date: 13/11/2020
Field of study

System languages such as C or C++ are widely used for their high performance, however the allowance of arbitrary pointer arithmetic and type cast introduces a risk of memory corruptions. These memory errors cause unexpected termination of programs, or even worse, attackers can exploit them to alter the behavior of programs or leak crucial data. Despite advances in memory safety solutions, high and unpredictable overhead remains a major challenge. Accepting that it is extremely difficult to achieve complete memory safety with the performance level suitable for production deployment, researchers attempt to strike a balance between performance, detection coverage, interoperability, precision, and detection timing. Some properties are much more desirable, e.g. the interoperability with pre-compiled libraries. Comparatively less critical properties are sacrificed for performance, for example, tolerating longer detection delay or narrowing down detection coverage by performing approximate or probabilistic checking or detecting only certain errors. Modern solutions compete for performance. The performance matrix of memory safety solutions have two major assessment criteria – run-time and memory overheads. Researchers trade-off and balance performance metrics depending on its purpose or placement. Many of them tolerate the increase in memory use for better speed, since memory safety enforcement is more desirable for troubleshooting or testing during development, where a memory resource is not the main issue. Run-time overhead, considered more critical, is impacted by cache misses, dynamic instructions, DRAM row activations, branch predictions and other factors. This research proposes, implements, and evaluates MIU: Memory Integrity Utilities containing three solutions – MemPatrol, FRAMER and spaceMiu. MIU suggests new techniques for practical deployment of memory safety by exploiting free resources with the following focuses: (1) achieving memory safety with overhead < 1% by using concurrency and trading off prompt detection and coverage; but yet providing eventual detection by a monitor isolation design of an in-register monitor process and the use of AES instructions (2) complete memory safety with near-zero false negatives focusing on eliminating overhead, that hardware support cannot resolve, by using a new tagged-pointer representation utilising the top unused bits of a pointer.Research Foundation of Kore

Apollo (Cambridge)