462 research outputs found

    A Survey of Techniques for Architecting TLBs

    Get PDF
    “Translation lookaside buffer” (TLB) caches virtual to physical address translation information and is used in systems ranging from embedded devices to high-end servers. Since TLB is accessed very frequently and a TLB miss is extremely costly, prudent management of TLB is important for improving performance and energy efficiency of processors. In this paper, we present a survey of techniques for architecting and managing TLBs. We characterize the techniques across several dimensions to highlight their similarities and distinctions. We believe that this paper will be useful for chip designers, computer architects and system engineers

    Microarchitectural techniques to reduce energy consumption in the memory hierarchy

    Get PDF
    This thesis states that dynamic profiling of the memory reference stream can improve energy and performance in the memory hierarchy. The research presented in this theses provides multiple instances of using lightweight hardware structures to profile the memory reference stream. The objective of this research is to develop microarchitectural techniques to reduce energy consumption at different levels of the memory hierarchy. Several simple and implementable techniques were developed as a part of this research. One of the techniques identifies and eliminates redundant refresh operations in DRAM and reduces DRAM refresh power. Another, reduces leakage energy in L2 and higher level caches for multiprocessor systems. The emphasis of this research has been to develop several techniques of obtaining energy savings in caches using a simple hardware structure called the counting Bloom filter (CBF). CBFs have been used to predict L2 cache misses and obtain energy savings by not accessing the L2 cache on a predicted miss. A simple extension of this technique allows CBFs to do way-estimation of set associative caches to reduce energy in cache lookups. Another technique using CBFs track addresses in a Virtual Cache and reduce false synonym lookups. Finally this thesis presents a technique to reduce dynamic power consumption in level one caches using significance compression. The significant energy and performance improvements demonstrated by the techniques presented in this thesis suggest that this work will be of great value for designing memory hierarchies of future computing platforms.Ph.D.Committee Chair: Lee, Hsien-Hsin S.; Committee Member: Cahtterjee,Abhijit; Committee Member: Mukhopadhyay, Saibal; Committee Member: Pande, Santosh; Committee Member: Yalamanchili, Sudhaka

    Energy-Ecient Physically Tagged Caches for Embedded Processors with Virtual Memory

    Get PDF
    ABSTRACT In this paper we present a low-power tag organization for physically tagged caches in embedded processor

    Design of Efficient TLB-based Data Classification Mechanisms in Chip Multiprocessors

    Full text link
    Most of the data referenced by sequential and parallel applications running in current chip multiprocessors are referenced by a single thread, i.e., private. Recent proposals leverage this observation to improve many aspects of chip multiprocessors, such as reducing coherence overhead or the access latency to distributed caches. The effectiveness of those proposals depends to a large extent on the amount of detected private data. However, the mechanisms proposed so far either do not consider either thread migration or the private use of data within different application phases, or do entail high overhead. As a result, a considerable amount of private data is not detected. In order to increase the detection of private data, this thesis proposes a TLB-based mechanism that is able to account for both thread migration and private application phases with low overhead. Classification status in the proposed TLB-based classification mechanisms is determined by the presence of the page translation stored in other core's TLBs. The classification schemes are analyzed in multilevel TLB hierarchies, for systems with both private and distributed shared last-level TLBs. This thesis introduces a page classification approach based on inspecting other core's TLBs upon every TLB miss. In particular, the proposed classification approach is based on exchange and count of tokens. Token counting on TLBs is a natural and efficient way for classifying memory pages. It does not require the use of complex and undesirable persistent requests or arbitration, since when two ormore TLBs race for accessing a page, tokens are appropriately distributed classifying the page as shared. However, TLB-based ability to classify private pages is strongly dependent on TLB size, as it relies on the presence of a page translation in the system TLBs. To overcome that, different TLB usage predictors (UP) have been proposed, which allow a page classification unaffected by TLB size. Specifically, this thesis introduces a predictor that obtains system-wide page usage information by either employing a shared last-level TLB structure (SUP) or cooperative TLBs working together (CUP).La mayor parte de los datos referenciados por aplicaciones paralelas y secuenciales que se ejecutan enCMPs actuales son referenciadas por un único hilo, es decir, son privados. Recientemente, algunas propuestas aprovechan esta observación para mejorar muchos aspectos de los CMPs, como por ejemplo reducir el sobrecoste de la coherencia o la latencia de los accesos a cachés distribuidas. La efectividad de estas propuestas depende en gran medida de la cantidad de datos que son considerados privados. Sin embargo, los mecanismos propuestos hasta la fecha no consideran la migración de hilos de ejecución ni las fases de una aplicación. Por tanto, una cantidad considerable de datos privados no se detecta apropiadamente. Con el fin de aumentar la detección de datos privados, proponemos un mecanismo basado en las TLBs, capaz de reclasificar los datos a privado, y que detecta la migración de los hilos de ejecución sin añadir complejidad al sistema. Los mecanismos de clasificación en las TLBs se han analizado en estructuras de varios niveles, incluyendo TLBs privadas y con un último nivel de TLB compartido y distribuido. Esta tesis también presenta un mecanismo de clasificación de páginas basado en la inspección de las TLBs de otros núcleos tras cada fallo de TLB. De forma particular, el mecanismo propuesto se basa en el intercambio y el cuenteo de tokens (testigos). Contar tokens en las TLBs supone una forma natural y eficiente para la clasificación de páginas de memoria. Además, evita el uso de solicitudes persistentes o arbitraje alguno, ya que si dos o más TLBs compiten para acceder a una página, los tokens se distribuyen apropiadamente y la clasifican como compartida. Sin embargo, la habilidad de los mecanismos basados en TLB para clasificar páginas privadas depende del tamaño de las TLBs. La clasificación basada en las TLBs se basa en la presencia de una traducción en las TLBs del sistema. Para evitarlo, se han propuesto diversos predictores de uso en las TLBs (UP), los cuales permiten una clasificación independiente del tamaño de las TLBs. En concreto, esta tesis presenta un sistema mediante el que se obtiene información de uso de página a nivel de sistema con la ayuda de un nivel de TLB compartida (SUP) o mediante TLBs cooperando juntas (CUP).La major part de les dades referenciades per aplicacions paral·leles i seqüencials que s'executen en CMPs actuals són referenciades per un sol fil, és a dir, són privades. Recentment, algunes propostes aprofiten aquesta observació per a millorar molts aspectes dels CMPs, com és reduir el sobrecost de la coherència o la latència d'accés a memòries cau distribuïdes. L'efectivitat d'aquestes propostes depen en gran mesura de la quantitat de dades detectades com a privades. No obstant això, els mecanismes proposats fins a la data no consideren la migració de fils d'execució ni les fases d'una aplicació. Per tant, una quantitat considerable de dades privades no es detecta apropiadament. A fi d'augmentar la detecció de dades privades, aquesta tesi proposa un mecanisme basat en les TLBs, capaç de reclassificar les dades com a privades, i que detecta la migració dels fils d'execució sense afegir complexitat al sistema. Els mecanismes de classificació en les TLBs s'han analitzat en estructures de diversos nivells, incloent-hi sistemes amb TLBs d'últimnivell compartides i distribuïdes. Aquesta tesi presenta un mecanisme de classificació de pàgines basat en inspeccionar les TLBs d'altres nuclis després de cada fallada de TLB. Concretament, el mecanisme proposat es basa en l'intercanvi i el compte de tokens. Comptar tokens en les TLBs suposa una forma natural i eficient per a la classificació de pàgines de memòria. A més, evita l'ús de sol·licituds persistents o arbitratge, ja que si dues o més TLBs competeixen per a accedir a una pàgina, els tokens es distribueixen apropiadament i la classifiquen com a compartida. No obstant això, l'habilitat dels mecanismes basats en TLB per a classificar pàgines privades depenen de la grandària de les TLBs. La classificació basada en les TLBs resta en la presència d'una traducció en les TLBs del sistema. Per a evitar-ho, s'han proposat diversos predictors d'ús en les TLBs (UP), els quals permeten una classificació independent de la grandària de les TLBs. Específicament, aquesta tesi introdueix un predictor que obté informació d'ús de la pàgina a escala de sistema mitjançant un nivell de TLB compartida (SUP) or mitjançant TLBs cooperant juntes (CUP).Esteve García, A. (2017). Design of Efficient TLB-based Data Classification Mechanisms in Chip Multiprocessors [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/86136TESI

    Optimizing high locality memory references in cache coherent shared memory multi-core processors

    Get PDF
    Optimizing memory references has been a primary research area of computer systems ever since the advent of the stored program computers. The objective of this thesis research is to identify and optimize two classes of high locality data memory reference streams in cache coherent shared memory multi-processors. More specifically, this thesis classifies such memory objects into spatial and temporal false shared memory objects. The underlying hypothesis is that the policy of treating all the memory objects as being permanently shared significantly hinders the optimization of high-locality memory objects in modern cache coherent shared memory multi-processor systems: the policy forces the systems to unconditionally prepare to incur shared-memory-related overheads for every memory reference. To verify the hypothesis, this thesis explores two different schemes to minimize the shared memory abstraction overheads associated with memory reference streams of spatial and temporal false shared memory objects, respectively. The schemes implement the exception rules which enable isolating false memory objects from the shared memory domain, in a spatial and temporal manner. However, the exception rules definitely require special consideration in cache coherent shared memory multi-processors, regarding the data consistency, cache coherence, and memory consistency model. Thus, this thesis not only implements the schemes based on such consideration, but also breaks the chain of the widespread faulty assumption of prior academic work. This high-level approach ultimately aims at upgrading scalability of large scale systems, such as multi-socket cache coherent shared memory multi-processors, throughout improving performance and reducing energy/power consumption. This thesis demonstrates the efficacy and efficiency of the schemes in terms of performance improvement and energy/power reduction.Ph.D

    CROSS-LAYER CUSTOMIZATION PLATFORM FOR LOW-POWER AND REAL-TIME EMBEDDED APPLICATIONS

    Get PDF
    Modern embedded applications have become increasingly complex and diverse in their functionalities and requirements. Data processing, communication and multimedia signal processing, real-time control and various other functionalities can often need to be implemented on the same System-on-Chip(SOC) platform. The significant power constraints and real-time guarantee requirements of these applications have become significant obstacles for the traditional embedded system design methodologies. The general-purpose computing microarchitectures of these platforms are designed to achieve good performance on average, which is far from optimal for any particular application. The system must always assume worst-case scenarios, which results in significant power inefficiencies and resource under-utilization. This dissertation introduces a cross-layer application-customizable embedded platform, which dynamically exploits application information and fine-tunes system components at system software and hardware layers. This is achieved with the close cooperation and seamless integration of the compiler, the operating system, and the hardware architecture. The compiler is responsible for extracting application regularities through static and profile-based analysis. The relevant application knowledge is propagated and utilized at run-time across the system layers through the judiciously introduced reconfigurability at both OS and hardware layers. The introduced framework comprehensively covers the fundamental subsystems of memory management and multi-tasking execution control

    Effective bootstrapping of Peer-to Peer networks over Mobile Ad-hoc networks

    Get PDF
    Mobile Ad-hoc Networks (MANETs) and Peer-to-Peer (P2P) networks are vigorous, revolutionary communication technologies in the 21st century. They lead the trend of decentralization. Decentralization will ultimately win clients over client/server model, because it gives ordinary network users more control, and stimulates their active participation. It is a determinant factor in shaping the future of networking. MANETs and P2P networks are very similar in nature. Both are dynamic, distributed. Both use multi-hop broadcast or multicast as major pattern of traffic. Both set up connection by self-organizing and maintain connection by self-healing. Embodying the slogan networking without networks, both abandoned traditional client/server model and disclaimed pre-existing infrastructure. However, their status quo levels of real world application are widely divergent. P2P networks are now accountable for about 50 ~ 70% internet traffic, while MANETs are still primarily in the laboratory. The interesting and confusing phenomenon has sparked considerable research effort to transplant successful approaches from P2P networks into MANETs. While most research in the synergy of P2P networks and MANETs focuses on routing, the network bootstrapping problem remains indispensable for any such transplantation to be realized. The most pivotal problems in bootstrapping are: (1) automatic configuration of nodes addresses and IDs, (2) topology discovery and transformation in different layers and name spaces. In this dissertation research, we have found novel solutions for these problems. The contributions of this dissertation are: (1) a non-IP, flat address automatic configuration scheme, which integrates lower layer addresses and P2P IDs in application layer and makes simple cryptographical assignment possible. A related paper entitled Pastry over Ad-Hoc Networks with Automatic Flat Address Configuration was submitted to Elsevier Journal of Ad Hoc Networks in May. (2) an effective ring topology construction algorithm which builds perfect ring in P2P ID space using only simplest multi-hop unicast or multicast. Upon this ring, popular structured P2P networks like Chord, Pastry could be built with great ease. A related paper entitled Chord Bootstrapping on MANETs - All Roads lead to Rome will be ready for submission after defense of the dissertation

    Acta Cybernetica : Volume 25. Number 2.

    Get PDF

    Programming Persistent Memory

    Get PDF
    Beginning and experienced programmers will use this comprehensive guide to persistent memory programming. You will understand how persistent memory brings together several new software/hardware requirements, and offers great promise for better performance and faster application startup times—a huge leap forward in byte-addressable capacity compared with current DRAM offerings. This revolutionary new technology gives applications significant performance and capacity improvements over existing technologies. It requires a new way of thinking and developing, which makes this highly disruptive to the IT/computing industry. The full spectrum of industry sectors that will benefit from this technology include, but are not limited to, in-memory and traditional databases, AI, analytics, HPC, virtualization, and big data. Programming Persistent Memory describes the technology and why it is exciting the industry. It covers the operating system and hardware requirements as well as how to create development environments using emulated or real persistent memory hardware. The book explains fundamental concepts; provides an introduction to persistent memory programming APIs for C, C++, JavaScript, and other languages; discusses RMDA with persistent memory; reviews security features; and presents many examples. Source code and examples that you can run on your own systems are included. What You’ll Learn Understand what persistent memory is, what it does, and the value it brings to the industry Become familiar with the operating system and hardware requirements to use persistent memory Know the fundamentals of persistent memory programming: why it is different from current programming methods, and what developers need to keep in mind when programming for persistence Look at persistent memory application development by example using the Persistent Memory Development Kit (PMDK) Design and optimize data structures for persistent memory Study how real-world applications are modified to leverage persistent memory Utilize the tools available for persistent memory programming, application performance profiling, and debugging Who This Book Is For C, C++, Java, and Python developers, but will also be useful to software, cloud, and hardware architects across a broad spectrum of sectors, including cloud service providers, independent software vendors, high performance compute, artificial intelligence, data analytics, big data, etc

    Advanced information processing system: The Army fault tolerant architecture conceptual study. Volume 2: Army fault tolerant architecture design and analysis

    Get PDF
    Described here is the Army Fault Tolerant Architecture (AFTA) hardware architecture and components and the operating system. The architectural and operational theory of the AFTA Fault Tolerant Data Bus is discussed. The test and maintenance strategy developed for use in fielded AFTA installations is presented. An approach to be used in reducing the probability of AFTA failure due to common mode faults is described. Analytical models for AFTA performance, reliability, availability, life cycle cost, weight, power, and volume are developed. An approach is presented for using VHSIC Hardware Description Language (VHDL) to describe and design AFTA's developmental hardware. A plan is described for verifying and validating key AFTA concepts during the Dem/Val phase. Analytical models and partial mission requirements are used to generate AFTA configurations for the TF/TA/NOE and Ground Vehicle missions
    corecore