32 research outputs found

    Utopia: Fast and Efficient Address Translation via Hybrid Restrictive & Flexible Virtual-to-Physical Address Mappings

    Full text link
    Conventional virtual memory (VM) frameworks enable a virtual address to flexibly map to any physical address. This flexibility necessitates large data structures to store virtual-to-physical mappings, which leads to high address translation latency and large translation-induced interference in the memory hierarchy. On the other hand, restricting the address mapping so that a virtual address can only map to a specific set of physical addresses can significantly reduce address translation overheads by using compact and efficient translation structures. However, restricting the address mapping flexibility across the entire main memory severely limits data sharing across different processes and increases data accesses to the swap space of the storage device, even in the presence of free memory. We propose Utopia, a new hybrid virtual-to-physical address mapping scheme that allows both flexible and restrictive hash-based address mapping schemes to harmoniously co-exist in the system. The key idea of Utopia is to manage physical memory using two types of physical memory segments: restrictive and flexible segments. A restrictive segment uses a restrictive, hash-based address mapping scheme that maps virtual addresses to only a specific set of physical addresses and enables faster address translation using compact translation structures. A flexible segment employs the conventional fully-flexible address mapping scheme. By mapping data to a restrictive segment, Utopia enables faster address translation with lower translation-induced interference. Utopia improves performance by 24% in a single-core system over the baseline system, whereas the best prior state-of-the-art contiguity-aware translation scheme improves performance by 13%.Comment: To appear in 56th IEEE/ACM International Symposium on Microarchitecture (MICRO), 202

    Bit-Flip Aware Data Structures for Phase Change Memory

    Get PDF
    Big, non-volatile, byte-addressable, low-cost, and fast non-volatile memories like Phase Change Memory are appearing in the marketplace. They have the capability to unify both memory and storage and allow us to rethink the present memory hierarchy. An important draw-back to Phase Change Memory is limited write-endurance. In addition, Phase Change Memory shares with other Non-Volatile Random Access Memories an asym- metry in the energy costs of writes and reads. Best use of Non-Volatile Random Access Memories limits the number of times a Non-Volatile Random Access Memory cell changes contents, called a bit-flip. While the future of main memory is still unknown, we should already start to create data structures for them in order to shape the future era. This thesis investigates the creation of bit-flip aware data structures.The thesis first considers general ways in which a data structure can save bit- flips by smart overwrites and by using the exclusive-or of pointers. It then shows how a simple content dependent encoding can reduce bit-flips for web corpora. It then shows how to build hash based dictionary structures for Linear Hashing and Spiral Storage. Finally, the thesis presents Gray counters, close to bit-flip optimal counters that even enable age- based wear leveling with counters managed by the Non-Volatile Random Access Memories themselves instead of by the Operating Systems

    TACKLING PERFORMANCE AND SECURITY ISSUES FOR CLOUD STORAGE SYSTEMS

    Get PDF
    Building data-intensive applications and emerging computing paradigm (e.g., Machine Learning (ML), Artificial Intelligence (AI), Internet of Things (IoT) in cloud computing environments is becoming a norm, given the many advantages in scalability, reliability, security and performance. However, under rapid changes in applications, system middleware and underlying storage device, service providers are facing new challenges to deliver performance and security isolation in the context of shared resources among multiple tenants. The gap between the decades-old storage abstraction and modern storage device keeps widening, calling for software/hardware co-designs to approach more effective performance and security protocols. This dissertation rethinks the storage subsystem from device-level to system-level and proposes new designs at different levels to tackle performance and security issues for cloud storage systems. In the first part, we present an event-based SSD (Solid State Drive) simulator that models modern protocols, firmware and storage backend in detail. The proposed simulator can capture the nuances of SSD internal states under various I/O workloads, which help researchers understand the impact of various SSD designs and workload characteristics on end-to-end performance. In the second part, we study the security challenges of shared in-storage computing infrastructures. Many cloud providers offer isolation at multiple levels to secure data and instance, however, security measures in emerging in-storage computing infrastructures are not studied. We first investigate the attacks that could be conducted by offloaded in-storage programs in a multi-tenancy cloud environment. To defend against these attacks, we build a lightweight Trusted Execution Environment, IceClave to enable security isolation between in-storage programs and internal flash management functions. We show that while enforcing security isolation in the SSD controller with minimal hardware cost, IceClave still keeps the performance benefit of in-storage computing by delivering up to 2.4x better performance than the conventional host-based trusted computing approach. In the third part, we investigate the performance interference problem caused by other tenants' I/O flows. We demonstrate that I/O resource sharing can often lead to performance degradation and instability. The block device abstraction fails to expose SSD parallelism and pass application requirements. To this end, we propose a software/hardware co-design to enforce performance isolation by bridging the semantic gap. Our design can significantly improve QoS (Quality of Service) by reducing throughput penalties and tail latency spikes. Lastly, we explore more effective I/O control to address contention in the storage software stack. We illustrate that the state-of-the-art resource control mechanism, Linux cgroups is insufficient for controlling I/O resources. Inappropriate cgroup configurations may even hurt the performance of co-located workloads under memory intensive scenarios. We add kernel support for limiting page cache usage per cgroup and achieving I/O proportionality

    Evaluating TLB (Translation Lookaside Buffer) Performance Overhead for NVM (non-volatile Memory) Hybrid System

    Get PDF
    As the non-volatile memory (NVM) technology offers near-DRAM performance and near-disk capacity, NVM has emerged as a new storage class. Conventional file systems, designed for hard disk drives or solid-state drives, need to be re-examined or even re-designed for NVM storage. For example, new file systems such as NOVA, HMFS, HMVFS and Ext4-DAX, have been developed and implemented to fully leverage NVM’s characteristics, such as fast fine-grained access. This thesis research uses a variety of I/O workloads to evaluate the performance overhead of the TLB (translation lookaside buffer) in various file systems on emulated NVM storage systems, in which NVM resides on the memory bus. As NVM’s capacity becomes much greater than DRAM and applications’ footprints continue to increase rapidly, the number of TLB entries scales up with the same pace, leading to a significant amount of TLB misses. The goal of this research is to gain insights into file system optimizations on storage-class memory. Experimental results show that NVM based file systems can have 50% more TLB overhead compare to with conventional file systems, under the same file operations. Profiling based on performance counters show that TLB-friendly journaling/logging should be taken into consideration into future file system design

    Methods for Robust and Energy-Efficient Microprocessor Architectures

    Get PDF
    Σήμερα, η εξέλιξη της τεχνολογίας επιτρέπει τη βελτίωση τριών βασικών στοιχείων της σχεδίασης των επεξεργαστών: αυξημένες επιδόσεις, χαμηλότερη κατανάλωση ισχύος και χαμηλότερο κόστος παραγωγής του τσιπ, ενώ οι σχεδιαστές επεξεργαστών έχουν επικεντρωθεί στην παραγωγή επεξεργαστών με περισσότερες λειτουργίες σε χαμηλότερο κόστος. Οι σημερινοί επεξεργαστές είναι πολύ ταχύτεροι και διαθέτουν εξελιγμένες λειτουργικές μονάδες συγκριτικά με τους προκατόχους τους, ωστόσο, καταναλώνουν αρκετά μεγάλη ενέργεια. Τα ποσά ηλεκτρικής ισχύος που καταναλώνονται, και η επακόλουθη έκλυση θερμότητας, αυξάνονται παρά τη μείωση του μεγέθους των τρανζίστορ. Αναπτύσσοντας όλο και πιο εξελιγμένους μηχανισμούς και λειτουργικές μονάδες για την αύξηση της απόδοσης και βελτίωση της ενέργειας, σε συνδυασμό με τη μείωση του μεγέθους των τρανζίστορ, οι επεξεργαστές έχουν γίνει εξαιρετικά πολύπλοκα συστήματα, καθιστώντας τη διαδικασία της επικύρωσής τους σημαντική πρόκληση για τη βιομηχανία ολοκληρωμένων κυκλωμάτων. Συνεπώς, οι κατασκευαστές επεξεργαστών αφιερώνουν επιπλέον χρόνο, προϋπολογισμό και χώρο στο τσιπ για να διασφαλίσουν ότι οι επεξεργαστές θα λειτουργούν σωστά κατά τη διάθεσή τους στη αγορά. Για τους λόγους αυτούς, η εργασία αυτή παρουσιάζει νέες μεθόδους για την επιτάχυνση και τη βελτίωση της φάσης της επικύρωσης, καθώς και για τη βελτίωση της ενεργειακής απόδοσης των σύγχρονων επεξεργαστών. Στο πρώτο μέρος της διατριβής προτείνονται δύο διαφορετικές μέθοδοι για την επικύρωση του επεξεργαστή, οι οποίες συμβάλλουν στην επιτάχυνση αυτής της διαδικασίας και στην αποκάλυψη σπάνιων σφαλμάτων στους μηχανισμούς μετάφρασης διευθύνσεων των σύγχρονων επεξεργαστών. Και οι δύο μέθοδοι καθιστούν ευκολότερη την ανίχνευση και τη διάγνωση σφαλμάτων, και επιταχύνουν την ανίχνευση του σφάλματος κατά τη φάση της επικύρωσης. Στο δεύτερο μέρος της διατριβής παρουσιάζεται μια λεπτομερής μελέτη χαρακτηρισμού των περιθωρίων τάσης σε επίπεδο συστήματος σε δύο σύγχρονους ARMv8 επεξεργαστές. Η μελέτη του χαρακτηρισμού προσδιορίζει τα αυξημένα περιθώρια τάσης που έχουν προκαθοριστεί κατά τη διάρκεια κατασκευής του κάθε μεμονωμένου πυρήνα του επεξεργαστή και αναλύει τυχόν απρόβλεπτες συμπεριφορές που μπορεί να προκύψουν σε συνθήκες μειωμένης τάσης. Για την μελέτη και καταγραφή της συμπεριφοράς του συστήματος υπό συνθήκες μειωμένης τάσης, παρουσιάζεται επίσης σε αυτή τη διατριβή μια απλή και ενοποιημένη συνάρτηση: η συνάρτηση πυκνότητας-σοβαρότητας. Στη συνέχεια, παρουσιάζεται αναλυτικά η ανάπτυξη ειδικά σχεδιασμένων προγραμμάτων (micro-viruses) τα οποία υποβάλουν της θεμελιώδεις δομές του επεξεργαστή σε μεγάλο φορτίο εργασίας. Αυτά τα προγράμματα στοχεύουν στην γρήγορη αναγνώριση των ασφαλών περιθωρίων τάσης. Τέλος, πραγματοποιείται ο χαρακτηρισμός των περιθωρίων τάσης σε εκτελέσεις πολλαπλών πυρήνων, καθώς επίσης και σε διαφορετικές συχνότητες, και προτείνεται ένα πρόγραμμα το οποίο εκμεταλλεύεται όλες τις διαφορετικές πτυχές του προβλήματος της κατανάλωσης ενέργειας και παρέχει μεγάλη εξοικονόμηση ενέργειας διατηρώντας παράλληλα υψηλά επίπεδα απόδοσης. Αυτή η μελέτη έχει ως στόχο τον εντοπισμό και την ανάλυση της σχέσης μεταξύ ενέργειας και απόδοσης σε διαφορετικούς συνδυασμούς τάσης και συχνότητας, καθώς και σε διαφορετικό αριθμό νημάτων/διεργασιών που εκτελούνται στο σύστημα, αλλά και κατανομής των προγραμμάτων στους διαθέσιμους πυρήνες.Technology scaling has enabled improvements in the three major design optimization objectives: increased performance, lower power consumption, and lower die cost, while system design has focused on bringing more functionality into products at lower cost. While today's microprocessors, are much faster and much more versatile than their predecessors, they also consume much power. As operating frequency and integration density increase, the total chip power dissipation increases. This is evident from the fact that due to the demand for increased functionality on a single chip, more and more transistors are being packed on a single die and hence, the switching frequency increases in every technology generation. However, by developing aggressive and sophisticated mechanisms to boost performance and to enhance the energy efficiency in conjunction with the decrease of the size of transistors, microprocessors have become extremely complex systems, making the microprocessor verification and manufacturing testing a major challenge for the semiconductor industry. Manufacturers, therefore, choose to spend extra effort, time, budget and chip area to ensure that the delivered products are operating correctly. To meet high-dependability requirements, manufacturers apply a sequence of verification tasks throughout the entire life-cycle of the microprocessor to ensure the correct functionality of the microprocessor chips from the various types of errors that may occur after the products are released to the market. To this end, this work presents novel methods for ensuring the correctness of the microprocessor during the post-silicon validation phase and for improving the energy efficiency requirements of modern microprocessors. These methods can be applied during the prototyping phase of the microprocessors or after their release to the market. More specifically, in the first part of the thesis, we present and describe two different ISA-independent software-based post-silicon validation methods, which contribute to formalization and modeling as well as the acceleration of the post-silicon validation process and expose difficult-to-find bugs in the address translation mechanisms (ATM) of modern microprocessors. Both methods improve the detection and diagnosis of a hardware design bug in the ATM structures and significantly accelerate the bug detection during the post-silicon validation phase. In the second part of the thesis we present a detailed system-level voltage scaling characterization study for two state-of-the-art ARMv8-based multicore CPUs. We present an extensive characterization study which identifies the pessimistic voltage guardbands (the increased voltage margins set by the manufacturer) of each individual microprocessor core and analyze any abnormal behavior that may occur in off-nominal voltage conditions. Towards the formalization of the any abnormal behavior we also present a simple consolidated function; the Severity function, which aggregates the effects of reduced voltage operation. We then introduce the development of dedicated programs (diagnostic micro-viruses) that aim to accelerate the time-consuming voltage margins characterization studies by stressing the fundamental hardware components. Finally, we present a comprehensive exploration of how two server-grade systems behave in different frequency and core allocation configurations beyond nominal voltage operation in multicore executions. This analysis aims (1) to identify the best performance per watt operation points, (2) to reveal how and why the different core allocation options affect the energy consumption, and (3) to enhance the default Linux scheduler to take task allocation decisions for balanced performance and energy efficiency

    Near-Memory Address Translation

    Full text link
    Memory and logic integration on the same chip is becoming increasingly cost effective, creating the opportunity to offload data-intensive functionality to processing units placed inside memory chips. The introduction of memory-side processing units (MPUs) into conventional systems faces virtual memory as the first big showstopper: without efficient hardware support for address translation MPUs have highly limited applicability. Unfortunately, conventional translation mechanisms fall short of providing fast translations as contemporary memories exceed the reach of TLBs, making expensive page walks common. In this paper, we are the first to show that the historically important flexibility to map any virtual page to any page frame is unnecessary in today's servers. We find that while limiting the associativity of the virtual-to-physical mapping incurs no penalty, it can break the translate-then-fetch serialization if combined with careful data placement in the MPU's memory, allowing for translation and data fetch to proceed independently and in parallel. We propose the Distributed Inverted Page Table (DIPTA), a near-memory structure in which the smallest memory partition keeps the translation information for its data share, ensuring that the translation completes together with the data fetch. DIPTA completely eliminates the performance overhead of translation, achieving speedups of up to 3.81x and 2.13x over conventional translation using 4KB and 1GB pages respectively.Comment: 15 pages, 9 figure

    Paving the Path for Heterogeneous Memory Adoption in Production Systems

    Full text link
    Systems from smartphones to data-centers to supercomputers are increasingly heterogeneous, comprising various memory technologies and core types. Heterogeneous memory systems provide an opportunity to suitably match varying memory access pat- terns in applications, reducing CPU time thus increasing performance per dollar resulting in aggregate savings of millions of dollars in large-scale systems. However, with increased provisioning of main memory capacity per machine and differences in memory characteristics (for example, bandwidth, latency, cost, and density), memory management in such heterogeneous memory systems poses multi-fold challenges on system programmability and design. In this thesis, we tackle memory management of two heterogeneous memory systems: (a) CPU-GPU systems with a unified virtual address space, and (b) Cloud computing platforms that can deploy cheaper but slower memory technologies alongside DRAMs to reduce cost of memory in data-centers. First, we show that operating systems do not have sufficient information to optimally manage pages in bandwidth-asymmetric systems and thus fail to maximize bandwidth to massively-threaded GPU applications sacrificing GPU throughput. We present BW-AWARE placement/migration policies to support OS to make optimal data management decisions. Second, we present a CPU-GPU cache coherence design where CPU and GPU need not implement same cache coherence protocol but provide cache-coherent memory interface to the programmer. Our proposal is first practical approach to provide a unified, coherent CPU–GPU address space without requiring hardware cache coherence, with a potential to enable an explosion in algorithms that leverage tightly coupled CPU–GPU coordination. Finally, to reduce the cost of memory in cloud platforms where the trend has been to map datasets in memory, we make a case for a two-tiered memory system where cheaper (per bit) memories, such as Intel/Microns 3D XPoint, will be deployed alongside DRAM. We present Thermostat, an application-transparent huge-page-aware software mechanism to place pages in a dual-technology hybrid memory system while achieving both the cost advantages of two-tiered memory and performance advantages of transparent huge pages. With Thermostat’s capability to control the application slowdown on a per application basis, cloud providers can realize cost savings from upcoming cheaper memory technologies by shifting infrequently accessed cold data to slow memory, while satisfying throughput demand of the customers.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/137052/1/nehaag_1.pd

    A Survey of Techniques for Architecting TLBs

    Get PDF
    “Translation lookaside buffer” (TLB) caches virtual to physical address translation information and is used in systems ranging from embedded devices to high-end servers. Since TLB is accessed very frequently and a TLB miss is extremely costly, prudent management of TLB is important for improving performance and energy efficiency of processors. In this paper, we present a survey of techniques for architecting and managing TLBs. We characterize the techniques across several dimensions to highlight their similarities and distinctions. We believe that this paper will be useful for chip designers, computer architects and system engineers
    corecore