6,607 research outputs found

    High Performance Hybrid Memory Systems with 3D-stacked DRAM

    Get PDF
    The bandwidth of traditional DRAM is pin limited and so does not scale well with the increasing demand of data intensive workloads. 3D-stacked DRAM can alleviate this problem providing substantially higher bandwidth to a processor chip. However, the capacity of 3D-stacked DRAM is not enough to replace the bulk of the memory and therefore it is used together with off-chip DRAM in a hybrid memory system, either as a DRAM cache or as part of a flat address space with support for data migration. The performance of both above alternative designs is limited by their particular overheads. This thesis proposes new designs that improve the performance of hybrid memory systems. It does so first by alleviating the overheads of current approaches and second, by proposing a new design that combines the best attributes of DRAM caching and data migration while addressing their respective weaknesses. The first part of this thesis focuses on improving the performance of DRAM caches. Besides the unavoidable DRAM access to fetch the requested data, tag access is in the critical path adding significant latency and energy costs. Existing approaches are not able to remove these overheads and in some cases limit DRAM cache design options. To alleviate the tag access overheads of DRAM caches this thesis proposes Decoupled Fused Cache (DFC), a DRAM cache design that fuses DRAM cache tags with the tags of the on-chip Last Level Cache (LLC) to access the DRAM cache data directly on LLC misses. Compared to current state-of-the-art DRAM caches, DFC improves system performance by 11% on average. Finally, DFC reduces DRAM cache traffic by 25% and DRAM cache energy consumption by 24.5%. The second part of this thesis focuses on improving the performance of data migration. Data migration has significant performance potential, but also entails overheads which may diminish its benefits or even degrade performance. These overheads are mainly due to the high cost of swapping data between memories which also makes selecting which data to migrate critical to performance. To address these challenges of data migration this thesis proposes LLC guided Data Migration (LGM). LGM uses the LLC to predict future reuse and select memory segments for migration. Furthermore, LGM reduces the data migration traffic overheads by not migrating the cache lines of memory segments which are present in the LLC. LGM outperforms current state-of-the art data migration, improving system performance by 12.1% and reducing memory system dynamic energy by 13.2%. DRAM caches and data migration offer different tradeoffs for the utilization of 3D-stacked DRAM but also share some similar challenges. The third part of this thesis aims to provide an alternative approach to the utilization of 3D-stacked DRAM combining the strengths of both DRAM caches and data migration while eliminating their weaknesses. To that end, this thesis proposes Hybrid2, a hybrid memory system design which uses only a small fraction of the 3D-stacked DRAM as a cache and thus does not deny valuable capacity from the memory system. It further leverages the DRAM cache as a staging area to select the data most suitable for migration. Finally, Hybrid2 alleviates the metadata overheads of both DRAM caches and migration using a common mechanism. Depending on the system configuration, Hybrid2 on average outperforms state-of-the-art migration schemes by 6.4% to 9.1%, compared to DRAM caches Hybrid2 gives away on average only 0.3%, to 5.3% of performance offering up to 24.6% more main memory capacity

    High Performance Hybrid Memory Systems with 3D-stacked DRAM

    Get PDF
    The bandwidth of traditional DRAM is pin limited and so does not scale wellwith the increasing demand of data intensive workloads limiting performance.3D-stacked DRAM can alleviate this problem providing substantially higherbandwidth to a processor chip. However, the capacity of 3D-stacked DRAM isnot enough to replace the bulk of the memory and therefore it is used eitheras a DRAM cache or as part of a flat address space with support for datamigration. The performance of both above alternative designs is limited bytheir particular overheads. In this thesis we propose designs that improvethe performance of hybrid memory systems in which 3D-stacked DRAM isused either as a cache or as part of a flat address space with data migration.DRAM caches have shown excellent potential in capturing the spatial andtemporal data locality of applications, however they are still far from their idealperformance. Besides the unavoidable DRAM access to fetch the requesteddata, tag access is in the critical path adding significant latency and energycosts. Existing approaches are not able to remove these overheads and insome cases limit DRAM cache design options. To alleviate the tag accessoverheads of DRAM caches this thesis proposes Decoupled Fused Cache (DFC),a DRAM cache design that fuses DRAM cache tags with the tags of the on-chipLast Level Cache (LLC) to access the DRAM cache data directly on LLCmisses. Compared to current state-of-the-art DRAM caches, DFC improvessystem performance by 6% on average and by 16-18% for large cacheline sizes.Finally, DFC reduces DRAM cache traffic by 18% and DRAM cache energyconsumption by 7%. Data migration schemes have significant performancepotential, but also entail overheads, which may diminish migration benefitsor even lead to performance degradation. These overheads are mainly due tothe high cost of swapping data between memories which also makes selectingwhich data to migrate critical to performance. To address these challengesof data migration this thesis proposes LLC guided Data Migration (LGM).LGM uses the LLC to predict future reuse and select memory segments formigration. Furtermore, LGM reduces the data migration traffic overheads bynot migrating the cache lines of memory segments which are present in theLLC. LGM outperforms current state-of-the art migration designs improvingsystem performance by 12.1% and reducing memory system dynamic energyby 13.2%

    Hybrid2: Combining Caching and Migration in Hybrid Memory Systems

    Get PDF
    This paper considers a hybrid memory system composed of memory technologies with different characteristics; in particular a small, near memory exhibiting high bandwidth, i.e., 3D-stacked DRAM, and a larger, far memory offering capacity at lower bandwidth, i.e., off-chip DRAM. In the past,the near memory of such a system has been used either as a DRAM cache or as part of a flat address space combined with a migration mechanism. Caches and migration offer different tradeoffs (between performance, main memory capacity, data transfer costs, etc.) and share similar challenges related todata-transfer granularity and metadata management. This paper proposes Hybrid2 , a new hybrid memory system architecture that combines a DRAM cache with a migration scheme. Hybrid 2 does not deny valuable capacity from the memory system because it uses only a small fraction of the near memory as a DRAM cache; 64MB in our experiments.It further leverages the DRAM cache as a staging area to select the data most suitable for migration. Finally, Hybrid2 alleviates the metadata overheads of both DRAM caches and migration using a common mechanism. Using near to far memory ratios of 1:16, 1:8 and 1:4 in our experiments, Hybrid2 on average outperforms current state-of-the-art migration schemes by 7.9%, 9.1% and 6.4%, respectively. In the same system configurations, compared to DRAM caches Hybrid2 gives away on average only 0.3%, 1.2%, and 5.3% of performance offering 5.9%, 12.1%, and 24.6% more main memory capacity, respectively

    Exploiting heterogeneity in Chip-Multiprocessor Design

    Get PDF
    In the past decade, semiconductor manufacturers are persistent in building faster and smaller transistors in order to boost the processor performance as projected by Moore’s Law. Recently, as we enter the deep submicron regime, continuing the same processor development pace becomes an increasingly difficult issue due to constraints on power, temperature, and the scalability of transistors. To overcome these challenges, researchers propose several innovations at both architecture and device levels that are able to partially solve the problems. These diversities in processor architecture and manufacturing materials provide solutions to continuing Moore’s Law by effectively exploiting the heterogeneity, however, they also introduce a set of unprecedented challenges that have been rarely addressed in prior works. In this dissertation, we present a series of in-depth studies to comprehensively investigate the design and optimization of future multi-core and many-core platforms through exploiting heteroge-neities. First, we explore a large design space of heterogeneous chip multiprocessors by exploiting the architectural- and device-level heterogeneities, aiming to identify the optimal design patterns leading to attractive energy- and cost-efficiencies in the pre-silicon stage. After this high-level study, we pay specific attention to the architectural asymmetry, aiming at developing a heterogeneity-aware task scheduler to optimize the energy-efficiency on a given single-ISA heterogeneous multi-processor. An advanced statistical tool is employed to facilitate the algorithm development. In the third study, we shift our concentration to the device-level heterogeneity and propose to effectively leverage the advantages provided by different materials to solve the increasingly important reliability issue for future processors

    TD-NUCA: runtime driven management of NUCA caches in task dataflow programming models

    Get PDF
    In high performance processors, the design of on-chip memory hierarchies is crucial for performance and energy efficiency. Current processors rely on large shared Non-Uniform Cache Architectures (NUCA) to improve performance and reduce data movement. Multiple solutions exploit information available at the microarchitecture level or in the operating system to optimize NUCA performance. However, existing methods have not taken advantage of the information captured by task dataflow programming models to guide the management of NUCA caches. In this paper we propose TD-NUCA, a hardware/software co-designed approach that leverages information present in the runtime system of task dataflow programming models to efficiently manage NUCA caches. TD-NUCA identifies the data access and reuse patterns of parallel applications in the runtime system and guides the operation of the NUCA caches in the hardware. As a result, TD-NUCA achieves a 1.18x average speedup over the baseline S-NUCA while requiring only 0.62x the data movement.This work has been supported by the Spanish Ministry of Science and Technology (contract PID2019-107255GB-C21) and the Generalitat de Catalunya (contract 2017-SGR-1414). M. Casas has been partially supported by the Grant RYC- 2017-23269 funded by MCIN/AEI/10.13039/501100011033 and ESF ‘Investing in your future’. M. Moreto has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship No. RYC-2016-21104.Peer ReviewedPostprint (published version

    ecoHMEM: Improving object placement methodology for hybrid memory systems in HPC

    Get PDF
    Recent byte-addressable persistent memory (PMEM) technology offers capacities comparable to storage devices and access times much closer to DRAMs than other non-volatile memory technology. To palliate the large gap with DRAM performance, DRAM and PMEM are usually combined. Users have the choice to either manage the placement to different memory spaces by software or leverage the DRAM as a cache for the virtual address space of the PMEM. We present novel methodology for automatic object-level placement, including efficient runtime object matching and bandwidth-aware placement. Our experiments leveraging Intel® Optane™ Persistent Memory show from matching to greatly improved performance with respect to state-of-the-art software and hardware solutions, attaining over 2x runtime improvement in miniapplications and over 6% in OpenFOAM, a complex production application.This paper received funding from the Intel-BSC Exascale Laboratory SoW 5.1, the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska-Curie grant agreement No. 749516, the EPEEC project from the European Union’s Horizon 2020 research and innovation program under grant agreement No 801051, the DEEP-SEA project from the European Commission’s EuroHPC program under grant agreement 955606, and the Ministerio de Ciencia e Innovacion—Agencia Estatal de Investigación (PID2019-107255GB-C21/AEI/10.13039/501100011033).Peer ReviewedPostprint (author's final draft

    Master of Science

    Get PDF
    thesisEfficient movement of massive amounts of data over high-speed networks at high throughput is essential for a modern-day in-memory storage system. In response to the growing needs of throughput and latency demands at scale, a new class of database systems was developed in recent years. The development of these systems was guided by increased access to high throughput, low latency network fabrics, and declining cost of Dynamic Random Access Memory (DRAM). These systems were designed with On-Line Transactional Processing (OLTP) workloads in mind, and, as a result, are optimized for fast dispatch and perform well under small request-response scenarios. However, massive server responses such as those for range queries and data migration for load balancing poses challenges for this design. This thesis analyzes the effects of large transfers on scale-out systems through the lens of a modern Network Interface Card (NIC). The present-day NIC offers new and exciting opportunities and challenges for large transfers, but using them efficiently requires smart data layout and concurrency control. We evaluated the impact of modern NICs in designing data layout by measuring transmit performance and full system impact by observing the effects of Direct Memory Access (DMA), Remote Direct Memory Access (RDMA), and caching improvements such as Intel® Data Direct I/O (DDIO). We discovered that use of techniques such as Zero Copy yield around 25% savings in CPU cycles and a 50% reduction in the memory bandwidth utilization on a server by using a client-assisted design with records that are not updated in place. We also set up experiments that underlined the bottlenecks in the current approach to data migration in RAMCloud and propose guidelines for a fast and efficient migration protocol for RAMCloud

    Learning to Rank Graph-based Application Objects on Heterogeneous Memories

    Full text link
    Persistent Memory (PMEM), also known as Non-Volatile Memory (NVM), can deliver higher density and lower cost per bit when compared with DRAM. Its main drawback is that it is typically slower than DRAM. On the other hand, DRAM has scalability problems due to its cost and energy consumption. Soon, PMEM will likely coexist with DRAM in computer systems but the biggest challenge is to know which data to allocate on each type of memory. This paper describes a methodology for identifying and characterizing application objects that have the most influence on the application's performance using Intel Optane DC Persistent Memory. In the first part of our work, we built a tool that automates the profiling and analysis of application objects. In the second part, we build a machine learning model to predict the most critical object within large-scale graph-based applications. Our results show that using isolated features does not bring the same benefit compared to using a carefully chosen set of features. By performing data placement using our predictive model, we can reduce the execution time degradation by 12\% (average) and 30\% (max) when compared to the baseline's approach based on LLC misses indicator

    Silsesquioxane polymer as a potential scaffold for laryngeal reconstruction

    Get PDF
    Cancer, disease and trauma to the larynx and their treatment can lead to permanent loss of structures critical to voice, breathing and swallowing. Engineered partial or total laryngeal replacements would need to match the ambitious specifications of replicating functionality, outer biocompatibility, and permissiveness for an inner mucosal lining. Here we present porous polyhedral oligomeric silsesquioxane-poly(carbonate urea) urethane (POSS-PCUU) as a potential scaffold for engineering laryngeal tissue. Specifically, we employ a precipitation and porogen leaching technique for manufacturing the polymer. The polymer is chemically consistent across all sample types and produces a foam-like scaffold with two distinct topographies and an internal structure composed of nano- and micro-pores. Whilst the highly porous internal structure of the scaffold contributes to the complex tensile behaviour of the polymer, the surface of the scaffold remains largely non-porous. The low number of pores minimise access for cells, although primary fibroblasts and epithelial cells do attach and proliferate on the polymer surface. Our data show that with a change in manufacturing protocol to produce porous polymer surfaces, POSS-PCUU may be a potential candidate for overcoming some of the limitations associated with laryngeal reconstruction and regeneration
    corecore