15 research outputs found

    High Performance Hybrid Memory Systems with 3D-stacked DRAM

    Get PDF
    The bandwidth of traditional DRAM is pin limited and so does not scale well with the increasing demand of data intensive workloads. 3D-stacked DRAM can alleviate this problem providing substantially higher bandwidth to a processor chip. However, the capacity of 3D-stacked DRAM is not enough to replace the bulk of the memory and therefore it is used together with off-chip DRAM in a hybrid memory system, either as a DRAM cache or as part of a flat address space with support for data migration. The performance of both above alternative designs is limited by their particular overheads. This thesis proposes new designs that improve the performance of hybrid memory systems. It does so first by alleviating the overheads of current approaches and second, by proposing a new design that combines the best attributes of DRAM caching and data migration while addressing their respective weaknesses. The first part of this thesis focuses on improving the performance of DRAM caches. Besides the unavoidable DRAM access to fetch the requested data, tag access is in the critical path adding significant latency and energy costs. Existing approaches are not able to remove these overheads and in some cases limit DRAM cache design options. To alleviate the tag access overheads of DRAM caches this thesis proposes Decoupled Fused Cache (DFC), a DRAM cache design that fuses DRAM cache tags with the tags of the on-chip Last Level Cache (LLC) to access the DRAM cache data directly on LLC misses. Compared to current state-of-the-art DRAM caches, DFC improves system performance by 11% on average. Finally, DFC reduces DRAM cache traffic by 25% and DRAM cache energy consumption by 24.5%. The second part of this thesis focuses on improving the performance of data migration. Data migration has significant performance potential, but also entails overheads which may diminish its benefits or even degrade performance. These overheads are mainly due to the high cost of swapping data between memories which also makes selecting which data to migrate critical to performance. To address these challenges of data migration this thesis proposes LLC guided Data Migration (LGM). LGM uses the LLC to predict future reuse and select memory segments for migration. Furthermore, LGM reduces the data migration traffic overheads by not migrating the cache lines of memory segments which are present in the LLC. LGM outperforms current state-of-the art data migration, improving system performance by 12.1% and reducing memory system dynamic energy by 13.2%. DRAM caches and data migration offer different tradeoffs for the utilization of 3D-stacked DRAM but also share some similar challenges. The third part of this thesis aims to provide an alternative approach to the utilization of 3D-stacked DRAM combining the strengths of both DRAM caches and data migration while eliminating their weaknesses. To that end, this thesis proposes Hybrid2, a hybrid memory system design which uses only a small fraction of the 3D-stacked DRAM as a cache and thus does not deny valuable capacity from the memory system. It further leverages the DRAM cache as a staging area to select the data most suitable for migration. Finally, Hybrid2 alleviates the metadata overheads of both DRAM caches and migration using a common mechanism. Depending on the system configuration, Hybrid2 on average outperforms state-of-the-art migration schemes by 6.4% to 9.1%, compared to DRAM caches Hybrid2 gives away on average only 0.3%, to 5.3% of performance offering up to 24.6% more main memory capacity

    High Performance Hybrid Memory Systems with 3D-stacked DRAM

    Get PDF
    The bandwidth of traditional DRAM is pin limited and so does not scale wellwith the increasing demand of data intensive workloads limiting performance.3D-stacked DRAM can alleviate this problem providing substantially higherbandwidth to a processor chip. However, the capacity of 3D-stacked DRAM isnot enough to replace the bulk of the memory and therefore it is used eitheras a DRAM cache or as part of a flat address space with support for datamigration. The performance of both above alternative designs is limited bytheir particular overheads. In this thesis we propose designs that improvethe performance of hybrid memory systems in which 3D-stacked DRAM isused either as a cache or as part of a flat address space with data migration.DRAM caches have shown excellent potential in capturing the spatial andtemporal data locality of applications, however they are still far from their idealperformance. Besides the unavoidable DRAM access to fetch the requesteddata, tag access is in the critical path adding significant latency and energycosts. Existing approaches are not able to remove these overheads and insome cases limit DRAM cache design options. To alleviate the tag accessoverheads of DRAM caches this thesis proposes Decoupled Fused Cache (DFC),a DRAM cache design that fuses DRAM cache tags with the tags of the on-chipLast Level Cache (LLC) to access the DRAM cache data directly on LLCmisses. Compared to current state-of-the-art DRAM caches, DFC improvessystem performance by 6% on average and by 16-18% for large cacheline sizes.Finally, DFC reduces DRAM cache traffic by 18% and DRAM cache energyconsumption by 7%. Data migration schemes have significant performancepotential, but also entail overheads, which may diminish migration benefitsor even lead to performance degradation. These overheads are mainly due tothe high cost of swapping data between memories which also makes selectingwhich data to migrate critical to performance. To address these challengesof data migration this thesis proposes LLC guided Data Migration (LGM).LGM uses the LLC to predict future reuse and select memory segments formigration. Furtermore, LGM reduces the data migration traffic overheads bynot migrating the cache lines of memory segments which are present in theLLC. LGM outperforms current state-of-the art migration designs improvingsystem performance by 12.1% and reducing memory system dynamic energyby 13.2%

    Hybrid2: Combining Caching and Migration in Hybrid Memory Systems

    Get PDF
    This paper considers a hybrid memory system composed of memory technologies with different characteristics; in particular a small, near memory exhibiting high bandwidth, i.e., 3D-stacked DRAM, and a larger, far memory offering capacity at lower bandwidth, i.e., off-chip DRAM. In the past,the near memory of such a system has been used either as a DRAM cache or as part of a flat address space combined with a migration mechanism. Caches and migration offer different tradeoffs (between performance, main memory capacity, data transfer costs, etc.) and share similar challenges related todata-transfer granularity and metadata management. This paper proposes Hybrid2 , a new hybrid memory system architecture that combines a DRAM cache with a migration scheme. Hybrid 2 does not deny valuable capacity from the memory system because it uses only a small fraction of the near memory as a DRAM cache; 64MB in our experiments.It further leverages the DRAM cache as a staging area to select the data most suitable for migration. Finally, Hybrid2 alleviates the metadata overheads of both DRAM caches and migration using a common mechanism. Using near to far memory ratios of 1:16, 1:8 and 1:4 in our experiments, Hybrid2 on average outperforms current state-of-the-art migration schemes by 7.9%, 9.1% and 6.4%, respectively. In the same system configurations, compared to DRAM caches Hybrid2 gives away on average only 0.3%, 1.2%, and 5.3% of performance offering 5.9%, 12.1%, and 24.6% more main memory capacity, respectively

    Modelling supporting truss structure for wind turbine blade

    Get PDF
    Le résumé en français n'a pas été communiqué par l'auteur.Le résumé en anglais n'a pas été communiqué par l'auteur

    Lattice Boltzman Method with Adaptative Mesh Refinement strategy to solve the transport equation

    No full text
    International audienceIntroductionThe Lattice Boltzmann Method (LBM) is a widelyused method for solving the transport equation inmedia with complex geometry. This popularity stemsboth from its simplicity of implementation and fromits intrinsically parallelizable algorithm, which makesit a highly efficient High-Performance Computing(HPC) numerical method.The main drawback of this method, in its basic form,is the use of a lattice similar to a regular Cartesianmesh, which can lead to use a large number of siteswith a level of discretization that is much too high inareas of low interest. We propose here to use anAdaptive Mesh Refinement method for the LBM withthe strong objective of not reducing the HPCefficiency of LBM.HPC strategyOur main objective is to develop a high-levellanguage portable simulation tool on different parallelCPU and GPU architectures without having to rewriteit with each new processors technological advance.To do this, we have developed our code in C++coupled with the Kokkos library that allows us toobtain an executable that runs on several kinds ofarchitectures (as multicores x86 multicores, GPUNVIDIA®, AMD GPU or ARM processor) [1,2].LBM on non-conformal gridsIn order to modify a LB numerical scheme to a nonconforminglattice, we choose to use a Lax-Wendroffdiscretization approach for replacing the streamingstep. The LB algorithm then reads (1) for collisionstep where Ωi is the transport collision operator and(2) for the streaming step.fi(x,t)=fi(x,t)Ωif^*_i (x,t) = f_i (x,t) - \Omega_i (1)fi(x,t+Δt)=fi(x,t)χ(fi(x,t)fi(xeiΔx,t))0.5χ(1χ)(fi(x+eiΔx,t)2fi(x,t)+fi(xeiΔx,t))f_i (x, t+\Delta t) = f^*_i (x,t) - \chi(f^*_i (x,t) - f^*_i (x-e_i \Delta x,t)) - 0.5 \chi (1-\chi)(f^*_i (x+e_i \Delta x,t) - 2f^*_i (x,t) + f^*_i (x-e_i \Delta x,t))(2)_This scheme is stable for χ<1\chi<1, which we impose bychoice of Δt on the finest network and thus guaranteestability on all refinement levels.The mesh is organised in blocks of a given number ofcells in an octree structure. The communicationbetween neighboring blocks is done via ghost cells.For neighboring blocks of different refinement level,the ghost layers are filled using quadraticinterpolation.Adaptive Mesh Refinement criteriaIn order to compute the transport as accurately aspossible, we have chosen to refine as much aspossible the high concentration gradient zones andhave therefore chosen to use a criterion based on thegradient to refine or to coarsen each block componentof the lattice.When the mesh adaptation criteria requires to refine agiven block, the values to be assigned to the refinedlattice are obtained by projection and when thecriterion value asks for coarsening, the value to beassigned is obtained by averaging.References[1] Compatibilities of kokkos library (2020).https://github.com/kokkos/kokkos/wiki/Compiling(Web link accessible on March 8, 2022).[2] Verdier, W., Kestener, P., & Cartalade, A. (2020).Performance portability of lattice Boltzmannmethods for two-phase flows with phase change.Computer Methods in Applied Mechanics andEngineering, 370, 113266

    Multi-architecture implementation of Adaptive Mesh Refinement for Lattice-Boltzmann Method

    No full text
    International audienceThe research work presented herein proposes an implementation of a Lattice Boltzmann Method [1] (LBM), coupled with an Adaptive Mesh Refinement (AMR) algorithm, with main focus on the portability and the optimisation of the code on different high performance computing (HPC) architectures. To preserve the efficiency of LBM for HPC when using the adaptive grid, as well as to optimally exploit the available HPC resources and keep up-to-date with their progress, the developed computational tool is built atop of Kokkos [2] C++ library for scientific computing. Kokkos handles automatically the adaptation and optimisation of a single piece of software on different computer architectures, such as CPUs, GPUs, shared and distributed memory systems alike. The proposed method uses the BGK collision operator but alters the streaming step. Instead of using multiple time-steps [3], a single time-step with a Lax-Wendroff [4] spatial discretisation scheme is employed, which accommodates computational cells of different sizes, while sub-iterations per computation step and variable scaling between different grids are avoided, and data sweeps and exchanges are minimised. The computational domain is discretised by a cell-centred mesh, which is organised in a block-based octree structure. Computations, as well as refinement andcoarsening operations, are performed on each block separately. Block communication and boundary condition imposition are realised through layers of ghost cells filled by quadratic polynomial interpolations. Preliminary assessment and validation tests, on transport problems of a Gaussian distribution profile, for which analytical solutions exist, show that the AMR approach with respect to a fully refined uniform mesh simulation, can reduce the total number of computational cells, and therefore the mean time of a single computational iteration, 5 times, without loss of accuracy. In addition, hard disk I/O processes get accelerated. The normalised gradient of the concentration was used as a refinement criterion and coarsening occurred automatically on neighbouring blocks that did not require refinement.These encouraging results, indicate the great potential of the method’s application on more complex physical problems, such as porous media or multiphase flows and dissolution modelling, coupled with Navier-Stokes equations

    FusionCache: Using LLC tags for DRAM cache

    No full text
    DRAM caches have been shown to be an effective way to utilize the bandwidth and capacity of 3D stacked DRAM. Although they can capture the spatial and temporal data locality of applications, their access latency is still substantially higher than conventional on-chip SRAM caches. Moreover, their tag access latency and storage overheads are excessive. Storing tags for a large DRAM cache in SRAM is impractical as it would occupy a significant fraction of the processor chip. Storing them in the DRAM itself incurs high access overheads. Attempting to cache the DRAM tags on the processor adds a constant delay to the access time. In this paper, we introduce FusionCache, a DRAM cache that offers more efficient tag accesses by fusing DRAM cache tags with the tags of the on-chip Last Level Cache (LLC). We observe that, in an inclusive cache model where the DRAM cachelines are multiples of on-chip SRAM cachelines, LLC tags could be re-purposed to access a large part of the DRAM cache contents. Then, accessing DRAM cache tags incurs zero additional latency in the common case

    LLC-guided data migration in hybrid memory systems

    No full text
    Although 3D-stacked DRAM offers substantially higher bandwidth than commodity DDR DIMMs, it cannot yet provide the necessary capacity to replace the bulk of the memory. A promising alternative is to use flat address space, hybrid memory systems of two or more levels, each exhibiting different performance characteristics. One such existing approach employs a near, high bandwidth 3D-stacked memory, placed on top of the processor die, combined with a far, commodity DDR memory, placed off-chip. Migrating data from the far to the near memory has significant performance potential, but also entails overheads, which may diminish migration benefits or even lead to performance degradation. This paper describes a new data migration scheme for hybrid memory systems that takes into account the above overheads and improves migration efficiency and effectiveness. It is based on the observation that migrating memory segments, which are (partly) present in the Last-Level Cache (LLC) introduces lower migration traffic. Our approach relies on the state of the LLC cachelines to predict future reuse and select memory segments for migration. Thereby, the segments are migrated when present (at least partly) in the LLC incurring lower cost. Our experiments confirm that our approach outperforms current state-of-the art migration designs improving system performance by 12.1% and reducing memory system dynamic energy by 13.2%
    corecore