124 research outputs found

    Paving the Path for Heterogeneous Memory Adoption in Production Systems

    Full text link
    Systems from smartphones to data-centers to supercomputers are increasingly heterogeneous, comprising various memory technologies and core types. Heterogeneous memory systems provide an opportunity to suitably match varying memory access pat- terns in applications, reducing CPU time thus increasing performance per dollar resulting in aggregate savings of millions of dollars in large-scale systems. However, with increased provisioning of main memory capacity per machine and differences in memory characteristics (for example, bandwidth, latency, cost, and density), memory management in such heterogeneous memory systems poses multi-fold challenges on system programmability and design. In this thesis, we tackle memory management of two heterogeneous memory systems: (a) CPU-GPU systems with a unified virtual address space, and (b) Cloud computing platforms that can deploy cheaper but slower memory technologies alongside DRAMs to reduce cost of memory in data-centers. First, we show that operating systems do not have sufficient information to optimally manage pages in bandwidth-asymmetric systems and thus fail to maximize bandwidth to massively-threaded GPU applications sacrificing GPU throughput. We present BW-AWARE placement/migration policies to support OS to make optimal data management decisions. Second, we present a CPU-GPU cache coherence design where CPU and GPU need not implement same cache coherence protocol but provide cache-coherent memory interface to the programmer. Our proposal is first practical approach to provide a unified, coherent CPU–GPU address space without requiring hardware cache coherence, with a potential to enable an explosion in algorithms that leverage tightly coupled CPU–GPU coordination. Finally, to reduce the cost of memory in cloud platforms where the trend has been to map datasets in memory, we make a case for a two-tiered memory system where cheaper (per bit) memories, such as Intel/Microns 3D XPoint, will be deployed alongside DRAM. We present Thermostat, an application-transparent huge-page-aware software mechanism to place pages in a dual-technology hybrid memory system while achieving both the cost advantages of two-tiered memory and performance advantages of transparent huge pages. With Thermostat’s capability to control the application slowdown on a per application basis, cloud providers can realize cost savings from upcoming cheaper memory technologies by shifting infrequently accessed cold data to slow memory, while satisfying throughput demand of the customers.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/137052/1/nehaag_1.pd

    Simulation Modelling of Distributed-Shared Memory Multiprocessors

    Get PDF
    Institute for Computing Systems ArchitectureDistributed shared memory (DSM) systems have been recognised as a compelling platform for parallel computing due to the programming advantages and scalability. DSM systems allow applications to access data in a logically shared address space by abstracting away the distinction of physical memory location. As the location of data is transparent, the sources of overhead caused by accessing the distant memories are difficult to analyse. This memory locality problem has been identified as crucial to DSM performance. Many researchers have investigated the problem using simulation as a tool for conducting experiments resulting in the progressive evolution of DSM systems. Nevertheless, both the diversity of architectural configurations and the rapid advance of DSM implementations impose constraints on simulation model designs in two issues: the limitation of the simulation framework on model extensibility and the lack of verification applicability during a simulation run causing the delay in verification process. This thesis studies simulation modelling techniques for memory locality analysis of various DSM systems implemented on top of a cluster of symmetric multiprocessors. The thesis presents a simulation technique to promote model extensibility and proposes a technique for verification applicability, called a Specification-based Parameter Model Interaction (SPMI). The proposed techniques have been implemented in a new interpretation-driven simulation called DSiMCLUSTER on top of a discrete event simulation (DES) engine known as HASE. Experiments have been conducted to determine which factors are most influential on the degree of locality and to determine the possibility to maximise the stability of performance. DSiMCLUSTER has been validated against a SunFire 15K server and has achieved similarity of cache miss results, an average of +-6% with the worst case less than 15% of difference. These results confirm that the techniques used in developing the DSiMCLUSTER can contribute ways to achieve both (a) a highly extensible simulation framework to keep up with the ongoing innovation of the DSM architecture, and (b) the verification applicability resulting in an efficient framework for memory analysis experiments on DSM architecture

    Analysis of opportunities for cache coherence in heterogeneous embedded systems

    Full text link
    [ES] En el contexto de los sistemas empotrados heterogéneos surgen nuevas necesidades y retos. Este trabajo se va a centrar en la coherencia de éstos sistemas para analizar la posibilidad de aplicar técnicas que se ajusten mejor a dichas necesidades. Previo al análisis se presentará en qué consiste y qué soluciones se proponen actualmente para el problema de la coherencia.[EN] New challenges arise in the context of embedded heterogeneous systems. This work is focused on the coherence of those systems in order to analyze the posibility of applying techniques that best cope with such challenges. Prior to that, we will offer an explanation of what the coherency problem is and what the currently proposed solutions to that problem are.Esteve García, A. (2012). Analysis of opportunities for cache coherence in heterogeneous embedded systems. http://hdl.handle.net/10251/29846Archivo delegad

    Parallel Processes in HPX: Designing an Infrastructure for Adaptive Resource Management

    Get PDF
    Advancement in cutting edge technologies have enabled better energy efficiency as well as scaling computational power for the latest High Performance Computing(HPC) systems. However, complexity, due to hybrid architectures as well as emerging classes of applications, have shown poor computational scalability using conventional execution models. Thus alternative means of computation, that addresses the bottlenecks in computation, is warranted. More precisely, dynamic adaptive resource management feature, both from systems as well as application\u27s perspective, is essential for better computational scalability and efficiency. This research presents and expands the notion of Parallel Processes as a placeholder for procedure definitions, targeted at one or more synchronous domains, meta data for computation and resource management as well as infrastructure for dynamic policy deployment. In addition to this, the research presents additional guidelines for a framework for resource management in HPX runtime system. Further, this research also lists design principles for scalability of Active Global Address Space (AGAS), a necessary feature for Parallel Processes. Also, to verify the usefulness of Parallel Processes, a preliminary performance evaluation of different task scheduling policies is carried out using two different applications. The applications used are: Unbalanced Tree Search, a reference dynamic graph application, implemented by this research in HPX and MiniGhost, a reference stencil based application using bulk synchronous parallel model. The results show that different scheduling policies provide better performance for different classes of applications; and for the same application class, in certain instances, one policy fared better than the others, while vice versa in other instances, hence supporting the hypothesis of the need of dynamic adaptive resource management infrastructure, for deploying different policies and task granularities, for scalable distributed computing

    Efficient Algorithms for Large-Scale Image Analysis

    Get PDF
    This work develops highly efficient algorithms for analyzing large images. Applications include object-based change detection and screening. The algorithms are 10-100 times as fast as existing software, sometimes even outperforming FGPA/GPU hardware, because they are designed to suit the computer architecture. This thesis describes the implementation details and the underlying algorithm engineering methodology, so that both may also be applied to other applications

    Improving energy efficiency of virtualized datacenters

    Get PDF
    Nowadays, many organizations choose to increasingly implement the cloud computing approach. More specifically, as customers, these organizations are outsourcing the management of their physical infrastructure to data centers (or cloud computing platforms). Energy consumption is a primary concern for datacenter (DC) management. Its cost represents about 80% of the total cost of ownership and it is estimated that in 2020, the US DCs alone will spend about $13 billion on energy bills. Generally, the datacenter servers are manufactured in such a way that they achieve high energy efficiency at high utilizations. Thereby for a low cost per computation all datacenter servers should push the utilization as high as possible. In order to fight the historically low utilization, cloud computing adopted server virtualization. The latter allows a physical server to execute multiple virtual servers (called virtual machines) in an isolated way. With virtualization, the cloud provider can pack (consolidate) the entire set of virtual machines (VMs) on a small set of physical servers and thereby, reduce the number of active servers. Even so, the datacenter servers rarely reach utilizations higher than 50% which means that they operate with sets of longterm unused resources (called 'holes'). My first contribution is a cloud management system that dynamically splits/fusions VMs such that they can better fill the holes. This solution is effective only for elastic applications, i.e. applications that can be executed and reconfigured over an arbitrary number of VMs. However the datacenter resource fragmentation stems from a more fundamental problem. Over time, cloud applications demand more and more memory but the physical servers provide more an more CPU. In nowadays datacenters, the two resources are strongly coupled since they are bounded to a physical sever. My second contribution is a practical way to decouple the CPU-memory tuple that can simply be applied to a commodity server. Thereby, the two resources can vary independently, depending on their demand. My third and my forth contribution show a practical system which exploit the second contribution. The underutilization observed on physical servers is also true for virtual machines. It has been shown that VMs consume only a small fraction of the allocated resources because the cloud customers are not able to correctly estimate the resource amount necessary for their applications. My third contribution is a system that estimates the memory consumption (i.e. the working set size) of a VM, with low overhead and high accuracy. Thereby, we can now consolidate the VMs based on their working set size (not the booked memory). However, the drawback of this approach is the risk of memory starvation. If one or multiple VMs have an sharp increase in memory demand, the physical server may run out of memory. This event is undesirable because the cloud platform is unable to provide the client with the booked memory. My fourth contribution is a system that allows a VM to use remote memory provided by a different rack server. Thereby, in the case of a peak memory demand, my system allows the VM to allocate memory on a remote physical server

    Energy Efficient Servers

    Get PDF
    Computer scienc
    • …
    corecore