29 research outputs found

    mPart: Miss Ratio Curve Guided Partitioning in Key-Value Stores

    Get PDF
    Web applications employ key-value stores to cache the data that is most commonly accessed. The cache improves an web application’s performance by serving its requests from memory, avoiding fetching them from the backend database. Since the memory space is limited, maximizing the memory utilization is a key to delivering the best performance possible. This has lead to the use of multi-tenant systems, allowing applications to share cache space. In addition, application data access patterns change over time, so the system should be adaptive in its memory allocation. In this thesis, we address both multi-tenancy (where a single cache is used for mul- tiple applications) and dynamic workloads (changing access patterns) using a model that relates the cache size to the application miss ratio, known as a miss ratio curve. Intuitively, the larger the cache, the less likely the system will need to fetch the data from the database. Our efficient, online construction of the miss ratio curve allows us to determine a near optimal memory allocation given the available system memory, while adapting to changing data access patterns. We show that our model outper- forms an existing state-of-the-art sharing model, Memshare, in terms of cache hit ratio and does so at a lower time cost. We show that average hit ratio is consistently 1 percentage point greater and 99.9th percentile latency is reduced by as much as 2.9% under standard web application workloads containing millions of requests

    Scaling Hierarchical N-body Simulations on GPU Clusters

    Full text link
    Abstract — This paper focuses on the use of GPGPU-based clus-ters for hierarchical N-body simulations. Whereas the behavior of these hierarchical methods has been studied in the past on CPU-based architectures, we investigate key performance issues in the context of clusters of GPUs. These include kernel orga-nization and efficiency, the balance between tree traversal and force computation work, grain size selection through the tuning of offloaded work request sizes, and the reduction of sequential bottlenecks. The effects of various application parameters are studied and experiments done to quantify gains in performance. Our studies are carried out in the context of a production-quality parallel cosmological simulator called ChaNGa. We highlight the re-engineering of the application to make it more suitable for GPU-based environments. Finally, we present performance results from experiments on the NCSA Lincoln GPU cluster, including a note on GPU use in multistepped simulations

    Designing and Handling Failure issues in a Structured Overlay Network Based Grid

    Get PDF
    Grid computing is the computing paradigm that is concerned with coordinated resource sharing and problem solving in dynamic, autonomous multi-institutional virtual organizations. Data exchange and service allocation between virtual organizations are challenging problems in the field of Grid computing, due to the decentralization of Grid systems. The resource management in a Grid system ensures efficiency and usability. The required efficiency and usability of Grid systems can be achieved by building a decentralized multi-virtual Grid system. In this thesis we present a decentralized multi-virtual resource management framework in which the system is divided into virtual organizations, each controlled by a broker. An overlay network of brokers is responsible for global resource management and managing the allocation of services. We address two main issues for both local and global resource management: 1) decentralized allocation of tasks to suitable nodes to achieve both local and global load balancing; and 2) handling of both regular and broker failures. Experimental results verify that the system achieves dependable performance with various loads of services and broker failures

    Real-Time Scheduling for GPUs with Applications in Advanced Automotive Systems

    Get PDF
    Self-driving cars, once constrained to closed test tracks, are beginning to drive alongside human drivers on public roads. Loss of life or property may result if the computing systems of automated vehicles fail to respond to events at the right moment. We call such systems that must satisfy precise timing constraints “real-time systems.” Since the 1960s, researchers have developed algorithms and analytical techniques used in the development of real-time systems; however, this body of knowledge primarily applies to traditional CPU-based platforms. Unfortunately, traditional platforms cannot meet the computational requirements of self-driving cars without exceeding the power and cost constraints of commercially viable vehicles. We argue that modern graphics processing units, or GPUs, represent a feasible alternative, but new algorithms and analytical techniques must be developed in order to integrate these uniquely constrained processors into a real-time system. The goal of the research presented in this dissertation is to discover and remedy the issues that prevent the use of GPUs in real-time systems. To overcome these issues, we design and implement a real-time multi-GPU scheduler, called GPUSync. GPUSync tightly controls access to a GPU’s computational and DMA processors, enabling simultaneous use despite potential limitations in GPU hardware. GPUSync enables tasks to migrate among GPUs, allowing new classes of real-time multi-GPU computing platforms. GPUSync employs heuristics to guide scheduling decisions to improve system efficiency without risking violations in real-time constraints. GPUSync may be paired with a wide variety of common real-time CPU schedulers. GPUSync supports closed-source GPU runtimes and drivers without loss in functionality. We evaluate GPUSync with both analytical and runtime experiments. In our analytical experiments, we model and evaluate over fifty configurations of GPUSync. We determine which configurations support the greatest computational capacity while maintaining real-time constraints. In our runtime experiments, we execute computer vision programs similar to those found in automated vehicles, with and without GPUSync. Our results demonstrate that GPUSync greatly reduces jitter in video processing. Research into real-time systems with GPUs is a new area of study. Although there is prior work on such systems, no other GPU scheduling framework is as comprehensive and flexible as GPUSync.Doctor of Philosoph

    Novel techniques in large scaleable ATM switches

    Get PDF
    Bibliography: p. 172-178.This dissertation explores the research area of large scale ATM switches. The requirements for an ATM switch are determined by overviewing the ATM network architecture. These requirements lead to the discussion of an abstract ATM switch which illustrates the components of an ATM switch that automatically scale with increasing switch size (the Input Modules and Output Modules) and those that do not (the Connection Admission Control and Switch Management systems as well as the Cell Switch Fabric). An architecture is suggested which may result in a scalable Switch Management and Connection Admission Control function. However, the main thrust of the dissertation is confined to the cell switch fabric. The fundamental mathematical limits of ATM switches and buffer placement is presented next emphasising the desirability of output buffering. This is followed by an overview of the possible routing strategies in a multi-stage interconnection network. A variety of space division switches are then considered which leads to a discussion of the hypercube fabric, (a novel switching technique). The hypercube fabric achieves good performance with an O(N.log₂N)ÂČ) scaling. The output module, resequencing, cell scheduling and output buffering technique is presented leading to a complete description of the proposed ATM switch. Various traffic models are used to quantify the switch's performance. These include a simple exponential inter-arrival time model, a locality of reference model and a self-similar, bursty, multiplexed Variable Bit Rate (VBR) model. FIFO queueing is simple to implement in an ATNI switch, however, more responsive queueing strategies can result in an improved performance. An associative memory is presented which allows the separate queues in the ATM switch to be effectively logically combined into a single FIFO queue. The associative memory is described in detail and its feasibility is shown by laying out the Integrated Circuit masks and performing an analogue simulation of the IC's performance is SPICE3. Although optimisations were required to the original design, the feasibility of the approach is shown with a 15È s write time and a 160È s read time for a 32 row, 8 priority bit, 10 routing bit version of the memory. This is achieved with 2”m technology, more advanced technologies may result in even better performance. The various traffic models and switch models are simulated in a number of runs. This shows the performance of the hypercube which outperforms a Clos network of equivalent technology and approaches the performance of an ideal reference fabric. The associative memory leverages a significant performance advantage in the hypercube network and a modest advantage in the Clos network. The performance of the switches is shown to degrade with increasing traffic density, increasing locality of reference, increasing variance in the cell rate and increasing burst length. Interestingly, the fabrics show no real degradation in response to increasing self similarity in the fabric. Lastly, the appendices present suggestions on how redundancy, reliability and multicasting can be achieved in the hypercube fabric. An overview of integrated circuits is provided. A brief description of commercial ATM switching products is given. Lastly, a road map to the simulation code is provided in the form of descriptions of the functionality found in all of the files within the source tree. This is intended to provide the starting ground for anyone wishing to modify or extend the simulation system developed for this thesis

    Co-designing reliability and performance for datacenter memory

    Get PDF
    Memory is one of the key components that affects reliability and performance of datacenter servers. Memory in today’s servers is organized and shared in several ways to provide the most performant and efficient access to data. For example, cache hierarchy in multi-core chips to reduce access latency, non-uniform memory access (NUMA) in multi-socket servers to improve scalability, disaggregation to increase memory capacity. In all these organizations, hardware coherence protocols are used to maintain memory consistency of this shared memory and implicitly move data to the requesting cores. This thesis aims to provide fault-tolerance against newer models of failure in the organization of memory in datacenter servers. While designing for improved reliability, this thesis explores solutions that can also enhance performance of applications. The solutions build over modern coherence protocols to achieve these properties. First, we observe that DRAM memory system failure rates have increased, demanding stronger forms of memory reliability. To combat this, the thesis proposes DvĂ©, a hardware driven replication mechanism where data blocks are replicated across two different memory controllers in a cache-coherent NUMA system. Data blocks are accompanied by a code with strong error detection capabilities so that when an error is detected, correction is performed using the replica. Dvé’s organization offers two independent points of access to data which enables: (a) strong error correction that can recover from a range of faults affecting any of the components in the memory and (b) higher performance by providing another nearer point of memory access. Dvé’s coherent replication keeps the replicas in sync for reliability and also provides coherent access to read replicas during fault-free operation for improved performance. DvĂ© can flexibly provide these benefits on-demand at runtime. Next, we observe that the coherence protocol itself requires to be hardened against failures. Memory in datacenter servers is being disaggregated from the compute servers into dedicated memory servers, driven by standards like CXL. CXL specifies the coherence protocol semantics for compute servers to access and cache data from a shared region in the disaggregated memory. However, the CXL specification lacks the requisite level of fault-tolerance necessary to operate at an inter-server scale within the datacenter. Compute servers can fail or be unresponsive in the datacenter and therefore, it is important that the coherence protocol remain available in the presence of such failures. The thesis proposes Āpta, a CXL-based, shared disaggregated memory system for keeping the cached data consistent without compromising availability in the face of compute server failures. Āpta architects a high-performance fault-tolerant object-granular memory server that significantly improves performance for stateless function-as-a-service (FaaS) datacenter applications

    Cost functions in optical burst-switched networks

    Get PDF
    Optical Burst Switching (OBS) is a new paradigm for an all-optical Internet. It combines the best features of Optical Circuit Switching (OCS) and Optical Packet Switching (OPS) while avoidmg the mam problems associated with those networks .Namely, it offers good granularity, but its hardware requirements are lower than those of OPS. In a backbone network, low loss ratio is of particular importance. Also, to meet varying user requirements, it should support multiple classes of service. In Optical Burst-Switched networks both these goals are closely related to the way bursts are arranged in channels. Unlike the case of circuit switching, scheduling decisions affect the loss probability of future burst This thesis proposes the idea of a cost function. The cost function is used to judge the quality of a burst arrangement and estimate the probability that this burst will interfere with future bursts. Two applications of the cost functio n are proposed. A scheduling algorithm uses the value of the cost function to optimize the alignment of the new burst with other bursts in a channel, thus minimising the loss ratio. A cost-based burst droppmg algorithm, that can be used as a part of a Quality of Service scheme, drops only those bursts, for which the cost function value indicates that are most likely to cause a contention. Simulation results, performed using a custom-made OBS extension to the ns-2 simulator, show that the cost-based algorithms improve network performanc

    Communication Architectures for Scalable GPU-centric Computing Systems

    Get PDF
    In recent years, power consumption has become the main concern in High Performance Computing (HPC). This has lead to heterogeneous computing systems in which Central Processing Units (CPUs) are supported by accelerators, such as Graphics Processing Units (GPUs). While GPUs used to be seen as slave devices to which the main processor offloads computation, today’s systems tend to deploy more GPUs than CPUs. Eventually, the GPU will become a first-class processor, bearing increasing responsibilities. Promoting the GPU to a first-class processor comes with many challenges, such as progress guarantees, dynamic memory management, and scheduling. However, one of the main challenges is the GPU’s inability to orchestrate communication, which is currently entirely handled by the CPU. This work addresses that issue and presents solutions to allow GPUs to source and sink network traffic independently. Many important aspects are addressed, ranging from the application level to how networking hardware is accessed. First, important and large scale exascale applications are studied to further understand their communication behavior and applications’ requirements. Several metrics are presented, including time spent for communication, message sizes, and the length of queues that are required to match messages with receive requests. One aspect the analysis revealed is that messages are becoming smaller at scale, which renders the matching of messages and receive requests an important problem to address. The next part analyzes how the GPU can directly access the network with various communication models being presented and benchmarked. It is shown that a flat address space of distributed GPU memories shows superior bandwidth than put/get communication or CPU-controlled message passing, but less communication can be overlapped with computation. Overall, GPU-controlled communication is always superior, both in terms of time-to-solution and energy spending. The final part addresses communication management on GPUs, which is required to provide high-level communication abstractions. Besides other fundamental building blocks, an algorithm for the message matching is presented that yields similar performance as CPUs. However, it is also shown that the messaging protocol can be relaxed to improve performance significantly, leveraging the massive amount of parallelism provided by the GPU’s architecture

    Actor-Oriented Programming for Resource Constrained Multiprocessor Networks on Chip

    Get PDF
    Multiprocessor Networks on Chip (MPNoCs) are an attractive architecture for integrated circuits as they can benefit from the improved performance of ever smaller transistors but are not severely constrained by the poor performance of global on-chip wires. As the number of processors increases it becomes ever more expensive to provide coherent shared memory but this is a foundational assumption of thread-level parallelism. Threaded models of concurrency cannot efficiently address architectures where shared memory is not coherent or does not exist. In this thesis an extended actor oriented programming model is proposed to enable the design of complex and general purpose software for highly parallel and decentralised multiprocessor architectures. This model requires the encapsulation of an execution context and state into isolated Machines which may only initiate communication with one another via explicitly named channels. An emphasis on message passing and strong isolation of computation encourages application structures that are congruent with the nature of non-shared memory multiprocessors, and the model also avoids creating dependences on specific hardware topologies. A realisation of the model called Machine Java is presented to demonstrate the applicability of the model to a general purpose programming language. Applications designed with this framework are shown to be capable of scaling to large numbers of processors and remain independent of the hardware targets. Through the use of an efficient compilation technique, Machine Java is demonstrated to be portable across several architectures and viable even in the highly constrained context of an FPGA hosted MPNoC
    corecore