21 research outputs found

    Scale-out NUMA

    Get PDF
    Emerging datacenter applications operate on vast datasets that are kept in DRAM to minimize latency. The large number of servers needed to accommodate this massive memory footprint requires frequent server-to-server communication in applications such as key-value stores and graph-based applications that rely on large irregular data structures. The fine-grained nature of the accesses is a poor match to commodity networking technologies, including RDMA, which incur delays of 10-1000x over local DRAM operations. We introduce Scale-Out NUMA (soNUMA) – an architecture, programming model, and communication protocol for low-latency, distributed in-memory processing. soNUMA layers an RDMA-inspired programming model directly on top of a NUMA memory fabric via a stateless messaging protocol. To facilitate interactions between the application, OS, and the fabric, soNUMA relies on the remote memory controller – a new architecturally-exposed hardware block integrated into the node’s local coherence hierarchy. Our results based on cycle-accurate full-system simulation show that soNUMA performs remote reads at latencies that are within 4x of local DRAM, can fully utilize the available memory bandwidth, and can issue up to 10M remote memory operations per second per core

    Rack-Scale Memory Pooling for Datacenters

    Get PDF
    The rise of web-scale services has led to a staggering growth in user data on the Internet. To transform such a vast raw data into valuable information for the user and provide quality assurances, it is important to minimize access latency and enable in-memory processing. For more than a decade, the only practical way to accommodate for ever-growing data in memory has been to scale out server resources, which has led to the emergence of large-scale datacenters and distributed non-relational databases (NoSQL). Such horizontal scaling of resources translates to an increasing number of servers that participate in processing individual user requests. Typically, each user request results in hundreds of independent queries targeting different NoSQL nodes - servers, and the larger the number of servers involved, the higher the fan-out. To complete a single user request, all of the queries associated with that request have to complete first, and thus, the slowest query determines the completion time. Because of skewed popularity distributions and resource contention, the more servers we have, the harder it is to achieve high throughput and facilitate server utilization, without violating service level objectives. This thesis proposes rack-scale memory pooling (RSMP), a new scaling technique for future datacenters that reduces networking overheads and improves the performance of core datacenter software. RSMP is an approach to building larger, rack-scale capacity units for datacenters through specialized fabric interconnects with support for one-sided operations, and using them, in lieu of conventional servers (e.g. 1U), to scale out. We define an RSMP unit to be a server rack connecting 10s to 100s of servers to a secondary network enabling direct, low-latency access to the global memory of the rack. We, then, propose a new RSMP design - Scale-Out NUMA that leverages integration and a NUMA fabric to bridge the gap between local and remote memory to only 5× difference in access latency. Finally, we show how RSMP impacts NoSQL data serving, a key datacenter service used by most web-scale applications today. We show that using fewer larger data shards leads to less load imbalance and higher effective throughput, without violating applications¿ service level objectives. For example, by using Scale-Out NUMA, RSMP improves the throughput of a key-value store up to 8.2× over a traditional scale-out deployment

    Overlapping of Communication and Computation and Early Binding: Fundamental Mechanisms for Improving Parallel Performance on Clusters of Workstations

    Get PDF
    This study considers software techniques for improving performance on clusters of workstations and approaches for designing message-passing middleware that facilitate scalable, parallel processing. Early binding and overlapping of communication and computation are identified as fundamental approaches for improving parallel performance and scalability on clusters. Currently, cluster computers using the Message-Passing Interface for interprocess communication are the predominant choice for building high-performance computing facilities, which makes the findings of this work relevant to a wide audience from the areas of high-performance computing and parallel processing. The performance-enhancing techniques studied in this work are presently underutilized in practice because of the lack of adequate support by existing message-passing libraries and are also rarely considered by parallel algorithm designers. Furthermore, commonly accepted methods for performance analysis and evaluation of parallel systems omit these techniques and focus primarily on more obvious communication characteristics such as latency and bandwidth. This study provides a theoretical framework for describing early binding and overlapping of communication and computation in models for parallel programming. This framework defines four new performance metrics that facilitate new approaches for performance analysis of parallel systems and algorithms. This dissertation provides experimental data that validate the correctness and accuracy of the performance analysis based on the new framework. The theoretical results of this performance analysis can be used by designers of parallel system and application software for assessing the quality of their implementations and for predicting the effective performance benefits of early binding and overlapping. This work presents MPI/Pro, a new MPI implementation that is specifically optimized for clusters of workstations interconnected with high-speed networks. This MPI implementation emphasizes features such as persistent communication, asynchronous processing, low processor overhead, and independent message progress. These features are identified as critical for delivering maximum performance to applications. The experimental section of this dissertation demonstrates the capability of MPI/Pro to facilitate software techniques that result in significant application performance improvements. Specific demonstrations with Virtual Interface Architecture and TCP/IP over Ethernet are offered

    Techniques for optimizing time-stepped simulations

    Get PDF
    Master'sMASTER OF SCIENC

    A framework for evaluating the impact of communication on performance in large-scale distributed urban simulations

    Get PDF
    A primary motivation for employing distributed simulation is to enable the execution of large-scale simulation workloads that cannot be handled by the resources of a single stand-alone computing node. To make execution possible, the workload is distributed among multiple computing nodes connected to one another via a communication network. The execution of a distributed simulation involves alternating phases of computation and communication to coordinate the co-operating nodes and ensure correctness of the resulting simulation outputs. Reliably estimating the execution performance of a distributed simulation can be difficult due to non-deterministic execution paths involved in alternating computation and communication operations. However, performance estimates are useful as a guide for the simulation time that can be expected when using a given set of computing resources. Performance estimates can support decisions to commit time and resources to running distributed simulations, especially where significant amounts of funds or computing resources are necessary. Various performance estimation approaches are employed in the distributed computing literature, including the influential Bulk Synchronous Parallel (BSP) and LogP models. Different approaches make various assumptions that render them more suitable for some applications than for others. Actual performance depends on characteristics inherent to each distributed simulation application. An important aspect of these individual characteristics is the dynamic relationship between the communication and computation phases of the distributed simulation application. This work develops a framework for estimating the performance of distributed simulation applications, focusing mainly on aspects relevant to the dynamic relationship between communication and computation during distributed simulation execution. The framework proposes a meta-simulation approach based on the Multi-Agent Simulation (MAS) paradigm. Using the approach proposed by the framework, meta-simulations can be developed to investigate the performance of specific distributed simulation applications. The proposed approach enables the ability to compare various what-if scenarios. This ability is useful for comparing the effects of various parameters and strategies such as the number of computing nodes, the communication strategy, and the workload-distribution strategy. The proposed meta-simulation approach can also aid a search for optimal parameters and strategies for specific distributed simulation applications. The framework is demonstrated by implementing a meta-simulation which is based on case studies from the Urban Simulation domain

    A framework for evaluating the impact of communication on performance in large-scale distributed urban simulations

    Get PDF
    A primary motivation for employing distributed simulation is to enable the execution of large-scale simulation workloads that cannot be handled by the resources of a single stand-alone computing node. To make execution possible, the workload is distributed among multiple computing nodes connected to one another via a communication network. The execution of a distributed simulation involves alternating phases of computation and communication to coordinate the co-operating nodes and ensure correctness of the resulting simulation outputs. Reliably estimating the execution performance of a distributed simulation can be difficult due to non-deterministic execution paths involved in alternating computation and communication operations. However, performance estimates are useful as a guide for the simulation time that can be expected when using a given set of computing resources. Performance estimates can support decisions to commit time and resources to running distributed simulations, especially where significant amounts of funds or computing resources are necessary. Various performance estimation approaches are employed in the distributed computing literature, including the influential Bulk Synchronous Parallel (BSP) and LogP models. Different approaches make various assumptions that render them more suitable for some applications than for others. Actual performance depends on characteristics inherent to each distributed simulation application. An important aspect of these individual characteristics is the dynamic relationship between the communication and computation phases of the distributed simulation application. This work develops a framework for estimating the performance of distributed simulation applications, focusing mainly on aspects relevant to the dynamic relationship between communication and computation during distributed simulation execution. The framework proposes a meta-simulation approach based on the Multi-Agent Simulation (MAS) paradigm. Using the approach proposed by the framework, meta-simulations can be developed to investigate the performance of specific distributed simulation applications. The proposed approach enables the ability to compare various what-if scenarios. This ability is useful for comparing the effects of various parameters and strategies such as the number of computing nodes, the communication strategy, and the workload-distribution strategy. The proposed meta-simulation approach can also aid a search for optimal parameters and strategies for specific distributed simulation applications. The framework is demonstrated by implementing a meta-simulation which is based on case studies from the Urban Simulation domain

    Optimizations and Cost Models for multi-core architectures: an approach based on parallel paradigms

    Get PDF
    The trend in modern microprocessor architectures is clear: multi-core chips are here to stay, and researchers expect multiprocessors with 128 to 1024 cores on a chip in some years. Yet the software community is slowly taking the path towards parallel programming: while some works target multi-cores, these are usually inherited from the previous tools for SMP architectures, and rarely exploit specific characteristics of multi-cores. But most important, current tools have no facilities to guarantee performance or portability among architectures. Our research group was one of the first to propose the structured parallel programming approach to solve the problem of performance portability and predictability. This has been successfully demonstrated years ago for distributed and shared memory multiprocessors, and we strongly believe that the same should be applied to multi-core architectures. The main problem with performance portability is that optimizations are effective only under specific conditions, making them dependent on both the specific program and the target architecture. For this reason in current parallel programming (in general, but especially with multi-cores) optimizations usually follows a try-and-decide approach: each one must be implemented and tested on the specific parallel program to understand its benefits. If we want to make a step forward and really achieve some form of performance portability, we require some kind of prediction of the expected performance of a program. The concept of performance modeling is quite old in the world of parallel programming; yet, in the last years, this kind of research saw small improvements: cost models to describe multi-cores are missing, mainly because of the increasing complexity of microarchitectures and the poor knowledge of specific implementation details of current processors. In the first part of this thesis we prove that the way of performance modeling is still feasible, by studying the Tilera TilePro64. The high number of cores on-chip in this processor (64) required the use of several innovative solutions, such as a complex interconnection network and the use of multiple memory interfaces per chip. For these features the TilePro64 can be considered an insight of what to expect in future multi-core processors. The availability of a cycle-accurate simulator and an extensive documentation allowed us to model the architecture, and in particular its memory subsystem, at the accuracy level required to compare optimizations In the second part, focused on optimizations, we cover one of the most important issue of multi-core architectures: the memory subsystem. In this area multi-core strongly differs in their structure w.r.t off-chip parallel architectures, both SMP and NUMA, thus opening new opportunities. In detail, we investigate the problem of data distribution over the memory controllers in several commercial multi-cores, and the efficient use of the cache coherency mechanisms offered by the TilePro64 processor. Finally, by using the performance model, we study different implementations, derived from the previous optimizations, of a simple test-case application. We are able to predict the best version using only profiled data from a sequential execution. The accuracy of the model has been verified by experimentally comparing the implementations on the real architecture, giving results within 1 − 2% of accuracy

    Simulation of the UKQCD computer

    Get PDF

    Reducing synchronization in distributed parallel programs

    Get PDF
    Developers of scalable libraries and applications for distributed-memory parallel systems face many challenges to attaining high performance. These challenges include communication latency, critical path delay, suboptimal scheduling, load imbalance, and system noise. These challenges are often defined and measured relative to points of broad synchronization in the program’s execution. Given the way in which many algorithms are defined and systems are implemented, gauging the above challenges at synchronization points is not unreasonable. In this thesis, I attempt to demonstrate that in many cases, those synchronization points are themselves the core issue behind these challenges. In some cases, the synchronizing operations cause a program to incur the costs from these challenges. In other cases, the presence of synchronization potentially exacerbates these problems. Through a simple performance model, I demonstrate that making synchronization less frequent can greatly mitigate performance issues. My work and several results in the literature show that many motifs and whole applications can be successfully redesigned to operate with asymptotically less synchronization than their naïve starting points. In exploring these issues, I have identified recurrent patterns across many applications and multiple environments that can guide future efforts more directly toward synchronization-avoiding designs. Thus, I attempt to offer developers the beginnings of a high-level play-book to follow rather than having to rediscover application-specific instances of the patterns
    corecore