3 research outputs found

    Measuring and Understanding Throughput of Network Topologies

    Full text link
    High throughput is of particular interest in data center and HPC networks. Although myriad network topologies have been proposed, a broad head-to-head comparison across topologies and across traffic patterns is absent, and the right way to compare worst-case throughput performance is a subtle problem. In this paper, we develop a framework to benchmark the throughput of network topologies, using a two-pronged approach. First, we study performance on a variety of synthetic and experimentally-measured traffic matrices (TMs). Second, we show how to measure worst-case throughput by generating a near-worst-case TM for any given topology. We apply the framework to study the performance of these TMs in a wide range of network topologies, revealing insights into the performance of topologies with scaling, robustness of performance across TMs, and the effect of scattered workload placement. Our evaluation code is freely available

    Resource management for extreme scale high performance computing systems in the presence of failures

    Get PDF
    2018 Summer.Includes bibliographical references.High performance computing (HPC) systems, such as data centers and supercomputers, coordinate the execution of large-scale computation of applications over tens or hundreds of thousands of multicore processors. Unfortunately, as the size of HPC systems continues to grow towards exascale complexities, these systems experience an exponential growth in the number of failures occurring in the system. These failures reduce performance and increase energy use, reducing the efficiency and effectiveness of emerging extreme-scale HPC systems. Applications executing in parallel on individual multicore processors also suffer from decreased performance and increased energy use as a result of applications being forced to share resources, in particular, the contention from multiple application threads sharing the last-level cache causes performance degradation. These challenges make it increasingly important to characterize and optimize the performance and behavior of applications that execute in these systems. To address these challenges, in this dissertation we propose a framework for intelligently characterizing and managing extreme-scale HPC system resources. We devise various techniques to mitigate the negative effects of failures and resource contention in HPC systems. In particular, we develop new HPC resource management techniques for intelligently utilizing system resources through the (a) optimal scheduling of applications to HPC nodes and (b) the optimal configuration of fault resilience protocols. These resource management techniques employ information obtained from historical analysis as well as theoretical and machine learning methods for predictions. We use these data to characterize system performance, energy use, and application behavior when operating under the uncertainty of performance degradation from both system failures and resource contention. We investigate how to better characterize and model the negative effects from system failures as well as application co-location on large-scale HPC computing systems. Our analysis of application and system behavior also investigates: the interrelated effects of network usage of applications and fault resilience protocols; checkpoint interval selection and its sensitivity to system parameters for various checkpoint-based fault resilience protocols; and performance comparisons of various promising strategies for fault resilience in exascale-sized systems

    Expression and Composition of Optimization-Based Applications for Software-Defined Networking

    Get PDF
    Motivated by the adoption of the Software Defined Networking and its increasing focus on applications for resource management, we propose a novel framework for expressing network optimization applications. Named the SDN Optimization Layer (SOL), the framework and its extensions alleviate the burden of constructing optimization applications by abstracting the low-level details of mathematical optimization techniques such as linear programming. SOL utilizes the path abstraction to express a wide variety of network constraints and resource-management logic. We show that the framework is general and efficient enough to support various classes of applications. We extend SOL to support composition of multiple applications in a fair and resource-efficient way. We demonstrate that SOL’s composition produces better resource efficiency than previously available composition approaches and is tolerant to network variations. Finally, as a case study, we develop a new application for load balancing network intrusion prevention systems, called SNIPS. We highlight the challenges in developing the SNIPS optimization from the ground up, show SOL’s (conceptually) simplified version, and verify that both produce nearly identical solutions.Doctor of Philosoph
    corecore