75 research outputs found

    An In-Depth Analysis of the Slingshot Interconnect

    Full text link
    The interconnect is one of the most critical components in large scale computing systems, and its impact on the performance of applications is going to increase with the system size. In this paper, we will describe Slingshot, an interconnection network for large scale computing systems. Slingshot is based on high-radix switches, which allow building exascale and hyperscale datacenters networks with at most three switch-to-switch hops. Moreover, Slingshot provides efficient adaptive routing and congestion control algorithms, and highly tunable traffic classes. Slingshot uses an optimized Ethernet protocol, which allows it to be interoperable with standard Ethernet devices while providing high performance to HPC applications. We analyze the extent to which Slingshot provides these features, evaluating it on microbenchmarks and on several applications from the datacenter and AI worlds, as well as on HPC applications. We find that applications running on Slingshot are less affected by congestion compared to previous generation networks.Comment: To be published in Proceedings of The International Conference for High Performance Computing Networking, Storage, and Analysis (SC '20) (2020

    Secure Multi-Path Selection with Optimal Controller Placement Using Hybrid Software-Defined Networks with Optimization Algorithm

    Get PDF
    The Internet's growth in popularity requires computer networks for both agility and resilience. Recently, unable to satisfy the computer needs for traditional networking systems. Software Defined Networking (SDN) is known as a paradigm shift in the networking industry. Many organizations are used SDN due to their efficiency of transmission. Striking the right balance between SDN and legacy switching capabilities will enable successful network scenarios in architecture networks. Therefore, this object grand scenario for a hybrid network where the external perimeter transport device is replaced with an SDN device in the service provider network. With the moving away from older networks to SDN, hybrid SDN includes both legacy and SDN switches. Existing models of SDN have limitations such as overfitting, local optimal trapping, and poor path selection efficiency. This paper proposed a Deep Kronecker Neural Network (DKNN) to improve its efficiency with a moderate optimization method for multipath selection in SDN. Dynamic resource scheduling is used for the reward function the learning performance is improved by the deep reinforcement learning (DRL) technique. The controller for centralised SDN acts as a network brain in the control plane. Among the most important duties network is selected for the best SDN controller. It is vulnerable to invasions and the controller becomes a network bottleneck. This study presents an intrusion detection system (IDS) based on the SDN model that runs as an application module within the controller. Therefore, this study suggested the feature extraction and classification of contractive auto-encoder with a triple attention-based classifier. Additionally, this study leveraged the best performing SDN controllers on which many other SDN controllers are based on OpenDayLight (ODL) provides an open northbound API and supports multiple southbound protocols. Therefore, one of the main issues in the multi-controller placement problem (CPP) that addresses needed in the setting of SDN specifically when different aspects in interruption, ability, authenticity and load distribution are being considered. Introducing the scenario concept, CPP is formulated as a robust optimization problem that considers changes in network status due to power outages, controller’s capacity, load fluctuations and changes in switches demand. Therefore, to improve network performance, it is planned to improve the optimal amount of controller placements by simulated annealing using different topologies the modified Dragonfly optimization algorithm (MDOA)

    Flare: Flexible In-Network Allreduce

    Full text link
    The allreduce operation is one of the most commonly used communication routines in distributed applications. To improve its bandwidth and to reduce network traffic, this operation can be accelerated by offloading it to network switches, that aggregate the data received from the hosts, and send them back the aggregated result. However, existing solutions provide limited customization opportunities and might provide suboptimal performance when dealing with custom operators and data types, with sparse data, or when reproducibility of the aggregation is a concern. To deal with these problems, in this work we design a flexible programmable switch by using as a building block PsPIN, a RISC-V architecture implementing the sPIN programming model. We then design, model, and analyze different algorithms for executing the aggregation on this architecture, showing performance improvements compared to state-of-the-art approaches

    Capturing the impact of external interference on HPC application performance

    Get PDF
    HPC applications are large software packages with high computation and storage requirements. To meet these requirements, the architectures of supercomputers are continuously evolving and their capabilities are continuously increasing. Present-day supercomputers have achieved petaflops of computational power by utilizing thousands to millions of compute cores, connected through specialized communication networks, and are equipped with petabytes of storage using a centralized I/O subsystem. While fulfilling the high resource demands of HPC applications, such a design also entails its own challenges. Applications running on these systems own the computation resources exclusively, but share the communication interconnect and the I/O subsystem with other concurrently running applications. Simultaneous access to these shared resources causes contention and inter-application interference, leading to degraded application performance. Inter-application interference is one of the sources of run-to-run variation. While other sources of variation, such as operating system jitter, have been investigated before, this doctoral thesis specifically focuses on inter-application interference and studies it from the perspective of an application. Variation in execution time not only causes uncertainty and affects user expectations (especially during performance analysis), but also causes suboptimal usage of HPC resources. Therefore, this thesis aims to evaluate inter-application interference, establish trends among applications under contention, and approximate the impact of external influences on the runtime of an application. To this end, this thesis first presents a method to correlate the performance of applications running side-by-side. The method divides the runtime of a system into globally synchronized, fine-grained time slices for which application performance data is recorded separately. The evaluation of the method demonstrates that correlating application performance data can identify inter-application interference. The thesis further uses the method to study I/O interference and shows that file access patterns are a significant factor in determining the interference potential of an application. This thesis also presents a technique to estimate the impact of external influences on an application run. The technique introduces the concept of intrinsic performance characteristics to cluster similar application execution segments. Anomalies in the cluster are the result of external interference. An evaluation with several benchmarks shows high accuracy in estimating the impact of interference from a single application run. The contributions of this thesis will help establish interference trends and devise interference mitigation techniques. Similarly, estimating the impact of external interference will restore user expectations and help performance analysts separate application performance from external influence

    Intelligent middleware for HPC systems to improve performance and energy cost efficiency

    Full text link
    High-performance computing (HPC) systems play an essential role in large-scale scientific computations. As the number of nodes in HPC systems continues to increase, their power consumption leads to larger energy costs. The energy costs pose a financial burden on maintaining HPC systems, which will be more challenging on future extreme-scale systems where the number of nodes and power consumption are expected to further grow. To support this growth, higher degrees of network and memory resource sharing are implemented, causing a substantial increase in performance variation and degradation. These challenges call for innovations in HPC system middleware that reduce energy cost without trading off performance. By taking the performance of an HPC system as a first-order constraint, this thesis establishes that HPC systems can participate in demand response programs while providing performance guarantees through a novel design of the middleware. Well-designed middleware also enables enhanced performance by mitigating resource contention induced by energy or cost restrictions. This thesis aims to realize these goals through two complementary approaches. First, this thesis proposes novel policies for HPC systems to enable their participation in emerging power markets, where participants reduce their energy costs by following market requirements. Our policies guarantee that the Quality-of-Service (QoS) of jobs does not drop below given constraints and systematically optimize cost reduction based on large deviation analysis in queueing theory. Through experiments on a real-world cluster whose power consumption is regulated to follow a dynamically changing power target, this thesis claims that HPC systems can participate in emerging power programs without violating the QoS constraints of jobs. Second, this thesis proposes novel resource management strategies to improve the performance of HPC systems. Better resource management can mitigate contention that causes performance degradation and poor system utilization. To resolve network contention, we design an intelligent job allocation policy for HPC systems that incorporate the state-of-the-art dragonfly network topology. Our allocation policy mitigates network contention, reduces network communication latency, and consequently improves the performance of the systems. As some latest HPC systems support the collection of high-granularity network performance metrics at runtime, we also propose a method to quantify the impact of network congestion and demonstrate that a network-data-driven job allocation policy improves HPC performance by avoiding network traffic hot spots.2022-01-18T00:00:00

    Canary: Congestion-Aware In-Network Allreduce Using Dynamic Trees

    Full text link
    The allreduce operation is an essential building block for many distributed applications, ranging from the training of deep learning models to scientific computing. In an allreduce operation, data from multiple hosts is aggregated together and then broadcasted to each host participating in the operation. Allreduce performance can be improved by a factor of two by aggregating the data directly in the network. Switches aggregate data coming from multiple ports before forwarding the partially aggregated result to the next hop. In all existing solutions, each switch needs to know the ports from which it will receive the data to aggregate. However, this forces packets to traverse a predefined set of switches, making these solutions prone to congestion. For this reason, we design Canary, the first congestion-aware in-network allreduce algorithm. Canary uses load balancing algorithms to forward packets on the least congested paths. Because switches do not know from which ports they will receive the data to aggregate, they use timeouts to aggregate the data in a best-effort way. We develop a P4 Canary prototype and evaluate it on a Tofino switch. We then validate Canary through simulations on large networks, showing performance improvements up to 40% compared to the state-of-the-art

    Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging

    Full text link
    Deep learning at scale is dominated by communication time. Distributing samples across nodes usually yields the best performance, but poses scaling challenges due to global information dissemination and load imbalance across uneven sample lengths. State-of-the-art decentralized optimizers mitigate the problem, but require more iterations to achieve the same accuracy as their globally-communicating counterparts. We present Wait-Avoiding Group Model Averaging (WAGMA) SGD, a wait-avoiding stochastic optimizer that reduces global communication via subgroup weight exchange. The key insight is a combination of algorithmic changes to the averaging scheme and the use of a group allreduce operation. We prove the convergence of WAGMA-SGD, and empirically show that it retains convergence rates similar to Allreduce-SGD. For evaluation, we train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale. Compared with state-of-the-art decentralized SGD variants, WAGMA-SGD significantly improves training throughput (e.g., 2.1x on 1,024 GPUs for reinforcement learning), and achieves the fastest time-to-solution (e.g., the highest score using the shortest training time for Transformer).Comment: Published in IEEE Transactions on Parallel and Distributed Systems (IEEE TPDS), vol. 32, no. 7, pp. 1725-1739, 1 July 202

    Enhancement of Metaheuristic Algorithm for Scheduling Workflows in Multi-fog Environments

    Get PDF
    Whether in computer science, engineering, or economics, optimization lies at the heart of any challenge involving decision-making. Choosing between several options is part of the decision- making process. Our desire to make the "better" decision drives our decision. An objective function or performance index describes the assessment of the alternative's goodness. The theory and methods of optimization are concerned with picking the best option. There are two types of optimization methods: deterministic and stochastic. The first is a traditional approach, which works well for small and linear problems. However, they struggle to address most of the real-world problems, which have a highly dimensional, nonlinear, and complex nature. As an alternative, stochastic optimization algorithms are specifically designed to tackle these types of challenges and are more common nowadays. This study proposed two stochastic, robust swarm-based metaheuristic optimization methods. They are both hybrid algorithms, which are formulated by combining Particle Swarm Optimization and Salp Swarm Optimization algorithms. Further, these algorithms are then applied to an important and thought-provoking problem. The problem is scientific workflow scheduling in multiple fog environments. Many computer environments, such as fog computing, are plagued by security attacks that must be handled. DDoS attacks are effectively harmful to fog computing environments as they occupy the fog's resources and make them busy. Thus, the fog environments would generally have fewer resources available during these types of attacks, and then the scheduling of submitted Internet of Things (IoT) workflows would be affected. Nevertheless, the current systems disregard the impact of DDoS attacks occurring in their scheduling process, causing the amount of workflows that miss deadlines as well as increasing the amount of tasks that are offloaded to the cloud. Hence, this study proposed a hybrid optimization algorithm as a solution for dealing with the workflow scheduling issue in various fog computing locations. The proposed algorithm comprises Salp Swarm Algorithm (SSA) and Particle Swarm Optimization (PSO). In dealing with the effects of DDoS attacks on fog computing locations, two Markov-chain schemes of discrete time types were used, whereby one calculates the average network bandwidth existing in each fog while the other determines the number of virtual machines existing in every fog on average. DDoS attacks are addressed at various levels. The approach predicts the DDoS attack’s influences on fog environments. Based on the simulation results, the proposed method can significantly lessen the amount of offloaded tasks that are transferred to the cloud data centers. It could also decrease the amount of workflows with missed deadlines. Moreover, the significance of green fog computing is growing in fog computing environments, in which the consumption of energy plays an essential role in determining maintenance expenses and carbon dioxide emissions. The implementation of efficient scheduling methods has the potential to mitigate the usage of energy by allocating tasks to the most appropriate resources, considering the energy efficiency of each individual resource. In order to mitigate these challenges, the proposed algorithm integrates the Dynamic Voltage and Frequency Scaling (DVFS) technique, which is commonly employed to enhance the energy efficiency of processors. The experimental findings demonstrate that the utilization of the proposed method, combined with the Dynamic Voltage and Frequency Scaling (DVFS) technique, yields improved outcomes. These benefits encompass a minimization in energy consumption. Consequently, this approach emerges as a more environmentally friendly and sustainable solution for fog computing environments
    corecore