215 research outputs found

    A Framework for Cyber Vulnerability Assessments of InfiniBand Networks

    Get PDF
    InfiniBand is a popular Input/Output interconnect technology used in High Performance Computing clusters. It is employed in over a quarter of the world’s 500 fastest computer systems. Although it was created to provide extremely low network latency with a high Quality of Service, the cybersecurity aspects of InfiniBand have yet to be thoroughly investigated. The InfiniBand Architecture was designed as a data center technology, logically separated from the Internet, so defensive mechanisms such as packet encryption were not implemented. Cyber communities do not appear to have taken an interest in InfiniBand, but that is likely to change as attackers branch out from traditional computing devices. This thesis considers the security implications of InfiniBand features and constructs a framework for conducting Cyber Vulnerability Assessments. Several attack primitives are tested and analyzed. Finally, new cyber tools and security devices for InfiniBand are proposed, and changes to existing products are recommended

    Routing on the Channel Dependency Graph:: A New Approach to Deadlock-Free, Destination-Based, High-Performance Routing for Lossless Interconnection Networks

    Get PDF
    In the pursuit for ever-increasing compute power, and with Moore's law slowly coming to an end, high-performance computing started to scale-out to larger systems. Alongside the increasing system size, the interconnection network is growing to accommodate and connect tens of thousands of compute nodes. These networks have a large influence on total cost, application performance, energy consumption, and overall system efficiency of the supercomputer. Unfortunately, state-of-the-art routing algorithms, which define the packet paths through the network, do not utilize this important resource efficiently. Topology-aware routing algorithms become increasingly inapplicable, due to irregular topologies, which either are irregular by design, or most often a result of hardware failures. Exchanging faulty network components potentially requires whole system downtime further increasing the cost of the failure. This management approach becomes more and more impractical due to the scale of today's networks and the accompanying steady decrease of the mean time between failures. Alternative methods of operating and maintaining these high-performance interconnects, both in terms of hardware- and software-management, are necessary to mitigate negative effects experienced by scientific applications executed on the supercomputer. However, existing topology-agnostic routing algorithms either suffer from poor load balancing or are not bounded in the number of virtual channels needed to resolve deadlocks in the routing tables. Using the fail-in-place strategy, a well-established method for storage systems to repair only critical component failures, is a feasible solution for current and future HPC interconnects as well as other large-scale installations such as data center networks. Although, an appropriate combination of topology and routing algorithm is required to minimize the throughput degradation for the entire system. This thesis contributes a network simulation toolchain to facilitate the process of finding a suitable combination, either during system design or while it is in operation. On top of this foundation, a key contribution is a novel scheduling-aware routing, which reduces fault-induced throughput degradation while improving overall network utilization. The scheduling-aware routing performs frequent property preserving routing updates to optimize the path balancing for simultaneously running batch jobs. The increased deployment of lossless interconnection networks, in conjunction with fail-in-place modes of operation and topology-agnostic, scheduling-aware routing algorithms, necessitates new solutions to solve the routing-deadlock problem. Therefore, this thesis further advances the state-of-the-art by introducing a novel concept of routing on the channel dependency graph, which allows the design of an universally applicable destination-based routing capable of optimizing the path balancing without exceeding a given number of virtual channels, which are a common hardware limitation. This disruptive innovation enables implicit deadlock-avoidance during path calculation, instead of solving both problems separately as all previous solutions

    Composable architecture for rack scale big data computing

    No full text
    The rapid growth of cloud computing, both in terms of the spectrum and volume of cloud workloads, necessitate re-visiting the traditional rack-mountable servers based datacenter design. Next generation datacenters need to offer enhanced support for: (i) fast changing system configuration requirements due to workload constraints, (ii) timely adoption of emerging hardware technologies, and (iii) maximal sharing of systems and subsystems in order to lower costs. Disaggregated datacenters, constructed as a collection of individual resources such as CPU, memory, disks etc., and composed into workload execution units on demand, are an interesting new trend that can address the above challenges. In this paper, we demonstrated the feasibility of composable systems through building a rack scale composable system prototype using PCIe switch. Through empirical approaches, we develop assessment of the opportunities and challenges for leveraging the composable architecture for rack scale cloud datacenters with a focus on big data and NoSQL workloads. In particular, we compare and contrast the programming models that can be used to access the composable resources, and developed the implications for the network and resource provisioning and management for rack scale architecture

    Management, Optimization and Evolution of the LHCb Online Network

    Get PDF
    The LHCb experiment is one of the four large particle detectors running at the Large Hadron Collider (LHC) at CERN. It is a forward single-arm spectrometer dedicated to test the Standard Model through precision measurements of Charge-Parity (CP) violation and rare decays in the b quark sector. The LHCb experiment will operate at a luminosity of 2x10^32cm-2s-1, the proton-proton bunch crossings rate will be approximately 10 MHz. To select the interesting events, a two-level trigger scheme is applied: the rst level trigger (L0) and the high level trigger (HLT). The L0 trigger is implemented in custom hardware, while HLT is implemented in software runs on the CPUs of the Event Filter Farm (EFF). The L0 trigger rate is dened at about 1 MHz, and the event size for each event is about 35 kByte. It is a serious challenge to handle the resulting data rate (35 GByte/s). The Online system is a key part of the LHCb experiment, providing all the IT services. It consists of three major components: the Data Acquisition (DAQ) system, the Timing and Fast Control (TFC) system and the Experiment Control System (ECS). To provide the services, two large dedicated networks based on Gigabit Ethernet are deployed: one for DAQ and another one for ECS, which are referred to Online network in general. A large network needs sophisticated monitoring for its successful operation. Commercial network management systems are quite expensive and dicult to integrate into the LHCb ECS. A custom network monitoring system has been implemented based on a Supervisory Control And Data Acquisition (SCADA) system called PVSS which is used by LHCb ECS. It is a homogeneous part of the LHCb ECS. In this thesis, it is demonstrated how a large scale network can be monitored and managed using tools originally made for industrial supervisory control. The thesis is organized as the follows: Chapter 1 gives a brief introduction to LHC and the B physics on LHC, then describes all sub-detectors and the trigger and DAQ system of LHCb from structure to performance. Chapter 2 first introduces the LHCb Online system and the dataflow, then focuses on the Online network design and its optimization. In Chapter 3, the SCADA system PVSS is introduced briefly, then the architecture and implementation of the network monitoring system are described in detail, including the front-end processes, the data communication and the supervisory layer. Chapter 4 first discusses the packet sampling theory and one of the packet sampling mechanisms: sFlow, then demonstrates the applications of sFlow for the network trouble-shooting, the traffic monitoring and the anomaly detection. In Chapter 5, the upgrade of LHC and LHCb is introduced, the possible architecture of DAQ is discussed, and two candidate internetworking technologies (high speed Ethernet and InfniBand) are compared in different aspects for DAQ. Three schemes based on 10 Gigabit Ethernet are presented and studied. Chapter 6 is a general summary of the thesis

    High Performance Computing using Infiniband-based clusters

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    Fault-tolerant routing in SCI networks

    Get PDF
    Fault-tolerant routing has been a hot topic in the academic community for quite some time now, and several different approaches have been suggested. In the interconnect industry however, fault-tolerant routing has not been implemented to the same extent. In this thesis we have adapted and implemented a local fault-tolerant routing approach in SCI interconnect technology produced by Dolphin Interconnect Solutions. The existing technology used in SCI is based in a static reconfiguration approach, where the traffic is disabled, while the new routing is calculated by a central front-end and distributed out to the nodes. Our algorithm builds upon the principle of enabling the nodes to make routing decisions from the information that is available to them locally, and having the rest of the nodes in the cluster to be prepared for this unexpected traffic. The algorithm has been tested on real hardware, and we have shown that it can handle several levels of traffic in the network. The test has also proven that our method gives the same performance both before and after the error occurs if the packets have the same conditions, such as competing traffic and link length. Our routing algorithm is currently integrated as a part of Dolphin Interconnect Solutions driver in the last official release

    The global unified parallel file system (GUPFS) project: FY 2002 activities and results

    Full text link

    SDN-based control and orchestration of optical data centre networks

    Get PDF
    The use of the Internet is linked with the constant technological change that the world is suffering nowadays, which is responsible for the important need to update the infrastructure of current data centers. The amount of traffic that is moving in data centers has increased significantly in the past few years, so a better alternative for them should be studied, as the use of Ethernet or InfiniBand is no longer appropriate in terms of scalability and flexibility. Optical technology is one possible solution for it, as it provides a big bandwidth, low latency and an overall better performance. However, the physical resources that form a data center should be managed in an efficient way. To perform an optimum use of them, the new concept of virtual data center appeared, where the orchestration of the resources is done with the aim of offering to a cloud infrastructure to a third party. In this context, OpenStack has become one of the most popular open source platforms when building public or private clouds, based on three important aspects: compute, storage and network. But the flexibility of these cloud infrastructures is attached to being scalable or dynamic. In this case, Software Definiton Network (SDN) and Network Function Virtualization (NFV) play an important role in data centers, as they allow to build complex network capabilities on demand. In this project, we experimentally demonstrate the programmable OPsquare data center network empowered by an SDN control plane. The implementation is based on monitoring the real-time statistics of the network, so some actions such as network slices provisioning and reconfiguration, packet priority class assignment or dynamic load balancing operations can be done in order to achieve the required Quality of Service level. This project is a cooperation between TU/e (Eindhoven University of Technology, The Netherlands) and UPC (Universitat Politècnica de Catalunya, Barcelona)
    • …
    corecore