106 research outputs found

    UP-DOWN ROUTING BASED DEADLOCK FREE DYNAMIC RECONFIGURATION IN HIGH SPEED LOCAL AREA NETWORKS

    Get PDF
    Dynamic reconfiguration of high speed switched network is the process of changing from one routing function to another while the network remains in running mode Current distributed switch-based interconnected systems require high performance reliability and availability These systems changes their topologies due to hot expansion of components link or node activation and deactivation Therefore in order to support hard real-time and distributed multimedia applications over a high speed network we need to avoid discarding packets when the topology changes Thus a dynamic reconfiguration algorithm updates the routing tables of these interconnected switches according to new changed topology without stopping the traffic Here we propose an improved deadlock-free partial progressive reconfiguration PPR technique based on UP DOWN routing algorithm that assigns the directions to various links of high-speed switched networks based on pre-order traversal of computed spanning tree This improved technique gives better performance as compared to traditional PPR by minimizing the path length of packets to be transmitted Moreover the proposed reconfiguration strategy makes the optimize use of all operational links and reduces the traffic congestion in the network The simulated results are compared with traditional PP

    Kernel-assisted and Topology-aware MPI Collective Communication among Multicore or Many-core Clusters

    Get PDF
    Multicore or many-core clusters have become the most prominent form of High Performance Computing (HPC) systems. Hardware complexity and hierarchies not only exist in the inter-node layer, i.e., hierarchical networks, but also exist in internals of multicore compute nodes, e.g., Non Uniform Memory Accesses (NUMA), network-style interconnect, and memory and shared cache hierarchies. Message Passing Interface (MPI), the most widely adopted in the HPC communities, suffers from decreased performance and portability due to increased hardware complexity of multiple levels. We identified three critical issues specific to collective communication: The first problem arises from the gap between logical collective topologies and underlying hardware topologies; Second, current MPI communications lack efficient shared memory message delivering approaches; Last, on distributed memory machines, like multicore clusters, a single approach cannot encompass the extreme variations not only in the bandwidth and latency capabilities, but also in features such as the aptitude to operate multiple concurrent copies simultaneously. To bridge the gap between logical collective topologies and hardware topologies, we developed a distance-aware framework to integrate the knowledge of hardware distance into collective algorithms in order to dynamically reshape the communication patterns to suit the hardware capabilities. Based on process distance information, we used graph partitioning techniques to organize the MPI processes in a multi-level hierarchy, mapping on the hardware characteristics. Meanwhile, we took advantage of the kernel-assisted one-sided single-copy approach (KNEM) as the default shared memory delivering method. Via kernel-assisted memory copy, the collective algorithms offload copy tasks onto non-leader/not-root processes to evenly distribute copy workloads among available cores. Finally, on distributed memory machines, we developed a technique to compose multi-layered collective algorithms together to express a multi-level algorithm with tight interoperability between the levels. This tight collaboration results in more overlaps between inter- and intra-node communication. Experimental results have confirmed that, by leveraging several technologies together, such as kernel-assisted memory copy, the distance-aware framework, and collective algorithm composition, not only do MPI collectives reach the potential maximum performance on a wide variation of platforms, but they also deliver a level of performance immune to modifications of the underlying process-core binding

    Exploring InfiniBand Congestion Control

    Get PDF
    Congestion Control (CC) is used to achieve high performance and good utilization of network resources during high load in lossless interconnection networks. Without CC a congestion which started from a single node can grow, spread and degrade the performance of the network. Congestion can affect both the contributors of the congestion and also other traffic flows in the network. Infiniband (IB) is one of communication standards providing support for Congestion Control. The IB standard describes the CC functionality for detecting and resolving congestion. The behavior of the IB CC mechanism depends on the values of CC parameters. The given values of the parameters will determine characteristics like how aggressive the congestion detection should be, the rate of feedback from the forwarding node detecting congestion to the contributors of the congestion - and how much and for how long the contributors should lower their injection rates. But there are very few guidelines about how to set the values of the CC parameters for IB CC it to be efficient. In this thesis, an experiment of a Mesh network topology will be conducted using OmNet++ as a simulation platform. Large amount of traffic will be generated and fed to the network until a congestion is contributed. The performance will be measured when Infiniband congestion control is disable and when it is enabled. The results from those simulations will be compared and analysed. The topology s host-to-switch link capacities are to be increased. There will be a search for proper IB CC parameters and finally, we will learn more about how IB CC parameters influence performance by focusing on some of the parameters

    Memory Footprint of Locality Information on Many-Core Platforms

    Get PDF
    International audienceExploiting the power of HPC platforms requires knowledge of their increasingly complex hardware topologies. Multiple components of the software stack, for instance MPI implementations or OpenMP runtimes, now perform their own topology discovery to find out the available cores and memory, and to better place tasks based on their affinities.We study in this article the impact of this topology discovery in terms of memory footprint. Storing locality information wastes an amount of physical memory that is becoming an issue on many-core platforms on the road to exascale.We demonstrate that this information may be factorized between processes by using a shared-memory region. Our analysis of the physical and virtual memories in supercomputing architectures shows that this shared region can be mapped at the same virtual address in all processes, hence dramatically simplifying the software implementation.Our implementation in hwloc and Open MPI shows a memory footprint that does not increase with the number of MPI ranks per node anymore. Moreover the job launch time is decreased by more than a factor of 2 on an Intel Knights Landing Xeon Phi and on a 96-core NUMA platform
    • …
    corecore