8 research outputs found

    Hierarchical Parallel Matrix Multiplication on Large-Scale Distributed Memory Platforms

    Full text link
    Matrix multiplication is a very important computation kernel both in its own right as a building block of many scientific applications and as a popular representative for other scientific applications. Cannon algorithm which dates back to 1969 was the first efficient algorithm for parallel matrix multiplication providing theoretically optimal communication cost. However this algorithm requires a square number of processors. In the mid 1990s, the SUMMA algorithm was introduced. SUMMA overcomes the shortcomings of Cannon algorithm as it can be used on a non-square number of processors as well. Since then the number of processors in HPC platforms has increased by two orders of magnitude making the contribution of communication in the overall execution time more significant. Therefore, the state of the art parallel matrix multiplication algorithms should be revisited to reduce the communication cost further. This paper introduces a new parallel matrix multiplication algorithm, Hierarchical SUMMA (HSUMMA), which is a redesign of SUMMA. Our algorithm reduces the communication cost of SUMMA by introducing a two-level virtual hierarchy into the two-dimensional arrangement of processors. Experiments on an IBM BlueGene-P demonstrate the reduction of communication cost up to 2.08 times on 2048 cores and up to 5.89 times on 16384 cores.Comment: 9 page

    Node-Type-Based Load-Balancing Routing for Parallel Generalized Fat-Trees

    Full text link
    High-Performance Computing (HPC) clusters are made up of a variety of node types (usually compute, I/O, service, and GPGPU nodes) and applications don't use nodes of a different type the same way. Resulting communication patterns reflect organization of groups of nodes, and current optimal routing algorithms for all-to-all patterns will not always maximize performance for group-specific communications. Since application communication patterns are rarely available beforehand, we choose to rely on node types as a good guess for node usage. We provide a description of node type heterogeneity and analyse performance degradation caused by unlucky repartition of nodes of the same type. We provide an extension to routing algorithms for Parallel Generalized Fat-Tree topologies (PGFTs) which balances load amongst groups of nodes of the same type. We show how it removes these performance issues by comparing results in a variety of situations against corresponding classical algorithms

    Hierarchical Work-Stealing

    Get PDF
    In this paper, we study the problem of dynamic load-balancing on heterogeneous hierarchical platforms. In particular, we consider here applications involving heavy communications on a distributed platform. The work-stealing algorithm introduced by Blumofe and Leiserson is a commonly used technique to distribute load in a distributed environment but it suffers from poor performances in some cases of communications-intensive applications. We present here several variants of this algorithm found in the literature and different grid middlewares like Satin and Kaapi. In addition, we propose two new variations of the work-stealing algorithm : HWS and PWS. These algorithms improve performances by taking the networking structure into account within the scheduling. We conduct a theoretical analysis of HWS in the case of fork-join task graphs and present experimental results comparing the most relevant algorithms. Experiments on Grid'5000 show that HWS and PWS allow us to obtain performance gains of up to twenty per cent when compared to the standard algorithm. Moreover in some case, the standard algorithm reaches worse performances on the distributed platform than on a single machine while PWS and HWS achieve some speedup

    WSCOM: Online task scheduling with data transfers

    No full text
    In our paper we consider the on-line problem of tasks scheduling with communication. All information on tasks and communication are not available in advance except the DAG of task topology. We take a novel approach by considering the bi-objective problem where we try to minimize both completion time and the number of data transmissions. We propose a new variation of the work-stealing algorithm: WSCOM. We try here to take advantage of the information of the DAG topology, and improve locality by clustering the tasks together. We propose several variants designed to overlap communication or optimize the graph decomposition. Performance is evaluated by simulation and we compare our algorithms with off-line list-scheduling algorithms from the literature. These experiments validate the different design choices taken. In particular we show that WSCOM is able to achieve performance closes to off-line algorithms in most cases and is even able to achieve better performance in the event of congestion due to less data transfer

    Node-Type-Based Load-Balancing Routing for Parallel Generalized Fat-Trees

    No full text
    International audienceHigh-Performance Computing (HPC) clusters are made up of a variety of node types (usually compute, I/O, service, and GPGPU nodes) and applications don't use nodes of a different type the same way. Resulting communication patterns reflect organization of groups of nodes, and current optimal routing algorithms for all-to-all patterns will not always maximize performance for group-specific communications. Since application communication patterns are rarely available beforehand, we choose to rely on node types as a good guess for node usage. We provide a description of node type heterogeneity and analyse performance degradation caused by unlucky repartition of nodes of the same type. We provide an extension to routing algorithms for Parallel Generalized Fat-Tree topologies (PGFTs) which balances load amongst groups of nodes of the same type. We show how it removes these performance issues by comparing results in a variety of situations against corresponding classical algorithms

    Contrast enhancement in 1p/19q-codeleted anaplastic oligodendrogliomas is associated with 9p loss, genomic instability, and angiogenic gene expression

    No full text

    Rare predicted loss-of-function variants of type I IFN immunity genes are associated with life-threatening COVID-19

    No full text
    BackgroundWe previously reported that impaired type I IFN activity, due to inborn errors of TLR3- and TLR7-dependent type I interferon (IFN) immunity or to autoantibodies against type I IFN, account for 15-20% of cases of life-threatening COVID-19 in unvaccinated patients. Therefore, the determinants of life-threatening COVID-19 remain to be identified in similar to 80% of cases.MethodsWe report here a genome-wide rare variant burden association analysis in 3269 unvaccinated patients with life-threatening COVID-19, and 1373 unvaccinated SARS-CoV-2-infected individuals without pneumonia. Among the 928 patients tested for autoantibodies against type I IFN, a quarter (234) were positive and were excluded.ResultsNo gene reached genome-wide significance. Under a recessive model, the most significant gene with at-risk variants was TLR7, with an OR of 27.68 (95%CI 1.5-528.7, P=1.1x10(-4)) for biochemically loss-of-function (bLOF) variants. We replicated the enrichment in rare predicted LOF (pLOF) variants at 13 influenza susceptibility loci involved in TLR3-dependent type I IFN immunity (OR=3.70[95%CI 1.3-8.2], P=2.1x10(-4)). This enrichment was further strengthened by (1) adding the recently reported TYK2 and TLR7 COVID-19 loci, particularly under a recessive model (OR=19.65[95%CI 2.1-2635.4], P=3.4x10(-3)), and (2) considering as pLOF branchpoint variants with potentially strong impacts on splicing among the 15 loci (OR=4.40[9%CI 2.3-8.4], P=7.7x10(-8)). Finally, the patients with pLOF/bLOF variants at these 15 loci were significantly younger (mean age [SD]=43.3 [20.3] years) than the other patients (56.0 [17.3] years; P=1.68x10(-5)).ConclusionsRare variants of TLR3- and TLR7-dependent type I IFN immunity genes can underlie life-threatening COVID-19, particularly with recessive inheritance, in patients under 60 years old
    corecore