233 research outputs found

    Performance Improvements of Common Sparse Numerical Linear Algebra Computations

    Get PDF
    Manufacturers of computer hardware are able to continuously sustain an unprecedented pace of progress in computing speed of their products, partially due to increased clock rates but also because of ever more complicated chip designs. With new processor families appearing every few years, it is increasingly harder to achieve high performance rates in sparse matrix computations. This research proposes new methods for sparse matrix factorizations and applies in an iterative code generalizations of known concepts from related disciplines. The proposed solutions and extensions are implemented in ways that tend to deliver efficiency while retaining ease of use of existing solutions. The implementations are thoroughly timed and analyzed using a commonly accepted set of test matrices. The tests were conducted on modern processors that seem to have gained an appreciable level of popularity and are fairly representative for a wider range of processor types that are available on the market now or in the near future. The new factorization technique formally introduced in the early chapters is later on proven to be quite competitive with state of the art software currently available. Although not totally superior in all cases (as probably no single approach could possibly be), the new factorization algorithm exhibits a few promising features. In addition, an all-embracing optimization effort is applied to an iterative algorithm that stands out for its robustness. This also gives satisfactory results on the tested computing platforms in terms of performance improvement. The same set of test matrices is used to enable an easy comparison between both investigated techniques, even though they are customarily treated separately in the literature. Possible extensions of the presented work are discussed. They range from easily conceivable merging with existing solutions to rather more evolved schemes dependent on hard to predict progress in theoretical and algorithmic research

    GPU-Accelerated Asynchronous Error Correction for Mixed Precision Iterative Refinement

    Get PDF
    In hardware-aware high performance computing, block-asynchronous iteration and mixed precision iterative refinement are two techniques that may be used to leverage the computing power of SIMD accelerators like GPUs in the iterative solution of linear equation systems. although they use a very different approach for this purpose, they share the basic idea of compensating the convergence properties of an inferior numerical algorithm by a more efficient usage of the provided computing power. In this paper, we analyze the potential of combining both techniques. Therefore, we derive a mixed precision iterative refinement algorithm using a block-asynchronous iteration as an error correction solver, and compare its performance with a pure implementation of a block-asynchronous iteration and an iterative refinement method using double precision for the error correction solver. For matrices from the University of Florida Matrix collection, we report the convergence behaviour and provide the total solver runtime using different GPU architectures

    Benthic Biomonitoring in Arctic Tundra Streams: A Community-Based Approach in Iqaluit, Nunavut, Canada

    Get PDF
    Recent residential, commercial, and industrial development in the catchments of several Arctic streams has heightened the need to assess these freshwater systems accurately. It was imperative to develop methods that would be both effective at judging ecological condition of tundra streams and suitable for use by local groups. An investigation of two streams influenced by urbanization in Iqaluit, Nunavut, was carried out between July and August each year in 2007 – 09. Simple summary metrics (e.g., Shannon Index) and multivariate analysis (DCA, RD A) both demonstrated biological impairment in the benthic community at site locations downstream of urbanized portions of a local stream. This impairment was characterized by a loss of diversity and a dramatic shift of the benthic community to one dominated by chironomids from the subfamily Orthocladiinae. Elevated levels of total nitrogen (TN) and total phosphorus (TP) and several metals (Zn, Sr, Rb, Al, Co, Fe) were also found to be significantly related to benthic assemblages within these disturbed areas. This investigation also addressed taxonomic sufficiency, indicating that while family-level taxonomic identifications were sensitive enough to differentiate between pristine and impacted stream sites, a more precise taxonomic identification of the dominant benthos taxa (Insecta: Diptera: Chironomidae) to sub-family/tribe level identified a significant shift towards pollution-tolerant taxa. This higher taxonomic resolution will allow for the adaptation of protocols and the use of simple summary metrics to be effective for a community-based biomonitoring program in Arctic tundra streams.De rĂ©cents dĂ©veloppements rĂ©sidentiels, commerciaux et industriels dans les bassins versants de plusieurs cours d’eau de l’Arctique ont intensifiĂ© la nĂ©cessitĂ© de bien Ă©valuer ces systĂšmes d’eau douce. Il Ă©tait impĂ©ratif de mettre au point des mĂ©thodes qui permettraient de juger des conditions Ă©cologiques des cours d’eau de la toundra et qui seraient utilisables par divers groupes de la rĂ©gion. Entre juillet et aoĂ»t des annĂ©es 2007 Ă  2009, une enquĂȘte a Ă©tĂ© effectuĂ©e sur deux cours d’eau influencĂ©s par l’urbanisation Ă  Iqaluit, au Nunavut. De simples mesures sommaires (indice de Shannon par exemple) et une analyse Ă  variables multiples (DCA, RDA) ont permis de dĂ©montrer la dĂ©gradation biologique de la communautĂ© benthique Ă  divers lieux du site, en aval de segments urbanisĂ©s d’un cours d’eau local. Cette dĂ©gradation Ă©tait caractĂ©risĂ©e par une perte de diversitĂ© et un changement dramatique de la communautĂ© benthique qui est maintenant dominĂ©e par des chironomidĂ©s de la sous-famille Orthocladiinae. Nous avons Ă©galement constatĂ© que les taux Ă©levĂ©s d’azote total (AT), de phosphore total (PT) et de plusieurs mĂ©taux (Zn, Sr, Rb, Al, Co, Fe) Ă©taient fortement liĂ©s aux assemblages benthiques faisant partie de ces zones perturbĂ©es. Cette enquĂȘte a Ă©galement portĂ© sur la suffisance taxonomique, ce qui a laissĂ© croire que bien que les identifications taxonomiques au niveau de la famille Ă©taient assez sensibles pour diffĂ©rencier entre les sites de cours d’eau vierges et les sites perturbĂ©s, une identification taxonomique plus prĂ©cise allant des taxons benthiques dominants (Insecta:Diptera: Chironomidae) jusqu’au niveau de la sous-famille et de la tribu ont permis d’identifier un virage important vers des taxons tolĂ©rants Ă  la pollution. Cette rĂ©solution taxonomique supĂ©rieure permettra l’adaptation de protocoles et l’utilisation de simples mesures sommaires efficaces en vue de l’établissement d’un programme de biosurveillance communautaire dans les cours d’eau de la toundra de l’Arctique

    Regulatory Immunotherapy in Bone Marrow Transplantation

    Get PDF
    Every year individuals receive hematopoietic stem cell transplantation (HSCT) to eradicate malignant and nonmalignant disease. The immunobiology of allotransplantation is an area of ongoing discovery, from the recipient's conditioning treatment prior to the transplant to the donor cell populations responsible for engraftment, graft-versus-host disease, and graft-versus-tumor effect. In this review, we focus on donor-type immunoregulatory T cells, namely, natural killer T cells (NKT) and regulatory T cells (Treg), and their current and potential roles in tolerance induction after allogeneic HSCT

    Performance of random sampling for computing low-rank approximations of a dense matrix on GPUs

    Get PDF
    International audienceA low-rank approximation of a dense matrix plays an important role in many applications. To compute such an approximation , a common approach uses the QR factorization with column pivoting (QRCP). Though the reliability and efficiency of QRCP have been demonstrated, this determin-istic approach requires costly communication at each step of the factorization. Since such communication is becoming increasingly expensive on modern computers, an alternative approach based on random sampling, which can be implemented using communication-optimal kernels, is becoming attractive. To study its potential, in this paper, we compare the performance of random sampling with that of QRCP on an NVIDIA Kepler GPU. Our performance results demonstrate that random sampling can be up to 12.8× faster than the deterministic approach for computing the approximation of the same accuracy. We also present the parallel scaling of the random sampling over multiple GPUs on a single compute node, showing a speedup of 3.8× over three Kepler GPUs. These results demonstrate the potential of the random sampling as an excellent computational tool for many applications, and its potential is likely to grow on the emerging computers with the increasing communication costs

    HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi

    Get PDF
    This paper presents the design and implementation of several fundamental dense linear algebra (DLA) algorithms for multicore with Intel Xeon Phi coprocessors. In particular, we consider algorithms for solving linear systems. Further, we give an overview of the MAGMA MIC library, an open source, high performance library, that incorporates the developments presented here and, more broadly, provides the DLA functionality equivalent to that of the popular LAPACK library while targeting heterogeneous architectures that feature a mix of multicore CPUs and coprocessors. The LAPACK-compliance simplifies the use of the MAGMA MIC library in applications, while providing them with portably performant DLA. High performance is obtained through the use of the high-performance BLAS, hardware-specific tuning, and a hybridization methodology whereby we split the algorithm into computational tasks of various granularities. Execution of those tasks is properly scheduled over the heterogeneous hardware by minimizing data movements and mapping algorithmic requirements to the architectural strengths of the various heterogeneous hardware components. Our methodology and programming techniques are incorporated into the MAGMA MIC API, which abstracts the application developer from the specifics of the Xeon Phi architecture and is therefore applicable to algorithms beyond the scope of DLA

    LU Factorization with Partial Pivoting for a Multicore System with Accelerators

    Full text link

    Proposed Consistent Exception Handling for the BLAS and LAPACK

    Full text link
    Numerical exceptions, which may be caused by overflow, operations like division by 0 or sqrt(-1), or convergence failures, are unavoidable in many cases, in particular when software is used on unforeseen and difficult inputs. As more aspects of society become automated, e.g., self-driving cars, health monitors, and cyber-physical systems more generally, it is becoming increasingly important to design software that is resilient to exceptions, and that responds to them in a consistent way. Consistency is needed to allow users to build higher-level software that is also resilient and consistent (and so on recursively). In this paper we explore the design space of consistent exception handling for the widely used BLAS and LAPACK linear algebra libraries, pointing out a variety of instances of inconsistent exception handling in the current versions, and propose a new design that balances consistency, complexity, ease of use, and performance. Some compromises are needed, because there are preexisting inconsistencies that are outside our control, including in or between existing vendor BLAS implementations, different programming languages, and even compilers for the same programming language. And user requests from our surveys are quite diverse. We also propose our design as a possible model for other numerical software, and welcome comments on our design choices
    • 

    corecore