32 research outputs found

    Evaluation of Directive-Based GPU Programming Models on a Block Eigensolver with Consideration of Large Sparse Matrices

    Get PDF
    Achieving high performance and performance portability for large-scale scientific applications is a major challenge on heterogeneous computing systems such as many-core CPUs and accelerators like GPUs. In this work, we implement a widely used block eigensolver, Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG), using two popular directive based programming models (OpenMP and OpenACC) for GPU-accelerated systems. Our work differs from existing work in that it adopts a holistic approach that optimizes the full solver performance rather than narrowing the problem into small kernels (e.g., SpMM, SpMV). Our LOPBCG GPU implementation achieves a 2.8×{\times }–4.3×{\times } speedup over an optimized CPU implementation when tested with four different input matrices. The evaluated configuration compared one Skylake CPU to one Skylake CPU and one NVIDIA V100 GPU. Our OpenMP and OpenACC LOBPCG GPU implementations gave nearly identical performance. We also consider how to create an efficient LOBPCG solver that can solve problems larger than GPU memory capacity. To this end, we create microbenchmarks representing the two dominant kernels (inner product and SpMM kernel) in LOBPCG and then evaluate performance when using two different programming approaches: tiling the kernels, and using Unified Memory with the original kernels. Our tiled SpMM implementation achieves a 2.9×{\times } and 48.2×{\times } speedup over the Unified Memory implementation on supercomputers with PCIe Gen3 and NVLink 2.0 CPU to GPU interconnects, respectively

    Solving Large Dense Symmetric Eigenproblem on Hybrid Architectures

    Get PDF
    Dense symmetric eigenproblem is one of the most significant problems in the numerical linear algebra that arises in numerous research fields such as bioinformatics, computational chemistry, and meteorology. In the past years, the problems arising in these fields become bigger than ever resulting in growing demands in both computational power as well as the storage capacities. In such problems, the eigenproblem becomes the main computational bottleneck for which solution is required an extremely high computational power. Modern computing architectures that can meet these growing demands are those that combine the power of the traditional multi-core processors and the general-purpose GPUs and are called hybrid systems. These systems exhibit very high performance when the data fits into the GPU memory ; however, if the volume of the data exceeds the total GPU memory, i.e. the data is out-of-core from the GPU perspective, the performance rapidly decreases. This dissertation is focused on the development of the algorithms that solve dense symmetric eigenproblems on the hybrid GPU-based architectures. In particular, it aims at developing the eigensolvers that exhibit very high performance even if a problem is out- of-core for the GPU. The developed out-of-core eigensolvers are evaluated and compared on real problems that arise in the simulation of molecular motions. In such problems the data, usually too large to fit into the GPU memory, are stored in the main memory and copied to the GPU memory in pieces. That approach results in the performance drop due to a slow interconnection and a high memory latency. To overcome this problem an approach that applies blocking strategy and re- designs the existing eigensolvers, in order to decrease the volume of data transferred and the number of memory transfers, is presented. This approach designs and implements a set of the block- oriented, communication-avoiding BLAS routines that overlap the data transfers with the number of computations performed. Next, these routines are applied to speed-up the following eigensolvers: the solver based on the multi-stage reduction to a tridiagonal form, the Krylov subspace-based method, and the spectral divide-and-conquer method. Although the out-of-core BLAS routines significantly improve the performance of these three eigensolvers, a careful re-design is required in order to tackle the solution of the large eigenproblems on the hybrid CPU-GPU systems. In the out-of-core multi-stage reduction approach, the factor that mostly influences the performance is the band size of the obtained band matrix. On the other hand, the Krylov subspace- based method, although it is based on the memory- bound BLAS-2 operations, is the fastest method if only a small subset of the eigenpairs is required. Finally, the spectral divide-and- conquer algorithm, which exhibits significantly higher arithmetic cost than the other two eigensolvers, achieves extremely high performance since it can be performed completely in terms of the compute-bound BLAS-3 operations. Furthermore, its high arithmetic cost is further reduced by exploiting the special structure of a matrix. Finally, the results presented in the dissertation show that the three out-of-core eigen- solvers, for a set of the specific macromolecular problems, significantly overcome the multi-core variants and attain high flops rate even if data do not fit into the GPU memory. This proves that it is possible to solve large eigenproblems on modest computing systems equipped with a single GPU

    Context adaptivity for selected computational kernels with applications in optoelectronics and in phylogenetics

    Get PDF
    Computational Kernels sind der kritische Teil rechenintensiver Software, wofĂŒr der grĂ¶ĂŸte Rechenaufwand anfĂ€llt; daher mĂŒssen deren Design und Implementierung sorgfĂ€ltig vorgenommen werden. Zwei wissenschaftliche Anwendungsprobleme aus der Optoelektronik und aus der Phylogenetik, sowie dazugehörige Computational Kernels motivieren diese Arbeit. Im ersten Anwendungsproblem werden Komponenten zur Berechnung komplex-symmetrischer Eigenwertprobleme diskutiert, welche in der Simulation von Wellenleitern in der Optoelektronik auftreten. LAPACK und ScaLAPACK beinhalten sehr leistungsfĂ€hige Referenzimplementierungen fĂŒr bestimmte Problemstellungen der linearen Algebra. In Bezug auf Eigenwertprobleme werden ausschließlich reell-symmetrische und komplex-hermitesche Varianten angeboten, daher sind effiziente Codes fĂŒr komplex-symmetrische (nicht-hermitesche) Eigenwertprobleme sehr wĂŒnschenswert. Das zweite Anwendungsproblem behandelt einen parallelen, wissenschaftlichen Workflow zur Rekonstruktion von Phylogenien, welcher entworfen, umgesetzt und evaluiert wird. Die Rekonstruktion von phylogenetischen BĂ€umen ist ein NP-hartes Problem, welches Ă€ußerst viel RechenkapazitĂ€t benötigt, wodurch ein paralleler Ansatz erforderlich ist. Die grundlegende Idee dieser Arbeit ist die Untersuchung der Wechselbeziehung zwischen dem Kontext der behandelten Kernels und deren Effizienz. Ein Kontext eines Computational Kernels beinhaltet Modellaspekte (z.B. Struktur der Eingabedaten), Softwareaspekte (z.B. rechenintensive Bibliotheken), Hardwareaspekte (z.B. verfĂŒgbarer Hauptspeicher und unterstĂŒtzte darstellbare Genauigkeit), sowie weitere Anforderungen bzw. EinschrĂ€nkungen. EinschrĂ€nkungen sind hinsichtlich Laufzeit, Speicherverbrauch, gelieferte Genauigkeit usw., möglich. Das Konzept der KontextadaptivitĂ€t wird fĂŒr ausgewĂ€hlte Anwendungsprobleme in Computational Science gezeigt. Die vorgestellte Methode ist ein Meta-Algorithmus, der Aspekte des Kontexts verwendet, um optimale Leistung hinsichtlich der angewandten Metrik zu erzielen. Es ist wichtig, den Kontext einzubeziehen, weil Anforderungen gegeneinander ausgetauscht werden könnten, resultierend in einer höheren Leistung. Zum Beispiel kann im Falle einer niedrigen benötigten Genauigkeit ein schnellerer Algorithmus einer bewĂ€hrten, aber langsameren, Methode vorgezogen werden. Speziell fĂŒr komplex-symmetrische Eigenwertprobleme zugeschnittene Codes zielen darauf ab, Genauigkeit gegen Geschwindigkeit einzutauschen. Die Innovation wird durch neue algorithmische AnsĂ€tze belegt, welche die algebraische Struktur ausnutzen. BezĂŒglich der Berechnung von phylogenetischen BĂ€umen wird die Abbildung eines Workflows auf ein Campusgrid-System gezeigt. Die Innovation besteht in der anpassungsfĂ€higen Implementierung des Workflows, der nebenlĂ€ufige Instanzen von Computational Kernels in einem verteilten System darstellt. Die AdaptivitĂ€t bezeichnet hier die FĂ€higkeit des Workflows, die Rechenlast hinsichtlich verfĂŒgbarer Rechner, Zeit und QualitĂ€t der phylogenetischen BĂ€ume anzupassen. KontextadaptivitĂ€t wird durch die Implementierung und Evaluierung von wissenschaftlichen Problemstellungen aus der Optoelektronik und aus der Phylogenetik gezeigt. FĂŒr das Fachgebiet der Optoelektronik zielt eine Familie von Algorithmen auf die Lösung von verallgemeinerten komplex-symmetrischen Eigenwertproblemen ab. Unser alternativer Ansatz nutzt die symmetrische Struktur aus und spielt gĂŒnstigere Laufzeit gegen eine geringere Genauigkeit aus. Dieser Ansatz ist somit schneller, jedoch (meist) ungenauer als der konventionelle Lösungsweg. ZusĂ€tzlich zum sequentiellen Löser wird eine parallele Variante diskutiert und teilweise auf einem Cluster mit bis zu 1024 CPU-Cores evaluiert. Die erzielten Laufzeiten beweisen die Überlegenheit unseres Ansatzes -- allerdings sind weitere Untersuchungen zur Erhöhung der Genauigkeit notwendig. FĂŒr das Fachgebiet der Phylogenetik zeigen wir, dass die phylogenetische Baum-Rekonstruktion mittels eines Condor-basierten Campusgrids effizient parallelisiert werden kann. Dieser parallele wissenschaftliche Workflow weist einen geringen parallelen Overhead auf, resultierend in exzellenter Effizienz.Computational kernels are the crucial part of computationally intensive software, where most of the computing time is spent; hence, their design and implementation have to be accomplished carefully. Two scientific application problems from optoelectronics and from phylogenetics and corresponding computational kernels are motivating this thesis. In the first application problem, components for the computational solution of complex symmetric EVPs are discussed, arising in the simulation of waveguides in optoelectronics. LAPACK and ScaLAPACK contain highly effective reference implementations for certain numerical problems in linear algebra. With respect to EVPs, only real symmetric and complex Hermitian codes are available, therefore efficient codes for complex symmetric (non-Hermitian) EVPs are highly desirable. In the second application problem, a parallel scientific workflow for computing phylogenies is designed, implemented, and evaluated. The reconstruction of phylogenetic trees is an NP-hard problem that demands huge scale computing capabilities, and therefore a parallel approach is necessary. One idea underlying this thesis is to investigate the interaction between the context of the kernels considered and their efficiency. The context of a computational kernel comprises model aspects (for instance, structure of input data), software aspects (for instance, computational libraries), hardware aspects (for instance, available RAM and supported precision), and certain requirements or constraints. Constraints may exist with respect to runtime, memory usage, accuracy required, etc.. The concept of context adaptivity is demonstrated to selected computational problems in computational science. The method proposed here is a meta-algorithm that utilizes aspects of the context to result in an optimal performance concerning the applied metric. It is important to consider the context, because requirements may be traded for each other, resulting in a higher performance. For instance, in case of a low required accuracy, a faster algorithmic approach may be favored over an established but slower method. With respect to EVPs, prototypical codes that are especially targeted at complex symmetric EVPs aim at trading accuracy for speed. The innovation is evidenced by the implementation of new algorithmic approaches exploiting structure. Concerning the computation of phylogenetic trees, the mapping of a scientific workflow onto a campus grid system is demonstrated. The adaptive implementation of the workflow features concurrent instances of a computational kernel on a distributed system. Here, adaptivity refers to the ability of the workflow to vary computational load in terms of available computing resources, available time, and quality of reconstructed phylogenetic trees. Context adaptivity is discussed by means of computational problems from optoelectronics and from phylogenetics. For the field of optoelectronics, a family of implemented algorithms aim at solving generalized complex symmetric EVPs. Our alternative approach exploiting structural symmetry trades runtime for accuracy, hence, it is faster but (usually) features a lower accuracy than the conventional approach. In addition to a complete sequential solver, a parallel variant is discussed and partly evaluated on a cluster utilizing up to 1024 CPU cores. Achieved runtimes evidence the superiority of our approach, however, further investigations on improving accuracy are suggested. For the field of phylogenetics, we show that phylogenetic tree reconstruction can efficiently be parallelized on a campus grid infrastructure. The parallel scientific workflow features a moderate parallel overhead, resulting in an excellent efficiency

    Design and optimisation of scientific programs in a categorical language

    Get PDF
    This thesis presents an investigation into the use of advanced computer languages for scientific computing, an examination of performance issues that arise from using such languages for such a task, and a step toward achieving portable performance from compilers by attacking these problems in a way that compensates for the complexity of and differences between modern computer architectures. The language employed is Aldor, a functional language from computer algebra, and the scientific computing area is a subset of the family of iterative linear equation solvers applied to sparse systems. The linear equation solvers that are considered have much common structure, and this is factored out and represented explicitly in the lan-guage as a framework, by means of categories and domains. The flexibility introduced by decomposing the algorithms and the objects they act on into separate modules has a strong performance impact due to its negative effect on temporal locality. This necessi-tates breaking the barriers between modules to perform cross-component optimisation. In this instance the task reduces to one of collective loop fusion and array contrac
    corecore