142 research outputs found

    Using reconfigurable computing technology to accelerate matrix decomposition and applications

    Get PDF
    Matrix decomposition plays an increasingly significant role in many scientific and engineering applications. Among numerous techniques, Singular Value Decomposition (SVD) and Eigenvalue Decomposition (EVD) are widely used as factorization tools to perform Principal Component Analysis for dimensionality reduction and pattern recognition in image processing, text mining and wireless communications, while QR Decomposition (QRD) and sparse LU Decomposition (LUD) are employed to solve the dense or sparse linear system of equations in bioinformatics, power system and computer vision. Matrix decompositions are computationally expensive and their sequential implementations often fail to meet the requirements of many time-sensitive applications. The emergence of reconfigurable computing has provided a flexible and low-cost opportunity to pursue high-performance parallel designs, and the use of FPGAs has shown promise in accelerating this class of computation. In this research, we have proposed and implemented several highly parallel FPGA-based architectures to accelerate matrix decompositions and their applications in data mining and signal processing. Specifically, in this dissertation we describe the following contributions: ‱ We propose an efficient FPGA-based double-precision floating-point architecture for EVD, which can efficiently analyze large-scale matrices. ‱ We implement a floating-point Hestenes-Jacobi architecture for SVD, which is capable of analyzing arbitrary sized matrices. ‱ We introduce a novel deeply pipelined reconfigurable architecture for QRD, which can be dynamically configured to perform either Householder transformation or Givens rotation in a manner that takes advantage of the strengths of each. ‱ We design a configurable architecture for sparse LUD that supports both symmetric and asymmetric sparse matrices with arbitrary sparsity patterns. ‱ By further extending the proposed hardware solution for SVD, we parallelize a popular text mining tool-Latent Semantic Indexing with an FPGA-based architecture. ‱ We present a configurable architecture to accelerate Homotopy l1-minimization, in which the modification of the proposed FPGA architecture for sparse LUD is used at its core to parallelize both Cholesky decomposition and rank-1 update. Our experimental results using an FPGA-based acceleration system indicate the efficiency of our proposed novel architectures, with application and dimension-dependent speedups over an optimized software implementation that range from 1.5ÃÂ to 43.6ÃÂ in terms of computation time

    Batched Linear Algebra Problems on GPU Accelerators

    Get PDF
    The emergence of multicore and heterogeneous architectures requires many linear algebra algorithms to be redesigned to take advantage of the accelerators, such as GPUs. A particularly challenging class of problems, arising in numerous applications, involves the use of linear algebra operations on many small-sized matrices. The size of these matrices is usually the same, up to a few hundred. The number of them can be thousands, even millions. Compared to large matrix problems with more data parallel computation that are well suited on GPUs, the challenges of small matrix problems lie in the low computing intensity, the large sequential operation fractions, and the big PCI-E overhead. These challenges entail redesigning the algorithms instead of merely porting the current LAPACK algorithms. We consider two classes of problems. The first is linear systems with one-sided factorizations (LU, QR, and Cholesky) and their solver, forward and backward substitution. The second is a two-sided Householder bi-diagonalization. They are challenging to develop and are highly demanded in applications. Our main efforts focus on the same-sized problems. Variable-sized problems are also considered, though to a lesser extent. Our contributions can be summarized as follows. First, we formulated a batched linear algebra framework to solve many data-parallel, small-sized problems/tasks. Second, we redesigned a set of fundamental linear algebra algorithms for high- performance, batched execution on GPU accelerators. Third, we designed batched BLAS (Basic Linear Algebra Subprograms) and proposed innovative optimization techniques for high-performance computation. Fourth, we illustrated the batched methodology on real-world applications as in the case of scaling a CFD application up to 4096 nodes on the Titan supercomputer at Oak Ridge National Laboratory (ORNL). Finally, we demonstrated the power, energy and time efficiency of using accelerators as compared to CPUs. Our solutions achieved large speedups and high energy efficiency compared to related routines in CUBLAS on NVIDIA GPUs and MKL on Intel Sandy-Bridge multicore CPUs. The modern accelerators are all Single-Instruction Multiple-Thread (SIMT) architectures. Our solutions and methods are based on NVIDIA GPUs and can be extended to other accelerators, such as the Intel Xeon Phi and AMD GPUs based on OpenCL

    Scientific kernels on VIRAM and imagine media processors

    Full text link

    Context adaptivity for selected computational kernels with applications in optoelectronics and in phylogenetics

    Get PDF
    Computational Kernels sind der kritische Teil rechenintensiver Software, wofĂŒr der grĂ¶ĂŸte Rechenaufwand anfĂ€llt; daher mĂŒssen deren Design und Implementierung sorgfĂ€ltig vorgenommen werden. Zwei wissenschaftliche Anwendungsprobleme aus der Optoelektronik und aus der Phylogenetik, sowie dazugehörige Computational Kernels motivieren diese Arbeit. Im ersten Anwendungsproblem werden Komponenten zur Berechnung komplex-symmetrischer Eigenwertprobleme diskutiert, welche in der Simulation von Wellenleitern in der Optoelektronik auftreten. LAPACK und ScaLAPACK beinhalten sehr leistungsfĂ€hige Referenzimplementierungen fĂŒr bestimmte Problemstellungen der linearen Algebra. In Bezug auf Eigenwertprobleme werden ausschließlich reell-symmetrische und komplex-hermitesche Varianten angeboten, daher sind effiziente Codes fĂŒr komplex-symmetrische (nicht-hermitesche) Eigenwertprobleme sehr wĂŒnschenswert. Das zweite Anwendungsproblem behandelt einen parallelen, wissenschaftlichen Workflow zur Rekonstruktion von Phylogenien, welcher entworfen, umgesetzt und evaluiert wird. Die Rekonstruktion von phylogenetischen BĂ€umen ist ein NP-hartes Problem, welches Ă€ußerst viel RechenkapazitĂ€t benötigt, wodurch ein paralleler Ansatz erforderlich ist. Die grundlegende Idee dieser Arbeit ist die Untersuchung der Wechselbeziehung zwischen dem Kontext der behandelten Kernels und deren Effizienz. Ein Kontext eines Computational Kernels beinhaltet Modellaspekte (z.B. Struktur der Eingabedaten), Softwareaspekte (z.B. rechenintensive Bibliotheken), Hardwareaspekte (z.B. verfĂŒgbarer Hauptspeicher und unterstĂŒtzte darstellbare Genauigkeit), sowie weitere Anforderungen bzw. EinschrĂ€nkungen. EinschrĂ€nkungen sind hinsichtlich Laufzeit, Speicherverbrauch, gelieferte Genauigkeit usw., möglich. Das Konzept der KontextadaptivitĂ€t wird fĂŒr ausgewĂ€hlte Anwendungsprobleme in Computational Science gezeigt. Die vorgestellte Methode ist ein Meta-Algorithmus, der Aspekte des Kontexts verwendet, um optimale Leistung hinsichtlich der angewandten Metrik zu erzielen. Es ist wichtig, den Kontext einzubeziehen, weil Anforderungen gegeneinander ausgetauscht werden könnten, resultierend in einer höheren Leistung. Zum Beispiel kann im Falle einer niedrigen benötigten Genauigkeit ein schnellerer Algorithmus einer bewĂ€hrten, aber langsameren, Methode vorgezogen werden. Speziell fĂŒr komplex-symmetrische Eigenwertprobleme zugeschnittene Codes zielen darauf ab, Genauigkeit gegen Geschwindigkeit einzutauschen. Die Innovation wird durch neue algorithmische AnsĂ€tze belegt, welche die algebraische Struktur ausnutzen. BezĂŒglich der Berechnung von phylogenetischen BĂ€umen wird die Abbildung eines Workflows auf ein Campusgrid-System gezeigt. Die Innovation besteht in der anpassungsfĂ€higen Implementierung des Workflows, der nebenlĂ€ufige Instanzen von Computational Kernels in einem verteilten System darstellt. Die AdaptivitĂ€t bezeichnet hier die FĂ€higkeit des Workflows, die Rechenlast hinsichtlich verfĂŒgbarer Rechner, Zeit und QualitĂ€t der phylogenetischen BĂ€ume anzupassen. KontextadaptivitĂ€t wird durch die Implementierung und Evaluierung von wissenschaftlichen Problemstellungen aus der Optoelektronik und aus der Phylogenetik gezeigt. FĂŒr das Fachgebiet der Optoelektronik zielt eine Familie von Algorithmen auf die Lösung von verallgemeinerten komplex-symmetrischen Eigenwertproblemen ab. Unser alternativer Ansatz nutzt die symmetrische Struktur aus und spielt gĂŒnstigere Laufzeit gegen eine geringere Genauigkeit aus. Dieser Ansatz ist somit schneller, jedoch (meist) ungenauer als der konventionelle Lösungsweg. ZusĂ€tzlich zum sequentiellen Löser wird eine parallele Variante diskutiert und teilweise auf einem Cluster mit bis zu 1024 CPU-Cores evaluiert. Die erzielten Laufzeiten beweisen die Überlegenheit unseres Ansatzes -- allerdings sind weitere Untersuchungen zur Erhöhung der Genauigkeit notwendig. FĂŒr das Fachgebiet der Phylogenetik zeigen wir, dass die phylogenetische Baum-Rekonstruktion mittels eines Condor-basierten Campusgrids effizient parallelisiert werden kann. Dieser parallele wissenschaftliche Workflow weist einen geringen parallelen Overhead auf, resultierend in exzellenter Effizienz.Computational kernels are the crucial part of computationally intensive software, where most of the computing time is spent; hence, their design and implementation have to be accomplished carefully. Two scientific application problems from optoelectronics and from phylogenetics and corresponding computational kernels are motivating this thesis. In the first application problem, components for the computational solution of complex symmetric EVPs are discussed, arising in the simulation of waveguides in optoelectronics. LAPACK and ScaLAPACK contain highly effective reference implementations for certain numerical problems in linear algebra. With respect to EVPs, only real symmetric and complex Hermitian codes are available, therefore efficient codes for complex symmetric (non-Hermitian) EVPs are highly desirable. In the second application problem, a parallel scientific workflow for computing phylogenies is designed, implemented, and evaluated. The reconstruction of phylogenetic trees is an NP-hard problem that demands huge scale computing capabilities, and therefore a parallel approach is necessary. One idea underlying this thesis is to investigate the interaction between the context of the kernels considered and their efficiency. The context of a computational kernel comprises model aspects (for instance, structure of input data), software aspects (for instance, computational libraries), hardware aspects (for instance, available RAM and supported precision), and certain requirements or constraints. Constraints may exist with respect to runtime, memory usage, accuracy required, etc.. The concept of context adaptivity is demonstrated to selected computational problems in computational science. The method proposed here is a meta-algorithm that utilizes aspects of the context to result in an optimal performance concerning the applied metric. It is important to consider the context, because requirements may be traded for each other, resulting in a higher performance. For instance, in case of a low required accuracy, a faster algorithmic approach may be favored over an established but slower method. With respect to EVPs, prototypical codes that are especially targeted at complex symmetric EVPs aim at trading accuracy for speed. The innovation is evidenced by the implementation of new algorithmic approaches exploiting structure. Concerning the computation of phylogenetic trees, the mapping of a scientific workflow onto a campus grid system is demonstrated. The adaptive implementation of the workflow features concurrent instances of a computational kernel on a distributed system. Here, adaptivity refers to the ability of the workflow to vary computational load in terms of available computing resources, available time, and quality of reconstructed phylogenetic trees. Context adaptivity is discussed by means of computational problems from optoelectronics and from phylogenetics. For the field of optoelectronics, a family of implemented algorithms aim at solving generalized complex symmetric EVPs. Our alternative approach exploiting structural symmetry trades runtime for accuracy, hence, it is faster but (usually) features a lower accuracy than the conventional approach. In addition to a complete sequential solver, a parallel variant is discussed and partly evaluated on a cluster utilizing up to 1024 CPU cores. Achieved runtimes evidence the superiority of our approach, however, further investigations on improving accuracy are suggested. For the field of phylogenetics, we show that phylogenetic tree reconstruction can efficiently be parallelized on a campus grid infrastructure. The parallel scientific workflow features a moderate parallel overhead, resulting in an excellent efficiency

    Task-based multifrontal QR solver for heterogeneous architectures

    Get PDF
    Afin de s'adapter aux architectures multicoeurs et aux machines de plus en plus complexes, les modĂšles de programmations basĂ©s sur un parallĂ©lisme de tĂąche ont gagnĂ© en popularitĂ© dans la communautĂ© du calcul scientifique haute performance. Les moteurs d'exĂ©cution fournissent une interface de programmation qui correspond Ă  ce paradigme ainsi que des outils pour l'ordonnancement des tĂąches qui dĂ©finissent l'application. Dans cette Ă©tude, nous explorons la conception de solveurs directes creux Ă  base de tĂąches, qui reprĂ©sentent une charge de travail extrĂȘmement irrĂ©guliĂšre, avec des tĂąches de granularitĂ©s et de caractĂ©ristiques diffĂ©rentes ainsi qu'une consommation mĂ©moire variable, au-dessus d'un moteur d'exĂ©cution. Dans le cadre du solveur qr mumps, nous montrons dans un premier temps la viabilitĂ© et l'efficacitĂ© de notre approche avec l'implĂ©mentation d'une mĂ©thode multifrontale pour la factorisation de matrices creuses, en se basant sur le modĂšle de programmation parallĂšle appelĂ© "flux de tĂąches sĂ©quentielles" (Sequential Task Flow). Cette approche, nous a ensuite permis de dĂ©velopper des fonctionnalitĂ©s telles que l'intĂ©gration de noyaux dense de factorisation de type "minimisation de cAfin de s'adapter aux architectures multicoeurs et aux machines de plus en plus complexes, les modĂšles de programmations basĂ©s sur un parallĂ©lisme de tĂąche ont gagnĂ© en popularitĂ© dans la communautĂ© du calcul scientifique haute performance. Les moteurs d'exĂ©cution fournissent une interface de programmation qui correspond Ă  ce paradigme ainsi que des outils pour l'ordonnancement des tĂąches qui dĂ©finissent l'application. Dans cette Ă©tude, nous explorons la conception de solveurs directes creux Ă  base de tĂąches, qui reprĂ©sentent une charge de travail extrĂȘmement irrĂ©guliĂšre, avec des tĂąches de granularitĂ©s et de caractĂ©ristiques diffĂ©rentes ainsi qu'une consommation mĂ©moire variable, au-dessus d'un moteur d'exĂ©cution. Dans le cadre du solveur qr mumps, nous montrons dans un premier temps la viabilitĂ© et l'efficacitĂ© de notre approche avec l'implĂ©mentation d'une mĂ©thode multifrontale pour la factorisation de matrices creuses, en se basant sur le modĂšle de programmation parallĂšle appelĂ© "flux de tĂąches sĂ©quentielles" (Sequential Task Flow). Cette approche, nous a ensuite permis de dĂ©velopper des fonctionnalitĂ©s telles que l'intĂ©gration de noyaux dense de factorisation de type "minimisation de cAfin de s'adapter aux architectures multicoeurs et aux machines de plus en plus complexes, les modĂšles de programmations basĂ©s sur un parallĂ©lisme de tĂąche ont gagnĂ© en popularitĂ© dans la communautĂ© du calcul scientifique haute performance. Les moteurs d'exĂ©cution fournissent une interface de programmation qui correspond Ă  ce paradigme ainsi que des outils pour l'ordonnancement des tĂąches qui dĂ©finissent l'application. !!br0ken!!ommunications" (Communication Avoiding) dans la mĂ©thode multifrontale, permettant d'amĂ©liorer considĂ©rablement la scalabilitĂ© du solveur par rapport a l'approche original utilisĂ©e dans qr mumps. Nous introduisons Ă©galement un algorithme d'ordonnancement sous contraintes mĂ©moire au sein de notre solveur, exploitable dans le cas des architectures multicoeur, rĂ©duisant largement la consommation mĂ©moire de la mĂ©thode multifrontale QR avec un impacte nĂ©gligeable sur les performances. En utilisant le modĂšle prĂ©sentĂ© ci-dessus, nous visons ensuite l'exploitation des architectures hĂ©tĂ©rogĂšnes pour lesquelles la granularitĂ© des tĂąches ainsi les stratĂ©gies l'ordonnancement sont cruciales pour profiter de la puissance de ces architectures. Nous proposons, dans le cadre de la mĂ©thode multifrontale, un partitionnement hiĂ©rarchique des donnĂ©es ainsi qu'un algorithme d'ordonnancement capable d'exploiter l'hĂ©tĂ©rogĂ©nĂ©itĂ© des ressources. Enfin, nous prĂ©sentons une Ă©tude sur la reproductibilitĂ© de l'exĂ©cution parallĂšle de notre problĂšme et nous montrons Ă©galement l'utilisation d'un modĂšle de programmation alternatif pour l'implĂ©mentation de la mĂ©thode multifrontale. L'ensemble des rĂ©sultats expĂ©rimentaux prĂ©sentĂ©s dans cette Ă©tude sont Ă©valuĂ©s avec une analyse dĂ©taillĂ©e des performance que nous proposons au dĂ©but de cette Ă©tude. Cette analyse de performance permet de mesurer l'impacte de plusieurs effets identifiĂ©s sur la scalabilitĂ© et la performance de nos algorithmes et nous aide ainsi Ă  comprendre pleinement les rĂ©sultats obtenu lors des tests effectuĂ©s avec notre solveur.To face the advent of multicore processors and the ever increasing complexity of hardware architectures, programming models based on DAG parallelism regained popularity in the high performance, scientific computing community. Modern runtime systems offer a programming interface that complies with this paradigm and powerful engines for scheduling the tasks into which the application is decomposed. These tools have already proved their effectiveness on a number of dense linear algebra applications. In this study we investigate the design of task-based sparse direct solvers which constitute extremely irregular workloads, with tasks of different granularities and characteristics with variable memory consumption on top of runtime systems. In the context of the qr mumps solver, we prove the usability and effectiveness of our approach with the implementation of a sparse matrix multifrontal factorization based on a Sequential Task Flow parallel programming model. Using this programming model, we developed features such as the integration of dense 2D Communication Avoiding algorithms in the multifrontal method allowing for better scalability compared to the original approach used in qr mumps. In addition we introduced a memory-aware algorithm to control the memory behaviour of our solver and show, in the context of multicore architectures, an important reduction of the memory footprint for the multifrontal QR factorization with a small impact on performance. Following this approach, we move to heterogeneous architectures where task granularity and scheduling strategies are critical to achieve performance. We present, for the multifrontal method, a hierarchical strategy for data partitioning and a scheduling algorithm capable of handling the heterogeneity of resources. Finally we present a study on the reproducibility of executions and the use of alternative programming models for the implementation of the multifrontal method. All the experimental results presented in this study are evaluated with a detailed performance analysis measuring the impact of several identified effects on the performance and scalability. Thanks to this original analysis, presented in the first part of this study, we are capable of fully understanding the results obtained with our solver
    • 

    corecore