57 research outputs found

    High-performance computing with PetaBricks and Julia

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Mathematics, 2011.Cataloged from PDF version of thesis.Includes bibliographical references (p. 163-170).We present two recent parallel programming languages, PetaBricks and Julia, and demonstrate how we can use these two languages to re-examine classic numerical algorithms in new approaches for high-performance computing. PetaBricks is an implicitly parallel language that allows programmers to naturally express algorithmic choice explicitly at the language level. The PetaBricks compiler and autotuner is not only able to compose a complex program using fine-grained algorithmic choices but also find the right choice for many other parameters including data distribution, parallelization and blocking. We re-examine classic numerical algorithms with PetaBricks, and show that the PetaBricks autotuner produces nontrivial optimal algorithms that are difficult to reproduce otherwise. We also introduce the notion of variable accuracy algorithms, in which accuracy measures and requirements are supplied by the programmer and incorporated by the PetaBricks compiler and autotuner in the search of optimal algorithms. We demonstrate the accuracy/performance trade-offs by benchmark problems, and show how nontrivial algorithmic choice can change with different user accuracy requirements. Julia is a new high-level programming language that aims at achieving performance comparable to traditional compiled languages, while remaining easy to program and offering flexible parallelism without extensive effort. We describe a problem in large-scale terrain data analysis which motivates the use of Julia. We perform classical filtering techniques to study the terrain profiles and propose a measure based on Singular Value Decomposition (SVD) to quantify terrain surface roughness. We then give a brief tutorial of Julia and present results of our serial blocked SVD algorithm implementation in Julia. We also describe the parallel implementation of our SVD algorithm and discuss how flexible parallelism can be further explored using Julia.by Yee Lok Wong.Ph.D

    Evaluation of Distributed Programming Models and Extensions to Task-based Runtime Systems

    Get PDF
    High Performance Computing (HPC) has always been a key foundation for scientific simulation and discovery. And more recently, deep learning models\u27 training have further accelerated the demand of computational power and lower precision arithmetic. In this era following the end of Dennard\u27s Scaling and when Moore\u27s Law seemingly still holds true to a lesser extent, it is not a coincidence that HPC systems are equipped with multi-cores CPUs and a variety of hardware accelerators that are all massively parallel. Coupling this with interconnect networks\u27 speed improvements lagging behind those of computational power increases, the current state of HPC systems is heterogeneous and extremely complex. This was heralded as a great challenge to the software stacks and their ability to extract performance from these systems, but also as a great opportunity to innovate at the programming model level to explore the different approaches and propose new solutions. With usability, portability, and performance as the main factors to consider, this dissertation first evaluates some of the widely used parallel programming models (MPI, MPI+OpenMP, and task-based runtime systems) ability to manage the load imbalance among the processes computing the LU factorization of a large dense matrix stored in the Block Low-Rank (BLR) format. Next I proposed a number of optimizations and implemented them in PaRSEC\u27s Dynamic Task Discovery (DTD) model, including user-level graph trimming and direct Application Programming Interface (API) calls to perform data broadcast operation to further extend the limit of STF model. On the other hand, the Parameterized Task Graph (PTG) approach in PaRSEC is the most scalable approach for many different applications, which I then explored the possibility of combining both the algorithmic approach of Communication-Avoiding (CA) and the communication-computation overlapping benefits provided by runtime systems using 2D five-point stencil as the test case. This broad programming models evaluation and extension work highlighted the abilities of task-based runtime system in achieving scalable performance and portability on contemporary heterogeneous HPC systems. Finally, I summarized the profiling capability of PaRSEC runtime system, and demonstrated with a use case its important role in the performance bottleneck identification leading to optimizations

    Quantitative performance modeling of scientific computations and creating locality in numerical algorithms

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1995.Includes bibliographical references (p. 141-150) and index.by Sivan Avraham Toledo.Ph.D

    高性能計算機に適したヤコビSVD手法の実装技術と性能解析

    Get PDF
    The massive-parallelism of recent high-performance computers (HPC) requires all algorithms running on them, including the singular value decomposition (SVD) algorithm, to have high-level of scalability. To achieve this goal, we focus on the one-sided Jacobi SVD (OSJ) algorithm. The OSJ with extension techniques like blocking and parallelization has a potential for high parallel efficiency due to its large-grained parallelism, but some of its theoretical properties are unknown. Furthermore, implementation techniques of the OSJ for HPC have not been studied well. This thesis aims to analyze the blocking and parallelization techniques of OSJ both theoretically and experimentally and provides new parallelization techniques for HPC. It consists of seven chapters. In the first chapter, the motivation and contribution of the thesis are described. In the second and third chapters, as the background, the current trend of HPC and applications of SVD are described. The idea and detailed algorithm of the OSJ and the existing extension techniques are also described.In the fourth chapter, a new bound on the orthogonality error of Hari’s V2 method is provided. The V2 method is a blocking method for the OSJ suitable for HPC. The bound is tighter than the existing one due to Drmač, thanks to the exploitation of the diagonally scaled structure of the matrix. In the fifth chapter, a new implementation method for HPC based on 2D blocked data distribution and all-reduce type communication is described. Theoretical analysis of the method shows the number of communication is reduced compared with the existing data distributions. Experimental results on highly parallel machines support the theoretical prediction and show good strong-scalability of the method. Bečka’s dynamic ordering method, which can reduce the number of iteration of OSJ, is also analyzed and a new bound on the global convergence rate of the method is provided.In the sixth chapter, implementation techniques of DSYRK for a many-core CPU, a new architecture of CPU which is used in HPC, is considered. DSYRK is a variation of matrix-matrix multiplication used in the V2 method. The new parallelization technique of DSYRK which utilizes all the three dimensions for parallelism in the matrix-matrix multiplication accelerates the performance of DSYRK on Xeon Phi, an Intel’s many-core CPU, up to 75% of the theoretical peak performance.The last chapter concludes the thesis and provides the future work of this study. 現代の高性能計算機は高い並列性をもっており,特異値分解(SVD)のような科学技術計算に用いられる基本的な行列計算についてもこのような高い並列性に対応することが求められている.片側ヤコビ法は広く用いられている行列計算ライブラリLAPACKに実装されたSVD計算手法の一つであり,高い計算精度と実用的な速度を併せ持つ.また,ブロック化や並列化などの拡張手法と組み合わせることで,高い計算効率と粗粒度な並列性を持たせることができる.そのため片側ヤコビ法は高性能計算機に適すると考えられているが,ブロック化や並列化については理論的・実験的検証が不十分であり,また現代の高性能計算機に向けた実装の研究も進んでいなかった. 本論文は,片側ヤコビ法の拡張手法について理論的・実験的解析を行うとともに高い並列性能を持つ実装手法の開発とその解析を行うものであり,7つの章で構成される. 第1章では現状の片側ヤコビ法が持つ精度面での利点と性能面での問題点について実例を用いて示すことで本研究の目的を説明し,また本論文の成果の概略と構成を述べる. 第2章では本研究の背景となる事柄についてまとめる.第一に現代の高性能計算機が持つ特性についてまとめ,高度な並列性に対応することがSVD計算アルゴリズムに不可欠であることを示す.第二に,SVDとその科学技術計算における応用を示し,また近年明らかにされたSVDの誤差の性質についてまとめる.そして最後にSVD計算アルゴリズムについて概要を示し,ヤコビ法やそのSVDへの応用手法である片側ヤコビ法の基礎となるアイデアについて説明する. 第3章では片側ヤコビ法とそれに対する既存の拡張手法について詳細に解説する.ここでははじめに片側ヤコビ法の詳細な手順を示し,その問題点を指摘することで拡張手法の必要性を説明する.そして,第一に並列化と密接な関係があるヤコビ法の巡回順序について,第二に片側ヤコビ法の高性能化に必要なブロック化手法について,第三に近年開発されヤコビ法の劇的な高性能化につながった前処理手法について説明する. 第4章では本論文の成果の1つである,片側ヤコビ法のブロック化したときにおける誤差解析の結果を示す.ブロック化手法はいくつか考えられるが本論文ではとくに高性能計算に適したHariのV2手法を対象にし,片側ヤコビ法の収束性に影響を与える直交性に対する誤差について調べている.本研究における解析はDrmačによる先行研究から発展させたものであり,行列の対角スケーリングされた構造を利用することでより良い上限が得られている.また多数の行列を用いた実験による検証により,上限に出現する行列に依存した係数が実用上小さいことを示している. 第5章では第一に,本論文の2つ目の成果である二次元ブロック分割によるデータ配置とAllReduce型の計算を用いた分散メモリ向け並列化手法について論じている.現代の高性能計算機のほぼすべてが分散メモリ型並列計算機であるが,このような計算機ではデータ配置が計算や通信の特性を決定づける.本論文ではまずV2手法の計算パターンを利用して,二次元ブロック分割とAllReduce型計算の組み合わせた新規の並列化手法を示している.そして計算量や通信量を理論的に解析し,この手法が従来のデータ分散手法と比較してオーダーの意味で通信回数を削減することを示している.また,高性能計算機による実験的検証によってこの手法の性能が理論から得られた予測に従うことを確かめている.さらに京コンピュータを用いた大規模並列計算機上での性能検証を行い,1万次元の行列のSVDを計算するときに約2万5千コアまで性能が向上するという高い強スケーラビリティを確認している.第二に,本論文の3つ目の成果である,Bečkaらの動的順序と呼ばれる並列化手法について理論的検証を行い,一次収束性に対する新しい上界を求めている.この上界は従来のものとは異なり並列数と反比例する形となっており,動的順序の並列計算に適した特性を示している. 第6章では本論文の4つ目の成果である,DSYRKの高性能実装手法について論じている.DSYRKはBLASにおいて定義された対称な形を持つ行列積であり,V2手法における主要な計算の1つである.本研究ではメニーコアCPUという将来使用されると目されるアーキテクチャを持つ,Xeon Phiを用いたときにおける問題点を検証する.DSYRKの既存実装ではDSYRKの持つ対称な構造のためXeon Phiが持つ高い性能を活かせていなかったが,本研究ではメニーコアCPUの持つ高い並列性に対処するため,行列積の持つ3次元的な並列化軸をすべて利用する新規並列化アルゴリズムを示している.またXeon Phi上で性能検証結果では,理論ピーク性能の約76%に達する高い性能を得ている. 第7章では本論文をまとめ,今後の展望について述べている.電気通信大学201

    Connected Attribute Filtering Based on Contour Smoothness

    Get PDF

    Control and Estimation Oriented Model Order Reduction for Linear and Nonlinear Systems

    Full text link
    Optimization based controls are advantageous in meeting stringent performance requirements and accommodating constraints. Although computers are becoming more powerful, solving optimization problems in real-time remains an obstacle because of associated computational complexity. Research efforts to address real-time optimization with limited computational power have intensified over the last decade, and one direction that has shown some success is model order reduction. This dissertation contains a collection of results relating to open- and closed-loop reduction techniques for large scale unconstrained linear descriptor systems, constrained linear systems, and nonlinear systems. For unconstrained linear descriptor systems, this dissertation develops novel gramian and Riccati solution approximation techniques. The gramian approximation is used for an open-loop reduction technique following that of balanced truncation proposed by (Moore, 1981) for ordinary linear systems and (Stykel, 2004) for linear descriptor systems. The Riccati solution is used to generalize the Linear Quadratic Gaussian balanced truncation (LQGBT) of (Verriest, 1981) and (Jonckheere and Silverman, 1983). These are applied to an electric machine model to reduce the number of states from >>100000 to 8 while improving accuracy over the state-of-the-art modal truncation of (Zhou, 2015) for the purpose of condition monitoring. Furthermore, a link between unconstrained model predictive control (MPC) with a terminal penalty and LQG of a linear system is noted, suggesting an LQGBT reduced model as a natural model for reduced MPC design. The efficacy of such a reduced controller is demonstrated by the real-time control of a diesel airpath. Model reduction generally introduces modeling errors, and controlling a constrained plant subject to modeling errors falls squarely into robust control. A standard assumption of robust control is that inputs/states/outputs are constrained by convex sets, and these sets are ``tightened'' for robust constraint satisfaction. However, robust control is often overly conservative, and resulting control strategies cannot take advantage of the true admissible sets. A new reduction problem is proposed that considers the reduced order model accuracy and constraint conservativeness. A constant tube methodology for reduced order constrained MPC is presented, and the proposed reduced order model is found to decrease the constraint conservativeness of the reduced order MPC law compared to reduced order models obtained by gramian and LQG reductions. For nonlinear systems, a reformulation of the empirical gramians of (Lall et al., 1999) and (Hahn et al., 2003) into simpler, yet more general forms is provided. The modified definitions are used in the balanced truncation of a nonlinear diesel airpath model, and the reduced order model is used to design a reduced MPC law for tracking control. Further exploiting the link between the gramian and Riccati solution for linear systems, the new empirical gramian formulation is extended to obtain empirical Riccati covariance matrices used for closed-loop model order reduction of a nonlinear system. Balanced truncation using the empirical Riccati covariance matrices is demonstrated to result in a closer-to-optimal nonlinear compensator than the previous balanced truncation techniques discussed in the dissertation.PHDNaval Architecture & Marine EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/140839/1/riboch_1.pd

    ParallelLCA : a foreground aware parallel calculator for life cycle assessment

    Get PDF
    Life Cycle Assessment (LCA), which aims to assess the environmental impacts during the life cycle of a system product (S) (e.g., production of aluminum in Quebec), can be used to compare different systems built with different types of materials to determine which is the least harmful to the environment. The calculation in LCA represents a computational challenge as it is dependent on the size of the system, the number of iterations in the Monte-Carlo simulation, and the number of uncertain variables in the system. First, the solving of a linear system of dimensions in the order of 10,000 equations by 10,000 unknown variables is required for the base case. Second, the building of a graph iterative in nature with minimum dimensions of 10,000 vertices. Third, the computing of a Monte-Carlo simulation requiring several thousands of iterations to converge is to be computed. Finally, a sensitivity analysis which requires the computing of millions of correlations between vectors each having a dimension that is proportional to the number of iterations in the Monte-Carlo simulation. To best solve the computational challenges present in LCA, this research benefits from well established libraries that solve large sparse linear systems and performs large sparse matrix computing. Also, this thesis adopted mathematical optimizations that removed the matrix inverse step from the contribution analysis module, which is very expensive, as well as other algorithmic optimizations that removed the large and variant part of the LCA supply-chain from the matrix component of the various calculation phases. Furthermore, this research experimented with libraries such as OpenMP, MPI, and Apache Spark to parallelize the computation. First, the thesis will discuss the literature regarding these computational opportunities. Second, it will present a proposed LCA calculator for implementing an efficient LCA computation. Finally, it will present the performance of computing the different phases of LCA for various dimensions of the system (S) and concludes with suggestions for improvement and future development

    Real-time stress analysis of three-dimensional boundary element problems with continuously updating geometry

    Get PDF
    Computational design of mechanical components is an iterative process that involves multiple stress analysis runs; this can be time consuming and expensive. Significant improvements in the efficiency of this process can be made by increasing the level of interactivity. One approach is through real-time re-analysis of models with continuously updating geometry. In this work the boundary element method is used to realise this vision. Three primary areas need to be considered to accelerate the re-solution of boundary element problems. These are re-meshing the model, updating the boundary element system of equations and re-solution of the system. Once the initial model has been constructed and solved, the user may apply geometric perturbations to parts of the model. A new re-meshing algorithm accommodates these changes in geometry whilst retaining as much of the existing mesh as possible. This allows the majority of the previous boundary element system of equations to be re-used for the new analysis. Efficiency is achieved during re-integration by applying a reusable intrinsic sample point (RISP) integration scheme with a 64-bit single precision code. Parts of the boundary element system that have not been updated are retained by the re-analysis and integrals that multiply zero boundary conditions are suppressed. For models with fewer than 10,000 degrees of freedom, the re-integration algorithm performs up to five times faster than a standard integration scheme with less than 0.15% reduction in the L_2-norm accuracy of the solution vector. The method parallelises easily and an additional six times speed-up can be achieved on eight processors over the serial implementation. The performance of a range of direct, iterative and reduction based linear solvers have been compared for solving the boundary element system with the iterative generalised minimal residual (GMRES) solver providing the fastest convergence rate and the most accurate result. Further time savings are made by preconditioning the updated system with the LU decomposition of the original system. Using these techniques, near real-time analysis can be achieved for three-dimensional simulations; for two-dimensional models such real-time performance has already been demonstrated
    corecore