934 research outputs found

    New Sequential and Scalable Parallel Algorithms for Incomplete Factor Preconditioning

    Get PDF
    The solution of large, sparse, linear systems of equations Ax = b is an important kernel, and the dominant term with regard to execution time, in many applications in scientific computing. The large size of the systems of equations being solved currently (millions of unknowns and equations) requires iterative solvers on parallel computers. Preconditioning, which is the process of translating a linear system into a related system that is easier to solve, is widely used to reduce solution time and is sometimes required to ensure convergence. Level-based preconditioning (ILU(ℓ)) has long been used in serial contexts and is widely recognized as robust and effective for a wide range of problems. However, the method has long been regarded as an inherently sequential technique. Parallelism, it has been thought, can be achieved primarily at the expense of increased iterations. We dispute these claims. The first half of this dissertation takes an in-depth look at structurally based ILU(ℓ) symbolic factorization. There are two definitions of fill level, “sum” and “max,” that have been proposed. Hitherto, these definitions have been cast in terms of matrix terminology. We develop a sequence of lemmas and theorems that provide graph theoretic characterizations of both definitions; these characterizations are based on the static graph of a matrix, G(A). Our Incomplete Fill Path Theorem characterizes fill levels per the sum definition; this is the definition that is used in most library implementations of the “classic” ILU(ℓ) factorization algorithm. Our theorem leads to several new graph-search algorithms that compute factors identical, or nearly identical, to those computed by the “classic” algorithm. Our analyses shows that the new algorithms have lower run time complexity than that of the previously existing algorithms for certain classes of matrices that are commonly encountered in scientific applications. The second half of this dissertation presents a Parallel ILU algorithmic framework (PILU). This framework enables scalable parallel ILU preconditioning by combining concepts from domain decomposition and graph ordering. The framework can accommodate ILU(ℓ) factorization as well as threshold-based ILUT methods. A model implementation of the framework, the Euclid library, was developed as part of this dissertation. This library was used to obtain experimental results for Poisson\u27s equation, the Convection-Diffusion equation, and a nonlinear Radiative Transfer problem. The experiments, which were conducted on a variety of platforms with up to 400 CPUs, demonstrate that our approach is highly scalable for arbitrary ILU(ℓ) fill levels

    Distributed Maple: parallel computer algebra in networked environments

    Get PDF
    AbstractWe describe the design and use of Distributed Maple, an environment for executing parallel computer algebra programs on multiprocessors and heterogeneous clusters. The system embeds kernels of the computer algebra system Maple as computational engines into a networked coordination layer implemented in the programming language Java. On the basis of a comparatively high-level programming model, one may write parallel Maple programs that show good speedups in medium-scaled environments. We report on the use of the system for the parallelization of various functions of the algebraic geometry library CASA and demonstrate how design decisions affect the dynamic behaviour and performance of a parallel application. Numerous experimental results allow comparison of Distributed Maple with other systems for parallel computer algebra

    Investigating Real-Time Sonar Performance Predictions Using Beowulf Clustering

    Get PDF
    Predicting sonar performance, critical to using any sonar to its maximum effectiveness, is computationally intensive and typically the results are based on data from the past and may not be applicable to the current water conditions. This paper discusses how Beowulf clustering techniques were investigated and applied to achieve real-time sonar performance prediction capabilities based on commercially off the shelf (COTS) hardware and software. A sonar system measures ambient noise in real-time. Based on the active sonar range scale, new ambient measurements can be available every 1 to 24 seconds. Traditional sonar performance prediction techniques operated serially and often took approximately 120 seconds of computing time per prediction. These predictions were outdated by potentially several sonar measurements. Using Beowulf clustering techniques, the same prediction now takes approximately 2 seconds. Analysis of measured data using a sonar hardware suite reveals that there is a set of sonar system parameters where a serial approach to sonar performance prediction is more efficient than Beowulf clustering. Using these parameters, a sonar engineer can make the best decision for system prediction capability based on the number of sonar beams and the expected operational range. The paper includes a discussion on the taxonomies of parallel computing, the historical developments leading to measuring the speed of light, and how those measurements enable acoustic paths to be computed in ocean environments

    2HOT: An Improved Parallel Hashed Oct-Tree N-Body Algorithm for Cosmological Simulation

    Full text link
    We report on improvements made over the past two decades to our adaptive treecode N-body method (HOT). A mathematical and computational approach to the cosmological N-body problem is described, with performance and scalability measured up to 256k (2182^{18}) processors. We present error analysis and scientific application results from a series of more than ten 69 billion (409634096^3) particle cosmological simulations, accounting for 4×10204 \times 10^{20} floating point operations. These results include the first simulations using the new constraints on the standard model of cosmology from the Planck satellite. Our simulations set a new standard for accuracy and scientific throughput, while meeting or exceeding the computational efficiency of the latest generation of hybrid TreePM N-body methods.Comment: 12 pages, 8 figures, 77 references; To appear in Proceedings of SC '1

    Parallelizing the Cluster Rank Analysis application

    Get PDF
    A wide range of researchers is beginning to utilize customized statistical methods for analyzing data as hardware and software become cheaper and more widely available. Cluster Rank Analysis (CRA) is an existing multivariate statistical algorithm that existed as an inefficient service-oriented application. Here it is described how CRA was optimized and parallelized using an available computing cluster and both open source and custom software. This was followed by the development of a command-line submission system for CRA jobs, as well as a Web retrieval system for the results of analyses. A subsequent timing study revealed speedup that quickly rose to 15 by the use 35 processors, and should reach a proposed maximum of 19 given over 100 processors. It was found that this speedup was limited primarily by the serial portion of code; the Ethernet communication network was sufficient for this application. By the time that even 10 processors were involved in parallel runs, the average runtime had dropped from over 100 minutes to approximately 15 minutes, before being reduced to 6 minutes by 80 processors. The locations of bottlenecks suggest that further performance increases are possible through additional parallelization. This work with CRA illustrates (1) the speed with which high-performance in-house applications can be developed and (2) the speed and efficiency with which statistical analyses of complex data structures can be carried out given commodity hardware and software resources

    Comparison of numerical solution strategies for gravity field recovery from GOCE SGG observations implemented on a parallel platform

    Get PDF
    International audienceThe recovery of a full set of gravity field parameters from satellite gravity gradiometry (SGG) is a huge numerical and computational task. In practice, parallel computing has to be applied to estimate the more than 90 000 harmonic coefficients parameterizing the Earth?s gravity field up to a maximum spherical harmonic degree of 300. Three independent solution strategies, i.e. two iterative methods (preconditioned conjugate gradient method, semi-analytic approach) and a strict solver (Distributed Non-approximative Adjustment), which are operational on a parallel platform (?Graz Beowulf Cluster?), are assessed and compared both theoretically and on the basis of a realistic-as-possible numerical simulation, regarding the accuracy of the results, as well as the computational effort. Special concern is given to the correct treatment of the coloured noise characteristics of the gradiometer. The numerical simulations show that there are no significant discrepancies among the solutions of the three methods. The newly proposed Distributed Nonapproximative Adjustment approach, which is the only one of the three methods that solves the inverse problem in a strict sense, also turns out to be a feasible method for practical applications.Key words. Spherical harmonics ? satellite gravity gradiometry ? GOCE ? parallel computing ? Beowulf cluste

    HPCCP/CAS Workshop Proceedings 1998

    Get PDF
    This publication is a collection of extended abstracts of presentations given at the HPCCP/CAS (High Performance Computing and Communications Program/Computational Aerosciences Project) Workshop held on August 24-26, 1998, at NASA Ames Research Center, Moffett Field, California. The objective of the Workshop was to bring together the aerospace high performance computing community, consisting of airframe and propulsion companies, independent software vendors, university researchers, and government scientists and engineers. The Workshop was sponsored by the HPCCP Office at NASA Ames Research Center. The Workshop consisted of over 40 presentations, including an overview of NASA's High Performance Computing and Communications Program and the Computational Aerosciences Project; ten sessions of papers representative of the high performance computing research conducted within the Program by the aerospace industry, academia, NASA, and other government laboratories; two panel sessions; and a special presentation by Mr. James Bailey

    Reliable massively parallel symbolic computing : fault tolerance for a distributed Haskell

    Get PDF
    As the number of cores in manycore systems grows exponentially, the number of failures is also predicted to grow exponentially. Hence massively parallel computations must be able to tolerate faults. Moreover new approaches to language design and system architecture are needed to address the resilience of massively parallel heterogeneous architectures. Symbolic computation has underpinned key advances in Mathematics and Computer Science, for example in number theory, cryptography, and coding theory. Computer algebra software systems facilitate symbolic mathematics. Developing these at scale has its own distinctive set of challenges, as symbolic algorithms tend to employ complex irregular data and control structures. SymGridParII is a middleware for parallel symbolic computing on massively parallel High Performance Computing platforms. A key element of SymGridParII is a domain specific language (DSL) called Haskell Distributed Parallel Haskell (HdpH). It is explicitly designed for scalable distributed-memory parallelism, and employs work stealing to load balance dynamically generated irregular task sizes. To investigate providing scalable fault tolerant symbolic computation we design, implement and evaluate a reliable version of HdpH, HdpH-RS. Its reliable scheduler detects and handles faults, using task replication as a key recovery strategy. The scheduler supports load balancing with a fault tolerant work stealing protocol. The reliable scheduler is invoked with two fault tolerance primitives for implicit and explicit work placement, and 10 fault tolerant parallel skeletons that encapsulate common parallel programming patterns. The user is oblivious to many failures, they are instead handled by the scheduler. An operational semantics describes small-step reductions on states. A simple abstract machine for scheduling transitions and task evaluation is presented. It defines the semantics of supervised futures, and the transition rules for recovering tasks in the presence of failure. The transition rules are demonstrated with a fault-free execution, and three executions that recover from faults. The fault tolerant work stealing has been abstracted in to a Promela model. The SPIN model checker is used to exhaustively search the intersection of states in this automaton to validate a key resiliency property of the protocol. It asserts that an initially empty supervised future on the supervisor node will eventually be full in the presence of all possible combinations of failures. The performance of HdpH-RS is measured using five benchmarks. Supervised scheduling achieves a speedup of 757 with explicit task placement and 340 with lazy work stealing when executing Summatory Liouville up to 1400 cores of a HPC architecture. Moreover, supervision overheads are consistently low scaling up to 1400 cores. Low recovery overheads are observed in the presence of frequent failure when lazy on-demand work stealing is used. A Chaos Monkey mechanism has been developed for stress testing resiliency with random failure combinations. All unit tests pass in the presence of random failure, terminating with the expected results
    corecore