315 research outputs found

    Developing numerical libraries in Java

    Full text link
    The rapid and widespread adoption of Java has created a demand for reliable and reusable mathematical software components to support the growing number of compute-intensive applications now under development, particularly in science and engineering. In this paper we address practical issues of the Java language and environment which have an effect on numerical library design and development. Benchmarks which illustrate the current levels of performance of key numerical kernels on a variety of Java platforms are presented. Finally, a strategy for the development of a fundamental numerical toolkit for Java is proposed and its current status is described.Comment: 11 pages. Revised version of paper presented to the 1998 ACM Conference on Java for High Performance Network Computing. To appear in Concurrency: Practice and Experienc

    Rectangular Full Packed Format for Cholesky's Algorithm: Factorization, Solution and Inversion

    Get PDF
    We describe a new data format for storing triangular, symmetric, and Hermitian matrices called RFPF (Rectangular Full Packed Format). The standard two dimensional arrays of Fortran and C (also known as full format) that are used to represent triangular and symmetric matrices waste nearly half of the storage space but provide high performance via the use of Level 3 BLAS. Standard packed format arrays fully utilize storage (array space) but provide low performance as there is no Level 3 packed BLAS. We combine the good features of packed and full storage using RFPF to obtain high performance via using Level 3 BLAS as RFPF is a standard full format representation. Also, RFPF requires exactly the same minimal storage as packed format. Each LAPACK full and/or packed triangular, symmetric, and Hermitian routine becomes a single new RFPF routine based on eight possible data layouts of RFPF. This new RFPF routine usually consists of two calls to the corresponding LAPACK full format routine and two calls to Level 3 BLAS routines. This means {\it no} new software is required. As examples, we present LAPACK routines for Cholesky factorization, Cholesky solution and Cholesky inverse computation in RFPF to illustrate this new work and to describe its performance on several commonly used computer platforms. Performance of LAPACK full routines using RFPF versus LAPACK full routines using standard format for both serial and SMP parallel processing is about the same while using half the storage. Performance gains are roughly one to a factor of 43 for serial and one to a factor of 97 for SMP parallel times faster using vendor LAPACK full routines with RFPF than with using vendor and/or reference packed routines

    Benchmarking: More Aspects of High Performance Computing

    Get PDF
    The original HPL algorithm makes the assumption that all data can be fit entirely in the main memory. This assumption will obviously give a good performance due to the absence of disk I/O. However, not all applications can fit their entire data in memory. These applications which require a fair amount of I/O to move data to and from main memory and secondary storage, are more indicative of usage of an Massively Parallel Processor (MPP) System. Given this scenario a well designed I/O architecture will play a significant part in the performance of the MPP System on regular jobs. And, this is not represented in the current Benchmark. The modified HPL algorithm is hoped to be a step in filling this void. The most important factor in the performance of out-of-core algorithms is the actual I/O operations performed and their efficiency in transferring data to/from main memory and disk, Various methods were introduced in the report for performing I/O operations. The I/O method to use depends on the design of the out-of-core algorithm. Conversely, the performance of the out-of-core algorithm is affected by the choice of I/O operations. This implies, good performance is achieved when I/O efficiency is closely tied with the out-of-core algorithms. The out-of-core algorithms must be designed from the start. It is easily observed in the timings for various plots, that I/O plays a significant part in the overall execution time. This leads to an important conclusion, retro-fitting an existing code may not be the best choice. The right-looking algorithm selected for the LU factorization is a recursive algorithm and performs well when the entire dataset is in memory. At each stage of the loop the entire trailing submatrix is read into memory panel by panel. This gives a polynomial number of I/O reads and writes. If the left-looking algorithm was selected for the main loop, the number of I/O operations involved will be linear on the number of columns. This is due to the data access pattern for the left-looking factorization. The right-looking algorithm performs better for in-core data, but the left-looking will perform better for out-of-core data due to the reduced I/O operations. Hence the conclusion that out-of-core algorithms will perform better when designed from start. The out-of-core and thread based computation do not interact in this case, since I/O is not done by the threads. The performance of the thread based computation does not depend on I/O as the algorithms are in the BLAS algorithms which assumes all the data to be in memory. This is the reason the out-of-core results and OpenMP threads results were presented separately and no attempt to combine them was made. In general, the modified HPL performs better with larger block sizes, due to less I/O involved for out-of-core part and better cache utilization for the thread based computation

    Overlapping communication and computation by using a hybrid MPI/SMPSs approach

    Get PDF
    A previous version of this document was submitted for publication by october 2008.Communication overhead is one of the dominant factors that affect performance in high-performance computing systems. To reduce the negative impact of communication, programmers overlap communication and computation by using asynchronous communication primitives. This increases code complexity, requiring more effort to write parallel code and making less readable code. This paper presents the hybrid use of MPI and SMPSs (SMP superscalar), a task-based shared-memory programming model, enhanced with a restart mechanism allowing the programmer to introduce the asynchronism that is necessary to enable the effective communication/computation overlap in a productive way. We demonstrate the hybrid use of MPI/SMPSs with the high-performance LINPACK benchmark, which uses the lookahead technique to overlap communication and computation. MPI/SMPSs improves the performance of a pure MPI with look-ahead by 7,6% on a 1024 processors machine. In addition to better performance, hybrid MPI/SMPSs substantially reduces code complexity, it is less sensitive to network bandwidth and operating system noise, and improves the use of main memory.Postprint (published version

    ScALPEL: A Scalable Adaptive Lightweight Performance Evaluation Library for application performance monitoring

    Get PDF
    As supercomputers continue to grow in scale and capabilities, it is becoming increasingly difficult to isolate processor and system level causes of performance degradation. Over the last several years, a significant number of performance analysis and monitoring tools have been built/proposed. However, these tools suffer from several important shortcomings, particularly in distributed environments. In this paper we present ScALPEL, a Scalable Adaptive Lightweight Performance Evaluation Library for application performance monitoring at the functional level. Our approach provides several distinct advantages. First, ScALPEL is portable across a wide variety of architectures, and its ability to selectively monitor functions presents low run-time overhead, enabling its use for large-scale production applications. Second, it is run-time configurable, enabling both dynamic selection of functions to profile as well as events of interest on a per function basis. Third, our approach is transparent in that it requires no source code modifications. Finally, ScALPEL is implemented as a pluggable unit by reusing existing performance monitoring frameworks such as Perfmon and PAPI and extending them to support both sequential and MPI applications.Comment: 10 pages, 4 figures, 2 table

    Astro - A Low-Cost, Low-Power Cluster for CPU-GPU Hybrid Computing using the Jetson TK1

    Get PDF
    With the rising costs of large scale distributed systems many researchers have began looking at utilizing low power architectures for clusters. In this paper, we describe our Astro cluster, which consists of 46 NVIDIA Jetson TK1 nodes each equipped with an ARM Cortex A15 CPU, 192 core Kepler GPU, 2 GB of RAM, and 16 GB of flash storage. The cluster has a number of advantages when compared to conventional clusters including lower power usage, ambient cooling, shared memory between the CPU and GPU, and affordability. The cluster is built using commodity hardware and can be setup for relatively low costs while providing up to 190 single precision GFLOPS of computing power per node due to its combined GPU/CPU architecture. The cluster currently uses one 48-port Gigabit Ethernet switch and runs Linux for Tegra, a modified version of Ubuntu provided by NVIDIA as its operating system. Common file systems such as PVFS, Ceph, and NFS are supported by the cluster and benchmarks such as HPL, LAPACK, and LAMMPS are used to evaluate the system. At peak performance, the cluster is able to produce 328 GFLOPS of double precision and a peak of 810W using the LINPACK benchmark placing the cluster at 324th place on the Green500. Single precision benchmarks result in a peak performance of 6800 GFLOPs. The Astro cluster aims to be a proof-of-concept for future low power clusters that utilize a similar architecture. The cluster is installed with many of the same applications used by top supercomputers and is validated using the several standard supercomputing benchmarks. We show that with the rise of low-power CPUs and GPUs, and the need for lower server costs, this cluster provides insight into how ARM and CPU-GPU hybrid chips will perform in high-performance computing

    Review and analysis of dense linear system solver package for distributed memory machines

    Get PDF
    A dense linear system solver package recently developed at the University of Texas at Austin for distributed memory machine (e.g. Intel Paragon) has been reviewed and analyzed. The package contains about 45 software routines, some written in FORTRAN, and some in C-language, and forms the basis for parallel/distributed solutions of systems of linear equations encountered in many problems of scientific and engineering nature. The package, being studied by the Computer Applications Branch of the Analysis and Computation Division, may provide a significant computational resource for NASA scientists and engineers in parallel/distributed computing. Since the package is new and not well tested or documented, many of its underlying concepts and implementations were unclear; our task was to review, analyze, and critique the package as a step in the process that will enable scientists and engineers to apply it to the solution of their problems. All routines in the package were reviewed and analyzed. Underlying theory or concepts which exist in the form of published papers or technical reports, or memos, were either obtained from the author, or from the scientific literature; and general algorithms, explanations, examples, and critiques have been provided to explain the workings of these programs. Wherever the things were still unclear, communications were made with the developer (author), either by telephone or by electronic mail, to understand the workings of the routines. Whenever possible, tests were made to verify the concepts and logic employed in their implementations. A detailed report is being separately documented to explain the workings of these routines

    Real-Time, Dynamic Hardware Accelerators for BLAS Computation

    Get PDF
    This paper presents an approach to increasing the capability of scientific computing through the use of real-time, partially reconfigurable hardware accelerators that implement basic linear algebra subprograms (BLAS). The use of reconfigurable hardware accelerators for computing linear algebra functions has the potential to increase floating point computation while at the same time providing an architecture that minimizes data movement latency and increase power efficiency. While there has been significant work by the computing community to optimize BLAS routines at the software level, optimizing these routines in hardware using reconfigurable fabrics is in its infancy. This paper begins with a comprehensive overview of the history and evolution of BLAS for use in scientific computing. In the reviews current successes in using reconfigurable computing architectures achieve acceleration. It then presents an investigation of an accelerator approach with a granularity at the logic circuit level through real-time, partial reconfiguration of a programmable fabric with static accelerator cache memory to minimize data movement. Empirical data is presented for a study on a single-FPGA

    Basic linear algebra subprograms for FORTRAN usage

    Get PDF
    A package of 38 low level subprograms for many of the basic operations of numerical linear algebra is presented. The package is intended to be used with FORTRAN. The operations in the package are dot products, elementary vector operations, Givens transformations, vector copy and swap, vector norms, vector scaling, and the indices of components of largest magnitude. The subprograms and a test driver are available in portable FORTRAN. Versions of the subprograms are also provided in assembly language for the IBM 360/67, the CDC 6600 and CDC 7600, and the Univac 1108
    • …