2,910 research outputs found

    Rectangular Full Packed Format for Cholesky's Algorithm: Factorization, Solution and Inversion

    Get PDF
    We describe a new data format for storing triangular, symmetric, and Hermitian matrices called RFPF (Rectangular Full Packed Format). The standard two dimensional arrays of Fortran and C (also known as full format) that are used to represent triangular and symmetric matrices waste nearly half of the storage space but provide high performance via the use of Level 3 BLAS. Standard packed format arrays fully utilize storage (array space) but provide low performance as there is no Level 3 packed BLAS. We combine the good features of packed and full storage using RFPF to obtain high performance via using Level 3 BLAS as RFPF is a standard full format representation. Also, RFPF requires exactly the same minimal storage as packed format. Each LAPACK full and/or packed triangular, symmetric, and Hermitian routine becomes a single new RFPF routine based on eight possible data layouts of RFPF. This new RFPF routine usually consists of two calls to the corresponding LAPACK full format routine and two calls to Level 3 BLAS routines. This means {\it no} new software is required. As examples, we present LAPACK routines for Cholesky factorization, Cholesky solution and Cholesky inverse computation in RFPF to illustrate this new work and to describe its performance on several commonly used computer platforms. Performance of LAPACK full routines using RFPF versus LAPACK full routines using standard format for both serial and SMP parallel processing is about the same while using half the storage. Performance gains are roughly one to a factor of 43 for serial and one to a factor of 97 for SMP parallel times faster using vendor LAPACK full routines with RFPF than with using vendor and/or reference packed routines

    Performance Evaluation of Supercomputers using HPCC and IMB Benchmarks

    Get PDF
    The HPC Challenge (HPCC) benchmark suite and the Intel MPI Benchmark (IMB) are used to compare and evaluate the combined performance of processor, memory subsystem and interconnect fabric of five leading supercomputers - SGI Altix BX2, Cray XI, Cray Opteron Cluster, Dell Xeon cluster, and NEC SX-8. These five systems use five different networks (SGI NUMALINK4, Cray network, Myrinet, InfiniBand, and NEC IXS). The complete set of HPCC benchmarks are run on each of these systems. Additionally, we present Intel MPI Benchmarks (IMB) results to study the performance of 11 MPI communication functions on these systems

    The Evolution of the Mexican-Born Workforce in the United States

    Get PDF
    This paper examines the evolution of the Mexican-born workforce in the United States using data drawn from the decennial U.S. Census throughout the entire 20th century. It is well known that there has been a rapid rise in Mexican immigration to the United States in recent years. Interestingly, the share of Mexican immigrants in the U.S. workforce declined steadily beginning in the 1920s before beginning to rise in the 1960s. It was not until 1980 that the relative number of Mexican immigrants in the U.S. workforce was at the 1920 level. The paper examines the trends in the relative skills and economic performance of Mexican immigrants, and contrasts this evolution with that experienced by other immigrants arriving in the United States during the period. The paper also examines the costs and benefits of this influx by examining how the Mexican influx has altered economic opportunities in the most affected labor markets and by discussing how the relative prices of goods and services produced by Mexican immigrants may have changed over time.

    Fast Low Fidelity Microsimulation of Vehicle Traffic on Supercomputers

    Full text link
    A set of very simple rules for driving behavior used to simulate roadway traffic gives realistic results. Because of its simplicity, it is easy to implement the model on supercomputers (vectorizing and parallel), where we have achieved real time limits of more than 4~million~kilometers (or more than 53~million vehicle sec/sec). The model can be used for applications where both high simulation speed and individual vehicle resolution are needed. We use the model for extended statistical analysis to gain insight into traffic phenomena near capacity, and we discuss that this model is a good candidate for network routing applications. (Submitted to Transportation Research Board Meeting, Jan. 1994, Washington D.C.)Comment: 11 pages, latex, figs. available upon request, Cologne-WP 93.14

    Early benchmark results on the NEC SX-4

    Get PDF

    Optimizing sparse matrix-vector multiplication in NEC SX-Aurora vector engine

    Get PDF
    Sparse Matrix-Vector multiplication (SpMV) is an essential piece of code used in many High Performance Computing (HPC) applications. As previous literature shows, achieving efficient vectorization and performance in modern multi-core systems is nothing straightforward. It is important then to revisit the current stateof-the-art matrix formats and optimizations to be able to deliver deliver high performance in long vector architectures. In this tech-report, we describe how to develop an efficient implementation that achieves high throughput in the NEC Vector Engine: a 256 element-long vector architecture. Combining several pre-processing and kernel optimizations we obtain an average 12% improvement over a base SELLC-s implementation on a heterogeneous set of 24 matrices.Preprin

    Implementing and evaluating graph algorithms for long vector architectures

    Get PDF
    High-Performance Computing can be accelerated using long-vector architectures. However, creating efficient coding implementations for these architectures can be challenging. This Master's thesis focuses on implementing four well-known and widely-used graph processing algorithms using the RISC-V Vector Extension, leveraging an experimental system in an FPGA. I present a graph storage format that benefits from long vectors and describe how these four algorithms can be rewritten to utilize it. This thesis also introduces an instrumentation tool for FPGA that I developed to link the output of electrical engineering software with performance analysis tools for HPC. This tool allows users to visualize information coming from the logic analyzer internal to the FPGA with powerful visualization tools, permitting fine-grain analysis of the FPGA signals correlated with the code running on it. This tool has been integrated into the experimental performance analysis tools of BSC. In this thesis I leverage this tool to analyze and improve my implementations of graph algorithms for long-vector architectures, collecting the process and thoughts behind each optimization. Finally, I compare the performance of my vector implementations with other machines, such as the NEC SX-Aurora, a commercial RISC-V board, and an Intel chip

    Network Middleware for enterprise enhanced operation

    Get PDF

    Efficient direct convolution using long SIMD instructions

    Get PDF
    This paper demonstrates that state-of-the-art proposals to compute convolutions on architectures with CPUs supporting SIMD instructions deliver poor performance for long SIMD lengths due to frequent cache conflict misses. We first discuss how to adapt the state-of-the-art SIMD direct convolution to architectures using long SIMD instructions and analyze the implications of increasing the SIMD length on the algorithm formulation. Next, we propose two new algorithmic approaches: the Bounded Direct Convolution (BDC), which adapts the amount of computation exposed to mitigate cache misses, and the Multi-Block Direct Convolution (MBDC), which redefines the activation memory layout to improve the memory access pattern. We evaluate BDC, MBDC, the state-of-the-art technique, and a proprietary library on an architecture featuring CPUs with 16,384-bit SIMD registers using ResNet convolutions. Our results show that BDC and MBDC achieve respective speed-ups of 1.44× and 1.28× compared to the state-of-the-art technique for ResNet-101, and 1.83× and 1.63× compared to the proprietary library.This work receives EuroHPC-JU funding under grant no. 101034126, with support from the Horizon2020 program. Adrià Armejach is a Serra Hunter Fellow and has been partially supported by the Grant IJCI-2017-33945 funded by MCIN/AEI/10.13039/501100011033. Marc Casas has been par-tially supported by the Grant RYC-2017-23269 funded by MCIN/AEI/10.13039/501100011033 and ESF Investing in your future. This work is supported by the Spanish Ministry of Science and Technology through the PID2019-107255GB project and the Generalitat de Catalunya (contract 2017-SGR-1414).Peer ReviewedPostprint (author's final draft
    corecore