707,998 research outputs found

    Emulating Digital Logic using Transputer Networks (Very High Parallelism = Simplicity = Performance)

    Get PDF
    Modern VLSI technology has changed the economic rules by which the balance between processing power, memory and communications is decided in computing systems. This will have a profound impact on the design rules for the controlling software. In particular, the criteria for judging efficiency of the algorithms will be somewhat different. This paper explores some of these implications through the development of highly parallel and highly distributable algorithms based on occam and transputer networks. The major results reported are a new simplicity for software designs, a corresponding ability to reason (formally and informally) about their properties, the reusability of their components and some real performance figures which demonstrate their practicality. Some guidelines to assist in these designs are also given. As a vehicle for discussion, an interactive simulator is developed for checking the functional and timing characteristics of digital logic circuits of arbitrary complexity

    Ab initio computations of molecular systems by the auxiliary-field quantum Monte Carlo method

    Full text link
    The auxiliary-field quantum Monte Carlo (AFQMC) method provides a computational framework for solving the time-independent Schroedinger equation in atoms, molecules, solids, and a variety of model systems. AFQMC has recently witnessed remarkable growth, especially as a tool for electronic structure computations in real materials. The method has demonstrated excellent accuracy across a variety of correlated electron systems. Taking the form of stochastic evolution in a manifold of non-orthogonal Slater determinants, the method resembles an ensemble of density-functional theory (DFT) calculations in the presence of fluctuating external potentials. Its computational cost scales as a low-power of system size, similar to the corresponding independent-electron calculations. Highly efficient and intrinsically parallel, AFQMC is able to take full advantage of contemporary high-performance computing platforms and numerical libraries. In this review, we provide a self-contained introduction to the exact and constrained variants of AFQMC, with emphasis on its applications to the electronic structure in molecular systems. Representative results are presented, and theoretical foundations and implementation details of the method are discussed.Comment: 22 pages, 11 figure

    TARUC: A Topology-Aware Resource Utility and Contention Benchmark

    Get PDF
    Computer architects have increased hardware parallelism and power efficiency by integrating massively parallel hardware accelerators (coprocessors) into compute systems. Many modern HPC clusters now consist of multi-CPU nodes along with additional hardware accelerators in the form of graphics processing units (GPUs). Each CPU and GPU is integrated with system memory via communication links (QPI and PCIe) and multi-channel memory controllers. The increasing density of these heterogeneous computing systems has resulted in complex performance phenomena including nonuniform memory access (NUMA) and resource contention that make application performance hard to predict and tune. This paper presents the Topology Aware Resource Usability and Contention (TARUC) benchmark. TARUC is a modular, open-source, and highly configurable benchmark useful for profiling dense heterogeneous systems to provide insight for developers who wish to tune application codes for specific systems. Analysis of TARUC performance profiles from a multi-CPU, multi-GPU system is also presented

    A New Parareal Algorithm for Time-Periodic Problems with Discontinuous Inputs

    Full text link
    The Parareal algorithm, which is related to multiple shooting, was introduced for solving evolution problems in a time-parallel manner. The algorithm was then extended to solve time-periodic problems. We are interested here in time-periodic systems which are forced with quickly-switching discontinuous inputs. Such situations occur, e.g., in power engineering when electric devices are excited with a pulse-width-modulated signal. In order to solve those problems numerically we consider a recently introduced modified Parareal method with reduced coarse dynamics. Its main idea is to use a low-frequency smooth input for the coarse problem, which can be obtained, e.g., from Fourier analysis. Based on this approach, we present and analyze a new Parareal algorithm for time-periodic problems with highly-oscillatory discontinuous sources. We illustrate the performance of the method via its application to the simulation of an induction machine

    THE NAS PARALLEL BENCHMARKS

    Get PDF
    The Numerical Aerodynamic Simulation (NAS) Program, which is based at NASA Ames Research Center, is a large-scale effort to advance the state of computational aerodynamics. Specifically, the NAS organization aims &dquo;to provide the Nation’s aerospace research and development community by the year 2000 a highperformance, operational computing system capable of simulating an entire aerospace vehicle system within a computing time of one to several hours&dquo; (NAS Systems Division, 1988, p. 3). The successful solution of this &dquo;grand challenge&dquo; problem will require the development of computer systems that can perform the required complex scientific computations at a sustained rate nearly 1,000 times greater than current generation supercomputers can achieve. The architecture of computer systems able to achieve this level of performance will likely be dissimilar to the shared memory multiprocessing supercomputers of today. While no consensus yet exists on what the design will be, it is likely that the system will consist of at least 1,000 processors computing in parallel. Highly parallel systems with computing power roughly equivalent to that of traditional shared memory multiprocessors exist today. Unfortunately, for various reasons, the performance evaluation of these systems on comparable types of scientific computations is very difficult. Relevant data for the performance of algorithms of interest to the computational aerophysics community on many currently available parallel systems are limited. Benchmarking and performance evaluation of such systems have not kept pace with advances in hardware, software, and algorithms. In particular, there is as yet no generally accepted benchmark program or even a benchmark strategy for these systems

    Minimal-density, RAID-6 Codes: An Approach for w = 9

    Get PDF
    RAID-6 erasure codes provide vital data integrity in modern storage systems. There is a class of RAID-6 codes called “Minimal Density Codes,” which have desirable performance properties. These codes are parameterized by a “word size,” w, and constructions of these codes are known when w and w + 1 are prime numbers. However, there are obvious gaps for which there is no theory. An exhaustive search was used to fill in the important gap when w = 8, which is highly applicable to real-world systems, since it is a power of 2. This paper extends that approach to address the next theoretical hole at w = 9 by expanding upon the techniques used for w = 8 and adding customizations to allow for parallel processing

    Energy-Efficient Hardware-Accelerated Synchronization for Shared-L1-Memory Multiprocessor Clusters

    Get PDF
    The steeply growing performance demands for highly power- and energy-constrained processing systems such as end-nodes of the Internet-of-Things (IoT) have led to parallel near-threshold computing (NTC), joining the energy-efficiency benefits of low-voltage operation with the performance typical of parallel systems. Shared-L1-memory multiprocessor clusters are a promising architecture, delivering performance in the order of GOPS and over 100 GOPS/W of energy-efficiency. However, this level of computational efficiency can only be reached by maximizing the effective utilization of the processing elements (PEs) available in the clusters. Along with this effort, the optimization of PE-to-PE synchronization and communication is a critical factor for performance. In this article, we describe a light-weight hardware-accelerated synchronization and communication unit (SCU) for tightly-coupled clusters of processors. We detail the architecture, which enables fine-grain per-PE power management, and its integration into an eight-core cluster of RISC-V processors. To validate the effectiveness of the proposed solution, we implemented the eight-core cluster in advanced 22 nm FDX technology and evaluated performance and energy-efficiency with tunable microbenchmarks and a set of rea-life applications and kernels. The proposed solution allows synchronization-free regions as small as 42 cycles, over 41 smaller than the baseline implementation based on fast test-and-set access to L1 memory when constraining the microbenchmarks to 10 percent synchronization overhead. When evaluated on the real-life DSP-applications, the proposed SCU improves performance by up to 92 and 23 percent on average and energy efficiency by up to 98 and 39 percent on average

    Improving the performance of MIMO Relay Networks using Parallel Relays

    Get PDF
    In this thesis, we focus on improving the performance of MIMO relay communication systems using parallel relays and successive interference cancellation. We aim at minimising the mean-squared error of the signal waveform estimation at the destination node subjecting to transmit power constraints at the source and relay nodes. The joint source and relay matrices optimisation problems for parallel MIMO relay systems are highly nonconvex. We transform the problems into suitable forms which can be efficiently solved using standard convex optimization techniques. Both linear and nonlinear transceivers have been considered. The proposed algorithms have a significant performance improvement over the earlier approaches
    corecore