8,302 research outputs found

    Matrix-free GPU implementation of a preconditioned conjugate gradient solver for anisotropic elliptic PDEs

    Get PDF
    Many problems in geophysical and atmospheric modelling require the fast solution of elliptic partial differential equations (PDEs) in "flat" three dimensional geometries. In particular, an anisotropic elliptic PDE for the pressure correction has to be solved at every time step in the dynamical core of many numerical weather prediction models, and equations of a very similar structure arise in global ocean models, subsurface flow simulations and gas and oil reservoir modelling. The elliptic solve is often the bottleneck of the forecast, and an algorithmically optimal method has to be used and implemented efficiently. Graphics Processing Units have been shown to be highly efficient for a wide range of applications in scientific computing, and recently iterative solvers have been parallelised on these architectures. We describe the GPU implementation and optimisation of a Preconditioned Conjugate Gradient (PCG) algorithm for the solution of a three dimensional anisotropic elliptic PDE for the pressure correction in NWP. Our implementation exploits the strong vertical anisotropy of the elliptic operator in the construction of a suitable preconditioner. As the algorithm is memory bound, performance can be improved significantly by reducing the amount of global memory access. We achieve this by using a matrix-free implementation which does not require explicit storage of the matrix and instead recalculates the local stencil. Global memory access can also be reduced by rewriting the algorithm using loop fusion and we show that this further reduces the runtime on the GPU. We demonstrate the performance of our matrix-free GPU code by comparing it to a sequential CPU implementation and to a matrix-explicit GPU code which uses existing libraries. The absolute performance of the algorithm for different problem sizes is quantified in terms of floating point throughput and global memory bandwidth.Comment: 18 pages, 7 figure

    Generalized Wilson Chain for solving multichannel quantum impurity problems

    Full text link
    The Numerical Renormalization Group is used to solve quantum impurity problems, which describe magnetic impurities in metals, nanodevices, and correlated materials within DMFT. Here we present a simple generalization of the Wilson Chain, which improves the scaling of computational cost with the number of channels/bands, bringing new problems within reach. The method is applied to calculate the t-matrix of the three-channel Kondo model at T=0, which shows universal crossovers near non-Fermi liquid critical points. A non-integrable three-impurity problem with three bands is also studied, revealing a rich phase diagram and novel screening/overscreening mechanisms.Comment: 5 pages + 5 pages supplementary materia

    A low-energy rate-adaptive bit-interleaved passive optical network

    Get PDF
    Energy consumption of customer premises equipment (CPE) has become a serious issue in the new generations of time-division multiplexing passive optical networks, which operate at 10 Gb/s or higher. It is becoming a major factor in global network energy consumption, and it poses problems during emergencies when CPE is battery-operated. In this paper, a low-energy passive optical network (PON) that uses a novel bit-interleaving downstream protocol is proposed. The details about the network architecture, protocol, and the key enabling implementation aspects, including dynamic traffic interleaving, rate-adaptive descrambling of decimated traffic, and the design and implementation of a downsampling clock and data recovery circuit, are described. The proposed concept is shown to reduce the energy consumption for protocol processing by a factor of 30. A detailed analysis of the energy consumption in the CPE shows that the interleaving protocol reduces the total energy consumption of the CPE significantly in comparison to the standard 10 Gb/s PON CPE. Experimental results obtained from measurements on the implemented CPE prototype confirm that the CPE consumes significantly less energy than the standard 10 Gb/s PON CPE

    Access to vectors in multi-module memories

    Get PDF
    The poor bandwidth obtained from memory when conflicts arise in the modules or in the interconnection network degrades the performance of computers. Address transformation schemes, such as interleaving, skewing and linear transformations, have been proposed to achieve conflict-free access for streams with constant stride. However, this is achieved only for some strides. In this paper, we summarize a mechanism to request the elements in an out-of-order way which allows to achieve conflict-free access for a larger number of strides. We study the cases of a single vector processor and of a vector multiprocessor system. For this latter case, we propose a synchronous mode of accessing memory that can be applied in SIMD machines or in MIMD systems with decoupled access and execution.Peer ReviewedPostprint (published version

    Simulating the behavior of the human brain on GPUS

    Get PDF
    The simulation of the behavior of the Human Brain is one of the most important challenges in computing today. The main problem consists of finding efficient ways to manipulate and compute the huge volume of data that this kind of simulations need, using the current technology. In this sense, this work is focused on one of the main steps of such simulation, which consists of computing the Voltage on neurons’ morphology. This is carried out using the Hines Algorithm and, although this algorithm is the optimum method in terms of number of operations, it is in need of non-trivial modifications to be efficiently parallelized on GPUs. We proposed several optimizations to accelerate this algorithm on GPU-based architectures, exploring the limitations of both, method and architecture, to be able to solve efficiently a high number of Hines systems (neurons). Each of the optimizations are deeply analyzed and described. Two different approaches are studied, one for mono-morphology simulations (batch of neurons with the same shape) and one for multi-morphology simulations (batch of neurons where every neuron has a different shape). In mono-morphology simulations we obtain a good performance using just a single kernel to compute all the neurons. However this turns out to be inefficient on multi-morphology simulations. Unlike the previous scenario, in multi-morphology simulations a much more complex implementation is necessary to obtain a good performance. In this case, we must execute more than one single GPU kernel. In every execution (kernel call) one specific part of the batch of the neurons is solved. These parts can be seen as multiple and independent tridiagonal systems. Although the present paper is focused on the simulation of the behavior of the Human Brain, some of these techniques, in particular those related to the solving of tridiagonal systems, can be also used for multiple oil and gas simulations. Our studies have proven that the optimizations proposed in the present work can achieve high performance on those computations with a high number of neurons, being our GPU implementations about 4× and 8× faster than the OpenMP multicore implementation (16 cores), using one and two NVIDIA K80 GPUs respectively. Also, it is important to highlight that these optimizations can continue scaling, even when dealing with a very high number of neurons.This project has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No. 720270 (HBP SGA1), from the Spanish Ministry of Economy and Competitiveness under the project Computación de Altas Prestaciones VII (TIN2015-65316-P), the Departament d’Innovació, Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Programació i Entorns d’Execució Parallels (2014-SGR-1051). We thank the support of NVIDIA through the BSC/UPC NVIDIA GPU Center of Excellence, and the European Union’s Horizon 2020 Research and Innovation Program under the Marie Sklodowska-Curie Grant Agreement No. 749516.Peer ReviewedPostprint (published version

    From Theory to Practice: Sub-Nyquist Sampling of Sparse Wideband Analog Signals

    Full text link
    Conventional sub-Nyquist sampling methods for analog signals exploit prior information about the spectral support. In this paper, we consider the challenging problem of blind sub-Nyquist sampling of multiband signals, whose unknown frequency support occupies only a small portion of a wide spectrum. Our primary design goals are efficient hardware implementation and low computational load on the supporting digital processing. We propose a system, named the modulated wideband converter, which first multiplies the analog signal by a bank of periodic waveforms. The product is then lowpass filtered and sampled uniformly at a low rate, which is orders of magnitude smaller than Nyquist. Perfect recovery from the proposed samples is achieved under certain necessary and sufficient conditions. We also develop a digital architecture, which allows either reconstruction of the analog input, or processing of any band of interest at a low rate, that is, without interpolating to the high Nyquist rate. Numerical simulations demonstrate many engineering aspects: robustness to noise and mismodeling, potential hardware simplifications, realtime performance for signals with time-varying support and stability to quantization effects. We compare our system with two previous approaches: periodic nonuniform sampling, which is bandwidth limited by existing hardware devices, and the random demodulator, which is restricted to discrete multitone signals and has a high computational load. In the broader context of Nyquist sampling, our scheme has the potential to break through the bandwidth barrier of state-of-the-art analog conversion technologies such as interleaved converters.Comment: 17 pages, 12 figures, to appear in IEEE Journal of Selected Topics in Signal Processing, the special issue on Compressed Sensin

    Vector computer memory bank contention

    Get PDF
    A number of vector supercomputers feature very large memories. Unfortunately the large capacity memory chips that are used in these computers are much slower than the fast central processing unit (CPU) circuitry. As a result, memory bank reservation times (in CPU ticks) are much longer than on previous generations of computers. A consequence of these long reservation times is that memory bank contention is sharply increased, resulting in significantly lowered performance rates. The phenomenon of memory bank contention in vector computers is analyzed using both a Markov chain model and a Monte Carlo simulation program. The results of this analysis indicate that future generations of supercomputers must either employ much faster memory chips or else feature very large numbers of independent memory banks
    • …
    corecore