4,833 research outputs found

    Correcting soft errors online in fast fourier transform

    Get PDF
    While many algorithm-based fault tolerance (ABFT) schemes have been proposed to detect soft errors offline in the fast Fourier transform (FFT) after computation finishes, none of the existing ABFT schemes detect soft errors online before the computation finishes. This paper presents an online ABFT scheme for FFT so that soft errors can be detected online and the corrupted computation can be terminated in a much more timely manner. We also extend our scheme to tolerate both arithmetic errors and memory errors, develop strategies to reduce its fault tolerance overhead and improve its numerical stability and fault coverage, and finally incorporate it into the widely used FFTW library - one of the today's fastest FFT software implementations. Experimental results demonstrate that: (1) the proposed online ABFT scheme introduces much lower overhead than the existing offline ABFT schemes; (2) it detects errors in a much more timely manner; and (3) it also has higher numerical stability and better fault coverage

    GAMER: a GPU-Accelerated Adaptive Mesh Refinement Code for Astrophysics

    Full text link
    We present the newly developed code, GAMER (GPU-accelerated Adaptive MEsh Refinement code), which has adopted a novel approach to improve the performance of adaptive mesh refinement (AMR) astrophysical simulations by a large factor with the use of the graphic processing unit (GPU). The AMR implementation is based on a hierarchy of grid patches with an oct-tree data structure. We adopt a three-dimensional relaxing TVD scheme for the hydrodynamic solver, and a multi-level relaxation scheme for the Poisson solver. Both solvers have been implemented in GPU, by which hundreds of patches can be advanced in parallel. The computational overhead associated with the data transfer between CPU and GPU is carefully reduced by utilizing the capability of asynchronous memory copies in GPU, and the computing time of the ghost-zone values for each patch is made to diminish by overlapping it with the GPU computations. We demonstrate the accuracy of the code by performing several standard test problems in astrophysics. GAMER is a parallel code that can be run in a multi-GPU cluster system. We measure the performance of the code by performing purely-baryonic cosmological simulations in different hardware implementations, in which detailed timing analyses provide comparison between the computations with and without GPU(s) acceleration. Maximum speed-up factors of 12.19 and 10.47 are demonstrated using 1 GPU with 4096^3 effective resolution and 16 GPUs with 8192^3 effective resolution, respectively.Comment: 60 pages, 22 figures, 3 tables. More accuracy tests are included. Accepted for publication in ApJ

    Synthetic aperture radar signal processing on the MPP

    Get PDF
    Satellite-borne Synthetic Aperture Radars (SAR) sense areas of several thousand square kilometers in seconds and transmit phase history signal data several tens of megabits per second. The Shuttle Imaging Radar-B (SIR-B) has a variable swath of 20 to 50 km and acquired data over 100 kms along track in about 13 seconds. With the simplification of separability of the reference function, the processing still requires considerable resources; high speed I/O, large memory and fast computation. Processing systems with regular hardware take hours to process one Seasat image and about one hour for a SIR-B image. Bringing this processing time closer to acquisition times requires an end-to-end system solution. For the purpose of demonstration, software was implemented on the present Massively Parallel Processor (MPP) configuration for processing Seasat and SIR-B data. The software takes advantage of the high processing speed offered by the MPP, the large Staging Buffer, and the high speed I/O between the MPP array unit and the Staging Buffer. It was found that with unoptimized Parallel Pascal code, the processing time on the MPP for a 4096 x 4096 sample subset of signal data ranges between 18 and 30.2 seconds depending on options

    Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs

    Full text link
    Deep learning frameworks have been widely deployed on GPU servers for deep learning applications in both academia and industry. In training deep neural networks (DNNs), there are many standard processes or algorithms, such as convolution and stochastic gradient descent (SGD), but the running performance of different frameworks might be different even running the same deep model on the same GPU hardware. In this study, we evaluate the running performance of four state-of-the-art distributed deep learning frameworks (i.e., Caffe-MPI, CNTK, MXNet, and TensorFlow) over single-GPU, multi-GPU, and multi-node environments. We first build performance models of standard processes in training DNNs with SGD, and then we benchmark the running performance of these frameworks with three popular convolutional neural networks (i.e., AlexNet, GoogleNet and ResNet-50), after that, we analyze what factors that result in the performance gap among these four frameworks. Through both analytical and experimental analysis, we identify bottlenecks and overheads which could be further optimized. The main contribution is that the proposed performance models and the analysis provide further optimization directions in both algorithmic design and system configuration.Comment: Published at DataCom'201

    A new data analysis framework for the search of continuous gravitational wave signals

    Full text link
    Continuous gravitational wave signals, like those expected by asymmetric spinning neutron stars, are among the most promising targets for LIGO and Virgo detectors. The development of fast and robust data analysis methods is crucial to increase the chances of a detection. We have developed a new and flexible general data analysis framework for the search of this kind of signals, which allows to reduce the computational cost of the analysis by about two orders of magnitude with respect to current procedures. This can correspond, at fixed computing cost, to a sensitivity gain of up to 10%-20%, depending on the search parameter space. Some possible applications are discussed, with a particular focus on a directed search for sources in the Galactic center. Validation through the injection of artificial signals in the data of Advanced LIGO first observational science run is also shown.Comment: 21 pages, 8 figure

    Block-iterative Richardson-Lucy methods for image deblurring

    Get PDF
    • …
    corecore