63 research outputs found

    Hardware Accelerated Scalable Parallel Random Number Generation

    Get PDF
    The Scalable Parallel Random Number Generators library (SPRNG) is widely used due to its speed, quality, and scalability. Monte Carlo (MC) simulations often employ SPRNG to generate large quantities of random numbers. Thanks to fast Field-Programmable Gate Array (FPGA) technology development, this thesis presents Hardware Accelerated SPRNG (HASPRNG) for the Virtex-II Pro XC2VP30 FPGAs. HASPRNG includes the full set of SPRNG generators and provides programming interfaces which hide detailed internal behavior from users. HASPRNG produces identical results with SPRNG, and it is verified with over 1 million consecutive random numbers for each type of generator. The programming interface allows a developer to use HASPRNG the same way as SPRNG. HASPRNG introduces 4-70 times faster execution than the original SPRNG. This thesis describes the implementation of HASPRNG, the verification platform, the programming interface, and its performance

    Mixing multi-core CPUs and GPUs for scientific simulation software

    Get PDF
    Recent technological and economic developments have led to widespread availability of multi-core CPUs and specialist accelerator processors such as graphical processing units (GPUs). The accelerated computational performance possible from these devices can be very high for some applications paradigms. Software languages and systems such as NVIDIA's CUDA and Khronos consortium's open compute language (OpenCL) support a number of individual parallel application programming paradigms. To scale up the performance of some complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica- tions using threading approaches and multi-core CPUs to control independent GPU devices. We present speed-up data and discuss multi-threading software issues for the applications level programmer and o er some suggested areas for language development and integration between coarse-grained and ne-grained multi-thread systems. We discuss results from three common simulation algorithmic areas including: partial di erential equations; graph cluster metric calculations and random number generation. We report on programming experiences and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs; a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and trends in multi-core programming for scienti c applications developers

    Pseudo-random number generators for Monte Carlo simulations on Graphics Processing Units

    Full text link
    Basic uniform pseudo-random number generators are implemented on ATI Graphics Processing Units (GPU). The performance results of the realized generators (multiplicative linear congruential (GGL), XOR-shift (XOR128), RANECU, RANMAR, RANLUX and Mersenne Twister (MT19937)) on CPU and GPU are discussed. The obtained speed-up factor is hundreds of times in comparison with CPU. RANLUX generator is found to be the most appropriate for using on GPU in Monte Carlo simulations. The brief review of the pseudo-random number generators used in modern software packages for Monte Carlo simulations in high-energy physics is present.Comment: 31 pages, 9 figures, 3 table

    GASPRNG: GPU accelerated scalable parallel random number generator library

    Get PDF
    AbstractGraphics processors represent a promising technology for accelerating computational science applications. Many computational science applications require fast and scalable random number generation with good statistical properties, so they use the Scalable Parallel Random Number Generators library (SPRNG). We present the GPU Accelerated SPRNG library (GASPRNG) to accelerate SPRNG in GPU-based high performance computing systems. GASPRNG includes code for a host CPU and CUDA code for execution on NVIDIA graphics processing units (GPUs) along with a programming interface to support various usage models for pseudorandom numbers and computational science applications executing on the CPU, GPU, or both. This paper describes the implementation approach used to produce high performance and also describes how to use the programming interface. The programming interface allows a user to be able to use GASPRNG the same way as SPRNG on traditional serial or parallel computers as well as to develop tightly coupled programs executing primarily on the GPU. We also describe how to install GASPRNG and use it. To help illustrate linking with GASPRNG, various demonstration codes are included for the different usage models. GASPRNG on a single GPU shows up to 280x speedup over SPRNG on a single CPU core and is able to scale for larger systems in the same manner as SPRNG. Because GASPRNG generates identical streams of pseudorandom numbers as SPRNG, users can be confident about the quality of GASPRNG for scalable computational science applications.Program summaryProgram title: GASPRNGCatalogue identifier: AEOI_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEOI_v1_0.htmlProgram obtainable from: CPC Program Library, Queen’s University, Belfast, N. IrelandLicensing provisions: UTK license.No. of lines in distributed program, including test data, etc.: 167900No. of bytes in distributed program, including test data, etc.: 1422058Distribution format: tar.gzProgramming language: C and CUDA.Computer: Any PC or workstation with NVIDIA GPU (Tested on Fermi GTX480, Tesla C1060, Tesla M2070).Operating system: Linux with CUDA version 4.0 or later. Should also run on MacOS, Windows, or UNIX.Has the code been vectorized or parallelized?: Yes. Parallelized using MPI directives.RAM: 512 MB∼ 732 MB (main memory on host CPU, depending on the data type of random numbers.) / 512 MB (GPU global memory)Classification: 4.13, 6.5.Nature of problem:Many computational science applications are able to consume large numbers of random numbers. For example, Monte Carlo simulations are able to consume limitless random numbers for the computation as long as resources for the computing are supported. Moreover, parallel computational science applications require independent streams of random numbers to attain statistically significant results. The SPRNG library provides this capability, but at a significant computational cost. The GASPRNG library presented here accelerates the generators of independent streams of random numbers using graphical processing units (GPUs).Solution method:Multiple copies of random number generators in GPUs allow a computational science application to consume large numbers of random numbers from independent, parallel streams. GASPRNG is a random number generators library to allow a computational science application to employ multiple copies of random number generators to boost performance. Users can interface GASPRNG with software code executing on microprocessors and/or GPUs.Running time:The tests provided take a few minutes to run

    Highly optimized simulations on single- and multi-GPU systems of 3D Ising spin glass

    Full text link
    We present a highly optimized implementation of a Monte Carlo (MC) simulator for the three-dimensional Ising spin-glass model with bimodal disorder, i.e., the 3D Edwards-Anderson model running on CUDA enabled GPUs. Multi-GPU systems exchange data by means of the Message Passing Interface (MPI). The chosen MC dynamics is the classic Metropolis one, which is purely dissipative, since the aim was the study of the critical off-equilibrium relaxation of the system. We focused on the following issues: i) the implementation of efficient access patterns for nearest neighbours in a cubic stencil and for lagged-Fibonacci-like pseudo-Random Numbers Generators (PRNGs); ii) a novel implementation of the asynchronous multispin-coding Metropolis MC step allowing to store one spin per bit and iii) a multi-GPU version based on a combination of MPI and CUDA streams. We highlight how cubic stencils and PRNGs are two subjects of very general interest because of their widespread use in many simulation codes. Our code best performances ~3 and ~5 psFlip on a GTX Titan with our implementations of the MINSTD and MT19937 respectively.Comment: 39 pages, 13 figure

    Accelerated Adjoint Algorithmic Differentiation with Applications in Finance

    Get PDF
    Adjoint Differentiation's (AD) ability to calculate Greeks efficiently and to machine precision while scaling in constant time to the number of input variables is attractive for calibration and hedging where frequent calculations are required. Algorithmic adjoint differentiation tools automatically generates derivative code and provide interesting challenges in both Computer Science and Mathematics. In this dissertation we focus on a manual implementation with particular emphasis on parallel processing using Graphics Processing Units (GPUs) to accelerate run times. Adjoint differentiation is applied to a Call on Max rainbow option with 3 underlying assets in a Monte Carlo environment. Assets are driven by the Heston stochastic volatility model and implemented using the Milstein discretisation scheme with truncation. The price is calculated along with Deltas and Vegas for each asset, at a total of 6 sensitivities. The application achieves favourable levels of parallelism on all three dimensions implemented by the GPU: Instruction Level Parallelism (ILP), Thread level parallelism (TLP), and Single Instruction Multiple Data (SIMD). We estimate the forward pass of the Milstein discretisation contains an ILP of 3.57 which is between the average range of 2-4. Monte Carlo simulations are embarrassingly parallel and are capable of achieving a high level of concurrency. However, in this context a single kernel running at low occupancy can perform better with a combination of Shared memory, vectorized data structures and a high register count per thread. Run time on the Intel Xeon CPU with 501 760 paths and 360 time steps takes 48.801 seconds. The GT950 Maxwell GPU completed in 0.115 seconds, achieving an 422⇥ speedup and a throughput of 13 million paths per second. The K40 is capable of achieving better performance

    Acceleration of GATE Monte Carlo simulations

    Get PDF
    Positron Emission Tomography (PET) and Single Photon Emission Computed Tomography are forms of medical imaging that produce functional images that reflect biological processes. They are based on the tracer principle. A biologically active substance, a pharmaceutical, is selected so that its spatial and temporal distribution in the body reflects a certain body function or metabolism. In order to form images of the distribution, the pharmaceutical is labeled with gamma-ray-emitting or positron-emitting radionuclides (radiopharmaceuticals or tracers). After administration of the tracer to a patient, an external position-sensitive gamma-ray camera can detect the emitted radiation to form a stack of images of the radionuclide distribution after a reconstruction process. Monte Carlo methods are numerical methods that use random numbers to compute quantities of interest. This is normally done by creating a random variable whose expected value is the desired quantity. One then simulates and tabulates the random variable and uses its sample mean and variance to construct probabilistic estimates. It represents an attempt to model nature through direct simulation of the essential dynamics of the system in question. Monte Carlo modeling is the method of choice for all applications where measurements are not feasible or where analytic models are not available due to the complex nature of the problem. In addition, such modeling is a practical approach in nuclear medical imaging in several important application fields: detector design, quantification, correction methods for image degradations, detection tasks etc. Several powerful dedicated Monte Carlo simulators for PET and/or SPECT are available. However, they are often not detailed nor flexible enough to enable realistic simulations of emission tomography detector geometries while also modeling time dependent processes such as decay, tracer kinetics, patient and bed motion, dead time or detector orbits. Our Monte Carlo simulator of choice, GEANT4 Application for Tomographic Emission (GATE), was specifically designed to address all these issues. The flexibility of GATE comes at a price however. The simulation of a simple prototype SPECT detector may be feasible within hours in GATE but an acquisition with a realistic phantom may take years to complete on a single CPU. In this dissertation we therefore focus on the Achilles’ heel of GATE: efficiency. Acceleration of GATE simulations can only be achieved through a combination of efficient data analysis, dedicated variance reduction techniques, fast navigation algorithms and parallelization. In the first part of this dissertation we consider the improvement of the analysis capabilities of GATE. The static analysis module in GATE is both inflexible and incapable of storing more detail without introducing a large computational overhead. However, the design and validation of the acceleration techniques in this dissertation requires a flexible, detailed and computationally efficient analysis module. To this end, we develop a new analysis framework capable of analyzing any process, from the decay of isotopes to particle interactions and detections in any detector element for any type of phantom. The evaluation of our framework consists of the assessment of spurious activity in 124I-Bexxar PET and of contamination in 131I-Bexxar SPECT. In the case of PET we describe how our framework can detect spurious coincidences generated by non-pure isotopes, even with realistic phantoms. We show that optimized energy thresholds, which can readily be applied in the clinic, can now be derived in order to minimize the contamination. We also show that the spurious activity itself is not spatially uniform. Therefore standard reconstruction and correction techniques are not adequate. In the case of SPECT we describe how it is now possible to classify detections into geometric detections, phantom scatter, penetration through the collimator, collimator scatter and backscatter in the end parts. We show that standard correction algorithms such as triple energy window correction cannot correct for septal penetration. We demonstrate that 124I PET with optimized energy thresholds offer better image quality than 131I SPECT when using standard reconstruction techniques. In the second part of this dissertation we focus on improving the efficiency of GATE with a variance reduction technique called Geometrical Importance Sampling (GIS). We describe how only 0.02% of all emitted photons can reach the crystal surface of a SPECT detector head with a low energy high resolution collimator. A lot of computing power is therefore wasted by tracking photons that will not contribute to the result. A twofold strategy is used to solve this problem: GIS employs Russian Roulette to discard those photons that will not likely contribute to the result. Photons in more important regions on the other hand are split into several photons with reduced weight to increase their survival chance. We show that this technique introduces branches into the particle history. We describe how this can be taken into account by a particle history tree that is used for the analysis of the results. The evaluation of GIS consists of energy spectra validation, spatial resolution and sensitivity for low and medium energy isotopes. We show that GIS reaches acceleration factors between 5 and 13 over analog GATE simulations for the isotopes in the study. It is a general acceleration technique that can be used for any isotope, phantom and detector combination. Although GIS is useful as a safe and accurate acceleration technique, it cannot deliver clinically acceptable simulation times. The main reason lies in its inability to force photons in a specific direction. In the third part of this dissertation we solve this problem for 99mTc SPECT simulations. Our approach is twofold. Firstly, we introduce two variance reduction techniques: forced detection (FD) and convolution-based forced detection (CFD) with multiple projection sampling (MPS). FD and CFD force copies of photons at decay and at every interaction point to be transported through the phantom in a direction sampled within a solid angle toward the SPECT detector head at all SPECT angles simultaneously. We describe how a weight must be assigned to each photon in order to compensate for the forced direction and non-absorption at emission and scatter. We show how the weights are calculated from the total and differential Compton and Rayleigh cross sections per electron with incorporation of Hubbell’s atomic form factor. In the case of FD all detector interactions are modeled by Monte Carlo, while in the case of CFD the detector is modeled analytically. Secondly, we describe the design of an FD and CFD specialized navigator to accelerate the slow tracking algorithms in GEANT4. The validation study shows that both FD and CFD closely match the analog GATE simulations and that we can obtain an acceleration factor between 3 (FD) and 6 (CFD) orders of magnitude over analog simulations. This allows for the simulation of a realistic acquisition with a torso phantom within 130 seconds. In the fourth part of this dissertation we exploit the intrinsic parallel nature of Monte Carlo simulations. We show how Monte Carlo simulations should scale linearly as a function of the number of processing nodes but that this is usually not achieved due to job setup time, output handling and cluster overhead. We describe how our approach is based on two steps: job distribution and output data handling. The job distribution is based on a time-domain partitioning scheme that retains all experimental parameters and that guarantees the statistical independence of each subsimulation. We also reduce the job setup time by the introduction of a parameterized collimator model for SPECT simulations. We reduce the data output handling time by a chain-based output merger. The scalability study is based on a set of simulations on a 70 CPU cluster and shows an acceleration factor of approximately 66 on 70 CPUs for both PET and SPECT.We also show that our method of parallelization does not introduce any approximations and that it can be readily combined with any of the previous acceleration techniques described above

    The theoretical development of a new high speed solution for Monte Carlo radiation transport computations

    Get PDF
    Advancements in parallel and cluster computing have made many complex Monte Carlo simulations possible in the past several years. Unfortunately, cluster computers are large, expensive, and still not fast enough to make the Monte Carlo technique useful for calculations requiring a near real-time evaluation period. For Monte Carlo simulations, a small computational unit called a Field Programmable Gate Array (FPGA) is capable of bringing the power of a large cluster computer into any personal computer (PC). Because an FPGA is capable of executing Monte Carlo simulations with a high degree of parallelism, a simulation run on a large FPGA can be executed at a much higher rate than an equivalent simulation on a modern single-processor desktop PC. In this thesis, a simple radiation transport problem involving moderate energy photons incident on a three-dimensional target is discussed. By comparing the theoretical evaluation speed of this transport problem on a large FPGA to the evaluation speed of the same transport problem using standard computing techniques, it is shown that it is possible to accelerate Monte Carlo computations significantly using FPGAs. In fact, we have found that our simple photon transport test case can be evaluated in excess of 650 times faster on a large FPGA than on a 3.2 GHz Pentium-4 desktop PC running MCNP5âÂÂan acceleration factor that we predict will be largely preserved for most Monte Carlo simulations

    Stream ciphers for secure display

    Get PDF
    In any situation where private, proprietary or highly confidential material is being dealt with, the need to consider aspects of data security has grown ever more important. It is usual to secure such data from its source, over networks and on to the intended recipient. However, data security considerations typically stop at the recipient's processor, leaving connections to a display transmitting raw data which is increasingly in a digital format and of value to an adversary. With a progression to wireless display technologies the prominence of this vulnerability is set to rise, making the implementation of 'secure display' increasingly desirable. Secure display takes aspects of data security right to the display panel itself, potentially minimising the cost, component count and thickness of the final product. Recent developments in display technologies should help make this integration possible. However, the processing of large quantities of time-sensitive data presents a significant challenge in such resource constrained environments. Efficient high- throughput decryption is a crucial aspect of the implementation of secure display and one for which the widely used and well understood block cipher may not be best suited. Stream ciphers present a promising alternative and a number of strong candidate algorithms potentially offer the hardware speed and efficiency required. In the past, similar stream ciphers have suffered from algorithmic vulnerabilities. Although these new-generation designs have done much to respond to this concern, the relatively short 80-bit key lengths of some proposed hardware candidates, when combined with ever-advancing computational power, leads to the thesis identifying exhaustive search of key space as a potential attack vector. To determine the value of protection afforded by such short key lengths a unique hardware key search engine for stream ciphers is developed that makes use of an appropriate data element to improve search efficiency. The simulations from this system indicate that the proposed key lengths may be insufficient for applications where data is of long-term or high value. It is suggested that for the concept of secure display to be accepted, a longer key length should be used
    • …
    corecore