16 research outputs found

    Computational Physics on Graphics Processing Units

    Full text link
    The use of graphics processing units for scientific computations is an emerging strategy that can significantly speed up various different algorithms. In this review, we discuss advances made in the field of computational physics, focusing on classical molecular dynamics, and on quantum simulations for electronic structure calculations using the density functional theory, wave function techniques, and quantum field theory.Comment: Proceedings of the 11th International Conference, PARA 2012, Helsinki, Finland, June 10-13, 201

    Large-Scale Discrete Fourier Transform on TPUs

    Full text link
    In this work, we present two parallel algorithms for the large-scale discrete Fourier transform (DFT) on Tensor Processing Unit (TPU) clusters. The two parallel algorithms are associated with two formulations of DFT: one is based on the Kronecker product, to be specific, dense matrix multiplications between the input data and the Vandermonde matrix, denoted as KDFT in this work; the other is based on the famous Cooley-Tukey algorithm and phase adjustment, denoted as FFT in this work. Both KDFT and FFT formulations take full advantage of TPU's strength in matrix multiplications. The KDFT formulation allows direct use of nonuniform inputs without additional step. In the two parallel algorithms, the same strategy of data decomposition is applied to the input data. Through the data decomposition, the dense matrix multiplications in KDFT and FFT are kept local within TPU cores, which can be performed completely in parallel. The communication among TPU cores is achieved through the one-shuffle scheme in both parallel algorithms, with which sending and receiving data takes place simultaneously between two neighboring cores and along the same direction on the interconnect network. The one-shuffle scheme is designed for the interconnect topology of TPU clusters, minimizing the time required by the communication among TPU cores. Both KDFT and FFT are implemented in TensorFlow. The three-dimensional complex DFT is performed on an example of dimension 8192ร—8192ร—81928192 \times 8192 \times 8192 with a full TPU Pod: the run time of KDFT is 12.66 seconds and that of FFT is 8.3 seconds. Scaling analysis is provided to demonstrate the high parallel efficiency of the two DFT implementations on TPUs

    The inherent overlapping in the parallel calculation of the Laplacian

    Get PDF
    Producciรณn CientรญficaA new approach for the parallel computation of the Laplacian in the Fourier domain is presented. This numerical problem inherits the intrinsic sequencing involved in the calculation of any multidimensional Fast Fourier Transform (FFT) where blocking communications assure that its computation is strictly carried out dimension by dimension. Such data dependency vanishes when one considers the Laplacian as the sum of n independent one-dimensional kernels, so that computation and communication can be naturally overlapped with nonblocking communications. Overlapping is demonstrated to be responsible for the speedup figures we obtain when our approach is compared to state-of-the-art parallel multidimensional FFTs.Junta de Castilla Leรณn (grant number VA296P18

    Spectral turning bands for efficient Gaussian random fields generation on GPUs and accelerators

    Get PDF
    A random field (RF) is a set of correlated random variables associated with different spatial locations. RF generation algorithms are of crucial importance for many scientific areas, such as astrophysics, geostatistics, computer graphics, and many others. Current approaches commonly make use of 3D fast Fourier transform (FFT), which does not scale well for RF bigger than the available memory; they are also limited to regular rectilinear meshes. We introduce random field generation with the turning band method (RAFT), an RF generation algorithm based on the turning band method that is optimized for massively parallel hardware such as GPUs and accelerators. Our algorithm replaces the 3D FFT with a lowerโ€order, oneโ€dimensional FFT followed by a projection step and is further optimized with loop unrolling and blocking. RAFT can easily generate RF on nonโ€regular (nonโ€uniform) meshes and efficiently produce fields with mesh sizes bigger than the available device memory by using a streaming, outโ€ofโ€core approach. Our algorithm generates RF with the correct statistical behavior and is tested on a variety of modern hardware, such as NVIDIA Tesla, AMD FirePro and Intel Phi. RAFT is faster than the traditional methods on regular meshes and has been successfully applied to two real case scenarios: planetary nebulae and cosmological simulations

    Using GPUs to Compute Large Out-of-card FFTs

    Get PDF
    ABSTRACT The optimization of Fast Fourier Transfer (FFT) problems that can fit into GPU memory has been studied extensively. Such on-card FFT libraries like CUFFT can generally achieve much better performance than their counterparts on a CPU, as the data transfer between CPU and GPU is usually not counted in their performance. This high performance, however, is limited by the GPU memory size. When the FFT problem size increases, the data transfer between system and GPU memory can comprise a substantial part of the overall execution time. Therefore, optimizations for FFT problems that outgrow the GPU memory can not bypass the tuning of data transfer between CPU and GPU. However, no prior study has attacked this problem. This paper is the first effort of using GPUs to efficiently compute large FFTs in the CPU memory of a single compute node. In this paper, the performance of the PCI bus during the transfer of a batch of FFT subarrays is studied and a blocked buffer algorithm is proposed to improve the effective bandwidth. More importantly, several FFT decomposition algorithms are proposed so as to increase the data locality, further improve the PCI bus efficiency and balance computation between kernels. By integrating the above two methods, we demonstrate an out-of-card FFT optimization strategy and develop an FFT library that efficiently computes large 1D, 2D and 3D FFTs that can not fit into the GPU's memory. On three of the latest GPUs, our large FFT library achieves much better double precision performance than two of the most efficient CPU based libraries, FFTW and Intel MKL. On average, our large FFTs on a single GeForce GTX480 are 46% faster than FFTW and 57% faster than MKL with multiple threads running on a four-core Intel i7 CPU. The speedup on a Tesla C2070 is 1.93ร— and 2.11ร— over FFTW and MKL. A peak performance of 21GFLOPS is achieved for a 2D FFT of size 2048 ร— 65536 on C2070 with double precision

    ์ด์ข… ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ์œ„ํ•œ OpenCL ํ”„๋ ˆ์ž„์›Œํฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2013. 8. ์ด์žฌ์ง„.OpenCL์€ ์ด์ข… ์ปดํ“จํŒ… ์‹œ์Šคํ…œ์˜ ๋‹ค์–‘ํ•œ ๊ณ„์‚ฐ ์žฅ์น˜๋ฅผ ์œ„ํ•œ ํ†ตํ•ฉ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ์ด๋‹ค. OpenCL์€ ๋‹ค์–‘ํ•œ ์ด๊ธฐ์ข…์˜ ๊ณ„์‚ฐ ์žฅ์น˜์— ๋Œ€ํ•œ ๊ณตํ†ต๋œ ํ•˜๋“œ์›จ์–ด ์ถ”์ƒํ™” ๋ ˆ์ด์–ด๋ฅผ ํ”„๋กœ๊ทธ๋ž˜๋จธ์—๊ฒŒ ์ œ๊ณตํ•œ๋‹ค. ํ”„๋กœ๊ทธ๋ž˜๋จธ๊ฐ€ ์ด ์ถ”์ƒํ™” ๋ ˆ์ด์–ด๋ฅผ ํƒ€๊นƒ์œผ๋กœ OpenCL ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์ž‘์„ฑํ•˜๋ฉด, ์ด ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์€ OpenCL์„ ์ง€์›ํ•˜๋Š” ๋ชจ๋“  ํ•˜๋“œ์›จ์–ด์—์„œ ์‹คํ–‰ ๊ฐ€๋Šฅํ•˜๋‹ค. ํ•˜์ง€๋งŒ ํ˜„์žฌ OpenCL์€ ๋‹จ์ผ ์šด์˜์ฒด์ œ ์‹œ์Šคํ…œ์„ ์œ„ํ•œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ๋กœ ํ•œ์ •๋œ๋‹ค. ํ”„๋กœ๊ทธ๋ž˜๋จธ๊ฐ€ ๋ช…์‹œ์ ์œผ๋กœ MPI์™€ ๊ฐ™์€ ํ†ต์‹  ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์œผ๋ฉด OpenCL ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์€ ๋ณต์ˆ˜๊ฐœ์˜ ๋…ธ๋“œ๋กœ ์ด๋ฃจ์–ด์ง„ ํด๋Ÿฌ์Šคํ„ฐ์—์„œ ๋™์ž‘ํ•˜์ง€ ์•Š๋Š”๋‹ค. ์š”์ฆ˜ ๋“ค์–ด ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ฉ€ํ‹ฐ์ฝ”์–ด CPU์™€ ๊ฐ€์†๊ธฐ๋ฅผ ๊ฐ–์ถ˜ ์ด์ข… ํด๋Ÿฌ์Šคํ„ฐ๋Š” ๊ทธ ์‚ฌ์šฉ์ž ๊ธฐ๋ฐ˜์„ ๋„“ํ˜€๊ฐ€๊ณ  ์žˆ๋‹ค. ์ด์— ํ•ด๋‹น ์ด์ข… ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ํƒ€๊นƒ์œผ๋กœ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ํ”„๋กœ๊ทธ๋ž˜๋จธ๋Š” MPI-OpenCL ๊ฐ™์ด ์—ฌ๋Ÿฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ์„ ํ˜ผํ•ฉํ•˜์—ฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์ž‘์„ฑํ•ด์•ผ ํ•œ๋‹ค. ์ด๋Š” ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ๋ณต์žกํ•˜๊ฒŒ ๋งŒ๋“ค์–ด ์œ ์ง€๋ณด์ˆ˜๊ฐ€ ์–ด๋ ต๊ฒŒ ๋˜๋ฉฐ ์ด์‹์„ฑ์ด ๋‚ฎ์•„์ง„๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด์ข… ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ์œ„ํ•œ OpenCL ํ”„๋ ˆ์ž„์›Œํฌ, SnuCL์„ ์ œ์•ˆํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ OpenCL ๋ชจ๋ธ์ด ์ด์ข… ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํ™˜๊ฒฝ์— ์ ํ•ฉํ•˜๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์ธ๋‹ค. ์ด์™€ ๋™์‹œ์— SnuCL์ด ๊ณ ์„ฑ๋Šฅ๊ณผ ์‰ฌ์šด ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ๋™์‹œ์— ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค. SnuCL์€ ํƒ€๊นƒ ์ด์ข… ํด๋Ÿฌ์Šคํ„ฐ์— ๋Œ€ํ•ด์„œ ๋‹จ์ผ ์šด์˜์ฒด์ œ๊ฐ€ ๋Œ์•„๊ฐ€๋Š” ํ•˜๋‚˜์˜ ์‹œ์Šคํ…œ ์ด๋ฏธ์ง€๋ฅผ ์‚ฌ์šฉ์ž์—๊ฒŒ ์ œ๊ณตํ•œ๋‹ค. OpenCL ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์€ ํด๋Ÿฌ์Šคํ„ฐ์˜ ๋ชจ๋“  ๊ณ„์‚ฐ ๋…ธ๋“œ์— ์กด์žฌํ•˜๋Š” ๋ชจ๋“  ๊ณ„์‚ฐ ์žฅ์น˜๊ฐ€ ํ˜ธ์ŠคํŠธ ๋…ธ๋“œ์— ์žˆ๋‹ค๋Š” ํ™˜์ƒ์„ ๊ฐ–๊ฒŒ ๋œ๋‹ค. ๋”ฐ๋ผ์„œ ํ”„๋กœ๊ทธ๋ž˜๋จธ๋Š” MPI ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์™€ ๊ฐ™์€ ์ปค๋ฎค๋‹ˆ์ผ€์ด์…˜ API๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  OpenCL ๋งŒ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด์ข… ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ํƒ€๊นƒ์œผ๋กœ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์ž‘์„ฑํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค. SnuCL์˜ ๋„์›€์œผ๋กœ OpenCL ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์€ ๋‹จ์ผ ๋…ธ๋“œ์—์„œ ์ด์ข… ๋””๋ฐ”์ด์Šค๊ฐ„ ์ด์‹์„ฑ์„ ๊ฐ€์งˆ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ด์ข… ํด๋Ÿฌ์Šคํ„ฐ ํ™˜๊ฒฝ์—์„œ๋„ ๋””๋ฐ”์ด์Šค๊ฐ„ ์ด์‹์„ฑ์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด ์—ดํ•œ ๊ฐœ์˜ OpenCL ๋ฒค์น˜๋งˆํฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ์‹คํ—˜์„ ํ†ตํ•˜์—ฌ SnuCL์˜ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค.OpenCL is a unified programming model for different types of computational units in a single heterogeneous computing system. OpenCL provides a common hardware abstraction layer across different computational units. Programmers can write OpenCL applications once and run them on any OpenCL-compliant hardware. However, one of the limitations of current OpenCL is that it is restricted to a programming model on a single operating system image. It does not work for a cluster of multiple nodes unless the programmer explicitly uses communication libraries, such as MPI. A heterogeneous cluster contains multiple general-purpose multicore CPUs and multiple accelerators to solve bigger problems within an acceptable time frame. As such clusters widen their user base, application developers for the clusters are being forced to turn to an unattractive mix of programming models, such as MPI-OpenCL. This makes the application more complex, hard to maintain, and less portable. In this thesis, we propose SnuCL, an OpenCL framework for heterogeneous clusters. We show that the original OpenCL semantics naturally fits to the heterogeneous cluster programming environment, and the framework achieves high performance and ease of programming. SnuCL provides a system image running a single operating system instance for heterogeneous clusters to the user. It allows the application to utilize compute devices in a compute node as if they were in the host node. No communication API, such as the MPI library, is required in the application source. With SnuCL, an OpenCL application becomes portable not only between heterogeneous devices in a single node, but also between compute devices in the cluster environment. We implement SnuCL and evaluate its performance using eleven OpenCL benchmark applications.Abstract I. Introduction I.1 Heterogeneous Computing I.2 Motivation I.3 Related Work I.4 Contributions I.5 Organization of this Thesis II. The OpenCL Architecture II.1 Platform Model II.2 Execution Model II.3 Memory Model II.4 OpenCL Applications III. The SnuCL Framework III.1 The SnuCL Runtime III.1.1 Mapping Components III.1.2 Organization of the SnuCL Runtime III.1.3 Processing Kernel-execution Commands III.1.4 Processing Synchronization Commands III.2 Memory Management III.2.1 The OpenCL Memory Model III.2.2 Space Allocation to Buffers III.2.3 Minimizing Memory Copying Overhead III.2.4 Processing Memory Commands III.2.5 Consistency Management III.2.6 Ease of Programming III.3 Extensions to OpenCL III.4 Code Transformations III.4.1 Detecting Buffers Written by a Kernel III.4.2 Emulating PEs for CPU Devices III.4.3 Distributing the Kernel Code IV. Distributed Execution Model for SnuCL IV.1 Two Problems in SnuCL IV.2 Remote Device Virtualization IV.2.1 Exclusive Execution on the Host IV.3 OpenCL Framework Integration IV.3.1 OpenCL Installable Client Driver (ICD) IV.3.2 Event Synchronization IV.3.3 Memory Sharing V. Experimental Results V.1 SnuCL Evaluation V.1.1 Methodology V.1.2 Results V.2 SnuCL-D Evaluation V.2.1 Methodology V.2.2 Results VI. Conclusions and Future Directions VI.1 Conclusions VI.2 Future Directions Bibliography Korean AbstractDocto

    Review : Deep learning in electron microscopy

    Get PDF
    Deep learning is transforming most areas of science and technology, including electron microscopy. This review paper offers a practical perspective aimed at developers with limited familiarity. For context, we review popular applications of deep learning in electron microscopy. Following, we discuss hardware and software needed to get started with deep learning and interface with electron microscopes. We then review neural network components, popular architectures, and their optimization. Finally, we discuss future directions of deep learning in electron microscopy
    corecore