16 research outputs found
Computational Physics on Graphics Processing Units
The use of graphics processing units for scientific computations is an
emerging strategy that can significantly speed up various different algorithms.
In this review, we discuss advances made in the field of computational physics,
focusing on classical molecular dynamics, and on quantum simulations for
electronic structure calculations using the density functional theory, wave
function techniques, and quantum field theory.Comment: Proceedings of the 11th International Conference, PARA 2012,
Helsinki, Finland, June 10-13, 201
Large-Scale Discrete Fourier Transform on TPUs
In this work, we present two parallel algorithms for the large-scale discrete
Fourier transform (DFT) on Tensor Processing Unit (TPU) clusters. The two
parallel algorithms are associated with two formulations of DFT: one is based
on the Kronecker product, to be specific, dense matrix multiplications between
the input data and the Vandermonde matrix, denoted as KDFT in this work; the
other is based on the famous Cooley-Tukey algorithm and phase adjustment,
denoted as FFT in this work. Both KDFT and FFT formulations take full advantage
of TPU's strength in matrix multiplications. The KDFT formulation allows direct
use of nonuniform inputs without additional step. In the two parallel
algorithms, the same strategy of data decomposition is applied to the input
data. Through the data decomposition, the dense matrix multiplications in KDFT
and FFT are kept local within TPU cores, which can be performed completely in
parallel. The communication among TPU cores is achieved through the one-shuffle
scheme in both parallel algorithms, with which sending and receiving data takes
place simultaneously between two neighboring cores and along the same direction
on the interconnect network. The one-shuffle scheme is designed for the
interconnect topology of TPU clusters, minimizing the time required by the
communication among TPU cores. Both KDFT and FFT are implemented in TensorFlow.
The three-dimensional complex DFT is performed on an example of dimension with a full TPU Pod: the run time of KDFT is 12.66
seconds and that of FFT is 8.3 seconds. Scaling analysis is provided to
demonstrate the high parallel efficiency of the two DFT implementations on
TPUs
The inherent overlapping in the parallel calculation of the Laplacian
Producciรณn CientรญficaA new approach for the parallel computation of the Laplacian in the Fourier domain is presented. This numerical problem inherits the intrinsic sequencing involved in the calculation of any multidimensional Fast Fourier Transform (FFT) where blocking communications assure that its computation is strictly carried out dimension by dimension. Such data dependency vanishes when one considers the Laplacian as the sum of n independent one-dimensional kernels, so that computation and communication can be naturally overlapped with nonblocking communications. Overlapping is demonstrated to be responsible for the speedup figures we obtain when our approach is compared to state-of-the-art parallel multidimensional FFTs.Junta de Castilla Leรณn (grant number VA296P18
Spectral turning bands for efficient Gaussian random fields generation on GPUs and accelerators
A random field (RF) is a set of correlated random variables associated with different spatial locations. RF generation algorithms are of crucial importance for many scientific areas, such as astrophysics, geostatistics, computer graphics, and many others. Current approaches commonly make use of 3D fast Fourier transform (FFT), which does not scale well for RF bigger than the available memory; they are also limited to regular rectilinear meshes.
We introduce random field generation with the turning band method (RAFT), an RF generation algorithm based on the turning band method that is optimized for massively parallel hardware such as GPUs and accelerators. Our algorithm replaces the 3D FFT with a lowerโorder, oneโdimensional FFT followed by a projection step and is further optimized with loop unrolling and blocking. RAFT can easily generate RF on nonโregular (nonโuniform) meshes and efficiently produce fields with mesh sizes bigger than the available device memory by using a streaming, outโofโcore approach. Our algorithm generates RF with the correct statistical behavior and is tested on a variety of modern hardware, such as NVIDIA Tesla, AMD FirePro and Intel Phi. RAFT is faster than the traditional methods on regular meshes and has been successfully applied to two real case scenarios: planetary nebulae and cosmological simulations
Using GPUs to Compute Large Out-of-card FFTs
ABSTRACT The optimization of Fast Fourier Transfer (FFT) problems that can fit into GPU memory has been studied extensively. Such on-card FFT libraries like CUFFT can generally achieve much better performance than their counterparts on a CPU, as the data transfer between CPU and GPU is usually not counted in their performance. This high performance, however, is limited by the GPU memory size. When the FFT problem size increases, the data transfer between system and GPU memory can comprise a substantial part of the overall execution time. Therefore, optimizations for FFT problems that outgrow the GPU memory can not bypass the tuning of data transfer between CPU and GPU. However, no prior study has attacked this problem. This paper is the first effort of using GPUs to efficiently compute large FFTs in the CPU memory of a single compute node. In this paper, the performance of the PCI bus during the transfer of a batch of FFT subarrays is studied and a blocked buffer algorithm is proposed to improve the effective bandwidth. More importantly, several FFT decomposition algorithms are proposed so as to increase the data locality, further improve the PCI bus efficiency and balance computation between kernels. By integrating the above two methods, we demonstrate an out-of-card FFT optimization strategy and develop an FFT library that efficiently computes large 1D, 2D and 3D FFTs that can not fit into the GPU's memory. On three of the latest GPUs, our large FFT library achieves much better double precision performance than two of the most efficient CPU based libraries, FFTW and Intel MKL. On average, our large FFTs on a single GeForce GTX480 are 46% faster than FFTW and 57% faster than MKL with multiple threads running on a four-core Intel i7 CPU. The speedup on a Tesla C2070 is 1.93ร and 2.11ร over FFTW and MKL. A peak performance of 21GFLOPS is achieved for a 2D FFT of size 2048 ร 65536 on C2070 with double precision
์ด์ข ํด๋ฌ์คํฐ๋ฅผ ์ํ OpenCL ํ๋ ์์ํฌ
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ)-- ์์ธ๋ํ๊ต ๋ํ์ : ์ ๊ธฐยท์ปดํจํฐ๊ณตํ๋ถ, 2013. 8. ์ด์ฌ์ง.OpenCL์ ์ด์ข
์ปดํจํ
์์คํ
์ ๋ค์ํ ๊ณ์ฐ ์ฅ์น๋ฅผ ์ํ ํตํฉ ํ๋ก๊ทธ๋๋ฐ ๋ชจ๋ธ์ด๋ค. OpenCL์ ๋ค์ํ ์ด๊ธฐ์ข
์ ๊ณ์ฐ ์ฅ์น์ ๋ํ ๊ณตํต๋ ํ๋์จ์ด ์ถ์ํ ๋ ์ด์ด๋ฅผ ํ๋ก๊ทธ๋๋จธ์๊ฒ ์ ๊ณตํ๋ค. ํ๋ก๊ทธ๋๋จธ๊ฐ ์ด ์ถ์ํ ๋ ์ด์ด๋ฅผ ํ๊น์ผ๋ก OpenCL ์ดํ๋ฆฌ์ผ์ด์
์ ์์ฑํ๋ฉด, ์ด ์ดํ๋ฆฌ์ผ์ด์
์ OpenCL์ ์ง์ํ๋ ๋ชจ๋ ํ๋์จ์ด์์ ์คํ ๊ฐ๋ฅํ๋ค. ํ์ง๋ง ํ์ฌ OpenCL์ ๋จ์ผ ์ด์์ฒด์ ์์คํ
์ ์ํ ํ๋ก๊ทธ๋๋ฐ ๋ชจ๋ธ๋ก ํ์ ๋๋ค. ํ๋ก๊ทธ๋๋จธ๊ฐ ๋ช
์์ ์ผ๋ก MPI์ ๊ฐ์ ํต์ ๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ฅผ ์ฌ์ฉํ์ง ์์ผ๋ฉด OpenCL ์ดํ๋ฆฌ์ผ์ด์
์ ๋ณต์๊ฐ์ ๋
ธ๋๋ก ์ด๋ฃจ์ด์ง ํด๋ฌ์คํฐ์์ ๋์ํ์ง ์๋๋ค. ์์ฆ ๋ค์ด ์ฌ๋ฌ ๊ฐ์ ๋ฉํฐ์ฝ์ด CPU์ ๊ฐ์๊ธฐ๋ฅผ ๊ฐ์ถ ์ด์ข
ํด๋ฌ์คํฐ๋ ๊ทธ ์ฌ์ฉ์ ๊ธฐ๋ฐ์ ๋ํ๊ฐ๊ณ ์๋ค. ์ด์ ํด๋น ์ด์ข
ํด๋ฌ์คํฐ๋ฅผ ํ๊น์ผ๋ก ํ๋ก๊ทธ๋๋ฐ ํ๊ธฐ ์ํด์๋ ํ๋ก๊ทธ๋๋จธ๋ MPI-OpenCL ๊ฐ์ด ์ฌ๋ฌ ํ๋ก๊ทธ๋๋ฐ ๋ชจ๋ธ์ ํผํฉํ์ฌ ์ดํ๋ฆฌ์ผ์ด์
์ ์์ฑํด์ผ ํ๋ค. ์ด๋ ์ดํ๋ฆฌ์ผ์ด์
์ ๋ณต์กํ๊ฒ ๋ง๋ค์ด ์ ์ง๋ณด์๊ฐ ์ด๋ ต๊ฒ ๋๋ฉฐ ์ด์์ฑ์ด ๋ฎ์์ง๋ค.
๋ณธ ๋
ผ๋ฌธ์์๋ ์ด์ข
ํด๋ฌ์คํฐ๋ฅผ ์ํ OpenCL ํ๋ ์์ํฌ, SnuCL์ ์ ์ํ๋ค. ๋ณธ ๋
ผ๋ฌธ์ OpenCL ๋ชจ๋ธ์ด ์ด์ข
ํด๋ฌ์คํฐ ํ๋ก๊ทธ๋๋ฐ ํ๊ฒฝ์ ์ ํฉํ๋ค๋ ๊ฒ์ ๋ณด์ธ๋ค. ์ด์ ๋์์ SnuCL์ด ๊ณ ์ฑ๋ฅ๊ณผ ์ฌ์ด ํ๋ก๊ทธ๋๋ฐ์ ๋์์ ๋ฌ์ฑํ ์ ์์์ ๋ณด์ธ๋ค. SnuCL์ ํ๊น ์ด์ข
ํด๋ฌ์คํฐ์ ๋ํด์ ๋จ์ผ ์ด์์ฒด์ ๊ฐ ๋์๊ฐ๋ ํ๋์ ์์คํ
์ด๋ฏธ์ง๋ฅผ ์ฌ์ฉ์์๊ฒ ์ ๊ณตํ๋ค. OpenCL ์ดํ๋ฆฌ์ผ์ด์
์ ํด๋ฌ์คํฐ์ ๋ชจ๋ ๊ณ์ฐ ๋
ธ๋์ ์กด์ฌํ๋ ๋ชจ๋ ๊ณ์ฐ ์ฅ์น๊ฐ ํธ์คํธ ๋
ธ๋์ ์๋ค๋ ํ์์ ๊ฐ๊ฒ ๋๋ค. ๋ฐ๋ผ์ ํ๋ก๊ทธ๋๋จธ๋ MPI ๋ผ์ด๋ธ๋ฌ๋ฆฌ์ ๊ฐ์ ์ปค๋ฎค๋์ผ์ด์
API๋ฅผ ์ฌ์ฉํ์ง ์๊ณ OpenCL ๋ง์ ์ฌ์ฉํ์ฌ ์ด์ข
ํด๋ฌ์คํฐ๋ฅผ ํ๊น์ผ๋ก ์ดํ๋ฆฌ์ผ์ด์
์ ์์ฑํ ์ ์๊ฒ ๋๋ค. SnuCL์ ๋์์ผ๋ก OpenCL ์ดํ๋ฆฌ์ผ์ด์
์ ๋จ์ผ ๋
ธ๋์์ ์ด์ข
๋๋ฐ์ด์ค๊ฐ ์ด์์ฑ์ ๊ฐ์ง ๋ฟ๋ง ์๋๋ผ ์ด์ข
ํด๋ฌ์คํฐ ํ๊ฒฝ์์๋ ๋๋ฐ์ด์ค๊ฐ ์ด์์ฑ์ ๊ฐ์ง ์ ์๊ฒ ๋๋ค. ๋ณธ ๋
ผ๋ฌธ์์๋ ์ด ์ดํ ๊ฐ์ OpenCL ๋ฒค์น๋งํฌ ์ดํ๋ฆฌ์ผ์ด์
์ ์คํ์ ํตํ์ฌ SnuCL์ ์ฑ๋ฅ์ ๋ณด์ธ๋ค.OpenCL is a unified programming model for different types of computational units in a single heterogeneous computing system. OpenCL provides a common hardware abstraction layer across different computational units. Programmers can write OpenCL applications once and run them on any OpenCL-compliant hardware. However, one of the limitations of current OpenCL is that it is restricted to a programming model on a single operating system image. It does not work for a cluster of multiple nodes unless the programmer explicitly uses communication libraries, such as MPI. A heterogeneous cluster contains multiple general-purpose multicore CPUs and multiple accelerators to solve bigger problems within an acceptable time frame. As such clusters widen their user base, application developers for the clusters are being forced to turn to an unattractive mix of programming models, such as MPI-OpenCL. This makes the application more complex, hard to maintain, and less portable.
In this thesis, we propose SnuCL, an OpenCL framework for heterogeneous clusters. We show that the original OpenCL semantics naturally fits to the heterogeneous cluster programming environment, and the framework achieves high performance and ease of programming. SnuCL provides a system image running a single operating system instance for heterogeneous clusters to the user. It allows the application to utilize compute devices in a compute node as if they were in the host node. No communication API, such as the MPI library, is required in the application source. With SnuCL, an OpenCL application becomes portable not only between heterogeneous devices in a single node, but also between compute devices in the cluster environment. We implement SnuCL and evaluate its performance using eleven OpenCL benchmark applications.Abstract
I. Introduction
I.1 Heterogeneous Computing
I.2 Motivation
I.3 Related Work
I.4 Contributions
I.5 Organization of this Thesis
II. The OpenCL Architecture
II.1 Platform Model
II.2 Execution Model
II.3 Memory Model
II.4 OpenCL Applications
III. The SnuCL Framework
III.1 The SnuCL Runtime
III.1.1 Mapping Components
III.1.2 Organization of the SnuCL Runtime
III.1.3 Processing Kernel-execution Commands
III.1.4 Processing Synchronization Commands
III.2 Memory Management
III.2.1 The OpenCL Memory Model
III.2.2 Space Allocation to Buffers
III.2.3 Minimizing Memory Copying Overhead
III.2.4 Processing Memory Commands
III.2.5 Consistency Management
III.2.6 Ease of Programming
III.3 Extensions to OpenCL
III.4 Code Transformations
III.4.1 Detecting Buffers Written by a Kernel
III.4.2 Emulating PEs for CPU Devices
III.4.3 Distributing the Kernel Code
IV. Distributed Execution Model for SnuCL
IV.1 Two Problems in SnuCL
IV.2 Remote Device Virtualization
IV.2.1 Exclusive Execution on the Host
IV.3 OpenCL Framework Integration
IV.3.1 OpenCL Installable Client Driver (ICD)
IV.3.2 Event Synchronization
IV.3.3 Memory Sharing
V. Experimental Results
V.1 SnuCL Evaluation
V.1.1 Methodology
V.1.2 Results
V.2 SnuCL-D Evaluation
V.2.1 Methodology
V.2.2 Results
VI. Conclusions and Future Directions
VI.1 Conclusions
VI.2 Future Directions
Bibliography
Korean AbstractDocto
Review : Deep learning in electron microscopy
Deep learning is transforming most areas of science and technology, including electron microscopy. This review paper offers a practical perspective aimed at developers with limited familiarity. For context, we review popular applications of deep learning in electron microscopy. Following, we discuss hardware and software needed to get started with deep learning and interface with electron microscopes. We then review neural network components, popular architectures, and their optimization. Finally, we discuss future directions of deep learning in electron microscopy