Search CORE

16 research outputs found

Computational Physics on Graphics Processing Units

Author: A. Asadchev
A. Castro
A. Harju
A. Harju
A. McAdams
A.G. Anderson
A.P. Lyubartsev
A.W. Götz
B.L. Tembre
C. Bonati
C. McNeile
C.M. Isborn
D.J. Hardy
E. Darve
G. Bhanot
G. Egri
G. Kresse
H.J. Rothe
I. Montvay
I. Samish
I. Ufimtsev
I.S. Ufimtsev
I.S. Ufimtsev
I.S. Ufimtsev
J. Enkovaara
J. Gao
J. Hubbard
J.A. Anderson
J.A. McCammon
J.E. Stone
J.S. Meredith
K. Esler
K. Moreland
K. Yasuda
K. Yasuda
L. Genovese
L. Genovese
L. Greengard
L. Gu
L. Ha
M. Bordag
M. Göckeler
M. Hasenbusch
M. Hutchinson
M. Macedonia
M.C. Gutzwiller
M.C. Payne
M.P. Allen
N. Cardoso
N. Goodnight
N. Luehr
N.A. Gumerov
P. Giannozzi
P. Kipfer
P. Petreczky
R. Parr
R.D. Mawhinney
R.D. Skeel
R.G. Belleman
S. Hakala
S. Ihnatsenka
S. Maintz
T. Shirakawa
T. Siro
T. Takahashi
T.W. Chiu
V. Rokhlin
V. Springel
W. Jia
W. Kohn
W.M.C. Foulkes
X. Andrade
Y. Aoki
Y. Chen
Z. Fodor
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

The use of graphics processing units for scientific computations is an emerging strategy that can significantly speed up various different algorithms. In this review, we discuss advances made in the field of computational physics, focusing on classical molecular dynamics, and on quantum simulations for electronic structure calculations using the density functional theory, wave function techniques, and quantum field theory.Comment: Proceedings of the 11th International Conference, PARA 2012, Helsinki, Finland, June 10-13, 201

arXiv.org e-Print Archive

Crossref

Large-Scale Discrete Fourier Transform on TPUs

Author: Anderson John
Chen Yi-Fan
Hechtman Blake
Lu Tianjian
Wang Tao
Publication venue
Publication date: 11/12/2020
Field of study

In this work, we present two parallel algorithms for the large-scale discrete Fourier transform (DFT) on Tensor Processing Unit (TPU) clusters. The two parallel algorithms are associated with two formulations of DFT: one is based on the Kronecker product, to be specific, dense matrix multiplications between the input data and the Vandermonde matrix, denoted as KDFT in this work; the other is based on the famous Cooley-Tukey algorithm and phase adjustment, denoted as FFT in this work. Both KDFT and FFT formulations take full advantage of TPU's strength in matrix multiplications. The KDFT formulation allows direct use of nonuniform inputs without additional step. In the two parallel algorithms, the same strategy of data decomposition is applied to the input data. Through the data decomposition, the dense matrix multiplications in KDFT and FFT are kept local within TPU cores, which can be performed completely in parallel. The communication among TPU cores is achieved through the one-shuffle scheme in both parallel algorithms, with which sending and receiving data takes place simultaneously between two neighboring cores and along the same direction on the interconnect network. The one-shuffle scheme is designed for the interconnect topology of TPU clusters, minimizing the time required by the communication among TPU cores. Both KDFT and FFT are implemented in TensorFlow. The three-dimensional complex DFT is performed on an example of dimension

8192 \times 8192 \times 8192

with a full TPU Pod: the run time of KDFT is 12.66 seconds and that of FFT is 8.3 seconds. Scaling analysis is provided to demonstrate the high parallel efficiency of the two DFT implementations on TPUs

arXiv.org e-Print Archive

Directory of Open Access Journals

The inherent overlapping in the parallel calculation of the Laplacian

Author: Chamorro Posada Pedro
Sánchez Curto Julio
Publication venue: 'Elsevier BV'
Publication date: 01/01/2023
Field of study

Producción CientíficaA new approach for the parallel computation of the Laplacian in the Fourier domain is presented. This numerical problem inherits the intrinsic sequencing involved in the calculation of any multidimensional Fast Fourier Transform (FFT) where blocking communications assure that its computation is strictly carried out dimension by dimension. Such data dependency vanishes when one considers the Laplacian as the sum of n independent one-dimensional kernels, so that computation and communication can be naturally overlapped with nonblocking communications. Overlapping is demonstrated to be responsible for the speedup figures we obtain when our approach is compared to state-of-the-art parallel multidimensional FFTs.Junta de Castilla León (grant number VA296P18

Repositorio Documental de la Universidad de Valladolid

Spectral turning bands for efficient Gaussian random fields generation on GPUs and accelerators

Author: Cosenza Biagio
Fahringer Thomas
Hunger Lars
Kimeswenger Stefan
Publication venue
Publication date: 01/01/2015
Field of study

A random field (RF) is a set of correlated random variables associated with different spatial locations. RF generation algorithms are of crucial importance for many scientific areas, such as astrophysics, geostatistics, computer graphics, and many others. Current approaches commonly make use of 3D fast Fourier transform (FFT), which does not scale well for RF bigger than the available memory; they are also limited to regular rectilinear meshes. We introduce random field generation with the turning band method (RAFT), an RF generation algorithm based on the turning band method that is optimized for massively parallel hardware such as GPUs and accelerators. Our algorithm replaces the 3D FFT with a lower‐order, one‐dimensional FFT followed by a projection step and is further optimized with loop unrolling and blocking. RAFT can easily generate RF on non‐regular (non‐uniform) meshes and efficiently produce fields with mesh sizes bigger than the available device memory by using a streaming, out‐of‐core approach. Our algorithm generates RF with the correct statistical behavior and is tested on a variety of modern hardware, such as NVIDIA Tesla, AMD FirePro and Intel Phi. RAFT is faster than the traditional methods on regular meshes and has been successfully applied to two real case scenarios: planetary nebulae and cosmological simulations

DepositOnce

Archivio della Ricerca - Università di Salerno

Using GPUs to Compute Large Out-of-card FFTs

Author: Jakob Siegel
Liang Gu
Xiaoming Li
Publication venue
Publication date: 01/01/2011
Field of study

ABSTRACT The optimization of Fast Fourier Transfer (FFT) problems that can fit into GPU memory has been studied extensively. Such on-card FFT libraries like CUFFT can generally achieve much better performance than their counterparts on a CPU, as the data transfer between CPU and GPU is usually not counted in their performance. This high performance, however, is limited by the GPU memory size. When the FFT problem size increases, the data transfer between system and GPU memory can comprise a substantial part of the overall execution time. Therefore, optimizations for FFT problems that outgrow the GPU memory can not bypass the tuning of data transfer between CPU and GPU. However, no prior study has attacked this problem. This paper is the first effort of using GPUs to efficiently compute large FFTs in the CPU memory of a single compute node. In this paper, the performance of the PCI bus during the transfer of a batch of FFT subarrays is studied and a blocked buffer algorithm is proposed to improve the effective bandwidth. More importantly, several FFT decomposition algorithms are proposed so as to increase the data locality, further improve the PCI bus efficiency and balance computation between kernels. By integrating the above two methods, we demonstrate an out-of-card FFT optimization strategy and develop an FFT library that efficiently computes large 1D, 2D and 3D FFTs that can not fit into the GPU's memory. On three of the latest GPUs, our large FFT library achieves much better double precision performance than two of the most efficient CPU based libraries, FFTW and Intel MKL. On average, our large FFTs on a single GeForce GTX480 are 46% faster than FFTW and 57% faster than MKL with multiple threads running on a four-core Intel i7 CPU. The speedup on a Tesla C2070 is 1.93× and 2.11× over FFTW and MKL. A peak performance of 21GFLOPS is achieved for a 2D FFT of size 2048 × 65536 on C2070 with double precision

CiteSeerX

이종 클러스터를 위한 OpenCL 프레임워크

Author: 김정원
Publication venue: 서울대학교 대학원
Publication date: 01/08/2013
Field of study

학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2013. 8. 이재진.OpenCL은 이종 컴퓨팅 시스템의 다양한 계산 장치를 위한 통합 프로그래밍 모델이다. OpenCL은 다양한 이기종의 계산 장치에 대한 공통된 하드웨어 추상화 레이어를 프로그래머에게 제공한다. 프로그래머가 이 추상화 레이어를 타깃으로 OpenCL 어플리케이션을 작성하면, 이 어플리케이션은 OpenCL을 지원하는 모든 하드웨어에서 실행 가능하다. 하지만 현재 OpenCL은 단일 운영체제 시스템을 위한 프로그래밍 모델로 한정된다. 프로그래머가 명시적으로 MPI와 같은 통신 라이브러리를 사용하지 않으면 OpenCL 어플리케이션은 복수개의 노드로 이루어진 클러스터에서 동작하지 않는다. 요즘 들어 여러 개의 멀티코어 CPU와 가속기를 갖춘 이종 클러스터는 그 사용자 기반을 넓혀가고 있다. 이에 해당 이종 클러스터를 타깃으로 프로그래밍 하기 위해서는 프로그래머는 MPI-OpenCL 같이 여러 프로그래밍 모델을 혼합하여 어플리케이션을 작성해야 한다. 이는 어플리케이션을 복잡하게 만들어 유지보수가 어렵게 되며 이식성이 낮아진다. 본 논문에서는 이종 클러스터를 위한 OpenCL 프레임워크, SnuCL을 제안한다. 본 논문은 OpenCL 모델이 이종 클러스터 프로그래밍 환경에 적합하다는 것을 보인다. 이와 동시에 SnuCL이 고성능과 쉬운 프로그래밍을 동시에 달성할 수 있음을 보인다. SnuCL은 타깃 이종 클러스터에 대해서 단일 운영체제가 돌아가는 하나의 시스템 이미지를 사용자에게 제공한다. OpenCL 어플리케이션은 클러스터의 모든 계산 노드에 존재하는 모든 계산 장치가 호스트 노드에 있다는 환상을 갖게 된다. 따라서 프로그래머는 MPI 라이브러리와 같은 커뮤니케이션 API를 사용하지 않고 OpenCL 만을 사용하여 이종 클러스터를 타깃으로 어플리케이션을 작성할 수 있게 된다. SnuCL의 도움으로 OpenCL 어플리케이션은 단일 노드에서 이종 디바이스간 이식성을 가질 뿐만 아니라 이종 클러스터 환경에서도 디바이스간 이식성을 가질 수 있게 된다. 본 논문에서는 총 열한 개의 OpenCL 벤치마크 어플리케이션의 실험을 통하여 SnuCL의 성능을 보인다.OpenCL is a unified programming model for different types of computational units in a single heterogeneous computing system. OpenCL provides a common hardware abstraction layer across different computational units. Programmers can write OpenCL applications once and run them on any OpenCL-compliant hardware. However, one of the limitations of current OpenCL is that it is restricted to a programming model on a single operating system image. It does not work for a cluster of multiple nodes unless the programmer explicitly uses communication libraries, such as MPI. A heterogeneous cluster contains multiple general-purpose multicore CPUs and multiple accelerators to solve bigger problems within an acceptable time frame. As such clusters widen their user base, application developers for the clusters are being forced to turn to an unattractive mix of programming models, such as MPI-OpenCL. This makes the application more complex, hard to maintain, and less portable. In this thesis, we propose SnuCL, an OpenCL framework for heterogeneous clusters. We show that the original OpenCL semantics naturally fits to the heterogeneous cluster programming environment, and the framework achieves high performance and ease of programming. SnuCL provides a system image running a single operating system instance for heterogeneous clusters to the user. It allows the application to utilize compute devices in a compute node as if they were in the host node. No communication API, such as the MPI library, is required in the application source. With SnuCL, an OpenCL application becomes portable not only between heterogeneous devices in a single node, but also between compute devices in the cluster environment. We implement SnuCL and evaluate its performance using eleven OpenCL benchmark applications.Abstract I. Introduction I.1 Heterogeneous Computing I.2 Motivation I.3 Related Work I.4 Contributions I.5 Organization of this Thesis II. The OpenCL Architecture II.1 Platform Model II.2 Execution Model II.3 Memory Model II.4 OpenCL Applications III. The SnuCL Framework III.1 The SnuCL Runtime III.1.1 Mapping Components III.1.2 Organization of the SnuCL Runtime III.1.3 Processing Kernel-execution Commands III.1.4 Processing Synchronization Commands III.2 Memory Management III.2.1 The OpenCL Memory Model III.2.2 Space Allocation to Buffers III.2.3 Minimizing Memory Copying Overhead III.2.4 Processing Memory Commands III.2.5 Consistency Management III.2.6 Ease of Programming III.3 Extensions to OpenCL III.4 Code Transformations III.4.1 Detecting Buffers Written by a Kernel III.4.2 Emulating PEs for CPU Devices III.4.3 Distributing the Kernel Code IV. Distributed Execution Model for SnuCL IV.1 Two Problems in SnuCL IV.2 Remote Device Virtualization IV.2.1 Exclusive Execution on the Host IV.3 OpenCL Framework Integration IV.3.1 OpenCL Installable Client Driver (ICD) IV.3.2 Event Synchronization IV.3.3 Memory Sharing V. Experimental Results V.1 SnuCL Evaluation V.1.1 Methodology V.1.2 Results V.2 SnuCL-D Evaluation V.2.1 Methodology V.2.2 Results VI. Conclusions and Future Directions VI.1 Conclusions VI.2 Future Directions Bibliography Korean AbstractDocto

SNU Open Repository and Archive

Review : Deep learning in electron microscopy

Author: Ede Jeffrey M.
Publication venue: 'Center for Open Science'
Publication date: 18/09/2020
Field of study

Deep learning is transforming most areas of science and technology, including electron microscopy. This review paper offers a practical perspective aimed at developers with limited familiarity. For context, we review popular applications of deep learning in electron microscopy. Following, we discuss hardware and software needed to get started with deep learning and interface with electron microscopes. We then review neural network components, popular architectures, and their optimization. Finally, we discuss future directions of deep learning in electron microscopy

arXiv.org e-Print Archive

Warwick Research Archives Portal Repository