Search CORE

358 research outputs found

GPU-based ultra fast dose calculation using a finite pencil beam model

Author: Amitava Majumdar
Chunhua Men
de la Zerda A
Dongju Choi
Dubash M
Fu W H
Hubert Pan
Jacques R Taylor R Wong J McNutt T
Jelen U
Jiang S B
Lin H
Lu W G
Meihua L
NVIDIA
Riabkov D
Sharp G C
Steve B Jiang
Wu C
Wu Q J
Xu F
Xuejun Gu
Yan D
Publication venue: 'IOP Publishing'
Publication date: 30/08/2009
Field of study

Online adaptive radiation therapy (ART) is an attractive concept that promises the ability to deliver an optimal treatment in response to the inter-fraction variability in patient anatomy. However, it has yet to be realized due to technical limitations. Fast dose deposit coefficient calculation is a critical component of the online planning process that is required for plan optimization of intensity modulated radiation therapy (IMRT). Computer graphics processing units (GPUs) are well-suited to provide the requisite fast performance for the data-parallel nature of dose calculation. In this work, we develop a dose calculation engine based on a finite-size pencil beam (FSPB) algorithm and a GPU parallel computing framework. The developed framework can accommodate any FSPB model. We test our implementation on a case of a water phantom and a case of a prostate cancer patient with varying beamlet and voxel sizes. All testing scenarios achieved speedup ranging from 200~400 times when using a NVIDIA Tesla C1060 card in comparison with a 2.27GHz Intel Xeon CPU. The computational time for calculating dose deposition coefficients for a 9-field prostate IMRT plan with this new framework is less than 1 second. This indicates that the GPU-based FSPB algorithm is well-suited for online re-planning for adaptive radiotherapy.Comment: submitted Physics in Medicine and Biolog

arXiv.org e-Print Archive

Crossref

Computational Physics on Graphics Processing Units

Author: A. Asadchev
A. Castro
A. Harju
A. Harju
A. McAdams
A.G. Anderson
A.P. Lyubartsev
A.W. Götz
B.L. Tembre
C. Bonati
C. McNeile
C.M. Isborn
D.J. Hardy
E. Darve
G. Bhanot
G. Egri
G. Kresse
H.J. Rothe
I. Montvay
I. Samish
I. Ufimtsev
I.S. Ufimtsev
I.S. Ufimtsev
I.S. Ufimtsev
J. Enkovaara
J. Gao
J. Hubbard
J.A. Anderson
J.A. McCammon
J.E. Stone
J.S. Meredith
K. Esler
K. Moreland
K. Yasuda
K. Yasuda
L. Genovese
L. Genovese
L. Greengard
L. Gu
L. Ha
M. Bordag
M. Göckeler
M. Hasenbusch
M. Hutchinson
M. Macedonia
M.C. Gutzwiller
M.C. Payne
M.P. Allen
N. Cardoso
N. Goodnight
N. Luehr
N.A. Gumerov
P. Giannozzi
P. Kipfer
P. Petreczky
R. Parr
R.D. Mawhinney
R.D. Skeel
R.G. Belleman
S. Hakala
S. Ihnatsenka
S. Maintz
T. Shirakawa
T. Siro
T. Takahashi
T.W. Chiu
V. Rokhlin
V. Springel
W. Jia
W. Kohn
W.M.C. Foulkes
X. Andrade
Y. Aoki
Y. Chen
Z. Fodor
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

The use of graphics processing units for scientific computations is an emerging strategy that can significantly speed up various different algorithms. In this review, we discuss advances made in the field of computational physics, focusing on classical molecular dynamics, and on quantum simulations for electronic structure calculations using the density functional theory, wave function techniques, and quantum field theory.Comment: Proceedings of the 11th International Conference, PARA 2012, Helsinki, Finland, June 10-13, 201

arXiv.org e-Print Archive

Crossref

High performance bioinformatics and computational biology on general-purpose graphics processing units

Author: Ling Cheng
Publication venue: The University of Edinburgh
Publication date: 25/06/2012
Field of study

Bioinformatics and Computational Biology (BCB) is a relatively new multidisciplinary field which brings together many aspects of the fields of biology, computer science, statistics, and engineering. Bioinformatics extracts useful information from biological data and makes these more intuitive and understandable by applying principles of information sciences, while computational biology harnesses computational approaches and technologies to answer biological questions conveniently. Recent years have seen an explosion of the size of biological data at a rate which outpaces the rate of increases in the computational power of mainstream computer technologies, namely general purpose processors (GPPs). The aim of this thesis is to explore the use of off-the-shelf Graphics Processing Unit (GPU) technology in the high performance and efficient implementation of BCB applications in order to meet the demands of biological data increases at affordable cost. The thesis presents detailed design and implementations of GPU solutions for a number of BCB algorithms in two widely used BCB applications, namely biological sequence alignment and phylogenetic analysis. Biological sequence alignment can be used to determine the potential information about a newly discovered biological sequence from other well-known sequences through similarity comparison. On the other hand, phylogenetic analysis is concerned with the investigation of the evolution and relationships among organisms, and has many uses in the fields of system biology and comparative genomics. In molecular-based phylogenetic analysis, the relationship between species is estimated by inferring the common history of their genes and then phylogenetic trees are constructed to illustrate evolutionary relationships among genes and organisms. However, both biological sequence alignment and phylogenetic analysis are computationally expensive applications as their computing and memory requirements grow polynomially or even worse with the size of sequence databases. The thesis firstly presents a multi-threaded parallel design of the Smith- Waterman (SW) algorithm alongside an implementation on NVIDIA GPUs. A novel technique is put forward to solve the restriction on the length of the query sequence in previous GPU-based implementations of the SW algorithm. Based on this implementation, the difference between two main task parallelization approaches (Inter-task and Intra-task parallelization) is presented. The resulting GPU implementation matches the speed of existing GPU implementations while providing more flexibility, i.e. flexible length of sequences in real world applications. It also outperforms an equivalent GPPbased implementation by 15x-20x. After this, the thesis presents the first reported multi-threaded design and GPU implementation of the Gapped BLAST with Two-Hit method algorithm, which is widely used for aligning biological sequences heuristically. This achieved up to 3x speed-up improvements compared to the most optimised GPP implementations. The thesis then presents a multi-threaded design and GPU implementation of a Neighbor-Joining (NJ)-based method for phylogenetic tree construction and multiple sequence alignment (MSA). This achieves 8x-20x speed up compared to an equivalent GPP implementation based on the widely used ClustalW software. The NJ method however only gives one possible tree which strongly depends on the evolutionary model used. A more advanced method uses maximum likelihood (ML) for scoring phylogenies with Markov Chain Monte Carlo (MCMC)-based Bayesian inference. The latter was the subject of another multi-threaded design and GPU implementation presented in this thesis, which achieved 4x-8x speed up compared to an equivalent GPP implementation based on the widely used MrBayes software. Finally, the thesis presents a general evaluation of the designs and implementations achieved in this work as a step towards the evaluation of GPU technology in BCB computing, in the context of other computer technologies including GPPs and Field Programmable Gate Arrays (FPGA) technology

Edinburgh Research Archive

Irregular alignment of arbitrarily long DNA sequences on GPU

Author: Guil-Mata Nicolás
Perez-Wohlfeil Esteban
Trelles-Salazar Oswaldo Rogelio
Publication venue: Springer Nature
Publication date: 22/12/2022
Field of study

The use of Graphics Processing Units to accelerate computational applications is increasingly being adopted due to its affordability, flexibility and performance. However, achieving top performance comes at the price of restricted data-parallelism models. In the case of sequence alignment, most GPU-based approaches focus on accelerating the Smith-Waterman dynamic programming algorithm due to its regularity. Nevertheless, because of its quadratic complexity, it becomes impractical when comparing long sequences, and therefore heuristic methods are required to reduce the search space. We present GPUGECKO, a CUDA implementation for the sequential, seed-and-extend sequence-comparison algorithm, GECKO. Our proposal includes optimized kernels based on collective operations capable of producing arbitrarily long alignments while dealing with heterogeneous and unpredictable load. Contrary to other state-of-the-art methods, GPUGECKO employs a batching mechanism that prevents memory exhaustion by not requiring to fit all alignments at once into the device memory, therefore enabling to run massive comparisons exhaustively with improved sensitivity while also providing up to 6x average speedup w.r.t. the CUDA acceleration of BLASTN.Funding for open access publishing: Universidad Málaga/CBUA /// This work has been partially supported by the European project ELIXIR-EXCELERATE (grant no. 676559), the Spanish national project Plataforma de Recursos Biomoleculares y Bioinformáticos (ISCIII-PT13.0001.0012 and ISCIII-PT17.0009.0022), the Fondo Europeo de Desarrollo Regional (UMA18-FEDERJA-156, UMA20-FEDERJA-059), the Junta de Andalucía (P18-FR-3130), the Instituto de Investigación Biomédica de Málaga IBIMA and the University of Málaga

Repositorio Institucional Universidad de Málaga

Reconfigurable computing for large-scale graph traversal algorithms

Author: Betkaoui Brahim
Publication venue: Computing, Imperial College London
Publication date: 01/09/2014
Field of study

This thesis proposes a reconfigurable computing approach for supporting parallel processing in large-scale graph traversal algorithms. Our approach is based on a reconfigurable hardware architecture which exploits the capabilities of both FPGAs (Field-Programmable Gate Arrays) and a multi-bank parallel memory subsystem. The proposed methodology to accelerate graph traversal algorithms has been applied to three case studies, revealing that application-specific hardware customisations can benefit performance. A summary of our four contributions is as follows. First, a reconfigurable computing approach to accelerate large-scale graph traversal algorithms. We propose a reconfigurable hardware architecture which decouples computation and communication while keeping multiple memory requests in flight at any given time, taking advantage of the high bandwidth of multi-bank memory subsystems. Second, a demonstration of the effectiveness of our approach through two case studies: the breadth-first search algorithm, and a graphlet counting algorithm from bioinformatics. Both case studies involve graph traversal, but each of them adopts a different graph data representation. Third, a method for using on-chip memory resources in FPGAs to reduce off-chip memory accesses for accelerating graph traversal algorithms, through a case-study of the All-Pairs Shortest-Paths algorithm. This case study has been applied to process human brain network data. Fourth, an evaluation of an approach based on instruction-set extension for FPGA design against many-core GPUs (Graphics Processing Units), based on a set of benchmarks with different memory access characteristics. It is shown that while GPUs excel at streaming applications, the proposed approach can outperform GPUs in applications with poor locality characteristics, such as graph traversal problems.Open Acces

Spiral - Imperial College Digital Repository

Recommended from our members

Faster Than Real-Time GPGPU Radiation Pressure Modeling Methods

Author: Kenneally Patrick William
Publication venue: University of Colorado Boulder
Publication date: 23/06/2019
Field of study

Solar radiation pressure (SRP) is a significant contributing dynamic force on spacecraft in all orbit regimes. Predicting, accommodating, and either leveraging or canceling its effect, is paramount to effective orbit determination, maneuver and mission design. As a result spacecraft numerical simulation requires computational models which provide the facility to model SRP with sufficient accuracy. However, typically the computationally intense nature of performing high-fidelity SRP evaluations has limited such evaluations to being an offline computation which generates lookup data. Precomputation limits the ability for a spacecraft dynamic simulation to accommodate the myriad time varying changes which occur to the spacecraft state during a mission. In the past decade the computer graphics industry has driven the development of highly parallel graphics processing units (GPU) capable of performing many thousands of floating point operations per second. General purpose GPU programming (GPGPU) has been leveraged particularly in Engineering and the Sciences where the high computational power of parallel GPU hardware presents the opportunity for significant increases in the size and dimension of computational problems now manageable on personal computers. This dissertation presents two modeling approaches which take advantage of the GPGPU aspect of commodity GPU hardware. The first contribution is a modeling approach which utilizes the vector graphics application programming interface (API) Open Graphics Library (OpenGL) and the GPGPU computing API Open Computing Language to develop a high geometric fidelity SRP modeling approach. The OpenGL-CL modeling approach computes SRP induced force and torque across a detailed spacecraft mesh model. The method utilizes the OpenGL-OpenCL shared context to facilitate modeling data between the two APIs. The OpenGL render pipeline is manipulated to render the sun-frame projected surface of the spacecraft into OpenGL Texture data objects. A custom OpenCL parallel reduction kernel is developed which subsequently computes the SRP force and torque across the spacecraft rendered into the OpenGL Textures. The method presents faster than real time computation speeds while accommodating spacecraft meshes with many thousands of vertices, arbitrary articulated components and detailed spacecraft material optical parameters. The second contribution is a GPU based parallel ray tracing modeling approach which ex- hibits faster than real time evaluation speeds. Techniques and algorithms from the computer graphics discipline are used to develop and implement a method which computes SRP force and torque across a detailed spacecraft triangulated mesh model. Efficient data structures such as bounding volume hierarchy (BVH) acceleration support a minimization of computational burden by reducing the ray-surface intersection search space. Accurate ray reflections are computed for complex materials by applying a Quasi-Monte Carlo integration method and importance sampling. Complex material bidirectional reflectance distribution functions (BRDF) are implemented with as both, ideal mirror-like specular and Lambertian diffuse, and as microfacet BRDF models. Arbitrary spacecraft articulation are accommodated at run time with no appreciable reduction in computational speed. Both SRP models utilize the latent computing power of the GPU which is exists in the large majority of consumer grade personal computing systems. Further access to latent computing power is enabled by the development of a software simulation communication middleware called Black Lion (BL). The third contribution of this thesis is the description of a novel software architecture and the design principles applied to the development of the BL software. Black Lion enables the integration of multiple local or distributed heterogeneous applications never intended to run in a cooperative settings. It is shown that BL enables access to more powerful latent personal computing resources by creating a means to transparently facilitate distributed simulation across multiple simulation nodes and computers. Finally, this dissertation demonstrates the utility of both modeling methods by their applica- tions in two case studies. Firstly, the high-fidelity SRP effects are computed for an ongoing asteroid sample return mission. Agreement between the OpenGL-CL methods is demonstrated. Both SRP modeling approaches make significant use of pre and post launch engineering data. The utility of direct access to a model’s physical parameters is demonstrated in an analysis of contributors to possible error between modeled and estimated SRP accelerations. Secondly, capability of fast computational speed paired with high geometric resolution, of both OpenGL-CL and ray tracing methods, is demonstrated. Each method is employed in the simulation and long-term propagation of realistic multi-layer insulation (MLI) debris object mesh models and the effect of departing from the typical flat-plate MLI model is investigated.</p

CU Scholar Institutional Repository

Exploring Computational Chemistry on Emerging Architectures

Author: Jenkins David Dewayne
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/12/2012
Field of study

Emerging architectures, such as next generation microprocessors, graphics processing units, and Intel MIC cards, are being used with increased popularity in high performance computing. Each of these architectures has advantages over previous generations of architectures including performance, programmability, and power efficiency. With the ever-increasing performance of these architectures, scientific computing applications are able to attack larger, more complicated problems. However, since applications perform differently on each of the architectures, it is difficult to determine the best tool for the job. This dissertation makes the following contributions to computer engineering and computational science. First, this work implements the computational chemistry variational path integral application, QSATS, on various architectures, ranging from microprocessors to GPUs to Intel MICs. Second, this work explores the use of analytical performance modeling to predict the runtime and scalability of the application on the architectures. This allows for a comparison of the architectures when determining which to use for a set of program input parameters. The models presented in this dissertation are accurate within 6%. This work combines novel approaches to this algorithm and exploration of the various architectural features to develop the application to perform at its peak. In addition, this expands the understanding of computational science applications and their implementation on emerging architectures while providing insight into the performance, scalability, and programmer productivity

University of Tennessee, Knoxville: Trace

Large Scale Computing for the Modelling of Whole Brain Connectivity

Author: Albers Kristoffer Jon
Publication venue: DTU Compute
Publication date: 01/01/2017
Field of study

Online Research Database In Technology

Hardware Acceleration of Electronic Design Automation Algorithms

Author: Gulati Kanupriya
Publication venue
Publication date
Field of study

With the advances in very large scale integration (VLSI) technology, hardware is going parallel. Software, which was traditionally designed to execute on single core microprocessors, now faces the tough challenge of taking advantage of this parallelism, made available by the scaling of hardware. The work presented in this dissertation studies the acceleration of electronic design automation (EDA) software on several hardware platforms such as custom integrated circuits (ICs), field programmable gate arrays (FPGAs) and graphics processors. This dissertation concentrates on a subset of EDA algorithms which are heavily used in the VLSI design flow, and also have varying degrees of inherent parallelism in them. In particular, Boolean satisfiability, Monte Carlo based statistical static timing analysis, circuit simulation, fault simulation and fault table generation are explored. The architectural and performance tradeoffs of implementing the above applications on these alternative platforms (in comparison to their implementation on a single core microprocessor) are studied. In addition, this dissertation also presents an automated approach to accelerate uniprocessor code using a graphics processing unit (GPU). The key idea is to partition the software application into kernels in an automated fashion, such that multiple instances of these kernels, when executed in parallel on the GPU, can maximally benefit from the GPU?s hardware resources. The work presented in this dissertation demonstrates that several EDA algorithms can be successfully rearchitected to maximally harness their performance on alternative platforms such as custom designed ICs, FPGAs and graphic processors, and obtain speedups upto 800X. The approaches in this dissertation collectively aim to contribute towards enabling the computer aided design (CAD) community to accelerate EDA algorithms on arbitrary hardware platforms

Texas A&M Repository

Architectures and GPU-Based Parallelization for Online Bayesian Computational Statistics and Dynamic Modeling

Author: Duan Lujie
Publication venue: 'University of Saskatchewan Library'
Publication date: 27/09/2021
Field of study

Recent work demonstrates that coupling Bayesian computational statistics methods with dynamic models can facilitate the analysis of complex systems associated with diverse time series, including those involving social and behavioural dynamics. Particle Markov Chain Monte Carlo (PMCMC) methods constitute a particularly powerful class of Bayesian methods combining aspects of batch Markov Chain Monte Carlo (MCMC) and the sequential Monte Carlo method of Particle Filtering (PF). PMCMC can flexibly combine theory-capturing dynamic models with diverse empirical data. Online machine learning is a subcategory of machine learning algorithms characterized by sequential, incremental execution as new data arrives, which can give updated results and predictions with growing sequences of available incoming data. While many machine learning and statistical methods are adapted to online algorithms, PMCMC is one example of the many methods whose compatibility with and adaption to online learning remains unclear. In this thesis, I proposed a data-streaming solution supporting PF and PMCMC methods with dynamic epidemiological models and demonstrated several successful applications. By constructing an automated, easy-to-use streaming system, analytic applications and simulation models gain access to arriving real-time data to shorten the time gap between data and resulting model-supported insight. The well-defined architecture design emerging from the thesis would substantially expand traditional simulation models' potential by allowing such models to be offered as continually updated services. Contingent on sufficiently fast execution time, simulation models within this framework can consume the incoming empirical data in real-time and generate informative predictions on an ongoing basis as new data points arrive. In a second line of work, I investigated the platform's flexibility and capability by extending this system to support the use of a powerful class of PMCMC algorithms with dynamic models while ameliorating such algorithms' traditionally stiff performance limitations. Specifically, this work designed and implemented a GPU-enabled parallel version of a PMCMC method with dynamic simulation models. The resulting codebase readily has enabled researchers to adapt their models to the state-of-art statistical inference methods, and ensure that the computation-heavy PMCMC method can perform significant sampling between the successive arrival of each new data point. Investigating this method's impact with several realistic PMCMC application examples showed that GPU-based acceleration allows for up to 160x speedup compared to a corresponding CPU-based version not exploiting parallelism. The GPU accelerated PMCMC and the streaming processing system can complement each other, jointly providing researchers with a powerful toolset to greatly accelerate learning and securing additional insight from the high-velocity data increasingly prevalent within social and behavioural spheres. The design philosophy applied supported a platform with broad generalizability and potential for ready future extensions. The thesis discusses common barriers and difficulties in designing and implementing such systems and offers solutions to solve or mitigate them

University of Saskatchewan Research Archive