358 research outputs found
GPU-based ultra fast dose calculation using a finite pencil beam model
Online adaptive radiation therapy (ART) is an attractive concept that
promises the ability to deliver an optimal treatment in response to the
inter-fraction variability in patient anatomy. However, it has yet to be
realized due to technical limitations. Fast dose deposit coefficient
calculation is a critical component of the online planning process that is
required for plan optimization of intensity modulated radiation therapy (IMRT).
Computer graphics processing units (GPUs) are well-suited to provide the
requisite fast performance for the data-parallel nature of dose calculation. In
this work, we develop a dose calculation engine based on a finite-size pencil
beam (FSPB) algorithm and a GPU parallel computing framework. The developed
framework can accommodate any FSPB model. We test our implementation on a case
of a water phantom and a case of a prostate cancer patient with varying beamlet
and voxel sizes. All testing scenarios achieved speedup ranging from 200~400
times when using a NVIDIA Tesla C1060 card in comparison with a 2.27GHz Intel
Xeon CPU. The computational time for calculating dose deposition coefficients
for a 9-field prostate IMRT plan with this new framework is less than 1 second.
This indicates that the GPU-based FSPB algorithm is well-suited for online
re-planning for adaptive radiotherapy.Comment: submitted Physics in Medicine and Biolog
Computational Physics on Graphics Processing Units
The use of graphics processing units for scientific computations is an
emerging strategy that can significantly speed up various different algorithms.
In this review, we discuss advances made in the field of computational physics,
focusing on classical molecular dynamics, and on quantum simulations for
electronic structure calculations using the density functional theory, wave
function techniques, and quantum field theory.Comment: Proceedings of the 11th International Conference, PARA 2012,
Helsinki, Finland, June 10-13, 201
High performance bioinformatics and computational biology on general-purpose graphics processing units
Bioinformatics and Computational Biology (BCB) is a relatively new
multidisciplinary field which brings together many aspects of the fields of
biology, computer science, statistics, and engineering. Bioinformatics extracts
useful information from biological data and makes these more intuitive and
understandable by applying principles of information sciences, while
computational biology harnesses computational approaches and technologies
to answer biological questions conveniently. Recent years have seen an
explosion of the size of biological data at a rate which outpaces the rate of
increases in the computational power of mainstream computer technologies,
namely general purpose processors (GPPs). The aim of this thesis is to explore
the use of off-the-shelf Graphics Processing Unit (GPU) technology in the high
performance and efficient implementation of BCB applications in order to meet
the demands of biological data increases at affordable cost.
The thesis presents detailed design and implementations of GPU solutions for
a number of BCB algorithms in two widely used BCB applications, namely
biological sequence alignment and phylogenetic analysis. Biological sequence
alignment can be used to determine the potential information about a newly
discovered biological sequence from other well-known sequences through
similarity comparison. On the other hand, phylogenetic analysis is concerned
with the investigation of the evolution and relationships among organisms,
and has many uses in the fields of system biology and comparative genomics.
In molecular-based phylogenetic analysis, the relationship between species is
estimated by inferring the common history of their genes and then
phylogenetic trees are constructed to illustrate evolutionary relationships
among genes and organisms. However, both biological sequence alignment
and phylogenetic analysis are computationally expensive applications as their computing and memory requirements grow polynomially or even worse with
the size of sequence databases.
The thesis firstly presents a multi-threaded parallel design of the Smith-
Waterman (SW) algorithm alongside an implementation on NVIDIA GPUs. A
novel technique is put forward to solve the restriction on the length of the
query sequence in previous GPU-based implementations of the SW algorithm.
Based on this implementation, the difference between two main task
parallelization approaches (Inter-task and Intra-task parallelization) is
presented. The resulting GPU implementation matches the speed of existing
GPU implementations while providing more flexibility, i.e. flexible length of
sequences in real world applications. It also outperforms an equivalent GPPbased
implementation by 15x-20x. After this, the thesis presents the first
reported multi-threaded design and GPU implementation of the Gapped
BLAST with Two-Hit method algorithm, which is widely used for aligning
biological sequences heuristically. This achieved up to 3x speed-up
improvements compared to the most optimised GPP implementations.
The thesis then presents a multi-threaded design and GPU implementation of
a Neighbor-Joining (NJ)-based method for phylogenetic tree construction and
multiple sequence alignment (MSA). This achieves 8x-20x speed up compared
to an equivalent GPP implementation based on the widely used ClustalW
software. The NJ method however only gives one possible tree which strongly
depends on the evolutionary model used. A more advanced method uses
maximum likelihood (ML) for scoring phylogenies with Markov Chain Monte
Carlo (MCMC)-based Bayesian inference. The latter was the subject of another
multi-threaded design and GPU implementation presented in this thesis,
which achieved 4x-8x speed up compared to an equivalent GPP
implementation based on the widely used MrBayes software.
Finally, the thesis presents a general evaluation of the designs and
implementations achieved in this work as a step towards the evaluation of
GPU technology in BCB computing, in the context of other computer technologies including GPPs and Field Programmable Gate Arrays (FPGA)
technology
Irregular alignment of arbitrarily long DNA sequences on GPU
The use of Graphics Processing Units to accelerate computational applications is increasingly being adopted due to its affordability, flexibility and performance. However, achieving top performance comes at the price of restricted data-parallelism models. In the case of sequence alignment, most GPU-based approaches focus on accelerating the Smith-Waterman dynamic programming algorithm due to its regularity. Nevertheless, because of its quadratic complexity, it becomes impractical when comparing long sequences, and therefore heuristic methods are required to reduce the search space. We present GPUGECKO, a CUDA implementation for the sequential, seed-and-extend sequence-comparison algorithm, GECKO. Our proposal includes optimized kernels based on collective operations capable of producing arbitrarily long alignments while dealing with heterogeneous and unpredictable load. Contrary to other state-of-the-art methods, GPUGECKO employs a batching mechanism that prevents memory exhaustion by not requiring to fit all alignments at once into the device memory, therefore enabling to run massive comparisons exhaustively with improved sensitivity while also providing up to 6x average speedup w.r.t. the CUDA acceleration of BLASTN.Funding for open access publishing: Universidad Málaga/CBUA /// This work has been partially supported by the European project ELIXIR-EXCELERATE (grant no. 676559), the Spanish national project Plataforma de Recursos Biomoleculares y Bioinformáticos (ISCIII-PT13.0001.0012 and ISCIII-PT17.0009.0022), the Fondo Europeo de Desarrollo Regional (UMA18-FEDERJA-156, UMA20-FEDERJA-059), the Junta de Andalucía (P18-FR-3130), the Instituto de Investigación Biomédica de Málaga IBIMA and the University of Málaga
Reconfigurable computing for large-scale graph traversal algorithms
This thesis proposes a reconfigurable computing approach for supporting parallel processing in large-scale graph traversal algorithms. Our approach is based on a reconfigurable hardware architecture which exploits the capabilities of both FPGAs (Field-Programmable Gate Arrays) and a multi-bank parallel memory subsystem.
The proposed methodology to accelerate graph traversal algorithms has been applied to three case studies, revealing that application-specific hardware customisations can benefit performance. A summary of our four contributions is as follows.
First, a reconfigurable computing approach to accelerate large-scale graph traversal algorithms. We propose a reconfigurable hardware architecture which decouples computation and communication while keeping multiple memory requests in flight at any given time, taking advantage of the high bandwidth of multi-bank memory subsystems.
Second, a demonstration of the effectiveness of our approach through two case studies: the breadth-first search algorithm, and a graphlet counting algorithm from bioinformatics. Both case studies involve graph traversal, but each of them adopts a different graph data representation.
Third, a method for using on-chip memory resources in FPGAs to reduce off-chip memory accesses for accelerating graph traversal algorithms, through a case-study of the All-Pairs Shortest-Paths algorithm. This case study has been applied to process human brain network data.
Fourth, an evaluation of an approach based on instruction-set extension for FPGA design against many-core GPUs (Graphics Processing Units), based on a set of benchmarks with different memory access characteristics. It is shown that while GPUs excel at streaming applications, the proposed approach can outperform GPUs in applications with poor locality characteristics, such as graph traversal problems.Open Acces
Recommended from our members
Faster Than Real-Time GPGPU Radiation Pressure Modeling Methods
Solar radiation pressure (SRP) is a significant contributing dynamic force on spacecraft in all orbit regimes. Predicting, accommodating, and either leveraging or canceling its effect, is paramount to effective orbit determination, maneuver and mission design. As a result spacecraft numerical simulation requires computational models which provide the facility to model SRP with sufficient accuracy. However, typically the computationally intense nature of performing high-fidelity SRP evaluations has limited such evaluations to being an offline computation which generates lookup data. Precomputation limits the ability for a spacecraft dynamic simulation to accommodate the myriad time varying changes which occur to the spacecraft state during a mission.
In the past decade the computer graphics industry has driven the development of highly parallel graphics processing units (GPU) capable of performing many thousands of floating point operations per second. General purpose GPU programming (GPGPU) has been leveraged particularly in Engineering and the Sciences where the high computational power of parallel GPU hardware presents the opportunity for significant increases in the size and dimension of computational problems now manageable on personal computers.
This dissertation presents two modeling approaches which take advantage of the GPGPU aspect of commodity GPU hardware. The first contribution is a modeling approach which utilizes the vector graphics application programming interface (API) Open Graphics Library (OpenGL) and the GPGPU computing API Open Computing Language to develop a high geometric fidelity SRP modeling approach. The OpenGL-CL modeling approach computes SRP induced force and torque across a detailed spacecraft mesh model. The method utilizes the OpenGL-OpenCL shared context to facilitate modeling data between the two APIs. The OpenGL render pipeline is manipulated to render the sun-frame projected surface of the spacecraft into OpenGL Texture data objects. A custom OpenCL parallel reduction kernel is developed which subsequently computes the SRP force and torque across the spacecraft rendered into the OpenGL Textures. The method presents faster than real time computation speeds while accommodating spacecraft meshes with many thousands of vertices, arbitrary articulated components and detailed spacecraft material optical parameters.
The second contribution is a GPU based parallel ray tracing modeling approach which ex- hibits faster than real time evaluation speeds. Techniques and algorithms from the computer graphics discipline are used to develop and implement a method which computes SRP force and torque across a detailed spacecraft triangulated mesh model. Efficient data structures such as bounding volume hierarchy (BVH) acceleration support a minimization of computational burden by reducing the ray-surface intersection search space. Accurate ray reflections are computed for complex materials by applying a Quasi-Monte Carlo integration method and importance sampling. Complex material bidirectional reflectance distribution functions (BRDF) are implemented with as both, ideal mirror-like specular and Lambertian diffuse, and as microfacet BRDF models. Arbitrary spacecraft articulation are accommodated at run time with no appreciable reduction in computational speed.
Both SRP models utilize the latent computing power of the GPU which is exists in the large majority of consumer grade personal computing systems. Further access to latent computing power is enabled by the development of a software simulation communication middleware called Black Lion (BL). The third contribution of this thesis is the description of a novel software architecture and the design principles applied to the development of the BL software. Black Lion enables the integration of multiple local or distributed heterogeneous applications never intended to run in a cooperative settings. It is shown that BL enables access to more powerful latent personal computing resources by creating a means to transparently facilitate distributed simulation across multiple simulation nodes and computers.
Finally, this dissertation demonstrates the utility of both modeling methods by their applica- tions in two case studies. Firstly, the high-fidelity SRP effects are computed for an ongoing asteroid sample return mission. Agreement between the OpenGL-CL methods is demonstrated. Both SRP modeling approaches make significant use of pre and post launch engineering data. The utility of direct access to a model’s physical parameters is demonstrated in an analysis of contributors to possible error between modeled and estimated SRP accelerations. Secondly, capability of fast computational speed paired with high geometric resolution, of both OpenGL-CL and ray tracing methods, is demonstrated. Each method is employed in the simulation and long-term propagation of realistic multi-layer insulation (MLI) debris object mesh models and the effect of departing from the typical flat-plate MLI model is investigated.</p
Exploring Computational Chemistry on Emerging Architectures
Emerging architectures, such as next generation microprocessors, graphics processing units, and Intel MIC cards, are being used with increased popularity in high performance computing. Each of these architectures has advantages over previous generations of architectures including performance, programmability, and power efficiency. With the ever-increasing performance of these architectures, scientific computing applications are able to attack larger, more complicated problems. However, since applications perform differently on each of the architectures, it is difficult to determine the best tool for the job. This dissertation makes the following contributions to computer engineering and computational science. First, this work implements the computational chemistry variational path integral application, QSATS, on various architectures, ranging from microprocessors to GPUs to Intel MICs. Second, this work explores the use of analytical performance modeling to predict the runtime and scalability of the application on the architectures. This allows for a comparison of the architectures when determining which to use for a set of program input parameters. The models presented in this dissertation are accurate within 6%. This work combines novel approaches to this algorithm and exploration of the various architectural features to develop the application to perform at its peak. In addition, this expands the understanding of computational science applications and their implementation on emerging architectures while providing insight into the performance, scalability, and programmer productivity
Hardware Acceleration of Electronic Design Automation Algorithms
With the advances in very large scale integration (VLSI) technology, hardware is going
parallel. Software, which was traditionally designed to execute on single core microprocessors,
now faces the tough challenge of taking advantage of this parallelism, made available
by the scaling of hardware. The work presented in this dissertation studies the acceleration
of electronic design automation (EDA) software on several hardware platforms such
as custom integrated circuits (ICs), field programmable gate arrays (FPGAs) and graphics
processors. This dissertation concentrates on a subset of EDA algorithms which are heavily
used in the VLSI design flow, and also have varying degrees of inherent parallelism
in them. In particular, Boolean satisfiability, Monte Carlo based statistical static timing
analysis, circuit simulation, fault simulation and fault table generation are explored. The
architectural and performance tradeoffs of implementing the above applications on these
alternative platforms (in comparison to their implementation on a single core microprocessor)
are studied. In addition, this dissertation also presents an automated approach to
accelerate uniprocessor code using a graphics processing unit (GPU). The key idea is to
partition the software application into kernels in an automated fashion, such that multiple
instances of these kernels, when executed in parallel on the GPU, can maximally benefit
from the GPU?s hardware resources.
The work presented in this dissertation demonstrates that several EDA algorithms can
be successfully rearchitected to maximally harness their performance on alternative platforms
such as custom designed ICs, FPGAs and graphic processors, and obtain speedups upto 800X. The approaches in this dissertation collectively aim to contribute towards enabling
the computer aided design (CAD) community to accelerate EDA algorithms on arbitrary
hardware platforms
Architectures and GPU-Based Parallelization for Online Bayesian Computational Statistics and Dynamic Modeling
Recent work demonstrates that coupling Bayesian computational statistics methods with dynamic models can facilitate the analysis of complex systems associated with diverse time series, including those involving social and behavioural dynamics. Particle Markov Chain Monte Carlo (PMCMC) methods constitute a particularly powerful class of Bayesian methods combining aspects of batch Markov Chain Monte Carlo (MCMC) and the sequential Monte Carlo method of Particle Filtering (PF). PMCMC can flexibly combine theory-capturing dynamic models with diverse empirical data. Online machine learning is a subcategory of machine learning algorithms characterized by sequential, incremental execution as new data arrives, which can give updated results and predictions with growing sequences of available incoming data. While many machine learning and statistical methods are adapted to online algorithms, PMCMC is one example of the many methods whose compatibility with and adaption to online learning remains unclear.
In this thesis, I proposed a data-streaming solution supporting PF and PMCMC methods with dynamic epidemiological models and demonstrated several successful applications.
By constructing an automated, easy-to-use streaming system, analytic applications and simulation models gain access to arriving real-time data to shorten the time gap between data and resulting model-supported insight. The well-defined architecture design emerging from the thesis would substantially expand traditional simulation models' potential by allowing such models to be offered as continually updated services.
Contingent on sufficiently fast execution time, simulation models within this framework can consume the incoming empirical data in real-time and generate informative predictions on an ongoing basis as new data points arrive.
In a second line of work, I investigated the platform's flexibility and capability by extending this system to support the use of a powerful class of PMCMC algorithms with dynamic models while ameliorating such algorithms' traditionally stiff performance limitations. Specifically, this work designed and implemented a GPU-enabled parallel version of a PMCMC method with dynamic simulation models. The resulting codebase readily has enabled researchers to adapt their models to the state-of-art statistical inference methods, and ensure that the computation-heavy PMCMC method can perform significant sampling between the successive arrival of each new data point. Investigating this method's impact with several realistic PMCMC application examples showed that GPU-based acceleration allows for up to 160x speedup compared to a corresponding CPU-based version not exploiting parallelism. The GPU accelerated PMCMC and the streaming processing system can complement each other, jointly providing researchers with a powerful toolset to greatly accelerate learning and securing additional insight from the high-velocity data increasingly prevalent within social and behavioural spheres.
The design philosophy applied supported a platform with broad generalizability and potential for ready future extensions.
The thesis discusses common barriers and difficulties in designing and implementing such systems and offers solutions to solve or mitigate them
- …