700 research outputs found
Gpu Optimizing And Accelerating Of Gibbs Ensemble On The Cuda Kepler Architecture
The main purpose of implementing the code on Kepler architecture is to speed up the GPU code, which is from the previous work done by our group, by using the new functions of NVIDIA CUDA\u27s Kepler architecture. Therefore, this thesis specifically focuses on the latest architecture.
To get benefits from the Kepler architecture, the primary work is to convert the code and make it adapt to the new features: Warp Shuffle and Dynamic Parallelism. The new code changes the way to transfer data and generate new kernel functions. In addition, another challenge is to trade o the use of resources on each thread to get the best performance.
The new code has different performance with different work sizes. Generally, the speedup is between 17% and 33%, and better performance is achieved in larger systems. This is a reasonable performance for the improvement with only two new features. The main contribution of this thesis is that the detailed evaluation of these two Kepler architectural features provide guidance to other researchers on the potential performance benefits of modifying their code. Therefore, they can make appropriate modifications and achieve reasonable speedup according to the structure of their codes
Tiling Optimization For Nested Loops On Gpus
Optimizing nested loops has been considered as an important topic and widely studied in parallel programming. With the development of GPU architectures, the performance of these computations can be significantly boosted with the massively parallel hardware.
General matrix-matrix multiplication is a typical example where executing such an algorithm on GPUs outperforms the performance obtained on other multicore CPUs. However, achieving ideal performance on GPUs usually requires a lot of human effort to manage
the massively parallel computation resources. Therefore, the efficient implementation of optimizing nested loops on GPUs became a popular topic in recent years. We present our work based on the tiling strategy in this dissertation to address three kinds of popular
problems. Different kinds of computations bring in different latency issues where dependencies in the computation may result in insufficient parallelism and the performance of computations without dependencies may be degraded due to intensive memory accesses. In this thesis, we tackle the challenges for each kind of problem and believe that other computations performed in nested loops can also benefit from the presented techniques.
We improve a parallel approximation algorithm for the problem of scheduling jobs on parallel identical machines to minimize makespan with a high-dimensional tiling method. The algorithm is designed and optimized for solving this kind of problem efficiently on GPUs. Because the algorithm is based on a higher-dimensional dynamic programming approach, where dimensionality refers to the number of variables in the dynamic programming equation characterizing the problem, the existing implementation suffers from the pain of dimensionality and cannot fully utilize GPU resources. We design a novel data-partitioning technique to accelerate the higher-dimensional dynamic programming component of the algorithm. Both the load imbalance and exceeding memory capacity
issues are addressed in our GPU solution. We present performance results to demonstrate how our proposed design improves the GPU utilization and makes it possible to solve large higher-dimensional dynamic programming problems within the limited GPU memory.
Experimental results show that the GPU implementation achieves up to 25X speedup compared to the best existing OpenMP implementation.
In addition, we focus on optimizing wavefront parallelism on GPUs. Wavefront parallelism is a well-known technique for exploiting the concurrency of applications that execute nested loops with uniform data dependencies. Recent research on such applications,
which range from sequence alignment tools to partial differential equation solvers, has used GPUs to benefit from the massively parallel computing resources. Wavefront parallelism faces the load imbalance issue because the parallelism is passing along the diagonal.
The tiling method has been introduced as a popular solution to address this issue. However, the use of hyperplane tiles increases the cost of synchronization and leads to poor data locality. In this paper, we present a highly optimized implementation of the wavefront
parallelism technique that harnesses the GPU architecture. A balanced workload and maximum resource utilization are achieved with an extremely low synchronization overhead. We design the kernel configuration to significantly reduce the minimum number of
synchronizations required and also introduce an inter-block lock to minimize the overhead of each synchronization. We evaluate the performance of our proposed technique for four different applications: Sequence Alignment, Edit Distance, Summed-Area Table, and 2DSOR. The performance results demonstrate that our method achieves speedups of up to six times compared to the previous best-known hyperplane tiling-based GPU implementation.
Finally, we extend the hyperplane tiling to high order 2D stencil computations. Unlike wavefront parallelism that has dependence in the spatial dimension, dependence remains only across two adjacent time steps along the temporal dimension in stencil computations.
Even if the no-dependence property significantly increases the parallelism obtained in the spatial dimensions, full parallelism may not be efficient on GPUs. Due to the limited cache capacity owned by each streaming multiprocessor, full parallelism can be obtained
on global memory only, which has high latency to access. Therefore, the tiling technique can be applied to improve the memory efficiency by caching the small tiled blocks. Because the widely studied tiling methods, like overlapped tiling and split tiling, have considerable computation overhead caused by load imbalance or extra operations, we propose a time skewed tiling method, which is designed upon the GPU architecture. We work around the serialized computation issue and coordinate the intra-tile parallelism and inter-tile parallelism to minimize the load imbalance caused by pipelined processing. Moreover, we address the high-order stencil computations in our development, which has not been comprehensively studied. The proposed method achieves up to 3.5X performance improvement when the stencil computation is performed on a Moore neighborhood pattern
Recommended from our members
Modeling Bioenergy Supply Chains: Feedstocks Pretreatment, Integrated System Design Under Uncertainty
Biofuels have been promoted by governmental policies for reducing fossil fuel dependency and greenhouse gas emissions, as well as facilitating regional economic growth. Comprehensive model analysis is needed to assess the economic and environmental impacts of developing bioenergy production systems. For cellulosic biofuel production and supply in particular, existing studies have not accounted for the inter-dependencies between multiple participating decision makers and simultaneously incorporated uncertainties and risks associated with the linked production systems.This dissertation presents a methodology that incorporates uncertainty element to the existing integrated modeling framework specifically designed for advanced biofuel production systems using dedicated energy crops as feedstock resources. The goal of the framework is to support the bioenergy industry for infrastructure and supply chain development. The framework is flexible to adapt to different topological network structures and decision scopes based on the modeling requirements, such as on capturing the interactions between the agricultural production system and the multi-refinery bioenergy supply chain system with regards to land allocation and crop adoption patterns, which is critical for estimating feedstock supply potentials for the bioenergy industry. The methodology is also particularly designed to incorporate system uncertainties by using stochastic programming models to improve the resilience of the optimized system design.The framework is used to construct model analyses in two case studies. The results of the California biomass supply model estimate that feedstock pretreatment via combined torrefaction and pelletization reduces delivered and transportation cost for long-distance biomass shipment by 5% and 15% respectively. The Pacific Northwest hardwood biofuels application integrates full-scaled supply chain infrastructure optimization with agricultural economic modeling and estimates that bio-jet fuels can be produced at costs between 4 to 5 dollars per gallon, and identifies areas suitable for simultaneously deploying a set of biorefineries using adopted poplar as the dedicated energy crop to produce biomass feedstocks. This application specifically incorporates system uncertainties in the crop market and provides an optimal system design solution with over 17% improvement in expected total profit compared to its corresponding deterministic model
Leakage discharge separation in multi-leaks pipe networks based on improved Independent Component Analysis with Reference (ICA-R) algorithm
The existing leakage assessment methods are not accurate and timely, making it difficult to meet the needs of water companies. In this paper, a methodology based on Independent Component Analysis with Reference (ICA-R) algorithm was proposed to give an more accurate estimation of leakage discharge in multi-leaks water distribution network without considering the specific individuality of one single leak. The proposed algorithm has been improved is improved to prevent error convergence in multi-leak pipe networks. Then an example EPANET model and a physical experimental platform were built to simulate and evaluate the flow in multi-leak WDNs, and the leakage flow rate is calculated by improved ICA-R algorithm and FastICA algorithm. The simulation results are shown the improved ICA-R algorithm has better performanc
- …