473 research outputs found

    A performance focused, development friendly and model aided parallelization strategy for scientific applications

    Get PDF
    The amelioration of high performance computing platforms has provided unprecedented computing power with the evolution of multi-core CPUs, massively parallel architectures such as General Purpose Graphics Processing Units (GPGPUs) and Many Integrated Core (MIC) architectures such as Intel\u27s Xeon phi coprocessor. However, it is a great challenge to leverage capabilities of such advanced supercomputing hardware, as it requires efficient and effective parallelization of scientific applications. This task is difficult mainly due to complexity of scientific algorithms coupled with the variety of available hardware and disparate programming models. To address the aforementioned challenges, this thesis presents a parallelization strategy to accelerate scientific applications that maximizes the opportunities of achieving speedup while minimizing the development efforts. Parallelization is a three step process (1) choose a compatible combination of architecture and parallel programming language, (2) translate base code/algorithm to a parallel language and (3) optimize and tune the application. In this research, a quantitative comparison of run time for various implementations of k-means algorithm, is used to establish that native languages (OpenMP, MPI, CUDA) perform better on respective architectures as opposed to vendor-neutral languages such as OpenCL. A qualitative model is used to select an optimal architecture for a given application by aligning the capabilities of accelerators with characteristics of the application. Once the optimal architecture is chosen, the corresponding native language is employed. This approach provides the best performance with reasonable accuracy (78%) of predicting a fitting combination, while eliminating the need for exploring different architectures individually. It reduces the required development efforts considerably as the application need not be re-written in multiple languages. The focus can be solely on optimization and tuning to achieve the best performance on available architectures with minimized investment in terms of cost and efforts. To verify the prediction accuracy of the qualitative model, the OpenDwarfs benchmark suite, which implements the Berkeley\u27s dwarfs in OpenCL, is used. A dwarf is an algorithmic method that captures a pattern of computation and communication. For the purpose of this research, the focus is on 9 application from various algorithmic domains that cover the seven dwarfs of symbolic computation, which were identified by Phillip Colella, as omnipresent in scientific and engineering applications. To validate the parallelization strategy collectively, a case study is undertaken. This case study involves parallelization of the Lower Upper Decomposition for the Gaussian Elimination algorithm from the linear algebra domain, using conventional trial and error methods as well as the proposed \u27Architecture First, Language Later\u27\u27 strategy. The development efforts incurred are contrasted for both methods. The aforesaid proposed strategy is observed to reduce the development efforts by an average of 50%

    Performance Evaluation of Pseudospectral Ultrasound Simulations on a Cluster of Xeon Phi Accelerators

    Get PDF
    The rapid development of novel procedures in medical ultrasonics, including treatment planning in therapeutic ultrasound and image reconstruction in photoacoustic tomography, leads to increasing demand for large-scale ultrasound simulations. However, routine execution of such simulations using traditional methods, e.g., finite difference time domain, is expensive and often considered intractable due to the computational and memory requirements. The k-space corrected pseudospectral time domain method used by the k-Wave toolbox allows for significant reductions in spatial and temporal grid resolution. These improvements are achieved at the cost of all-to-all communication, which are inherent to the multi-dimensional fast Fourier transforms. To improve data locality, reduce communication and allow efficient use of accelerators, we recently implemented a domain decomposition technique based on a local Fourier basis. In this paper, we investigate whether it is feasible to run the distributed k-Wave implementation on the Salomon cluster equipped with 864 Intel Xeon Phi (Knight’s Corner) accelerators. The results show the immaturity of the KNC platform with issues ranging from limited support of Infiniband and LustreFS in Intel MPI on this platform to poor performance of 3D FFTs achieved by Intel MKL on the KNC architecture. Yet, we show that it is possible to achieve strong and weak scaling comparable to CPU-only platforms albeit with the runtime 1.8× to 4.3× longer. However, the accounting policy for Salomon’s accelerators is far more favorable and thus their employment reduces the computational cost significantly
    • …
    corecore