23 research outputs found

    Accelerating the pace of protein functional annotation with intel xeon phi coprocessors

    Get PDF
    © 2002-2011 IEEE. Intel Xeon Phi is a new addition to the family of powerful parallel accelerators. The range of its potential applications in computationally driven research is broad; however, at present, the repository of scientific codes is still relatively limited. In this study, we describe the development and benchmarking of a parallel version of {\mmb e}FindSite, a structural bioinformatics algorithm for the prediction of ligand-binding sites in proteins. Implemented for the Intel Xeon Phi platform, the parallelization of the structure alignment portion of {\mmb e}FindSite using pragma-based OpenMP brings about the desired performance improvements, which scale well with the number of computing cores. Compared to a serial version, the parallel code runs 11.8 and 10.1 times faster on the CPU and the coprocessor, respectively; when both resources are utilized simultaneously, the speedup is 17.6. For example, ligand-binding predictions for 501 benchmarking proteins are completed in 2.1 hours on a single Stampede node equipped with the Intel Xeon Phi card compared to 3.1 hours without the accelerator and 36.8 hours required by a serial version. In addition to the satisfactory parallel performance, porting existing scientific codes to the Intel Xeon Phi architecture is relatively straightforward with a short development time due to the support of common parallel programming models by the coprocessor. The parallel version of {\mmb e}FindSite is freely available to the academic community at www.brylinski.org/efindsite

    Xeon Phi를 활용한 병렬 프로그래밍

    Get PDF
    학위논문 (석사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2015. 8. 윤성로.본 논문은 Intel의 병렬연산을 위한 보조 프로세싱 장치(Co-Processor)인 Xeon Phi를 이용한 가속화 알고리즘 설계 및 구현상 주요 이슈와 Xeon Phi로 구현 가능한 병렬화 모델을 제시한다. 효과적인 병렬프로그래밍을 위해서는 사용하고자하는 가속기의 구조적인 특징과 프로그래밍 모델에 대하여 개발자가 구체적인 이해를 할 필요가 있다. 그리고 가속기로 구현이 가능한 병렬화 모델을 파악하여 대상 어플리케이션에 효과적인 적용을 하여야한다. 따라서 본 논문은 먼저 효과적인 프로그래밍을 위해 알아두어야 할 Xeon Phi의 구조적 특징, 프로그래밍 모델을 설명한다. 더불어 프로그래머 관점에서 GPGPU (General-Purpose computing on Graphics Processing Units)와의 차이도 언급하여 이미 GPGPU에 익숙해진 개발자도 Xeon Phi에 쉽게 적응할 수 있도록 하였다. 다음으로 Xeon Phi로 구현할 수 있는 병렬화 모델을 정의하고 구체적인 사례와 함께 제시한다. 기존에 멀티코어 CPU (Central Processing Unit)를 위해 구현이 되었던 Strassen-Winograd 알고리즘과 SHRINK (SHaRed-memory SLINK)알고리즘에 Xeon Phi를 적용하여 구현한 뒤 실험을 통해 가속화 효과를 보였고, 이를 통해 Xeon Phi는 GPGPU와 동일한 형태의 병렬화 모델을 구현할 수 있을 뿐 아니라 GPGPU로는 효과적으로 구현하기 힘든 병렬화 모델도 구현할 수 있음을 확인하였다. 이러한 구체적인 사례와 함께 본 논문에 소개되지 않은 어플리케이션에도 가속화를 위해 적용할 수 있는 두 가지 병렬화 모델을 정의해 개발자가 주어진 어플리케이션의 성능 향상을 위해 효과적으로 Xeon Phi를 적용할 수 있도록 하였다.제 1 장 서 론 1 제 2 장 Xeon Phi 소개 3 제 1 절 개요 및 등장 배경 3 제 2 절 구조적 특징 5 제 3 절 프로그래밍 모델 9 제 4 절 GPGPU 대비 특징 11 제 3 장 Xeon Phi를 통한 가속화 15 제 1 절 Simple 모델 가속화 17 1. Strassen-Winograd 알고리즘 소개 18 2. 알고리즘 가속화 20 3. Xeon Phi의 적용 21 제 2 절 Complex 모델 가속화 22 1. SHRINK (SHaRed-memory SLINK) 알고리즘 소개 23 2. Xeon Phi의 적용 25 3. Hybrid SHRINK 26 제 4 장 실 험 29 제 1 절 Strassen-Winograd 알고리즘 가속화 29 1. 실험 환경 29 2. 실험 결과 30 제 2 절 SHRINK 알고리즘 가속화 33 1. 실험 환경 33 2. 실험 결과 34 제 4 장 결 론 38 참고문헌 40 Abstract 47Maste

    Optimisation of computational fluid dynamics applications on multicore and manycore architectures

    Get PDF
    This thesis presents a number of optimisations used for mapping the underlying computational patterns of finite volume CFD applications onto the architectural features of modern multicore and manycore processors. Their effectiveness and impact is demonstrated in a block-structured and an unstructured code of representative size to industrial applications and across a variety of processor architectures that make up contemporary high-performance computing systems. The importance of vectorization and the ways through which this can be achieved is demonstrated in both structured and unstructured solvers together with the impact that the underlying data layout can have on performance. The utility of auto-tuning for ensuring performance portability across multiple architectures is demonstrated and used for selecting optimal parameters such as prefetch distances for software prefetching or tile sizes for strip mining/loop tiling. On the manycore architectures, running more than one thread per physical core is found to be crucial for good performance on processors with in-order core designs but not required on out-of-order architectures. For architectures with high-bandwidth memory packages, their exploitation, whether explicitly or implicitly, is shown to be imperative for best performance. The implementation of all of these optimisations led to application speed-ups ranging between 2.7X and 3X on the multicore CPUs and 5.7X to 24X on the manycore processors.Open Acces

    EXPERIMENTAL EVALUATION OF PARALLEL PROGRAM SCALABILITY ON XEON PHI SMP

    Get PDF

    Towards the use of mini-applications in performance prediction and optimisation of production codes

    Get PDF
    Maintaining the performance of large scientific codes is a difficult task. To aid in this task a number of mini-applications have been developed that are more tract able to analyse than large-scale production codes, while retaining the performance characteristics of them. These “mini-apps” also enable faster hardware evaluation, and for sensitive commercial codes allow evaluation of code and system changes outside of access approval processes. Techniques for validating the representativeness of a mini-application to a target code are ultimately qualitative, requiring the researcher to decide whether the similarity is strong enough for the mini-application to be trusted to provide accurate predictions of the target performance. Little consideration is given to the sensitivity of those predictions to the few differences between the mini-application and its target, how those potentially-minor static differences may lead to each code responding very differently to a change in the computing environment. An existing mini-application, ‘Mini-HYDRA’, of a production CFD simulation code is reviewed. Arithmetic differences lead to divergence in intra-node performance scaling, so the developers had removed some arithmetic from Mini-HYDRA, but this breaks the simulation so limits numerical research. This work restores the arithmetic, repeating validation for similar performance scaling, achieving similar intra-node scaling performance whilst neither are memory-bound. MPI strong scaling functionality is also added, achieving very similar multi-node scaling performance. The arithmetic restoration inevitably leads to different memory-bounds, and also different and varied responses to changes in processor architecture or instruction set. A performance model is developed that predicts this difference in response, in terms of the arithmetic differences. It is supplemented by a new benchmark that measures the memory-bound of CFD loops. Together, they predict the strong scaling performance of a production ‘target’ code, with a mean error of 8.8% (s = 5.2%). Finally, the model is used to investigate limited speedup from vectorisation despite not being memory-bound. It identifies that instruction throughput is significantly reduced relative to serial counterparts, independent of data ordering in memory, indicating a bottleneck within the processor core

    Optimización del rendimiento y la eficiencia energética en sistemas masivamente paralelos

    Get PDF
    RESUMEN Los sistemas heterogéneos son cada vez más relevantes, debido a sus capacidades de rendimiento y eficiencia energética, estando presentes en todo tipo de plataformas de cómputo, desde dispositivos embebidos y servidores, hasta nodos HPC de grandes centros de datos. Su complejidad hace que sean habitualmente usados bajo el paradigma de tareas y el modelo de programación host-device. Esto penaliza fuertemente el aprovechamiento de los aceleradores y el consumo energético del sistema, además de dificultar la adaptación de las aplicaciones. La co-ejecución permite que todos los dispositivos cooperen para computar el mismo problema, consumiendo menos tiempo y energía. No obstante, los programadores deben encargarse de toda la gestión de los dispositivos, la distribución de la carga y la portabilidad del código entre sistemas, complicando notablemente su programación. Esta tesis ofrece contribuciones para mejorar el rendimiento y la eficiencia energética en estos sistemas masivamente paralelos. Se realizan propuestas que abordan objetivos generalmente contrapuestos: se mejora la usabilidad y la programabilidad, a la vez que se garantiza una mayor abstracción y extensibilidad del sistema, y al mismo tiempo se aumenta el rendimiento, la escalabilidad y la eficiencia energética. Para ello, se proponen dos motores de ejecución con enfoques completamente distintos. EngineCL, centrado en OpenCL y con una API de alto nivel, favorece la máxima compatibilidad entre todo tipo de dispositivos y proporciona un sistema modular extensible. Su versatilidad permite adaptarlo a entornos para los que no fue concebido, como aplicaciones con ejecuciones restringidas por tiempo o simuladores HPC de dinámica molecular, como el utilizado en un centro de investigación internacional. Considerando las tendencias industriales y enfatizando la aplicabilidad profesional, CoexecutorRuntime proporciona un sistema flexible centrado en C++/SYCL que dota de soporte a la co-ejecución a la tecnología oneAPI. Este runtime acerca a los programadores al dominio del problema, posibilitando la explotación de estrategias dinámicas adaptativas que mejoran la eficiencia en todo tipo de aplicaciones.ABSTRACT Heterogeneous systems are becoming increasingly relevant, due to their performance and energy efficiency capabilities, being present in all types of computing platforms, from embedded devices and servers to HPC nodes in large data centers. Their complexity implies that they are usually used under the task paradigm and the host-device programming model. This strongly penalizes accelerator utilization and system energy consumption, as well as making it difficult to adapt applications. Co-execution allows all devices to simultaneously compute the same problem, cooperating to consume less time and energy. However, programmers must handle all device management, workload distribution and code portability between systems, significantly complicating their programming. This thesis offers contributions to improve performance and energy efficiency in these massively parallel systems. The proposals address the following generally conflicting objectives: usability and programmability are improved, while ensuring enhanced system abstraction and extensibility, and at the same time performance, scalability and energy efficiency are increased. To achieve this, two runtime systems with completely different approaches are proposed. EngineCL, focused on OpenCL and with a high-level API, provides an extensible modular system and favors maximum compatibility between all types of devices. Its versatility allows it to be adapted to environments for which it was not originally designed, including applications with time-constrained executions or molecular dynamics HPC simulators, such as the one used in an international research center. Considering industrial trends and emphasizing professional applicability, CoexecutorRuntime provides a flexible C++/SYCL-based system that provides co-execution support for oneAPI technology. This runtime brings programmers closer to the problem domain, enabling the exploitation of dynamic adaptive strategies that improve efficiency in all types of applications.Funding: This PhD has been supported by the Spanish Ministry of Education (FPU16/03299 grant), the Spanish Science and Technology Commission under contracts TIN2016-76635-C2-2-R and PID2019-105660RB-C22. This work has also been partially supported by the Mont-Blanc 3: European Scalable and Power Efficient HPC Platform based on Low-Power Embedded Technology project (G.A. No. 671697) from the European Union’s Horizon 2020 Research and Innovation Programme (H2020 Programme). Some activities have also been funded by the Spanish Science and Technology Commission under contract TIN2016-81840-REDT (CAPAP-H6 network). The Integration II: Hybrid programming models of Chapter 4 has been partially performed under the Project HPC-EUROPA3 (INFRAIA-2016-1-730897), with the support of the EC Research Innovation Action under the H2020 Programme. In particular, the author gratefully acknowledges the support of the SPMT Department of the High Performance Computing Center Stuttgart (HLRS)
    corecore