19 research outputs found
Towards an algorithmic skeleton framework for programming the Intel R Xeon PhiTM processor
The Intel R Xeon PhiTM is the first processor based on Intel’s MIC (Many Integrated Cores) architecture. It is a co-processor specially tailored for data-parallel computations, whose basic architectural design is similar to the ones of GPUs (Graphics Processing Units), leveraging the use of many integrated low computational cores to perform parallel
computations. The main novelty of the MIC architecture, relatively to GPUs, is its
compatibility with the Intel x86 architecture. This enables the use of many of the tools commonly available for the parallel programming of x86-based architectures, which may lead to a smaller learning curve. However, programming the Xeon Phi still entails aspects intrinsic to accelerator-based computing, in general, and to the MIC architecture, in particular.
In this thesis we advocate the use of algorithmic skeletons for programming the Xeon Phi. Algorithmic skeletons abstract the complexity inherent to parallel programming,
hiding details such as resource management, parallel decomposition, inter-execution
flow communication, thus removing these concerns from the programmer’s mind. In
this context, the goal of the thesis is to lay the foundations for the development of a
simple but powerful and efficient skeleton framework for the programming of the Xeon
Phi processor. For this purpose we build upon Marrow, an existing framework for the
orchestration of OpenCLTM computations in multi-GPU and CPU environments. We extend
Marrow to execute both OpenCL and C++ parallel computations on the Xeon Phi.
We evaluate the newly developed framework, several well-known benchmarks, like
Saxpy and N-Body, will be used to compare, not only its performance to the existing
framework when executing on the co-processor, but also to assess the performance on the Xeon Phi versus a multi-GPU environment.projects PTDC/EIA- EIA/113613/2009 (Synergy-VM) and PTDC/EEI-CTP/1837/2012 (SwiftComp) for financing the purchase of the Intel R Xeon PhiT
Using the Xeon Phi platform to run speculatively-parallelized codes
Producción CientíficaIntel Xeon Phi accelerators are one of the newest devices used in the field of parallel computing. However, there are comparatively few studies concerning their performance when using most of the existing parallelization techniques. One of them is thread-level speculation, a technique that optimistically tries to extract parallelism of loops without the need of a compile-time analysis that guarantees that the loop can be executed in parallel. In this article we evaluate the performance delivered by an Intel Xeon Phi coprocessor when using a software, state-of-the-art thread-level speculative parallelization library in the execution of well-known benchmarks. We describe both the internal characteristics of the Xeon Phi platform and the particularities of the thread-level speculation library being used as benchmark. Our results show that, although the Xeon Phi delivers a relatively good speedup in comparison with a shared-memory architecture in terms of scalability, the relatively low computing power of its computational units when specific vectorization and SIMD instructions are not fully exploited makes this first generation of Xeon Phi architectures not competitive (in terms of absolute performance) with respect to conventional multicore systems for the execution of speculatively parallelized code.2018-04-01Castilla-Leon Regional Government (VA172A12-2); MICINN (Spain) and the European Union FEDER (MOGECOPP project TIN2011-25639, HomProg-HetSys project TIN2014-58876-P, CAPAP-H5 network TIN2014-53522-REDT)
Offloading strategies for Stencil kernels on the KNC Xeon Phi architecture: Accuracy versus performance
[EN] The ever-increasing computational requirements of HPC and service provider applications are becoming a great challenge
for hardware and software designers. These requirements are reaching levels where the isolated development on either
computational field is not enough to deal with such challenge. A holistic view of the computational thinking is therefore the
only way to success in real scenarios. However, this is not a trivial task as it requires, among others, of hardware¿software
codesign. In the hardware side, most high-throughput computers are designed aiming for heterogeneity, where accelerators (e.g. Graphics Processing Units (GPUs), Field-Programmable Gate Arrays (FPGAs), etc.) are connected through
high-bandwidth bus, such as PCI-Express, to the host CPUs. Applications, either via programmers, compilers, or runtime,
should orchestrate data movement, synchronization, and so on among devices with different compute and memory
capabilities. This increases the programming complexity and it may reduce the overall application performance. This
article evaluates different offloading strategies to leverage heterogeneous systems, based on several cards with the firstgeneration Xeon Phi coprocessors (Knights Corner). We use a 11-point 3-D Stencil kernel that models heat dissipation as
a case study. Our results reveal substantial performance improvements when using several accelerator cards. Additionally,
we show that computing of an approximate result by reducing the communication overhead can yield 23% performance
gains for double-precision data sets.The author(s) disclosed receipt of the following financial
support for the research, authorship, and/or publication of
this article: This work is jointly supported by the Fundacion
Seneca (Agencia Regional de Ciencia y Tecnologia,
Region de Murcia) under grants 15290/PI/2010 and
18946/JLI/13 and by the Spanish MINECO, as well as
European Commission FEDER funds, under grants
TIN2015-66972-C5-3-R and TIN2016-78799-P (AEI/
FEDER, UE). MH was supported by a research grant from the PRODEP under the Professional Development Program for Teachers (UAGro-197) MéxicoHernández, M.; Cebrián, JM.; Cecilia-Canales, JM.; García, JM. (2020). Offloading strategies for Stencil kernels on the KNC Xeon Phi architecture: Accuracy versus performance. International Journal of High Performance Computing Applications. 34(2):199-297. https://doi.org/10.1177/1094342017738352S199297342Michael Brown, W., Carrillo, J.-M. Y., Gavhane, N., Thakkar, F. M., & Plimpton, S. J. (2015). Optimizing legacy molecular dynamics software with directive-based offload. Computer Physics Communications, 195, 95-101. doi:10.1016/j.cpc.2015.05.004Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam, K., & Burger, D. (2012). Power Limitations and Dark Silicon Challenge the Future of Multicore. ACM Transactions on Computer Systems, 30(3), 1-27. doi:10.1145/2324876.2324879Feng, L. (2015). Data Transfer Using the Intel COI Library. High Performance Parallelism Pearls, 341-348. doi:10.1016/b978-0-12-802118-7.00020-0Jeffers, J., & Reinders, J. (2013). Offload. Intel Xeon Phi Coprocessor High Performance Programming, 189-241. doi:10.1016/b978-0-12-410414-3.00007-4Rahman, R. (2013). Intel® Xeon Phi™ Coprocessor Architecture and Tools. doi:10.1007/978-1-4302-5927-5Reinders J, Jeffers J (2014) High Performance Parallelism Pearls, Multicore and Many-core Programming Approaches (Characterization and Auto-tuning of 3DFD). Morgan Kaufmann, pp. 377–396.Shareef, B., de Doncker, E., & Kapenga, J. (2015). Monte Carlo simulations on Intel Xeon Phi: Offload and native mode. 2015 IEEE High Performance Extreme Computing Conference (HPEC). doi:10.1109/hpec.2015.7322456Ujaldón, M. (2016). CUDA Achievements and GPU Challenges Ahead. Lecture Notes in Computer Science, 207-217. doi:10.1007/978-3-319-41778-3_20Wang, E., Zhang, Q., Shen, B., Zhang, G., Lu, X., Wu, Q., & Wang, Y. (2014). High-Performance Computing on the Intel® Xeon Phi™. doi:10.1007/978-3-319-06486-4Wende, F., Klemm, M., Steinke, T., & Reinefeld, A. (2015). Concurrent Kernel Offloading. High Performance Parallelism Pearls, 201-223. doi:10.1016/b978-0-12-802118-7.00012-
Optimisation of a Molecular Dynamics Simulation of Chromosome Condensation
We present optimisations applied to a bespoke bio-physical molecular dynamics simulation designed to investigate chromosome condensation. Our primary focus is on domain-specific algorithmic improvements to determining short-range interaction forces between particles, as certain qualities of the simulation render traditional methods less effective. We implement tuned versions of the code for both traditional CPU architectures and the modern many-core architecture found in the Intel Xeon Phi coprocessor and compare their effectiveness. We achieve speed-ups starting at a factor of 10 over the original code, facilitating more detailed and larger-scale experiments
Efficient computation of the matrix square root in heterogeneous platforms
Dissertação de mestrado em Engenharia InformáticaMatrix algorithms often deal with large amounts of data at a time, which impairs efficient
cache memory usage. Recent collaborative work between the Numerical Algorithms
Group and the University of Minho led to a blocked approach to the matrix square root algorithm
with significant efficiency improvements, particularly in a multicore shared memory
environment.
Distributed memory architectures were left unexplored. In these systems data is distributed
across multiple memory spaces, including those associated with specialized accelerator
devices, such as GPUs. Systems with these devices are known as heterogeneous
platforms.
This dissertation focuses on studying the blocked matrix square root algorithm, first
in a multicore environment, and then in heterogeneous platforms. Two types of hardware
accelerators are explored: Intel Xeon Phi coprocessors and NVIDIA CUDA-enabled GPUs.
The initial implementation confirmed the advantages of the blocked method and showed
excellent scalability in a multicore environment. The same implementation was also used in
the Intel Xeon Phi, but the obtained performance results lagged behind the expected behaviour
and the CPU-only alternative. Several optimizations techniques were applied to the
common implementation, which managed to reduce the gap between the two environments.
The implementation for CUDA-enabled devices followed a different programming model
and was not able to benefit from any of the previous solutions. It also required the implementation
of BLAS and LAPACK routines, since no existing package fits the requirements of
this application. The measured performance also showed that the CPU-only implementation
is still the fastest.Algoritmos de matrizes lidam regularmente com grandes quantidades de dados ao
mesmo tempo, o que dificulta uma utilização eficiente da cache. Um trabalho recente de
colaboração entre o Numerical Algorithms Group e a Universidade do Minho levou a uma
abordagem por blocos para o algoritmo da raíz quadrada de uma matriz com melhorias de
eficiência significativas, particularmente num ambiente multicore de memória partilhada.
Arquiteturas de memória distribuída permaneceram inexploradas. Nestes sistemas
os dados são distribuídos por diversos espaços de memória, incluindo aqueles associados a
dispositivos aceleradores especializados, como GPUs. Sistemas com estes dispositivos são
conhecidos como plataformas heterogéneas.
Esta dissertação foca-se em estudar o algoritmo da raíz quadrada de uma matriz por
blocos, primeiro num ambiente multicore e depois usando plataformas heterogéneas. Dois
tipos de aceleradores são explorados: co-processadores Intel Xeon Phi e GPUs NVIDIA
habilitados para CUDA.
A implementação inicial confirmou as vantagens do método por blocos e mostrou uma
escalabilidade excelente num ambiente multicore. A mesma implementação foi ainda usada
para o Intel Xeon Phi, mas os resultados de performance obtidos ficaram aquém do comportamento
esperado e da alternativa usando apenas CPUs. Várias otimizações foram aplicadas
a esta implementação comum, conseguindo reduzir a diferença entre os dois ambientes.
A implementação para dispositivos CUDA seguiu um modelo de programação diferente
e não pôde beneficiar the nenhuma das soluções anteriores. Também exigiu a implementação
de rotinas BLAS e LAPACK, já que nenhum dos pacotes existentes se adequa aos requisitos
desta implementação. A performance medida também mostrou que a alternativa usando
apenas CPUs ainda é a mais rápida.Fundação para a Ciência e a Tecnologia (FCT) - Program UT Austin | Portuga
Modeling Energy Consumption of High-Performance Applications on Heterogeneous Computing Platforms
Achieving Exascale computing is one of the current leading challenges in High Performance Computing (HPC). Obtaining this next level of performance will allow more complex simulations to be run on larger datasets and offer researchers better tools for data processing and analysis. In the dawn of Big Data, the need for supercomputers will only increase. However, these systems are costly to maintain because power is expensive. Thus, a better understanding of power and energy consumption is required such that future hardware can benefit.
Available power models accurately capture the relationship to the number of cores and clock-rate, however the relationship between workload and power is less understood. Thus, investigation and analysis of power measurements has been a focal point in this work with the aim to improve the general understanding of energy consumption in the context of HPC.
This dissertation investigates power and energy consumption of many different parallel applications on several hardware platforms while varying a number of execution characteristics. Multicore and manycore hardware devices are investigated in homogeneous and heterogeneous computing environments. Further, common techniques for reducing power and energy consumption are employed to each of these devices.
Well-known power and performance models have been combined to form the Execution-Phase model, which may be used to quantify energy contributions based on execution phase and has been used to predict energy consumption to within 10%. However, due to limitations in the measurement procedure, a less intrusive approach is required.
The Empirical Mode Decomposition (EMD) and Hilbert-Huang Transform analysis technique has been applied in innovative ways to model, analyze, and visualize power and energy measurements. EMD is widely used in other research areas, including earthquake, brain-wave, speech recognition, and sea-level rise analysis and this is the first it has been applied to power traces to analyze the complex interactions occurring within HPC systems.
Probability distributions may be used to represent power and energy traces, thereby providing an alternative means of predicting energy consumption while retaining the fact that power is not constant over time. Further, these distributions may be used to define the cost of a workload for a given computing platform