18 research outputs found

    Comparative evaluation of bandwidth-bound applications on the Intel Xeon CPU MAX Series

    Full text link
    In this paper we explore the performance of Intel Xeon MAX CPU Series, representing the most significant new variation upon the classical CPU architecture since the Intel Xeon Phi Processor. Given the availability of a large on-package high-bandwidth memory, the bandwidth-to-compute ratio has significantly shifted compared to other CPUs on the market. Since a large fraction of HPC workloads are sensitive to the available bandwidth, we explore how this architecture performs on a selection of HPC proxies and applications that are mostly sensitive to bandwidth, and how it compares to the previous 3rd generation Intel Xeon Scalable processors (codenamed Ice Lake) and an AMD EPYC 7003 Series Processor with 3D V-Cache Technology (codenamed Milan-X). We explore performance with different parallel implementations (MPI, MPI+OpenMP, MPI+SYCL), compiled with different compilers and flags, and executed with or without hyperthreading. We show how performance bottlenecks are shifted from bandwidth to communication latencies for some applications, and demonstrate speedups compared to the previous generation between 2.0x-4.3x

    Multiple target task sharing support for the OpenMP accelerator model

    Get PDF
    The use of GPU accelerators is becoming common in HPC platforms due to the their effective performance and energy efficiency. In addition, new generations of multicore processors are being designed with wider vector units and/or larger hardware thread counts, also contributing to the peak performance of the whole system. Although current directive–based paradigms, such as OpenMP or OpenACC, support both accelerators and multicore-based hosts, they do not provide an effective and efficient way to concurrently use them, usually resulting in accelerated programs in which the potential computational performance of the host is not exploited. In this paper we propose an extension to the OpenMP 4.5 directive-based programming model to support the specification and execution of multiple instances of task regions on different devices (i.e. accelerators in conjunction with the vector and heavily multithreaded capabilities in multicore processors). The compiler is responsible for the generation of device-specific code for each device kind, delegating to the runtime system the dynamic schedule of the tasks to the available devices. The new proposed clause conveys useful insight to guide the scheduler while keeping a clean, abstract and machine independent programmer interface. The potential of the proposal is analyzed in a prototype implementation in the OmpSs compiler and runtime infrastructure. Performance evaluation is done using three kernels (N-Body, tiled matrix multiply and Stream) on different GPU-capable systems based on ARM, Intel x86 and IBM Power8. From the evaluation we observe speed–ups in the 8–20% range compared to versions in which only the GPU is used, reaching 96 % of the additional peak performance thanks to the reduction of data transfers and the benefits introduced by the OmpSs NUMA-aware scheduler.This work is partially supported by the IBM/BSC Deep Learning Center Initiative, by the Spanish Government through Programa Severo Ochoa (SEV-2015-0493), by the Spanish Ministry of Science and Technology through TIN2015-65316-P project and by the Generalitat de Catalunya (contract 2014-SGR-1051).Peer ReviewedPostprint (author's final draft

    The readying of applications for heterogeneous computing

    Get PDF
    High performance computing is approaching a potentially significant change in architectural design. With pressures on the cost and sheer amount of power, additional architectural features are emerging which require a re-think to the programming models deployed over the last two decades. Today's emerging high performance computing (HPC) systems are maximising performance per unit of power consumed resulting in the constituent parts of the system to be made up of a range of different specialised building blocks, each with their own purpose. This heterogeneity is not just limited to the hardware components but also in the mechanisms that exploit the hardware components. These multiple levels of parallelism, instruction sets and memory hierarchies, result in truly heterogeneous computing in all aspects of the global system. These emerging architectural solutions will require the software to exploit tremendous amounts of on-node parallelism and indeed programming models to address this are emerging. In theory, the application developer can design new software using these models to exploit emerging low power architectures. However, in practice, real industrial scale applications last the lifetimes of many architectural generations and therefore require a migration path to these next generation supercomputing platforms. Identifying that migration path is non-trivial: With applications spanning many decades, consisting of many millions of lines of code and multiple scientific algorithms, any changes to the programming model will be extensive and invasive and may turn out to be the incorrect model for the application in question. This makes exploration of these emerging architectures and programming models using the applications themselves problematic. Additionally, the source code of many industrial applications is not available either due to commercial or security sensitivity constraints. This thesis highlights this problem by assessing current and emerging hard- ware with an industrial strength code, and demonstrating those issues described. In turn it looks at the methodology of using proxy applications in place of real industry applications, to assess their suitability on the next generation of low power HPC offerings. It shows there are significant benefits to be realised in using proxy applications, in that fundamental issues inhibiting exploration of a particular architecture are easier to identify and hence address. Evaluations of the maturity and performance portability are explored for a number of alternative programming methodologies, on a number of architectures and highlighting the broader adoption of these proxy applications, both within the authors own organisation, and across the industry as a whole

    Large-scale performance of a DSL-based multi-block structured-mesh application for Direct Numerical Simulation

    Get PDF
    SBLI (Shock-wave/Boundary-layer Interaction) is a large-scale Computational Fluid Dynamics (CFD) application, developed over 20 years at the University of Southampton and extensively used within the UK Turbulence Consortium. It is capable of performing Direct Numerical Simulations (DNS) or Large Eddy Simulation (LES) of shock-wave/boundary-layer interaction problems over highly detailed multi-block structured mesh geometries. SBLI presents major challenges in data organization and movement that need to be overcome for continued high performance on emerging massively parallel hardware platforms. In this paper we present research in achieving this goal through the OPS embedded domain-specific language. OPS targets the domain of multi-block structured mesh applications. It provides an API embedded in C/C++ and Fortran and makes use of automatic code generation and compilation to produce executables capable of running on a range of parallel hardware systems. The core functionality of SBLI is captured using a new framework called OpenSBLI which enables a developer to declare the partial differential equations using Einstein notation and then automatically carryout discretization and generation of OPS (C/C++) API code. OPS is then used to automatically generate a wide range of parallel implementations. Using this multi-layered abstractions approach we demonstrate how new opportunities for further optimizations can be gained, such as fine-tuning the computation intensity and reducing data movement and apply them automatically. Performance results demonstrate there is no performance loss due to the high-level development strategy with OPS and OpenSBLI, with performance matching or exceeding the hand-tuned original code on all CPU nodes tested. The data movement optimizations provide over 3× speedups on CPU nodes, while GPUs provide 5× speedups over the best performing CPU node. The OPS generated parallel code also demonstrates excellent scalability on nearly 100K cores on a Cray XC30 (ARCHER at EPCC) and on over 4K GPUs on a CrayXK7 (Titan at ORNL)

    High performance computing systems. Performance modeling, benchmarking, and simulation : 5th International Workshop, PMBS 2014, New Orleans, LA, USA, November 16, 2014. Revised Selected Papers

    No full text
    This book constitutes the thoroughly refereed proceedings of the 5th International Workshop, PMBS 2014 in New Orleans, LA, USA in November 2014. The 12 full and 2 short papers presented in this volume were carefully reviewed and selected from 53 submissions. The papers cover topics on performance benchmarking and optimization; performance analysis and prediction; and power, energy and checkpointing

    An abstract interpretation for SPMD divergence on reducible control flow graphs

    Get PDF
    Vectorizing compilers employ divergence analysis to detect at which program point a specific variable is uniform, i.e. has the same value on all SPMD threads that execute this program point. They exploit uniformity to retain branching to counter branch divergence and defer computations to scalar processor units. Divergence is a hyper-property and is closely related to non-interference and binding time. There exist several divergence, binding time, and non-interference analyses already but they either sacrifice precision or make significant restrictions to the syntactical structure of the program in order to achieve soundness. In this paper, we present the first abstract interpretation for uniformity that is general enough to be applicable to reducible CFGs and, at the same time, more precise than other analyses that achieve at least the same generality. Our analysis comes with a correctness proof that is to a large part mechanized in Coq. Our experimental evaluation shows that the compile time and the precision of our analysis is on par with LLVM’s default divergence analysis that is only sound on more restricted CFGs. At the same time, our analysis is faster and achieves better precision than a state-of-the-art non-interference analysis that is sound and at least as general as our analysis

    Heterogeneous Acceleration for 5G New Radio Channel Modelling Using FPGAs and GPUs

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    Optimizing the Performance of Directive-based Programming Model for GPGPUs

    Get PDF
    Accelerators have been deployed on most major HPC systems. They are considered to improve the performance of many applications. Accelerators such as GPUs have an immense potential in terms of high compute capacity but programming these devices is a challenge. OpenCL, CUDA and other vendor-specific models for accelerator programming definitely offer high performance, but these are low-level models that demand excellent programming skills; moreover, they are time consuming to write and debug. In order to simplify GPU programming, several directive-based programming models have been proposed, including HMPP, PGI accelerator model and OpenACC. OpenACC has now become established as the de facto standard. We evaluate and compare these models involving several scientific applications. To study the implementation challenges and the principles and techniques of directive- based models, we built an open source OpenACC compiler on top of a main stream compiler framework (OpenUH as a branch of Open64). In this dissertation, we present the required techniques to parallelize and optimize the applications ported with OpenACC programming model. We apply both user-level optimizations in the applications and compiler and runtime-driven optimizations. The compiler optimization focuses on the parallelization of reduction operations inside nested parallel loops. To fully utilize all GPU resources, we also extend the OpenACC model to support multiple GPUs in a single node. Our application porting experience also revealed the challenge of choosing good loop schedules. The default loop schedule chosen by the compiler may not produce the best performance, so the user has to manually try different loop schedules to improve the performance. To solve this issue, we developed a locality-aware auto-tuning framework which is based on the proposed memory access cost model to help the compiler choose optimal loop schedules and guide the user to choose appropriate loop schedules.Computer Science, Department o

    Routing on the Channel Dependency Graph:: A New Approach to Deadlock-Free, Destination-Based, High-Performance Routing for Lossless Interconnection Networks

    Get PDF
    In the pursuit for ever-increasing compute power, and with Moore's law slowly coming to an end, high-performance computing started to scale-out to larger systems. Alongside the increasing system size, the interconnection network is growing to accommodate and connect tens of thousands of compute nodes. These networks have a large influence on total cost, application performance, energy consumption, and overall system efficiency of the supercomputer. Unfortunately, state-of-the-art routing algorithms, which define the packet paths through the network, do not utilize this important resource efficiently. Topology-aware routing algorithms become increasingly inapplicable, due to irregular topologies, which either are irregular by design, or most often a result of hardware failures. Exchanging faulty network components potentially requires whole system downtime further increasing the cost of the failure. This management approach becomes more and more impractical due to the scale of today's networks and the accompanying steady decrease of the mean time between failures. Alternative methods of operating and maintaining these high-performance interconnects, both in terms of hardware- and software-management, are necessary to mitigate negative effects experienced by scientific applications executed on the supercomputer. However, existing topology-agnostic routing algorithms either suffer from poor load balancing or are not bounded in the number of virtual channels needed to resolve deadlocks in the routing tables. Using the fail-in-place strategy, a well-established method for storage systems to repair only critical component failures, is a feasible solution for current and future HPC interconnects as well as other large-scale installations such as data center networks. Although, an appropriate combination of topology and routing algorithm is required to minimize the throughput degradation for the entire system. This thesis contributes a network simulation toolchain to facilitate the process of finding a suitable combination, either during system design or while it is in operation. On top of this foundation, a key contribution is a novel scheduling-aware routing, which reduces fault-induced throughput degradation while improving overall network utilization. The scheduling-aware routing performs frequent property preserving routing updates to optimize the path balancing for simultaneously running batch jobs. The increased deployment of lossless interconnection networks, in conjunction with fail-in-place modes of operation and topology-agnostic, scheduling-aware routing algorithms, necessitates new solutions to solve the routing-deadlock problem. Therefore, this thesis further advances the state-of-the-art by introducing a novel concept of routing on the channel dependency graph, which allows the design of an universally applicable destination-based routing capable of optimizing the path balancing without exceeding a given number of virtual channels, which are a common hardware limitation. This disruptive innovation enables implicit deadlock-avoidance during path calculation, instead of solving both problems separately as all previous solutions

    Facing Evolution on Industry 4.0: Modular Monitoring and Adaptive & Adaptable Visualization for Industrial Cyber-Physical Systems

    Get PDF
    Industry 4.0 comes to play an important role in various industrial domains where monitoring industrial cyberphysical systems (ICPSs) is becoming essential. This is due to the necessity to efficiently collect data from industrial processes for then making decisions that can impact on the operation of the industrial systems. Typically, the ICPSs are composed by heterogeneous, distributed and autonomous physical devices which can evolve over time. Thus, this makes necessary the adaptation of the monitoring system according to the physical devices. In this dissertation, it is proposed a monitoring system for ICPSs that evolve over time, that has been designed for multiple domains. In this proposal, the data capture and storage beside the ICPS evolution detection and information visualization are considered. To do so, the proposed solution is composed by two subsystems: (I) Modular Monitoring System which is based on the union of different standards able to capture data and store it in a structured manner and; (II) Personal Visualization & Evolution Detection System where the user has the possibility of customizing its visualization and the system is able to trigger alerts on ICPS evolution. In order to validate the proposal a prototype of each system has been developed and finally evaluated.La Industria 4.0 juega un papel importante en diversos ámbitos industriales donde la monitorización de los sistema ciber-físicos industriales (en inglés, Industrial Cyber-Physical Systems, abreviadamente ICPSs) se está convirtiendo en parte esencial. Esto se debe a la necesidad de recolectar datos de los procesos industriales de manera eficiente para luego tomar decisiones que puedan impactar en el funcionamiento de los sistemas industriales. Normalmente, los ICPSs están compuestos por dispositivos físicos heterogéneos, distribuidos y autónomos que pueden evolucionar con el tiempo. Por lo tanto, esto hace que sea necesaria la adaptación de los sistemas de monitorización de acuerdo con los dispositivos físicos. En esta tesis se propone un sistema de monitorización para ICPSs que evolucionan con el tiempo, el cuál ha sido diseñado para múltiples dominios. En esta propuesta se considera la captura y almacenamiento de datos además de la detección de la evolución y la visualización de la información. Para ello, la solución propuesta se compone de dos subsistemas: (I) Modular Monitoring System que se basa en la unión de diferentes estándares capaces de capturar datos y almacenarlos de forma estructurada y; (II) Personal Visualization & Evolution Detection System donde el usuario tiene la posibilidad de personalizar su visualización y el sistema es capaz de generar alertas en caso de que se produzca una evolución. Para validar la propuesta se ha desarrollado un prototipo de cada sistema y finalmente se ha evaluado.Industry 4.0 sona handia hartzen ari da egungo gizartean. Hori dela eta, zeinbat esparru industrialetan, sistema ziberfisiko industrialen (ingelesez, Industrial Cyber-Physical System, laburdura ICPS) monitorizioak berebiziko garrantzia hartu du. Guzti honen funtsa, datuen bilketa era eraginkorrean egitean datza, ondoren datu hauek analizatu eta sistema industrialetan eragin dezaketen erabakiak hartzeko asmoz. Industrial Cyber-Physical Systems (ICPSs)-ak eboluziona dezaketen gailu heterogeneo, banatu eta autonomoez osatuta daude eta, ondorioz, monitorizazio sistemak egokitzea ezinbesteko bihurtzen da gaur egun. Tesi honek, eboluziona dezaketen ICPSs monitorizazio sistema bat aurkezten du, zeina hainbat domeinu industrialetarako diseinatu den. Proposamen honetan, datuen bilketa eta biltegiratze eraginkorra eta eboluzioen hautematea eta informazioaren bistaratzea aztertzen dira. Proposamena bi azpisistemaz osatuta dago: (I) Modular Monitoring System, hainbat estandar bateratzetik sortu den sistema bat da, non, datuak jaso eta berauek era estrukturatu batetan gordetzeko gaitasuna duen; eta (II) Personal Visualization & Evolution Detection System, erabiltzaileak bere interfazeak sortzeko gaitasuna duen sistema bat da eta, horrekin batera, eboluzioak detektatu eta hauen inguruan alertak sortzeko gaitasuna du. Proposamena balioztatzeko asmoz, sistema bakoitzaren prototipo bat sortu eta ebaluatu da tesi honetan
    corecore