19 research outputs found

    Analysis and Characterization of Performance Variability for OpenMP Runtime

    Get PDF
    In the high performance computing (HPC) domain, performance variability is a major scalability issue for parallel computing applications with heavy synchronization and communication. In this paper, we present an experimental performance analysis of OpenMP benchmarks regarding the variation of execution time, and determine the potential factors causing performance variability. Our work offers some understanding of performance distributions and directions for future work on how to mitigate variability for OpenMP-based applications. Two representative OpenMP benchmarks from the EPCC OpenMP micro-benchmark suite and BabelStream are run across two x86 multicore platforms featuring up to 256 threads. From the obtained results, we characterize and explain the execution time variability as a function of thread-pinning, simultaneous multithreading (SMT) and core frequency variation

    Challenges and Opportunities in the Co-design of Convolutions and RISC-V Vector Processors

    Get PDF
    The RISC-V "V" extension introduces vector processing to the RISC-V architecture. Unlike most SIMD extensions, it supports long vectors which can result in significant improvement of multiple applications. In this paper, we present our ongoing research to implement and optimize a vectorized Winograd algorithm used in convolutional layers on RISC-V Vector(RISC-VV) processors. Our study identifies effective techniques for optimizing the kernels of Winograd on RISC-VV using intrinsic instructions, and showcases how certain instructions offer better performance. Our co-design findings suggest that the Winograd algorithm benefits from vector lengths up to 2048 bits and cache sizes up to 64MB. We use our experience with Winograd to highlight potential enhancements for the standard that would simplify code generation and aid low-level programming. Finally, we share our experience from experimenting with forks of gem5 for RISC-VV and stress the importance of a mature software ecosystem, to facilitate design space exploration and architectural optimization. Our study identifies effective techniques for optimizing the kernels of Winograd on RISC-VV using the available intrinsic instructions and showcases that certain instructions offer better performance to the vectorized algorithm. Furthermore, our co-design study reveals that the Winograd algorithm benefits from vector lengths up to 2048 bits and cache sizes up to 64MB

    CoCoPeLia: Communication-Computation Overlap Prediction for Efficient Linear Algebra on GPUs

    Get PDF
    Graphics Processing Units (GPUs) are well established in HPC systems and frequently used to accelerate linear algebra routines. Since data transfers pose a severe bottleneck for GPU offloading, modern GPUs provide the ability to overlap communication with computation by splitting the problem to fine-grained sub-kernels that are executed in a pipelined manner. This optimization is currently underutilized by GPU BLAS libraries, since it requires an approach to select an efficient tiling size, which in turn leads to a challenging problem that needs to consider routine, system, data, and problem-specific characteristics. In this work, we introduce an elaborate 3-way concurrency model for GPU BLAS offload time that considers previously neglected features regarding data access and machine behavior. We then incorporate our model in an automated, end-to-end framework (called CoCoPeLia) that supports overlap prediction, tile selection and effective tile scheduling. We validate our model's efficacy for dgemm, sgemm, and daxpy on two testbeds, with our experimental results showing that it achieves significantly lower prediction error than previous models and provides near-optimal tiling sizes for all problems. We also demonstrate that CoCoPeLia leads to considerable performance improvements compared to the state of the art BLAS routine implementations for GPUs

    Modeling the Scalability of the EuroExa Reconfigurable Accelerators - Preliminary Results

    Get PDF
    Current technology and application trends push for both performance and power efficiency. EuroEXA is a project that tries to achieve these goals and push its performance to exascale performance. Towards this objective, EuroEXA node integrate reconfigurable (FPGA) accelerators to offload computational intensive workloads. To fully utilize the FPGA’s resource pool, multiple accelerators must be instantiated. System design and dimensioning requires an early performance estimation to evaluate different design options, including using larger FPGA devices, instantiating larger number of accelerator instances, etc. In this paper, we present the preliminary results of modeling the scalability of EuroEXA reconfigurable accelerators in the FPGA fabric. We start by using simple equations to bound the total number of kernels that can work in parallel depending on the available memory channels and reconfigurable resources. Then, we use a 2nd degree polynomial model to predict the performance benefits of instantiating multiple replicated kernels in a FPGA. The model suggests whether the switching to another larger FPGA is advantageous choice in terms of performance. We verify our results using micro-benchmarks on two state-of-the-art FPGAs; AlveoU50 and AlveoU280

    PARALiA: a performance aware runtime for auto-tuning linear algebra on heterogeneous systems

    Get PDF
    Dense linear algebra operations appear very frequently in high-performance computing (HPC) applications, rendering their performance crucial to achieve optimal scalability. As many modern HPC clusters contain multi-GPU nodes, BLAS operations are frequently offloaded on GPUs, necessitating the use of optimized libraries to ensure good performance. Unfortunately, multi-GPU systems are accompanied by two significant optimization challenges: data transfer bottlenecks as well as problem splitting and scheduling in multiple workers (GPUs) with distinct memories. We demonstrate that the current multi-GPU BLAS methods for tackling these challenges target very specific problem and data characteristics, resulting in serious performance degradation for any slightly deviating workload. Additionally, an even more critical decision is omitted because it cannot be addressed using current scheduler-based approaches: the determination of which devices should be used for a certain routine invocation. To address these issues we propose a model-based approach: using performance estimation to provide problem-specific autotuning during runtime. We integrate this autotuning into an end-to-end BLAS framework named PARALiA. This framework couples autotuning with an optimized task scheduler, leading to near-optimal data distribution and performance-aware resource utilization. We evaluate PARALiA in an HPC testbed with 8 NVIDIA-V100 GPUs, improving the average performance of GEMM by 1.7× and energy efficiency by 2.5× over the state-of-the-art in a large and diverse dataset and demonstrating the adaptability of our performance-aware approach to future heterogeneous systems

    SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures

    Get PDF
    Near-Data-Processing (NDP) architectures present a promising way to alleviate data movement costs and can provide significant performance and energy benefits to parallel applications. Typically, NDP architectures support several NDP units, each including multiple simple cores placed close to memory. To fully leverage the benefits of NDP and achieve high performance for parallel workloads, efficient synchronization among the NDP cores of a system is necessary. However, supporting synchronization in many NDP systems is challenging because they lack shared caches and hardware cache coherence support, which are commonly used for synchronization in multicore systems, and communication across different NDP units can be expensive. This paper comprehensively examines the synchronization problem in NDP systems, and proposes SynCron, an end-to-end synchronization solution for NDP systems. SynCron adds low-cost hardware support near memory for synchronization acceleration, and avoids the need for hardware cache coherence support. SynCron has three components: 1) a specialized cache memory structure to avoid memory accesses for synchronization and minimize latency overheads, 2) a hierarchical message-passing communication protocol to minimize expensive communication across NDP units of the system, and 3) a hardware-only overflow management scheme to avoid performance degradation when hardware resources for synchronization tracking are exceeded. We evaluate SynCron using a variety of parallel workloads, covering various contention scenarios. SynCron improves performance by 1.27×\times on average (up to 1.78×\times) under high-contention scenarios, and by 1.35×\times on average (up to 2.29×\times) under low-contention real applications, compared to state-of-the-art approaches. SynCron reduces system energy consumption by 2.08×\times on average (up to 4.25×\times).Comment: To appear in the 27th IEEE International Symposium on High-Performance Computer Architecture (HPCA-27

    A flexible multi-temporal and multi-modal framework for Sentinel-1 and Sentinel-2 analysis ready data

    Get PDF
    The rich, complementary data provided by Sentinel-1 and Sentinel-2 satellite constellations host considerable potential to transform Earth observation (EO) applications. However, a substantial amount of effort and infrastructure is still required for the generation of analysis-ready data (ARD) from the low-level products provided by the European Space Agency (ESA). Here, a flexible Python framework able to generate a range of consistent ARD aligned with the ESA-recommended processing pipeline is detailed. Sentinel-1 Synthetic Aperture Radar (SAR) data are radiometrically calibrated, speckle-filtered and terrain-corrected, and Sentinel-2 multi-spectral data resampled in order to harmonise the spatial resolution between the two streams and to allow stacking with multiple scene classification masks. The global coverage and flexibility of the framework allows users to define a specific region of interest (ROI) and time window to create geo-referenced Sentinel-1 and Sentinel-2 images, or a combination of both with closest temporal alignment. The framework can be applied to any location and is user-centric and versatile in generating multi-modal and multi-temporal ARD. Finally, the framework handles automatically the inherent challenges in processing Sentinel data, such as boundary regions with missing values within Sentinel-1 and the filtering of Sentinel-2 scenes based on ROI cloud coverage

    EuroEXA - D2.6: Final ported application software

    Get PDF
    This document describes the ported software of the EuroEXA applications to the single CRDB testbed and it discusses the experiences extracted from porting and optimization activities that should be actively taken into account in future redesign and optimization. This document accompanies the ported application software, found in the EuroEXA private repository (https://github.com/euroexa). In particular, this document describes the status of the software for each of the EuroEXA applications, sketches the redesign and optimization strategy for each application, discusses issues and difficulties faced during the porting activities and the relative lesson learned. A few preliminary evaluation results have been presented, however the full evaluation will be discussed in deliverable 2.8

    Πρόβλεψη επίδοσης επικοινωνίας σε συστήματα μεγάλης κλίμακας

    No full text
    On the path to exascale, supercomputers will grow to host hundreds of million of cores and various complex heterogeneous processing elements, yet even today, users fail to leverage the existing compute power of large-scale systems, as large classes of typical HPC applications are bound by non-scalable communication phases. The ability to predict the communication time of parallel applications can assist users, com- pilers, runtime systems and schedulers with decision-making for optimal resource utilization, performance optimizations, power saving and resilience. This thesis presents a methodology for predictive communication modeling of HPC applications. Communication time depends on a complex set of parameters, relevant to the application, the system architecture, the runtime configuration and runtime conditions. To handle this complexity, we follow an empirical modeling approach. We define features that can be extracted from the application, the process mapping and the allocation shape ahead of execution, deploy a single benchmark to sweep over the parameter space and develop predictive models for communication time on three large-scale computing systems, Vilje, Piz Daint and ARIS, using different subsets of our features, statistical and machine-learning methods and training sets. We compare the predictive performance of our models on various communication patterns and applications, for multiple problem sizes, executions and runtime configurations, ranging from a few dozen to a few thousand cores. Our methodology is successful across all tested communication patterns on all systems and exhibits high prediction accuracy and goodness-of-fit. Our models are applicable just-in-time ahead of the execution of an HPC application, and, as we demonstrate in this thesis, their high accuracy make them suitable for communication-aware decision making, towards the optimization of resource utilization on large-scale systems.Οδεύοντας προς την εποχή των υπερυπολογιστικών συστημάτων με επιδόσεις της τάξης των ExaFlops, οι υπερυπολογιστές θα αποτελούνται από εκατοντάδες εκατομμύρια πυρήνες και διάφορα σύνθετα ετερογενή επεξεργαστικά στοιχεία. Ωστόσο, ήδη σήμερα, οι χρήστες αποτυγχάνουν να αξιοποιήσουν την υπάρχουσα υπολογιστική ισχύ των συστημάτων μεγάλης κλίμακας, όπως συμβαίνει με μεγάλες κατηγορίες παράλληλων εφαρμογών μεγάλης κλίμακας, η επίδοση των οποίων περιορίζεται από φάσεις επικοινωνίας που δεν κλιμακώνουν. Η δυνατότητα πρόβλεψης του χρόνου επικοινωνίας των παράλληλων εφαρμογών μπορεί να βοηθήσει τους χρήστες, τους μεταγλωττιστές, τα συστήματα χρόνου εκτέλεσης και τους χρονοδρομολογητές στη λήψη αποφάσεων για βέλτιστη χρήση πόρων, βελτιστοποιήσεις επιδόσεων, εξοικονόμηση ενέργειας και ελαστικότητα σε σφάλματα. Η παρούσα διατριβή παρουσιάζει μια μεθοδολογία για την μοντελοποίηση της επικοινωνίας των παράλληλων εφαρμογών μεγάλης κλίμακας με στόχο την πρόβλεψη. Ο χρόνος επικοινωνίας εξαρτάται από ένα πολύπλοκο σύνολο παραμέτρων, σχετικών με την εφαρμογή, την αρχιτεκτονική του συστήματος, τις ρυθμίσεις χρόνου εκτέλεσης και τις συνθήκες εκτέλεσης. Για την ενσωμάτωση αυτής της πολυπλοκότητας σε ένα μοντέλο πρόβλεψης, ακολουθούμε μια προσέγγιση εμπειρικής μοντελοποίησης. Ορίζουμε χαρακτηριστικά που μπορούν να εξαχθούν από την εφαρμογή, την απεικόνιση των διεργασιών στο σύστημα και το σχήμα κατανομής των υπολογιστικών πόρων, πριν από την εκτέλεση, αναπτύσσουμε ένα πρόγραμμα μετρήσεων αναφοράς για τη σάρωση του χώρου των παραμέτρων, και αναπτύσσουμε μοντέλα πρόβλεψης για τον χρόνο επικοινωνίας σε τρία υπολογιστικά συστήματα μεγάλης κλίμακας, τα συστήματα Vilje, Piz Daint και ARIS, χρησιμοποιώντας διαφορετικά υποσύνολα των χαρακτηριστικών μας, μεθόδους στατιστικής και μηχανικής μάθησης και διάφορα σύνολα εκπαίδευσης. Συγκρίνουμε την πρόβλεψη των μοντέλων μας σε διάφορα σχήματα επικοινωνίας και εφαρμογές, για πολλαπλά μεγέθη προβλημάτων, πολλαπλές εκτελέσεις και διαφορετικές ρυθμίσεις του χρόνου εκτέλεσης, που κυμαίνονται από μερικές δεκάδες έως μερικές χιλιάδες πυρήνες. Η μεθοδολογία μας είναι επιτυχής στην πρόβλεψη του χρόνου επικοινωνίας σε όλα τα σχήματα επικοινωνίας που εξετάζουμε, σε όλα τα συστήματα, και παρουσιάζει υψηλή ακρίβεια πρόβλεψης και καλή προσαρμογή. Τα μοντέλα που προτείνονται αποδίδουν προβλέψεις ακριβώς πριν από την εκτέλεση μίας παράλληλης εφαρμογής και, όπως καταδεικνύουμε σε αυτή τη διατριβή, η υψηλή ακρίβεια τους τα καθιστά κατάλληλα για λήψη αποφάσεων με επίγνωση της επικοινωνίας, προς την κατεύθυνση της βελτιστοποίησης της χρήσης των υπολογιστικών πόρων σε συστήματα μεγάλης κλίμακας
    corecore