Search CORE

38 research outputs found

A GPU performance estimation model based on micro-benchmarks and black-box kernel profiling

Author: Konstantinidis Elias
Κωνσταντινίδης Ηλίας
Publication venue
Publication date: 01/01/2017
Field of study

Κατά την τελευταία δεκαετία, οι επεξεργαστές γραφικών (GPUs) έχουν εδραιωθεί στον τομέα των υπολογιστικών συστημάτων υψηλής απόδοσης ως επιταχυντές υπολογισμών. Τα βασικά χαρακτηριστικά που δικαιολογούν αυτή τη σύγχρονη τάση είναι η εξαιρετικά υψηλή υπολογιστική απόδοση τους και η αξιοσημείωτη ενεργειακή αποδοτικότητα τους. Ωστόσο, η απόδοση τους είναι πολύ ευαίσθητη σε πολλούς παράγοντες, όπως π.χ. τον τύπο των μοτίβων πρόσβασης στη μνήμη (memory access patterns), την απόκλιση διακλαδώσεων (branch divergence), τον βαθμό παραλληλισμού και τις δυνητικές καθυστερήσεις (latencies). Συνεπώς, ο χρόνος εκτέλεσης ενός πυρήνα (kernel) σε ένα επεξεργαστή γραφικών είναι ένα δύσκολα προβλέψιμο μέγεθος. Στην περίπτωση που η απόδοση του πυρήνα δεν περιορίζεται από καθυστερήσεις, μπορεί να παρασχεθεί μια χονδρική εκτίμηση του χρόνου εκτέλεσης σε ένα συγκεκριμένο επεξεργαστή εφαρμόζοντας το μοντέλο γραμμής-οροφής (roofline), το οποίο χρησιμοποιείται για να αντιστοιχίσει την ένταση υπολογισμών του προγράμματος στην μέγιστη αναμενόμενη απόδοση για ένα συγκεκριμένο επεξεργαστή. Αν και αυτή η προσέγγιση είναι απλή, δεν μπορεί να παρέχει ακριβή αποτελέσματα πρόβλεψης. Σε αυτή τη διατριβή, μετά την επαλήθευση της αρχής του μοντέλου γραμμής-οροφής σε επεξεργαστές γραφικών με τη χρήση ενός μικρο-μετροπρογράμματος, προτείνεται ένα αναλυτικό μοντέλο απόδοσης. Συγκεκριμένα, βελτιώνεται το μοντέλο γραμμής-οροφής ακολουθώντας μια ποσοτική προσέγγιση και παρουσιάζεται μία πλήρως αυτοματοποιημένη μέθοδος πρόβλεψης απόδοσης σε επεξεργαστή γραφικών. Από αυτή την άποψη, το προτεινόμενο μοντέλο χρησιμοποιεί την αξιολόγηση μέσω μικρο-μετροπρογραμμάτων και την καταγραφή μετρικών με μέθοδο «μαύρου κουτιού», καθώς δεν απαιτείται διερεύνηση του πηγαίου/δυαδικού κώδικα. Το προτεινόμενο μοντέλο συνδυάζει τις παραμέτρους του επεξεργαστή γραφικών και του πυρήνα για να χαρακτηρίσει τον παράγοντα περιορισμού της απόδοσης και να προβλέψει το χρόνο εκτέλεσης στο στοχευόμενο υλικό, λαμβάνοντας υπόψη την αποδοτικότητα των ωφελίμων υπολογιστικών εντολών. Επιπλέον, προτείνεται η οπτική αναπαράσταση «διαμοιρασμού-τεταρτημορίου» (“quadrant-split”), η οποία αποδίδει τα χαρακτηριστικά πολλών επεξεργαστών σε σχέση με έναν συγκεκριμένο πυρήνα. Η πειραματική αξιολόγηση συνδυάζει δοκιμαστικές εκτελέσεις σε υπολογισμούς μορίων (κόκκινο/μαύρο SOR, LMSOR), πολλαπλασιασμό πινάκων (SGEMM) και ένα σύνολο 28 πυρήνων της σουίτας μετροπρογραμμάτων Rodinia, όλα εφαρμοσμένα σε έξι επεξεργαστές γραφικών CUDA. Το παρατηρηθέν απόλυτο σφάλμα στις προβλέψεις ήταν 27,66% στη μέση περίπτωση. Διερευνήθηκαν και αιτιολογήθηκαν ιδιαίτερες περιπτώσεις εσφαλμένων προβλέψεων. Επιπλέον, το προαναφερθέν μικρο-μετροπρόγραμμα χρησιμοποιήθηκε ως αντικείμενο για την πρόβλεψη απόδοσης και τα αποτελέσματα ήταν πολύ ακριβή. Προσθέτως, το μοντέλο απόδοσης εξετάστηκε σε σύνθετο περιβάλλον μεταξύ διαφορετικών κατασκευαστών, εφαρμόζοντας τη μέθοδο πρόβλεψης στους ίδιους πηγαίους κώδικες πυρήνων μέσω του περιβάλλοντος προγραμματισμού HIP που υποστηρίζεται από την πλατφόρμα AMD ROCm. Τα σφάλματα πρόβλεψης ήταν συγκρίσιμα αυτών των πειραμάτων του περιβάλλοντος CUDA, παρά τις σημαντικές διαφορές αρχιτεκτονικής που παρατηρούνται μεταξύ των διαφορετικών κατασκευαστών επεξεργαστών γραφικών.Over the last decade GPUs have been established in the High Performance Computing sector as compute accelerators. The primary characteristics that justify this modern trend are the exceptionally high compute throughput and the remarkable power efficiency of GPUs. However, GPU performance is highly sensitive to many factors, e.g. the type of memory access patterns, branch divergence, the degree of parallelism and potential latencies. Consequently, the execution time of a kernel on a GPU is a difficult to predict measure. Unless the kernel is latency bound, a rough estimate of the execution time on a particular GPU could be provided by applying the roofline model, which is used to map the program’s operation intensity to the peak expected performance on a particular processor. Though this approach is straightforward, it cannot not provide accurate prediction results. In this thesis, after validating the roofline principle on GPUs by employing a micro-benchmark, an analytical throughput oriented performance model is proposed. In particular, this improves on the roofline model following a quantitative approach and a completely automated GPU performance prediction technique is presented. In this respect, the proposed model utilizes micro-benchmarking and profiling in a “black-box” fashion as no inspection of source/binary code is required. The proposed model combines GPU and kernel parameters in order to characterize the performance limiting factor and to predict the execution time on target hardware, by taking into account the efficiency of beneficial computational instructions. In addition, the “quadrant-split” visual representation is proposed, which captures the characteristics of multiple processors in relation to a particular kernel. The experimental evaluation combines test executions on stencil computations (red/black SOR, LMSOR), matrix multiplication (SGEMM) and a total of 28 kernels of the Rodinia benchmark suite, all applied on six CUDA GPUs. The observed absolute error in predictions was 27.66% in the average case. Special cases of mispredicted results were investigated and justified. Moreover, the aforementioned micro-benchmark was used as a subject for performance prediction and the exhibited results were very accurate. Furthermore, the performance model was also examined in a cross vendor configuration by applying the prediction method on the same kernel source codes through the HIP programming environment supported on the AMD ROCm platform. Prediction errors were comparable to CUDA experiments despite the significant architectural differences evident between different vendor GPUs

Pergamos : Unified Institutional Repository / Digital Library Platform of the National and Kapodistrian University of Athens

Starlight: A kernel optimizer for GPU processing

Author: Alberto Zeni
Davide Conficconi
Eleonora D'Arnese
Emanuele del Sozzo
Marco Domenico Santambrogio
Publication venue
Publication date: 01/01/2024
Field of study

Archivio istituzionale della ricerca - Politecnico di Milano

Characterization and Acceleration of High Performance Compute Workloads

Author: Corda Stefano
Publication venue: Eindhoven University of Technology
Publication date: 19/12/2022
Field of study

Pure OAI Repository

Characterization and Acceleration of High Performance Compute Workloads

Author: Corda Stefano
Publication venue: Eindhoven University of Technology
Publication date: 19/12/2022
Field of study

Pure OAI Repository

Memory-Aware Latency Prediction Model for Concurrent Kernels in Partitionable GPUs: Simulations and Experiments

Author: Capodieci N.
Cavicchioli R.
Masola A.
Olmedo I. S.
Rouxel B.
Publication venue
Publication date: 01/01/2023
Field of study

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

Lessons learned in a decade of research software engineering gpu applications

Author: A Ilic
B van Werkhoven
B van Werkhoven
B van Werkhoven
B van Werkhoven
B van Werkhoven
BN Lawrence
C Goble
C Venters
E Konstantinidis
H Heydarian
M de Jong
P Lago
PR Gent
S Williams
SF Portegies Zwart
WJ Palenstijn
Y Yang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

After years of using Graphics Processing Units (GPUs) to accelerate scientific applications in fields as varied as tomography, computer vision, climate modeling, digital forensics, geospatial databases, particle physics, radio astronomy, and localization microscopy, we noticed a number of technical, socio-technical, and non-technical challenges that Research Software Engineers (RSEs) may run into. While some of these challenges, such as managing different programming languages within a project, or having to deal with different memory spaces, are common to all software projects involving GPUs, others are more typical of scientific software projects. Among these challenges we include changing resolutions or scales, maintaining an application over time and making it sustainable, and evaluating both the obtained results and the achieved performance

arXiv.org e-Print Archive

Crossref

CWI's Institutional Repository

Multi-tasking scheduling for heterogeneous systems

Author: Wen Yuan
Publication venue: The University of Edinburgh
Publication date: 07/07/2017
Field of study

Heterogeneous platforms play an increasingly important role in modern computer systems. They combine high performance with low power consumption. From mobiles to supercomputers, we see an increasing number of computer systems that are heterogeneous. The most well-known heterogeneous system, CPU+GPU platforms have been widely used in recent years. As they become more mainstream, serving multiple tasks from multiple users is an emerging challenge. A good scheduler can greatly improve performance. However, indiscriminately allocating tasks based on availability leads to poor performance. As modern GPUs have a large number of hardware resources, most tasks cannot efficiently utilize all of them. Concurrent task execution on GPU is a promising solution, however, indiscriminately running tasks in parallel causes a slowdown. This thesis focuses on scheduling OpenCL kernels. A runtime framework is developed to determine where to schedule OpenCL kernels. It predicts the best-fit device by using a machine learning-based classifier, then schedules the kernels accordingly to either CPU or GPU. To improve GPU utilization, a kernel merging approach is proposed. Kernels are merged if their predicted co-execution can provide better performance than sequential execution. A machine learning based classifier is developed to find the best kernel pairs for co-execution on GPU. Finally, a runtime framework is developed to schedule kernels separately on either CPU or GPU, and run kernels in pairs if their co-execution can improve performance. The approaches developed in this thesis significantly improve system performance and outperform all existing techniques

Edinburgh Research Archive