Search CORE

30 research outputs found

Data transfer optimizations for heterogeneous managed runtime systems

Author: Blanaru Florin-Gabriel
Publication venue
Publication date: 01/08/2022
Field of study

The University of Manchester - Institutional Repository

Doctor of Philosophy

Author: Chatterjee Niladrish
Publication venue: University of Utah
Publication date: 01/12/2013
Field of study

dissertationThe internet-based information infrastructure that has powered the growth of modern personal/mobile computing is composed of powerful, warehouse-scale computers or datacenters. These heavily subscribed datacenters perform data-processing jobs under intense quality of service guarantees. Further, high-performance compute platforms are being used to model and analyze increasingly complex scientific problems and natural phenomena. To ensure that the high-performance needs of these machines are met, it is necessary to increase the efficiency of the memory system that supplies data to the processing cores. Many of the microarchitectural innovations that were designed to scale the memory wall (e.g., out-of-order instruction execution, on-chip caches) are being rendered less effective due to several emerging trends (e.g., increased emphasis on energy consumption, limited access locality). This motivates the optimization of the main memory system itself. The key to an efficient main memory system is the memory controller. In particular, the scheduling algorithm in the memory controller greatly influences its performance. This dissertation explores this hypothesis in several contexts. It develops tools to better understand memory scheduling and develops scheduling innovations for CPUs and GPUs. We propose novel memory scheduling techniques that are strongly aware of the access patterns of the clients as well as the microarchitecture of the memory device. Based on these, we present (i) a Dynamic Random Access Memory (DRAM) chip microarchitecture optimized for reducing write-induced slowdown, (ii) a memory scheduling algorithm that exploits these features, (iii) several memory scheduling algorithms to reduce the memory-related stall experienced by irregular General Purpose Graphics Processing Unit (GPGPU) applications, and (iv) the Utah Simulated Memory Module (USIMM), a detailed, validated simulator for DRAM main memory that we use for analyzing and proposing scheduler algorithms

The University of Utah: J. Willard Marriott Digital Library

CUDA Unified Memory를 위한 데이터 관리 및 프리페칭 기법

Author: 정재훈
Publication venue: 서울대학교 대학원
Publication date: 01/08/2022
Field of study

학위논문(박사) -- 서울대학교대학원 : 공과대학 컴퓨터공학부, 2022. 8. 이재진.Unified Memory (UM) is a component of CUDA programming model which provides a memory pool that has a single address space and can be accessed by both the host and the GPU. When UM is used, a CUDA program does not need to explicitly move data between the host and the device. It also allows GPU memory oversubscription by using CPU memory as a backing store. UM significantly lessens the burden of a programmer and provides great programmability. However, using UM solely does not guarantee good performance. To fully exploit UM and improve performance, the programmer needs to add user hints to the source code to prefetch pages that are going to be accessed during the CUDA kernel execution. In this thesis, we propose three frameworks that exploits UM to improve the ease-of-programming while maximizing the application performance. The first framework is HUM, which hides host-to-device memory copy time of traditional CUDA program without any code modification. It overlaps the host-to-device memory copy with host computation or CUDA kernel computation by exploiting Unified Memory and fault mechanisms. The evaluation result shows that executing the applications under HUM is, on average, 1.21 times faster than executing them under original CUDA. The speedup is comparable to the average speedup 1.22 of the hand-optimized implementations for Unified Memory. The second framework is DeepUM which exploits UM to allow GPU memory oversubscription for deep neural networks. While UM allows memory oversubscription using a page fault mechanism, page fault handling introduces enormous overhead. We use a correlation prefetching technique to solve the problem and hide the overhead. The evaluation result shows that DeepUM achieves comparable performance to the other state-of-the-art approaches. At the same time, our framework can run larger batch size that other methods fail to run. The last framework is SnuRHAC that provides an illusion of a single GPU for the multiple GPUs in a cluster. Under SnuRHAC, a CUDA program designed to use a single GPU can utilize multiple GPUs in a cluster without any source code modification. SnuRHAC automatically distributes workload to multiple GPUs in a cluster and manages data across the nodes. To manage data efficiently, SnuRHAC extends Unified Memory and exploits its page fault mechanism. We also propose two prefetching techniques to fully exploit UM and to maximize performance. The evaluation result shows that while SnuRHAC significantly improves ease-of-programming, it shows scalable performance for the cluster environment depending on the application characteristics.Unified Memory (UM)는 CUDA 프로그래밍 모델에서 제공하는 기능 중 하나로 단일 메모리 주소 공간에 CPU와 GPU가 동시에 접근할 수 있도록 해준다. 이에 따라, UM을 사용할 경우 CUDA 프로그램에서 명시적으로 프로세서간에 데이터를 이동시켜주지 않아도 된다. 또한, CPU 메모리를 backing store로 사용하여 GPU의 메모리 크기 보다 더 많은 양의 메모리를 필요로 하는 프로그램을 실행할 수 있도록 해준다. 결과적으로, UM은 프로그래머의 부담을 크게 덜어주고 쉽게 프로그래밍 할 수 있도록 도와준다. 하지만, UM을 있는 그대로 사용하는 것은 성능 측면에서 좋지 않다. UM은 page fault mechanism을 통해 동작하는데 page fault를 처리하기 위해서는 많은 오버헤드가 발생하기 때문이다. UM을 사용하면서 최대의 성능을 얻기 위해서는 프로그래머가 소스 코드에 여러 힌트나 앞으로 CUDA 커널에서 사용될 메모리 영역에 대한 프리페치 명령을 삽입해주어야 한다. 본 논문은 UM을 사용하면서도 쉬운 프로그래밍과 최대의 성능이라는 두마리 토끼를 동시에 잡기 위한 방법들을 소개한다. 첫째로, HUM은 기존 CUDA 프로그램의 소스 코드를 수정하지 않고 호스트와 디바이스 간에 메모리 전송 시간을 최소화한다. 이를 위해, UM과 fault mechanism을 사용하여 호스트-디바이스 간 메모리 전송을 호스트 계산 혹은 CUDA 커널 실행과 중첩시킨다. 실험 결과를 통해 HUM을 통해 애플리케이션을 실행하는 것이 그렇지 않고 CUDA만을 사용하는 것에 비해 평균 1.21배 빠른 것을 확인하였다. 또한, Unified Memory를 기반으로 프로그래머가 소스 코드를 최적화한 것과 유사한 성능을 내는 것을 확인하였다. 두번째로, DeepUM은 UM을 활용하여 GPU의 메모리 크기 보다 더 많은 양의 메모리를 필요로 하는 딥 러닝 모델을 실행할 수 있게 한다. UM을 통해 GPU 메모리를 초과해서 사용할 경우 CPU와 GPU간에 페이지가 매우 빈번하게 이동하는데, 이때 많은 오버헤드가 발생한다. 두번째 방법에서는 correlation 프리페칭 기법을 통해 이 오버헤드를 최소화한다. 실험 결과를 통해 DeepUM은 기존에 연구된 결과들과 비슷한 성능을 보이면서 더 큰 배치 사이즈 혹은 더 큰 하이퍼파라미터를 사용하는 모델을 실행할 수 있음을 확인하였다. 마지막으로, SnuRHAC은 클러스터에 장착된 여러 GPU를 마치 하나의 통합된 GPU처럼 보여준다. 따라서, 프로그래머는 여러 GPU를 대상으로 프로그래밍 하지 않고 하나의 가상 GPU를 대상으로 프로그래밍하면 클러스터에 장착된 모든 GPU를 활용할 수 있다. 이는 SnuRHAC이 Unified Memory를 클러스터 환경에서 동작하도록 확장하고, 필요한 데이터를 자동으로 GPU간에 전송하고 관리해주기 때문이다. 또한, UM을 사용하면서 발생할 수 있는 오버헤드를 최소화하기 위해 다양한 프리페칭 기법을 소개한다. 실험 결과를 통해 SnuRHAC이 쉽게 GPU 클러스터를 위한 프로그래밍을 할 수 있도록 도와줄 뿐만 아니라, 애플리케이션 특성에 따라 최적의 성능을 낼 수 있음을 보인다.1 Introduction 1 2 Related Work 7 3 CUDA Unified Memory 12 4 Framework for Maximizing the Performance of Traditional CUDA Program 17 4.1 Overall Structure of HUM 17 4.2 Overlapping H2Dmemcpy and Computation 19 4.3 Data Consistency and Correctness 23 4.4 HUM Driver 25 4.5 HUM H2Dmemcpy Mechanism 26 4.6 Parallelizing Memory Copy Commands 29 4.7 Scheduling Memory Copy Commands 31 5 Framework for Running Large-scale DNNs on a Single GPU 33 5.1 Structure of DeepUM 33 5.1.1 DeepUM Runtime 34 5.1.2 DeepUM Driver 35 5.2 Correlation Prefetching for GPU Pages 36 5.2.1 Pair-based Correlation Prefetching 37 5.2.2 Correlation Prefetching in DeepUM 38 5.3 Optimizations for GPU Page Fault Handling 42 5.3.1 Page Pre-eviction 42 5.3.2 Invalidating UM Blocks of Inactive PyTorch Blocks 43 6 Framework for Virtualizing a Single Device Image for a GPU Cluster 45 6.1 Overall Structure of SnuRHAC 45 6.2 Workload Distribution 48 6.3 Cluster Unified Memory 50 6.4 Additional Optimizations 57 6.5 Prefetching 58 6.5.1 Static Prefetching 58 6.5.2 Dynamic Prefetching 61 7 Evaluation 62 7.1 Framework for Maximizing the Performance of Traditional CUDA Program 62 7.1.1 Methodology 63 7.1.2 Results 64 7.2 Framework for Running Large-scale DNNs on a Single GPU 70 7.2.1 Methodology 70 7.2.2 Comparison with Naive UM and IBM LMS 72 7.2.3 Parameters of the UM Block Correlation Table 78 7.2.4 Comparison with TensorFlow-based Approaches 79 7.3 Framework for Virtualizing Single Device Image for a GPU Cluster 81 7.3.1 Methodology 81 7.3.2 Results 84 8 Discussions and Future Work 91 9 Conclusion 93 초록 111박

SNU Open Repository and Archive

2016 eXtreme Science and Engineering Discovery Environment (XSEDE) Annual User Satisfaction Survey

Author: DeStefano Lizanne
Rivera Lorna
Wernert Julie
Publication venue
Publication date: 01/06/2016
Field of study

This report provides an analysis and evaluation of the 2016 eXtreme Science and Engineering Discovery Environment (XSEDE) Annual User Satisfaction Survey. Section C.2, describes the data collection methodology of the survey. The sample included 13 types of users in a sample size of 5000 (out of 14,398 users), with 1,007 respondents. The survey consisted of quantitative and qualitative questions designed to determine user satisfaction of XSEDE services and resources. • The survey was available from February 11, 2016 through April 7, 2016. The overall response rate was 22.2%, down from the project high of 27.4% achieved in 2015. • Awareness remained near constant when compared with 2015 results, but with most areas trending slightly upward. • Only one area–Mission–experienced slightly lower awareness. Areas scoring less than 3.0 in terms of awareness where the same as is in 2015: TIS, ECSS, Mobile, Visualization Resources, and Science Gateways. • Data suggests that users are satisfied with XSEDE resources and services ,with all mean satisfaction values significantly greater than 3.0 (on a 5.0 scale) and greater than “4” in most areas. Most areas trended slightly upward or remained the same. • Overall satisfaction with XSEDE remains high at 4.34 on a 5-point scale. This is on par with 2014’s alltime high of 4.36. • Training preferences have remained constant over the 2013-2016 period. Data consistently show preference for self-serve and “just-in-time” training options, (i.e., Web documentation and online, self-paced tutorials.) • Consistent with previous years, demographic analysis shows that a typical user is male, white, and a faculty member at a large, doctoral-granting/research-focused university. Chemistry, physics, and engineering were the primary fields of study for 52% of respondents. • Section D of this report includes all open-ended question responses. Responses are categorizedOpe

Illinois Digital Environment for Access to Learning and Scholarship Repository

Transparent Memory Extension for Shared GPUs

Author: Kehne Jens
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2019
Field of study

Grafikkarten (Graphics Processing Units, GPUs) nehmen in der heutigen Informatik eine wichtige Rolle ein, da sie für bestimmte Arten von Anwendungen große Leistungsgewinne bei gleichzeitig hoher Energieeffizienz ermöglichen. Aus diesem Grund haben alle großen Cloudanbieter in den letzten Jahren GPUs in ihre Angebote integriert. Die Plattformen dieser Anbieter verwenden üblicherweise Virtualisierung, um physische Ressourcen zwischen mehreren Kunden aufzuteilen. Dieses Aufteilen erhöht die Auslastung der Ressourcen und verschafft dem Cloudanbieter so einen Kostenvorteil gegenüber dedizierter physischer Hardware. Um die Auslastung noch weiter zu erhöhen, vermieten heutige Cloudanbieter häufig mehr Ressourcen, als tatsächlich physisch zur Verfügung stehen. Für den Fall, dass die Kunden die angebotenen Ressourcen tatsächlich vollständig auslasten wollen, muss der Anbieter in diesem Fall aber in der Lage sein, das Funktionieren der Kundenanwendungen zu garantieren, selbst wenn der Ressourcenbedarf der Kunden die Kapazität der physischen Ressourcen übersteigt. Der Speicher aktueller Grafikkarten lässt sich vergleichsweise einfach zwischen mehreren Kunden aufteilen, da diese Grafikkarten virtuellen Speicher ähnlich dem der CPU unterstützen. Der Anbieter kann so jedem Kunden einen großen, virtuellen Adressraum zur Verfügung stellen, muss aber nur so viel physischen Speicher bereitstellen, wie die Kunden tatsächlich verwenden. Falls der Anbieter mehr Speicher anbieten will, als physisch vorhanden ist, ist es grundsätzlich auch möglich, im Fall einer Überlastung des Grafikspeichers Daten in den Hauptspeicher des Systems auszulagern. Dieses Auslagern wird aber durch die asynchrone Arbeitsweise aktueller GPUs erschwert: Anwendungen können GPU-Kernels zur Ausführung direkt an die GPU senden, ohne dafür das Betriebssystem aufrufen zu müssen. Das Betriebssystem hat so keine Kontrolle über den Ausführungszeitpunkt der GPU-Kernels. Darüber hinaus gehen aktuelle GPUs davon aus, dass sämtlicher Grafikspeicher, der einmal von einer Anwendung angefordert wurde, jederzeit zugänglich ist. Sollte ein Kernel versuchen, auf eine nicht zugängliche Adresse zuzugreifen, behandelt die GPU diesen Zugriff als fatalen Fehler und beendet die Ausführung des Kernels. Bisherige Ansätze umgehen dieses Problem, indem sie einen Software-Scheduler für GPU-Kernels einsetzen, um die Kontrolle über den Ausführungszeitpunkt der Kernels zurückzugewinnen. Bei dieser Methode wird nach Beendigung jedes Kernels der nächste Kernel auf der CPU in Software ausgewählt und an die GPU gesendet. Sind Daten, auf die der nächste Kernel möglicherweise zugreift, von der GPU in den Hauptspeicher ausgelagert worden, kopiert der Scheduler diese Daten zurück auf die GPU, bevor der Kernel gestartet wird. Der entscheidende Nachteil dieses Ansatzes ist, dass der Software-Scheduler das extrem effiziente interne Scheduling und Context Switching der GPU ersetzt, ohne das gleiche Maß an Effizienz zu erreichen. Ansätze, die auf Software-Scheduling basieren, verursachen daher einen hohen Overhead, und zwar auch dann, wenn eine ausreichende Menge Grafikspeicher zur Verfügung steht. Da der Scheduler darüber hinaus keine Möglichkeit hat, festzustellen, auf welche Daten ein GPU-Kernel tatsächlich zugreift, werden mit diesem Ansatz häufig Daten kopiert, die gar nicht benötigt werden. In der vorliegenden Arbeit entwickeln wir einen alternativen Ansatz, um Auslagern von GPU-Daten zu ermöglichen. Unser Auslagerungsmechanismus, genannt GPUswap, blendet alle ausgelagerten Daten direkt in den GPU-Adressraum der jeweiligen Anwendung ein. Da auf diese Art alle Daten jederzeit zugänglich sind, kann GPUswap den Anwendungen weiterhin erlauben, Kommandos direkt an die GPU zu senden. Da unser Ansatz ohne Software-Scheduling auskommt, verursacht GPUswap keinerlei Overhead, solange Grafikspeicher in ausreichender Menge zur Verfügung steht. Falls tatsächlich Daten in den Hauptspeicher ausgelagert werden müssen, eliminiert GPUswap außerdem unnötige Datentransfers zwischen Hauptspeicher und GPU, da nur ausgelagerte Daten, auf die Anwendung tatsächlich zugreift, über den PCIe-Bus übertragen werden. Auch wenn GPUswap im Vergleich zu vorherigen Ansätzen deutlich weniger Overhead verursacht, ist der Overhead, der durch die Verwendung von Hauptspeicher anstelle von Grafikspeicher verursacht wird, immer noch erheblich: Anwendungen greifen auf ausgelagerte Daten über den PCIe-Bus zu, der über eine erheblich geringere Bandbreite verfügt als der Grafikspeicher. Um diesen Overhead zu reduzieren, sollten bevorzugt Speicherseiten ausgelagert werden, auf die selten zugegriffen wird. Solche Seiten zu identifizieren ist auf aktuellen GPUs allerdings nicht ohne Weiteres möglich, da die Hardwarefunktionen, die auf der CPU zu diesen Zweck normalerweise eingesetzt werden – z.B. Referenzbits – auf aktuellenGPUs nicht zur Verfügung stehen. In der vorliegenden Arbeit verwenden wir stattdessen Profiling, um selten verwendete Speicherseiten zu identifizieren. Bisherige Ansätze zum Profiling von GPU-Speicher basierten auf modifizierten Compilern, die alle Speicherzugriffe der analysierten Anwendung transparent instrumentieren. Dieser Ansatz hat allerdings zwei Nachteile: Erstens können nur Anwendungen untersucht werden, die vom modifizierten Compiler unterstützt werden, und zweitens muss sämtlicher Code der untersuchten Anwendung – inklusive verwendeter Bibliotheken – mit dem modifizierten Compiler übersetzt werden, da ansonsten Speicherzugriffe aus Anwendungsteilen, die mit einem anderen Compiler übersetzt wurden, für den Profiler nicht sichtbar sind. Unser Ansatz verwendet die Performancezähler der GPU anstelle eines modifizierten Compilers. Unser Profiler lagert einzelne Seiten aus dem Grafikspeicher in den Hauptspeicher aus und verwendet anschließend die Performancezähler, um die Zahl der Hauptspeicherzugriffe der Anwendung zu zählen. Wird dieser Vorgang einmal für jede Seite im Adressraum der Anwendung wiederholt, so erhält man ein vollständiges Zugriffsprofil des gesamten Speichers in diesem Adressraum. Im Gegensatz zu vorherigen Arbeiten funktioniert dieser Ansatz mit beliebigen Anwendungen und erfasst automatisch sämtliche Bibliotheken im Adressraum der Anwendung. Eine Untersuchung von mehreren Anwendungen aus der Rodinia Benchmark Suite mithilfe unseres Profilers zeigt, dass sich die Zahl der Zugriffe pro Seite bei den meisten Anwendungen vor allem zwischen verschiedenen Speicherpuffern der Anwendung unterscheidet, während Seiten innerhalb desselben Puffers meist eine ähnliche Zahl von Zugriffen aufweisen. Ausgehend von den gesammelten Profilen untersuchen wir mehrere mögliche Auslagerungsstrategien und ihre Anwendbarkeit auf aktuellen GPUs. Unser Prototyp enthält zwei dieser Strategien: Eine wählt auszulagernde Seiten zufällig aus, während die andere einen prioritätsbasierten Ansatz verwendet. Bei der prioritätsbasierten Strategie weist der Benutzer ausgehend von einem Zugriffsprofil der Anwendung jedem Puffer der Anwendung eine Priorität zu. Die Auslagerungsstrategie wählt dann bevorzugt Speicherseiten aus Puffern niedriger Priorität. Experimente mit beiden Strategien zeigen, dass der prioritätsbasierte Ansatz den Overhead von GPUswap im Vergleich zu zufälliger Auswahl nicht nur deutlich reduziert, sondern sogar in der Lage ist, größere Datenmengen ohne jeden Overhead auszulagern

KITopen

Inter-workgroup barrier synchronisation on graphics processing units

Author: Sorensen Tyler
Publication venue: Computing, Imperial College London
Publication date: 01/02/2019
Field of study

GPUs are parallel devices that are able to run thousands of independent threads concurrently. Traditional GPU programs are data-parallel, requiring little to no communication, i.e. synchronisation, between threads. However, classical concurrency in the context of CPUs often exploits synchronisation idioms that are not supported on GPUs. By studying such idioms on GPUs, with an aim to facilitate them in a portable way, a wider and more generic space of GPU applications can be made possible. While the breadth of this thesis extends to many aspects of GPU systems, the common thread throughout is the global barrier: an execution barrier that synchronises all threads executing a GPU application. The idea of such a barrier might seem straightforward, however this investigation reveals many challenges and insights. In particular, this thesis includes the following studies: Execution models: while a general global barrier can deadlock due to starvation on GPUs, it is shown that the scheduling guarantees of current GPUs can be used to dynamically create an execution environment that allows for a safe and portable global barrier across a subset of the GPU threads. Application optimisations: a set GPU optimisations are examined that are tailored for graph applications, including one optimisation enabled by the global barrier. It is shown that these optimisations can provided substantial performance improvements, e.g. the barrier optimisation achieves over a 10X speedup on AMD and Intel GPUs. The performance portability of these optimisations is investigated, as their utility varies across input, application, and architecture. Multitasking: because many GPUs do not support preemption, long-running GPU compute tasks (e.g. applications that use the global barrier) may block other GPU functions, including graphics. A simple cooperative multitasking scheme is proposed that allows graphics tasks to meet their deadlines with reasonable overheads.Open Acces

Spiral - Imperial College Digital Repository

Perfect Hash Function Generation on the GPU with RecSplit

Author: Bez Dominik
Publication venue: Karlsruher Institut für Technologie
Publication date: 15/11/2022
Field of study

Minimale perfekte Hashfunktionen (MPHFs) bilden eine statische Menge S von beliebigen Schlüsseln auf die Menge der ersten |S| natürlichen Zahlen bijektiv ab, d. h., jeder Hashwert wird exakt einmal verwendet. Sie sind in vielen Anwendungen hilfreich, zum Beispiel, um Hashtabellen mit garantiert konstanter Zugriffszeit zu implementieren. MPHFs können sehr kompakt sein — weniger als 2 Bit pro Schlüssel sind möglich. Andererseits sind MPHFs nicht in der Lage zu entscheiden, ob ein gegebener Schlüssel zu S gehört. Zurzeit ist RecSplit die speichereffizienteste MPHF. RecSplit bietet verschiedene Kompromisse zwischen Platzverbrauch, Konstruktionszeit und Anfragezeit an. RecSplit kann zum Beispiel eine MPHF mit 1.56 Bits pro Schlüssel in weniger als 2 ms pro Schlüssel konstruieren. Das ist jedoch zu langsam für große Eingaben. Diese Arbeit präsentiert neue RecSplit-Implementierungen, die Multithreading, SIMD und die Leistung von GPUs nutzen, um die Konstruktionszeit zu verbessern. Gemeinsam mit unserer neuen bijection-rotation-Methode erreichen wir Beschleunigungen um Faktoren bis zu 333 für unsere SIMD-Implementierung auf einer 8-Kern CPU und bis zu 1873 für unsere GPU-Implementierung verglichen mit der originalen, sequenziellen RecSplit-Implementierung. Dadurch können wir MPHFs mit 1.56 Bits pro Schlüssel in weniger als 1.5 μs pro Schlüssel konstruieren

KITopen

Developing Real-Time GPU-Sharing Platforms for Artificial-Intelligence Applications

Author: Otterness Nathan Michael
Publication venue: University of North Carolina at Chapel Hill Graduate School
Publication date: 01/01/2022
Field of study

In modern autonomous systems such as self-driving cars, sustained safe operation requires running complex software at rates possible only with the help of specialized computational accelerators. Graphics processing units (GPUs) remain a foremost example of such accelerators, due to their relative ease of use and the proficiency with which they can accelerate neural-network computations underlying modern computer-vision and artificial-intelligence algorithms. This means that ensuring GPU processing completes in a timely manner is essential---but doing so is not necessarily simple, especially when a single GPU is concurrently shared by many applications. Existing real-time research includes several techniques for improving timing characteristics of shared-GPU workloads, each with varying tradeoffs and practical limitations. In the world of timing correctness, however, one problem stands above all others: the lack of detailed information about how GPU hardware and software behaves. GPU manufacturers are usually willing to publish documentation sufficient for producing logically correct software, or guidance on tuning software to achieve "real-fast," high-throughput performance, but the same manufacturers neglect to provide details used when establishing temporal predictability. Techniques for improving the reliability of GPU software's temporal performance are only as good as the information upon which they are based, incentivising researchers to spend inordinate amounts of time learning foundational facts about existing hardware---facts that chip manufacturers must know, but are not willing to publish. This is both a continual inconvenience in established GPU research, and a high barrier to entry for newcomers. This dissertation addresses the "information problem" hindering real-time GPU research in several ways. First, it seeks to fight back against the monoculture that has arisen with respect to platform choice. Virtually all prior real-time GPU research is developed for and evaluated using GPUs manufactured by NVIDIA, but this dissertation provides details about an alternate platform: AMD GPUs. Second, this dissertation works towards establishing a model with which GPU performance can be predicted or controlled. To this end, it uses a series of experiments to discern the policy that governs the queuing behavior of concurrent GPU-sharing processes, on both NVIDIA and AMD GPUs. Finally, this dissertation addresses the novel problems for safety-critical systems caused by the changing landscape of the applications that run on GPUs. In particular, the advent of neural-network-based artificial-intelligence has catapulted GPU usage into safety-critical domains that are not prepared for the complexity of the new software or the fact that it cannot guarantee logical correctness. The lack of logical guarantees is unlikely to be "solved" in the near future, motivating a focus on increased throughput. Higher throughput increases the probability of producing a correct result within a fixed amount of time, but GPU-management efforts typically focus on worst-case performance, often at the expense of throughput. This dissertation's final chapter therefore evaluates existing GPU-management techniques' efficacy at managing neural-network applications, both from a throughput and worst-case perspective.Doctor of Philosoph

Carolina Digital Repository

Traçage et profilage de systèmes hétérogènes

Author: Fiorini Arnaud
Publication venue
Publication date: 01/08/2020
Field of study

RÉSUMÉ : Les systèmes hétérogènes sont de plus en plus présents dans tous les ordinateurs. En effet, de nombreuses tâches nécessitent l’utilisation de coprocesseurs spécialisés. Ces coprocesseurs ont permis des gains de performance très importants qui ont mené à des découvertes scientifiques, notamment l’apprentissage profond qui n’est réapparu qu’avec l’arrivée de la programmation multiusage des processeurs graphiques. Ces coprocesseurs sont de plus en plus complexes. La collaboration et la cohabitation dans un même système de ces puces mènent à des comportements qui ne peuvent pas être prédits avec l’utilisation d’analyse statique. De plus, l’utilisation de systèmes parallèles qui possèdent des milliers de fils d’exécution, et de modèles de programmation spécialisés, rend la compréhension de tels systèmes très difficile. Ces problèmes de compréhension rendent non seulement la programmation plus lente, plus couteuse, mais empêchent aussi le diagnostic de problèmes de performance.----------ABSTRACT : Heterogeneous systems are becoming increasingly relevant and important with the emergence of powerful specialized coprocessors. Because of the nature of certain problems, like graphics display, deep learning and physics simulation, these devices have become a necessity. The power derived from their highly parallel or very specialized architecture is essential to meet the demands of these problems. Because these use cases are common on everyday devices like cellphones and computers, highly parallel coprocessors are added to these devices and collaborate with standard CPUs. The cooperation between these different coprocessors makes the system very difficult to analyze and understand. The highly parallel workload and specialized programming models make programming applications very difficult. Troubleshooting performance issues is even more complex. Since these systems communicate through many layers, the abstractions hide many performance defects

PolyPublie