Search CORE

349 research outputs found

Library Cache Coherence

Author: Cho Myong Hyon
Devadas Srinivas
Khan Omer
Lis Mieszko
Shim Keun Sup
Publication venue
Publication date: 02/05/2011
Field of study

Directory-based cache coherence is a popular mechanism for chip multiprocessors and multicores. The directory protocol, however, requires multicast for invalidation messages and the collection of acknowledgement messages, which can be expensive in terms of latency and network traffic. Furthermore, the size of the directory increases with the number of cores. We present Library Cache Coherence (LCC), which requires neither broadcast/multicast for invalidations nor waiting for invalidation acknowledgements. A library is a set of timestamps that are used to auto-invalidate shared cache lines, and delay writes on the lines until all shared copies expire. The size of library is independent of the number of cores. By removing the complex invalidation process of directory-based cache coherence protocols, LCC generates fewer network messages. At the same time, LCC also allows reads on a cache block to take place while a write to the block is being delayed, without breaking sequential consistency. As a result, LCC has 1.85X less average memory latency than a MESI directory-based protocol on our set of benchmarks, even with a simple timestamp choosing algorithm; moreover, our experimental results on LCC with an ideal timestamp scheme (though not implementable) show the potential of further improvement for LCC with more sophisticated timestamp schemes

DSpace@MIT

Memory consistency directed cache coherence protocols for scalable multiprocessors

Author: Elver Marco
Nagarajan Vijay
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/02/2014
Field of study

The memory consistency model, which formally specifies the behavior of the memory system, is used by programmers to reason about parallel programs. From a hardware design perspective, weaker consistency models permit various optimizations in a multiprocessor system: this thesis focuses on designing and optimizing the cache coherence protocol for a given target memory consistency model. Traditional directory coherence protocols are designed to be compatible with the strictest memory consistency model, sequential consistency (SC). When they are used for chip multiprocessors (CMPs) that provide more relaxed memory consistency models, such protocols turn out to be unnecessarily strict. Usually, this comes at the cost of scalability, in terms of per-core storage due to sharer tracking, which poses a problem with increasing number of cores in today’s CMPs, most of which no longer are sequentially consistent. The recent convergence towards programming language based relaxed memory consistency models has sparked renewed interest in lazy cache coherence protocols. These protocols exploit synchronization information by enforcing coherence only at synchronization boundaries via self-invalidation. As a result, such protocols do not require sharer tracking which benefits scalability. On the downside, such protocols are only readily applicable to a restricted set of consistency models, such as Release Consistency (RC), which expose synchronization information explicitly. In particular, existing architectures with stricter consistency models (such as x86) cannot readily make use of lazy coherence protocols without either: adapting the protocol to satisfy the stricter consistency model; or changing the architecture’s consistency model to (a variant of) RC, typically at the expense of backward compatibility. The first part of this thesis explores both these options, with a focus on a practical approach satisfying backward compatibility. Because of the wide adoption of Total Store Order (TSO) and its variants in x86 and SPARC processors, and existing parallel programs written for these architectures, we first propose TSO-CC, a lazy cache coherence protocol for the TSO memory consistency model. TSO-CC does not track sharers and instead relies on self-invalidation and detection of potential acquires (in the absence of explicit synchronization) using per cache line timestamps to efficiently and lazily satisfy the TSO memory consistency model. Our results show that TSO-CC achieves, on average, performance comparable to a MESI directory protocol, while TSO-CC’s storage overhead per cache line scales logarithmically with increasing core count. Next, we propose an approach for the x86-64 architecture, which is a compromise between retaining the original consistency model and using a more storage efficient lazy coherence protocol. First, we propose a mechanism to convey synchronization information via a simple ISA extension, while retaining backward compatibility with legacy codes and older microarchitectures. Second, we propose RC3 (based on TSOCC), a scalable cache coherence protocol for RCtso, the resulting memory consistency model. RC3 does not track sharers and relies on self-invalidation on acquires. To satisfy RCtso efficiently, the protocol reduces self-invalidations transitively using per-L1 timestamps only. RC3 outperforms a conventional lazy RC protocol by 12%, achieving performance comparable to a MESI directory protocol for RC optimized programs. RC3’s storage overhead per cache line scales logarithmically with increasing core count and reduces on-chip coherence storage overheads by 45% compared to TSO-CC. Finally, it is imperative that hardware adheres to the promised memory consistency model. Indeed, consistency directed coherence protocols cannot use conventional coherence definitions (e.g. SWMR) to be verified against, and few existing verification methodologies apply. Furthermore, as the full consistency model is used as a specification, their interaction with other components (e.g. pipeline) of a system must not be neglected in the verification process. Therefore, verifying a system with such protocols in the context of interacting components is even more important than before. One common way to do this is via executing tests, where specific threads of instruction sequences are generated and their executions are checked for adherence to the consistency model. It would be extremely beneficial to execute such tests under simulation, i.e. when the functional design implementation of the hardware is being prototyped. Most prior verification methodologies, however, target post-silicon environments, which when used for simulation-based memory consistency verification would be too slow. We propose McVerSi, a test generation framework for fast memory consistency verification of a full-system design implementation under simulation. Our primary contribution is a Genetic Programming (GP) based approach to memory consistency test generation, which relies on a novel crossover function that prioritizes memory operations contributing to non-determinism, thereby increasing the probability of uncovering memory consistency bugs. To guide tests towards exercising as much logic as possible, the simulator’s reported coverage is used as the fitness function. Furthermore, we increase test throughput by making the test workload simulation-aware. We evaluate our proposed framework using the Gem5 cycle accurate simulator in full-system mode with Ruby (with configurations that use Gem5’s MESI protocol, and our proposed TSO-CC together with an out-of-order pipeline). We discover 2 new bugs in the MESI protocol due to the faulty interaction of the pipeline and the cache coherence protocol, highlighting that even conventional protocols should be verified rigorously in the context of a full-system. Crucially, these bugs would not have been discovered through individual verification of the pipeline or the coherence protocol. We study 11 bugs in total. Our GP-based test generation approach finds all bugs consistently, therefore providing much higher guarantees compared to alternative approaches (pseudo-random test generation and litmus tests)

Crossref

Edinburgh Research Archive

Edinburgh Research Explorer

Design of Efficient TLB-based Data Classification Mechanisms in Chip Multiprocessors

Author: Esteve García Albert
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 01/09/2017
Field of study

Most of the data referenced by sequential and parallel applications running in current chip multiprocessors are referenced by a single thread, i.e., private. Recent proposals leverage this observation to improve many aspects of chip multiprocessors, such as reducing coherence overhead or the access latency to distributed caches. The effectiveness of those proposals depends to a large extent on the amount of detected private data. However, the mechanisms proposed so far either do not consider either thread migration or the private use of data within different application phases, or do entail high overhead. As a result, a considerable amount of private data is not detected. In order to increase the detection of private data, this thesis proposes a TLB-based mechanism that is able to account for both thread migration and private application phases with low overhead. Classification status in the proposed TLB-based classification mechanisms is determined by the presence of the page translation stored in other core's TLBs. The classification schemes are analyzed in multilevel TLB hierarchies, for systems with both private and distributed shared last-level TLBs. This thesis introduces a page classification approach based on inspecting other core's TLBs upon every TLB miss. In particular, the proposed classification approach is based on exchange and count of tokens. Token counting on TLBs is a natural and efficient way for classifying memory pages. It does not require the use of complex and undesirable persistent requests or arbitration, since when two ormore TLBs race for accessing a page, tokens are appropriately distributed classifying the page as shared. However, TLB-based ability to classify private pages is strongly dependent on TLB size, as it relies on the presence of a page translation in the system TLBs. To overcome that, different TLB usage predictors (UP) have been proposed, which allow a page classification unaffected by TLB size. Specifically, this thesis introduces a predictor that obtains system-wide page usage information by either employing a shared last-level TLB structure (SUP) or cooperative TLBs working together (CUP).La mayor parte de los datos referenciados por aplicaciones paralelas y secuenciales que se ejecutan enCMPs actuales son referenciadas por un único hilo, es decir, son privados. Recientemente, algunas propuestas aprovechan esta observación para mejorar muchos aspectos de los CMPs, como por ejemplo reducir el sobrecoste de la coherencia o la latencia de los accesos a cachés distribuidas. La efectividad de estas propuestas depende en gran medida de la cantidad de datos que son considerados privados. Sin embargo, los mecanismos propuestos hasta la fecha no consideran la migración de hilos de ejecución ni las fases de una aplicación. Por tanto, una cantidad considerable de datos privados no se detecta apropiadamente. Con el fin de aumentar la detección de datos privados, proponemos un mecanismo basado en las TLBs, capaz de reclasificar los datos a privado, y que detecta la migración de los hilos de ejecución sin añadir complejidad al sistema. Los mecanismos de clasificación en las TLBs se han analizado en estructuras de varios niveles, incluyendo TLBs privadas y con un último nivel de TLB compartido y distribuido. Esta tesis también presenta un mecanismo de clasificación de páginas basado en la inspección de las TLBs de otros núcleos tras cada fallo de TLB. De forma particular, el mecanismo propuesto se basa en el intercambio y el cuenteo de tokens (testigos). Contar tokens en las TLBs supone una forma natural y eficiente para la clasificación de páginas de memoria. Además, evita el uso de solicitudes persistentes o arbitraje alguno, ya que si dos o más TLBs compiten para acceder a una página, los tokens se distribuyen apropiadamente y la clasifican como compartida. Sin embargo, la habilidad de los mecanismos basados en TLB para clasificar páginas privadas depende del tamaño de las TLBs. La clasificación basada en las TLBs se basa en la presencia de una traducción en las TLBs del sistema. Para evitarlo, se han propuesto diversos predictores de uso en las TLBs (UP), los cuales permiten una clasificación independiente del tamaño de las TLBs. En concreto, esta tesis presenta un sistema mediante el que se obtiene información de uso de página a nivel de sistema con la ayuda de un nivel de TLB compartida (SUP) o mediante TLBs cooperando juntas (CUP).La major part de les dades referenciades per aplicacions paral·leles i seqüencials que s'executen en CMPs actuals són referenciades per un sol fil, és a dir, són privades. Recentment, algunes propostes aprofiten aquesta observació per a millorar molts aspectes dels CMPs, com és reduir el sobrecost de la coherència o la latència d'accés a memòries cau distribuïdes. L'efectivitat d'aquestes propostes depen en gran mesura de la quantitat de dades detectades com a privades. No obstant això, els mecanismes proposats fins a la data no consideren la migració de fils d'execució ni les fases d'una aplicació. Per tant, una quantitat considerable de dades privades no es detecta apropiadament. A fi d'augmentar la detecció de dades privades, aquesta tesi proposa un mecanisme basat en les TLBs, capaç de reclassificar les dades com a privades, i que detecta la migració dels fils d'execució sense afegir complexitat al sistema. Els mecanismes de classificació en les TLBs s'han analitzat en estructures de diversos nivells, incloent-hi sistemes amb TLBs d'últimnivell compartides i distribuïdes. Aquesta tesi presenta un mecanisme de classificació de pàgines basat en inspeccionar les TLBs d'altres nuclis després de cada fallada de TLB. Concretament, el mecanisme proposat es basa en l'intercanvi i el compte de tokens. Comptar tokens en les TLBs suposa una forma natural i eficient per a la classificació de pàgines de memòria. A més, evita l'ús de sol·licituds persistents o arbitratge, ja que si dues o més TLBs competeixen per a accedir a una pàgina, els tokens es distribueixen apropiadament i la classifiquen com a compartida. No obstant això, l'habilitat dels mecanismes basats en TLB per a classificar pàgines privades depenen de la grandària de les TLBs. La classificació basada en les TLBs resta en la presència d'una traducció en les TLBs del sistema. Per a evitar-ho, s'han proposat diversos predictors d'ús en les TLBs (UP), els quals permeten una classificació independent de la grandària de les TLBs. Específicament, aquesta tesi introdueix un predictor que obté informació d'ús de la pàgina a escala de sistema mitjançant un nivell de TLB compartida (SUP) or mitjançant TLBs cooperant juntes (CUP).Esteve García, A. (2017). Design of Efficient TLB-based Data Classification Mechanisms in Chip Multiprocessors [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/86136TESI

Crossref

RiuNet

Transactional memory for high-performance embedded systems

Author: Piatka Christian
Publication venue
Publication date: 07/08/2023
Field of study

The increasing demand for computational power in embedded systems, which is required for various tasks, such as autonomous driving, can only be achieved by exploiting the resources offered by modern hardware. Due to physical limitations, hardware manufacturers have moved to increase the number of cores per processor instead of further increasing clock rates. Therefore, in our view, the additionally required computing power can only be achieved by exploiting parallelism. Unfortunately writing parallel code is considered a difficult and complex task. Hardware Transactional Memories (HTMs) are a suitable tool to write sophisticated parallel software. However, HTMs were not specifically developed for embedded systems and therefore cannot be used without consideration. The use of conventional HTMs increases complexity and makes it more difficult to foresee implications with other important properties of embedded systems. This thesis therefore describes how an HTM for embedded systems could be implemented. The HTM was designed to allow the parallel execution of software and to offer functionality which is useful for embedded systems. Hereby the focus lay on: elimination of the typical limitations of conventional HTMs, several conflict resolution mechanisms, investigation of real time behavior, and a feature to conserve energy. To enable the desired functionalities, the structure of the HTM described in this work strongly differs from a conventional HTM. In comparison to the baseline HTM, which was also designed and implemented in this thesis, the biggest adaptation concerns the conflict detection. It was modified so that conflicts can be detected and resolved centrally. For this, the cache hierarchy as well as the cache coherence had to be adapted and partially extended. The system was implemented in the cycle-accurate gem5 simulator. The eight benchmarks of the STAMP benchmark suite were used for evaluation. The evaluation of the various functionalities shows that the mechanisms work and add value for the operation in embedded systems.Der immer größer werdende Bedarf an Rechenleistung in eingebetteten Systemen, der für verschiedene Aufgaben wie z. B. dem autonomen Fahren benötigt wird, kann nur durch die effiziente Nutzung der zur Verfügung stehenden Ressourcen erreicht werden. Durch physikalische Grenzen sind Prozessorhersteller dazu übergegangen, Prozessoren mit mehreren Prozessorkernen auszustatten, statt die Taktraten weiter anzuheben. Daher kann die zusätzlich benötigte Rechenleistung aus unserer Sicht nur durch eine Steigerung der Parallelität gelingen. Hardwaretransaktionsspeicher (HTS) erlauben es ihren Nutzern schnell und einfach parallele Programme zu schreiben. Allerdings wurden HTS nicht speziell für eingebettete Systeme entwickelt und sind daher nur eingeschränkt für diese nutzbar. Durch den Einsatz herkömmlicher HTS steigt die Komplexität und es wird somit schwieriger abzusehen, ob andere wichtige Eigenschaften erreicht werden können. Um den Einsatz von HTS in eingebettete Systeme besser zu ermöglichen, beschreibt diese Arbeit einen konkreten Ansatz. Der HTS wurde hierzu so entwickelt, dass er eine parallele Ausführung von Programmen ermöglicht und Eigenschaften besitzt, welche für eingebettete Systeme nützlich sind. Dazu gehören unter anderem: Wegfall der typischen Limitierungen herkömmlicher HTS, Einflussnahme auf den Konfliktauflösungsmechanismus, Unterstützung einer abschätzbaren Ausführung und eine Funktion, um Energie einzusparen. Um die gewünschten Funktionalitäten zu ermöglichen, unterscheidet sich der Aufbau des in dieser Arbeit beschriebenen HTS stark von einem klassischen HTS. Im Vergleich zu dem Referenz HTS, der ebenfalls im Rahmen dieser Arbeit entworfen und implementiert wurde, betrifft die größte Anpassung die Konflikterkennung. Sie wurde derart verändert, dass die Konflikte zentral erkannt und aufgelöst werden können. Hierfür mussten die Cache-Hierarchie und Cache-Kohärenz stark angepasst und teilweise erweitert werden. Das System wurde in einem taktgenauen Simulator, dem gem5-Simulator, umgesetzt. Zur Evaluation wurden die acht Benchmarks der STAMP-Benchmark-Suite eingesetzt. Die Evaluation der verschiedenen Funktionen zeigt, dass die Mechanismen funktionieren und somit einen Mehrwert für eingebettete Systeme bieten

OPUS Augsburg

Recommended from our members

Architectural support for message queue task parallelism

Author: Wu Qinzhe
Publication venue
Publication date: 04/01/2024
Field of study

The scaling of threads is an attractive way to exploit task-level parallelism and boost performance. From the perspective of software programming, many applications (e.g., network package processing, SQL queries) could be composite of a set of small tasks. Those tasks are arranged in a data flow graph and each task is undertaken by some threads. Message queues are often used to coordinate the tasks among the threads. On the other side, thread scaling is in favor of the hardware advancing trend that there are more Processing Elements (PE) in modern Chip Multiprocessors (CMP) than ever before. This is because single PE cannot simply run faster due to power and thermal limitations; instead architects have to use more transistors for increasing number of PEs, in order to improve the overall computing power of a processor. Unfortunately, this paradigm using message queues to drive parallel tasks sometime leads to diminishing performance returns due to issues lying in the architecture and system design. Particularly, the conventional coherent shared-memory architectures let task-parallel workloads suffer from unnecessary synchronization overhead and load-to-use latency. For instance, when passing messages through queues, multiple threads could contend for the exclusivity of the cacheline where the shared queue data structure stays. The more threads, the more severe the contention is, because every transition upgrading a cacheline from shared to exclusive state needs to invalidate more copies in the private caches of other cores, and waits for the acknowledgements from more cores. Such a overhead hurts the scalability of threads synchronizing via message queues. Adding to the coherence overhead, the load-to-use latency (from a consumer requesting data until the data being moved to the consumer to use) is often on the critical path, slowing down the computation. This is because the cache hierarchy in modern processors creates some layers of local storage to buffer data separately for different cores. Therefore, serving message queue data in an ondemand manner incurs longer load-to-use latency. It is also challenging to schedule message-driven tasks to use cores efficiently when arrival rate and service rate mismatch. It wastes CPU cycles if a runtime system leaves tasks blocked on full/empty message queues, while switching tasks has additional scheduling overheads. Diverse system topologies further complicate the problem, as the scheduling also needs to take data locality into consideration. This dissertation explores architectural supports for enhancing the scalability of message queue task parallelism, reducing the load-to-use latency, as well as avoiding blocking. Specifically, this dissertation designs and evaluates a message queue architecture that lowers the overhead of synchronization on shared queue states, a speculation technique to hide the load-to-use latency, as well as a locality-aware message queue runtime system with low overhead on scheduling and buffer resizing. The first contribution of the dissertation is Virtual-Link scalable message queue architecture (VL). Instead of having threads access the shared queue state variables (i.e., head, tail, or lock) atomically, VL provides configurable hardware support, providing both data transfer and synchronization. Unlike other hardware queue architectures with dedicated network, VL reuses the existing cache coherence network and delivers a virtualized channel as if there were a direct link (or route) between two arbitrary PEs. VL facilitates efficient synchronized data movement between M:N producers and consumers with several benefits: (i) the number of sharers on synchronization primitives is reduced to zero, eliminating a primary bottleneck of traditional lock-free queues, (ii) memory spills, snoops, and invalidations are reduced, (iii) data stays on the fast path (inside the interconnect) a majority of the time. Another contribution of the dissertation is SPAMeR speculation mechanism. SPAMeR has the capability to speculatively push messages in anticipation of consumer message requests. With the speculation, the latency of moving data from the source to the consumer that needs the data could be partially or fully overlapped with the message processing time. Unlike pre-fetch approaches which predict what addresses to fetch next, with a queue we know exactly what data is needed next but not when it is needed; SPAMeR proposes algorithms to learn from queue operation history in order to predict this. Finally the dissertation contributes ARMQ locality-aware runtime. ARMQ collects a set of approaches that avoids message queue blocking, ranging from the most general yielding, to dynamically resizing the buffer, and to spawning helper tasks. On one hand, ARMQ minimizes the overheads (e.g., wasteful polling, context switch, memory allocation and copying etc.) with a few techniques (e.g., userspace threading, chunk-based ringbuffer etc.) On the other hand, ARMQ schedules the message-driven tasks precisely and opportunely, in order to maximize the data locality preserved (in favor of cache) and balance the resource allocation.Electrical and Computer Engineerin

Texas ScholarWorks

Eureka: a distributed shared memory system based on the Lazy Data Merging consistency model

Author: Tavares Joao Alberto Vianna
Publication venue: Monterey, California. Naval Postgraduate School
Publication date: 01/12/1995
Field of study

Distributed Shared Memory (DSM) provides an abstraction of shared memory on a network of workstations. Problems with existing DSM systems are lack of portability due to compiler and/or operating system modification requirements, and reduced performance due to significant synchronization and communication costs when compared to their message passing counterparts (e.g., PVM and MPI). Our approach was to introduce a new DSM consistency model, Lazy Data Merging (LDM), which extends Data Merging (DM). LDM is optimized for software runtime implementations and differs from DM by 'lazily' placing data updates across the communication network only when they are required. It is our belief that LDM can significantly reduce communication costs, particularly for applications that make extensive use of locks. We have completed the design of "Eureka", a prototype DSM system that provides a software implementation of the LDM consistency model. To ensure portability and efficiency we use only standard UniXTM system calls and a publicly available software thread package, Cthreads, from the University of Utah. Furthermore, we have implemented and tested some of Eureka's core components, specifically, the set of communication and hybrid (Invalidate/Update) coherence primitives, which are essential for follow on work in building the complete DSM system. The question of efficiency is still an open problem, because we did not compare Eureka with other DSM implementations.http://archive.org/details/eurekadistribute1094535209NANABrazilian Navy author

Calhoun, Institutional Archive of the Naval Postgraduate School

Clock Synchronization for Moderm Multiprocessors

Author: André dos Santos Oliveira
Publication venue
Publication date: 21/07/2015
Field of study

Repositório Aberto da Universidade do Porto

Parallel and Distributed Computing

Author
Publication venue: 'IntechOpen'
Publication date: 20/04/2021
Field of study

The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing

Directory of Open Access Books (DOAB)

True shared memory architecture for next-generation multi-GPU systems

Author: Mojumder Md Saiful Arefin
Publication venue
Publication date: 15/05/2021
Field of study

Machine learning (ML) is now omnipresent in all spheres of life. The use of deep neural networks (DNNs) for ML has gained popularity over the past few years. This is because DNNs are capable of efficiently solving complex problems such as image processing, object detection, language processing, etc. To train these DNN workloads, graphics process- ing units (GPUs) have become the most widely used platform. A GPU can support a large number of parallel threads that execute simultaneously to achieve a very high throughput. However, as the sizes of the DNN workloads grow, a single GPU is no longer adequate to provide fast training, and developers resort to using multi-GPU (MGPU) systems that can reduce the training time significantly. Consequently, to keep pace with the growth of DNN applications, GPU vendors are actively developing novel and efficient MGPU systems. To better understand the challenges associated with designing MGPU systems for DNN workloads, in this thesis, we first present our efforts to understand the behavior of the DNN workloads, in particular, the training of DNN workloads on MGPU systems. Using the DNN workloads as benchmarks, we observe the evolution of MGPU system architecture. Based on our profiling and characterization of DNN workloads on existing high-performance MGPU systems, we identify the computation- and communication- intensiveness of the DNN workloads and the hardware- and software-level inefficiencies present in the existing MGPU systems. We find that the data movement across multiple GPUs and high remote data access cost leading to NUMA effects, data duplication, and inefficient use of GPU memory leading to memory capacity issues, and the complexity in programming MGPUs pose serious limitations in the execution of ever-scaling DNN workloads on MGPU systems. To overcome the limitations of existing MGPU systems, we propose to unify the main memory of GPUs to design an MGPU system with true shared memory (MGPU-TSM). Our proposed MGPU-TSM system demonstrates a significant performance boost (3.8× for a 4 GPU system) over the best-performing existing MGPU system. This is because MGPU-TSM system eliminates the NUMA effects and the necessity for data duplication. To provide seamless data sharing across multiple GPUs and ease programming of MGPU- TSM, we propose a light-weight coherence protocol called MGCC. MGCC is a timestamp- based protocol that provides both intra- and inter-GPU coherence. We implement a number of hardware features including unified memory controller, request tracker and timestamp storage unit to support MGCC. Using both standard and synthetic stress benchmarks, we evaluate the MGPU-TSM system with MGCC leveraging sequential as well as relaxed consistency. Our evaluation of a 4-GPU system using MGPUSim simulator suggests that our proposed coherent MGPU system achieves up to 3.8× improved performance than current best-performing MGPU system while the stress tests performed using synthetic benchmarks suggests that MGCC leads to up to 46.1% performance overhead

Boston University Institutional Repository (OpenBU)

Distributed Implementation of eXtended Reality Technologies over 5G Networks

Author: González Morín Diego
Publication venue
Publication date: 15/06/2023
Field of study

Mención Internacional en el título de doctorThe revolution of Extended Reality (XR) has already started and is rapidly expanding as technology advances. Announcements such as Meta’s Metaverse have boosted the general interest in XR technologies, producing novel use cases. With the advent of the fifth generation of cellular networks (5G), XR technologies are expected to improve significantly by offloading heavy computational processes from the XR Head Mounted Display (HMD) to an edge server. XR offloading can rapidly boost XR technologies by considerably reducing the burden on the XR hardware, while improving the overall user experience by enabling smoother graphics and more realistic interactions. Overall, the combination of XR and 5G has the potential to revolutionize the way we interact with technology and experience the world around us. However, XR offloading is a complex task that requires state-of-the-art tools and solutions, as well as an advanced wireless network that can meet the demanding throughput, latency, and reliability requirements of XR. The definition of these requirements strongly depends on the use case and particular XR offloading implementations. Therefore, it is crucial to perform a thorough Key Performance Indicators (KPIs) analysis to ensure a successful design of any XR offloading solution. Additionally, distributed XR implementations can be intrincated systems with multiple processes running on different devices or virtual instances. All these agents must be well-handled and synchronized to achieve XR real-time requirements and ensure the expected user experience, guaranteeing a low processing overhead. XR offloading requires a carefully designed architecture which complies with the required KPIs while efficiently synchronizing and handling multiple heterogeneous devices. Offloading XR has become an essential use case for 5G and beyond 5G technologies. However, testing distributed XR implementations requires access to advanced 5G deployments that are often unavailable to most XR application developers. Conversely, the development of 5G technologies requires constant feedback from potential applications and use cases. Unfortunately, most 5G providers, engineers, or researchers lack access to cutting-edge XR hardware or applications, which can hinder the fast implementation and improvement of 5G’s most advanced features. Both technology fields require ongoing input and continuous development from each other to fully realize their potential. As a result, XR and 5G researchers and developers must have access to the necessary tools and knowledge to ensure the rapid and satisfactory development of both technology fields. In this thesis, we focus on these challenges providing knowledge, tools and solutiond towards the implementation of advanced offloading technologies, opening the door to more immersive, comfortable and accessible XR technologies. Our contributions to the field of XR offloading include a detailed study and description of the necessary network throughput and latency KPIs for XR offloading, an architecture for low latency XR offloading and our full end to end XR offloading implementation ready for a commercial XR HMD. Besides, we also present a set of tools which can facilitate the joint development of 5G networks and XR offloading technologies: our 5G RAN real-time emulator and a multi-scenario XR IP traffic dataset. Firstly, in this thesis, we thoroughly examine and explain the KPIs that are required to achieve the expected Quality of Experience (QoE) and enhanced immersiveness in XR offloading solutions. Our analysis focuses on individual XR algorithms, rather than potential use cases. Additionally, we provide an initial description of feasible 5G deployments that could fulfill some of the proposed KPIs for different offloading scenarios. We also present our low latency muti-modal XR offloading architecture, which has already been tested on a commercial XR device and advanced 5G deployments, such as millimeter-wave (mmW) technologies. Besides, we describe our full endto- end complex XR offloading system which relies on our offloading architecture to provide low latency communication between a commercial XR device and a server running a Machine Learning (ML) algorithm. To the best of our knowledge, this is one of the first successful XR offloading implementations for complex ML algorithms in a commercial device. With the goal of providing XR developers and researchers access to complex 5G deployments and accelerating the development of future XR technologies, we present FikoRE, our 5G RAN real-time emulator. FikoRE has been specifically designed not only to model the network with sufficient accuracy but also to support the emulation of a massive number of users and actual IP throughput. As FikoRE can handle actual IP traffic above 1 Gbps, it can directly be used to test distributed XR solutions. As we describe in the thesis, its emulation capabilities make FikoRE a potential candidate to become a reference testbed for distributed XR developers and researchers. Finally, we used our XR offloading tools to generate an XR IP traffic dataset which can accelerate the development of 5G technologies by providing a straightforward manner for testing novel 5G solutions using realistic XR data. This dataset is generated for two relevant XR offloading scenarios: split rendering, in which the rendering step is moved to an edge server, and heavy ML algorithm offloading. Besides, we derive the corresponding IP traffic models from the captured data, which can be used to generate realistic XR IP traffic. We also present the validation experiments performed on the derived models and their results.This work has received funding from the European Union (EU) Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie ETN TeamUp5G, grant agreement No. 813391.Programa de Doctorado en Multimedia y Comunicaciones por la Universidad Carlos III de Madrid y la Universidad Rey Juan CarlosPresidente: Narciso García Santos.- Secretario: Fernando Díaz de María.- Vocal: Aryan Kaushi

Universidad Carlos III de Madrid e-Archivo