4 research outputs found
Recommended from our members
Achieving Accurate Predictions of Future Events Under Hardware Heterogeneity
Heterogeneous hardware is becoming increasingly available in modern hardware, while research breakthroughs enforce the expectation that heterogeneity will keep increasing in the future. Significant gains can be achieved via appropriate utilization of heterogeneity, in terms of performance and power consumption, however, poor utilization can have a detrimental effect. Intelligent scheduling and resource management is a crucial challenge we need to overcome in order to harvest the full potential of heterogeneous hardware. As systems become larger and include greater levels of hardware diversity, the importance of intelligent scheduling and resource management is further accentuated.This dissertation presents techniques that aid the process of scheduling and resource management in the presence of heterogeneous hardware, via accurately predicting upcoming runtime events. With a proactive and accurate view of the near future, schedulers can utilize the underlying hardware more efficiently, and fully take advantage of the available benefits.By adapting a majority element heuristic, this dissertation significantly improves the accuracy of predicting memory addresses about to be accessed, while reducing prediction-related costs by a factor of ten thousand compared to previously proposed predictive approaches. Coupled with novel microarchitectural modifications, accurate address predictions are shown to improve the performance of heterogeneous memory architectures.Machine learning-based performance predictors are further presented, capable of predicting a program's performance when executed on a given general-purpose core. Trained to model the subtleties of the interaction between hardware and software, these predictors are capable of generating highly accurate predictions even for cores with varied Instruction Set Architectures. Utilizing these performance predictions for job scheduling, is shown to improve overall system performance. The trained predictors are further examined and interpreted in order to visualize the correlations between features picked up and amplified during training.Finally, this dissertation demonstrates that scheduling algorithms cannot guarantee deriving an optimal schedule during realistic execution scenarios due to the underlying hardware heterogeneity, the wide range of runtime requirements of software, as well as prediction error from performance predictors. In response, deep neural networks are trained to select one scheduling approach from a list of options with varied overheads and correctness guarantees. The scheduling approach chosen, is the one which will most likely return the highest-performance schedule with the lowest overhead, given a particular instance of the job-to-core assignment problem
Effizientes Programmiermodell fĂĽr OpenMP auf einem Cluster-basierten Many-Core-System
Da die Komplexität „System-on-Chip“ (SoC) auch weiterhin zunimmt, wird man die Herausforderungen aufgrund der Konvergenz der Software- und Hardwareentwicklung nicht ignorieren können. Dies gilt auch für den Umgang mit dem hierarchischen Design, in dem die Prozessorkerne in Clustern oder sogenannten „Tiles“ angeordnet werden, um mittels eines schnellen lokalen Speicherzugriffs eine geringe Latenz und eine hohe Bandbreite der lokalen Kommunikation zu gewährleisten. Aus der Sicht eines Programmierers ist es wünschenswert, sich diese Eigenheiten der Hardware zunutze zu machen und sie bei der Ausgestaltung der abstrakten Parallel-Programmierung gewissenhaft und zielführend zu berücksichtigen.
Diese Dissertation überwindet viele Engpässe in Bezug auf die Skalierbarkeit Cluster-basierter Many-Core-Systeme und führt das Programmiermodell OpenMP zur Vereinfachung der Anwendungsentwicklung ein. OpenMP abstrahiert von der Sichtweise des Programmierers – und es werden Richtlinien eingeführt, mit denen Schleifen in Programmsequenzen eingeteilt werden, als Basis für die parallele Programmierung.
In dieser Arbeit wird das OpenMP-Modell bespielhaft in einem konkreten Cluster-basierten Many-Core-System umgesetzt; dem Intel Single-Chip Cloud Computer (SCC). Es wird eine schlanke und hoch-optimierte Laufzeitschicht fĂĽr die AusfĂĽhrung von OpenMP sowie ein Speichermodell vorgestellt. Auf Basis dieser Laufzeitschicht wird der parallele Code automatisch von einem nativen Backend-Compiler (GCC 4.6) erzeugt, der mit der Laufzeitbibliothek verknĂĽpft ist.
Im Rahmen der Arbeit wird auf einen effizienten Designansatz für die OpenMP-Programmierung eingegangen, wobei der Intel SCC als Beispiel für Cluster-basierte Systeme zum Einsatz kommt. In nicht-Cache-kohärenten Systemen dient die SCC OpenMP Laufzeitbibliothek primär dazu, die folgenden Herausforderungen zu bewältigen:
1. Die AusfĂĽhrung von unmodifizierten, bestehenden OpenMP Programmen auf solchen Systemen.
2. Die Portierung des OpenMP-Speichermodells auf den SCC.
3. Die Synchronisation der parallelen Threads, auf die ein beträchtlicher Anteil der Ausführungszeit einer Anwendung entfällt.
Eine Reihe weiterer Beispiele, basierend auf verschiedenen gebräuchlichen Kernen und realen Anwendungen, untermauert die Tauglichkeit von OpenMP – und eine Reihe von Experimenten zeigt, wie dieses Modell zu einer deutlichen Beschleunigung (bis zu 48-fach) in verschiedenen parallelen Anwendungen führt.As the complexity of systems-on-chip (SoCs) continues to increase, it is no longer possible to ignore the challenges caused by the convergence of software and hardware development. This involves attempts to deal with the hierarchical design – in which several cores are grouped in clusters or tiles – to ensure low-latency, high-bandwidth local communication by relying on fast local memories. From a programmer’s perspec- tive, it is desirable to make use of these peculiarities of the hardware, which must be clearly and carefully taken into account when designing the support for high-level parallel programming models.
This dissertation overcomes many scalability bottlenecks in cluster-based many-core systems and introduces the OpenMP programming model as a means of simplifying application development. OpenMP represents an abstraction of the programmer’s view by providing abundant directives that decompose loops in sequential programs and lead to parallel programs.
In this work, the full OpenMP model is implemented on a specific instance of a cluster-based many-core system: the Intel Single-chip Cloud Computer (SCC). In this thesis, a lightweight and highly optimized runtime layer for OpenMP execution and memory model by generating the parallel code that is automatically compiled by native back-end compiler (GCC 4.6) that linked with the runtime library.
In this dissertation, I will address an efficient design approach of the OpenMP pro- gramming model for the Intel SCC as an example for cluster-based systems. The SCC OpenMP runtime library is designed to cope with three main challenges in a non-cache coherent system:
1. Executing unmodified legacy OpenMP programs on such system.
2. Landing OpenMP memory model on the SCC.
3. Synchronization in the work of parallel threads accounts for a sizeable fraction of an application’s execution time.
Furthermore, the effectiveness of OpenMP is demonstrated on a set of widely used kernels and real-world applications. An extensive set of experiments shows how this model achieves significant parallel speedups up to 48x in several applications
Towards Scalable OLTP Over Fast Networks
Online Transaction Processing (OLTP) underpins real-time data processing in many mission-critical applications, from banking to e-commerce.
These applications typically issue short-duration, latency-sensitive transactions that demand immediate processing.
High-volume applications, such as Alibaba's e-commerce platform, achieve peak transaction rates as high as 70 million transactions per second, exceeding the capacity of a single machine.
Instead, distributed OLTP database management systems (DBMS) are deployed across multiple powerful machines.
Historically, such distributed OLTP DBMSs have been primarily designed to avoid network communication, a paradigm largely unchanged since the 1980s.
However, fast networks challenge the conventional belief that network communication is the main bottleneck.
In particular, emerging network technologies, like Remote Direct Memory Access (RDMA), radically alter how data can be accessed over a network.
RDMA's primitives allow direct access to the memory of a remote machine within an order of magnitude of local memory access.
This development invalidates the notion that network communication is the primary bottleneck.
Given that traditional distributed database systems have been designed with the premise that the network is slow, they cannot efficiently exploit these fast network primitives, which requires us to reconsider how we design distributed OLTP systems.
This thesis focuses on the challenges RDMA presents and its implications on the design of distributed OLTP systems.
First, we examine distributed architectures to understand data access patterns and scalability in modern OLTP systems.
Drawing on these insights, we advocate a distributed storage engine optimized for high-speed networks.
The storage engine serves as the foundation of a database, ensuring efficient data access through three central components: indexes, synchronization primitives, and buffer management (caching).
With the introduction of RDMA, the landscape of data access has undergone a significant transformation.
This requires a comprehensive redesign of the storage engine components to exploit the potential of RDMA and similar high-speed network technologies.
Thus, as the second contribution, we design RDMA-optimized tree-based indexes — especially applicable for disaggregated databases to access remote data efficiently.
We then turn our attention to the unique challenges of RDMA.
One-sided RDMA, one of the network primitives introduced by RDMA, presents a performance advantage in enabling remote memory access while bypassing the remote CPU and the operating system.
This allows the remote CPU to process transactions uninterrupted, with no requirement to be on hand for network communication. However, that way, specialized one-sided RDMA synchronization primitives are required since traditional CPU-driven primitives are bypassed.
We found that existing RDMA one-sided synchronization schemes are unscalable or, even worse, fail to synchronize correctly, leading to hard-to-detect data corruption.
As our third contribution, we address this issue by offering guidelines to build scalable and correct one-sided RDMA synchronization primitives.
Finally, recognizing that maintaining all data in memory becomes economically unattractive, we propose a distributed buffer manager design that efficiently utilizes cost-effective NVMe flash storage.
By leveraging low-latency RDMA messages, our buffer manager provides a transparent memory abstraction, accessing the aggregated DRAM and NVMe storage across nodes.
Central to our approach is a distributed caching protocol that dynamically caches data.
With this approach, our system can outperform RDMA-enabled in-memory distributed databases while managing larger-than-memory datasets efficiently