497 research outputs found
Resource-aware scheduling for 2D/3D multi-/many-core processor-memory systems
This dissertation addresses the complexities of 2D/3D multi-/many-core processor-memory systems, focusing on two key areas: enhancing timing predictability in real-time multi-core processors and optimizing performance within thermal constraints. The integration of an increasing number of transistors into compact chip designs, while boosting computational capacity, presents challenges in resource contention and thermal management. The first part of the thesis improves timing predictability. We enhance shared cache interference analysis for set-associative caches, advancing the calculation of Worst-Case Execution Time (WCET). This development enables accurate assessment of cache interference and the effectiveness of partitioned schedulers in real-world scenarios. We introduce TCPS, a novel task and cache-aware partitioned scheduler that optimizes cache partitioning based on task-specific WCET sensitivity, leading to improved schedulability and predictability. Our research explores various cache and scheduling configurations, providing insights into their performance trade-offs. The second part focuses on thermal management in 2D/3D many-core systems. Recognizing the limitations of Dynamic Voltage and Frequency Scaling (DVFS) in S-NUCA many-core processors, we propose synchronous thread migrations as a thermal management strategy. This approach culminates in the HotPotato scheduler, which balances performance and thermal safety. We also introduce 3D-TTP, a transient temperature-aware power budgeting strategy for 3D-stacked systems, reducing the need for Dynamic Thermal Management (DTM) activation. Finally, we present 3QUTM, a novel method for 3D-stacked systems that combines core DVFS and memory bank Low Power Modes with a learning algorithm, optimizing response times within thermal limits. This research contributes significantly to enhancing performance and thermal management in advanced processor-memory systems
Adaptive Microarchitectural Optimizations to Improve Performance and Security of Multi-Core Architectures
With the current technological barriers, microarchitectural optimizations are increasingly important to ensure performance scalability of computing systems. The shift to multi-core architectures increases the demands on the memory system, and amplifies the role of microarchitectural optimizations in performance improvement. In a multi-core system, microarchitectural resources are usually shared, such as the cache, to maximize utilization but sharing can also lead to contention and lower performance. This can be mitigated through partitioning of shared caches.However, microarchitectural optimizations which were assumed to be fundamentally secure for a long time, can be used in side-channel attacks to exploit secrets, as cryptographic keys. Timing-based side-channels exploit predictable timing variations due to the interaction with microarchitectural optimizations during program execution. Going forward, there is a strong need to be able to leverage microarchitectural optimizations for performance without compromising security. This thesis contributes with three adaptive microarchitectural resource management optimizations to improve security and/or\ua0performance\ua0of multi-core architectures\ua0and a systematization-of-knowledge of timing-based side-channel attacks.\ua0We observe that to achieve high-performance cache partitioning in a multi-core system\ua0three requirements need to be met: i) fine-granularity of partitions, ii) locality-aware placement and iii) frequent changes. These requirements lead to\ua0high overheads for current centralized partitioning solutions, especially as the number of cores in the\ua0system increases. To address this problem, we present an adaptive and scalable cache partitioning solution (DELTA) using a distributed and asynchronous allocation algorithm. The\ua0allocations occur through core-to-core challenges, where applications with larger performance benefit will gain cache capacity. The\ua0solution is implementable in hardware, due to low computational complexity, and can scale to large core counts.According to our analysis, better performance can be achieved by coordination of multiple optimizations for different resources, e.g., off-chip bandwidth and cache, but is challenging due to the increased number of possible allocations which need to be evaluated.\ua0Based on these observations, we present a solution (CBP) for coordinated management of the optimizations: cache partitioning, bandwidth partitioning and prefetching.\ua0Efficient allocations, considering the inter-resource interactions and trade-offs, are achieved using local resource managers to limit the solution space.The continuously growing number of\ua0side-channel attacks leveraging\ua0microarchitectural optimizations prompts us to review attacks and defenses to understand the vulnerabilities of different microarchitectural optimizations. We identify the four root causes of timing-based side-channel attacks: determinism, sharing, access violation\ua0and information flow.\ua0Our key insight is that eliminating any of the exploited root causes, in any of the attack steps, is enough to provide protection.\ua0Based on our framework, we present a systematization of the attacks and defenses on a wide range of microarchitectural optimizations, which highlights their key similarities.\ua0Shared caches are an attractive attack surface for side-channel attacks, while defenses need to be efficient since the cache is crucial for performance.\ua0To address this issue, we present an adaptive and scalable cache partitioning solution (SCALE) for protection against cache side-channel attacks. The solution leverages randomness,\ua0and provides quantifiable and information theoretic security guarantees using differential privacy. The solution closes the performance gap to a state-of-the-art non-secure allocation policy for a mix of secure and non-secure applications
GME: GPU-based Microarchitectural Extensions to Accelerate Homomorphic Encryption
Fully Homomorphic Encryption (FHE) enables the processing of encrypted data
without decrypting it. FHE has garnered significant attention over the past
decade as it supports secure outsourcing of data processing to remote cloud
services. Despite its promise of strong data privacy and security guarantees,
FHE introduces a slowdown of up to five orders of magnitude as compared to the
same computation using plaintext data. This overhead is presently a major
barrier to the commercial adoption of FHE.
In this work, we leverage GPUs to accelerate FHE, capitalizing on a
well-established GPU ecosystem available in the cloud. We propose GME, which
combines three key microarchitectural extensions along with a compile-time
optimization to the current AMD CDNA GPU architecture. First, GME integrates a
lightweight on-chip compute unit (CU)-side hierarchical interconnect to retain
ciphertext in cache across FHE kernels, thus eliminating redundant memory
transactions. Second, to tackle compute bottlenecks, GME introduces special
MOD-units that provide native custom hardware support for modular reduction
operations, one of the most commonly executed sets of operations in FHE. Third,
by integrating the MOD-unit with our novel pipelined -bit integer
arithmetic cores (WMAC-units), GME further accelerates FHE workloads by .
Finally, we propose a Locality-Aware Block Scheduler (LABS) that exploits the
temporal locality available in FHE primitive blocks. Incorporating these
microarchitectural features and compiler optimizations, we create a synergistic
approach achieving average speedups of , , and
over Intel Xeon CPU, NVIDIA V100 GPU, and Xilinx FPGA
implementations, respectively
Classification of networks-on-chip in the context of analysis of promising self-organizing routing algorithms
This paper contains a detailed analysis of the current state of the
network-on-chip (NoC) research field, based on which the authors propose the
new NoC classification that is more complete in comparison with previous ones.
The state of the domain associated with wireless NoC is investigated, as the
transition to these NoCs reduces latency. There is an assumption that routing
algorithms from classical network theory may demonstrate high performance. So,
in this article, the possibility of the usage of self-organizing algorithms in
a wireless NoC is also provided. This approach has a lot of advantages
described in the paper. The results of the research can be useful for
developers and NoC manufacturers as specific recommendations, algorithms,
programs, and models for the organization of the production and technological
process.Comment: 10 p., 5 fig. Oral presentation on APSSE 2021 conferenc
Design and Implementation of High QoS 3D-NoC using Modified Double Particle Swarm Optimization on FPGA
One technique to overcome the exponential growth bottleneck is to increase the number of cores on a processor, although having too many cores might cause issues including chip overheating and communication blockage. The problem of the communication bottleneck on the chip is presently effectively resolved by networks-on-chip (NoC). A 3D stack of chips is now possible, thanks to recent developments in IC manufacturing techniques, enabling to reduce of chip area while increasing chip throughput and reducing power consumption. The automated process associated with mapping applications to form three-dimensional NoC architectures is a significant new path in 3D NoC research. This work proposes a 3D NoC partitioning approach that can identify the 3D NoC region that has to be mapped. A double particle swarm optimization (DPSO) inspired algorithmic technique, which may combine the characteristics having neighbourhood search and genetic architectures, also addresses the challenge of a particle swarm algorithm descending into local optimal solutions. Experimental evidence supports the claim that this hybrid optimization algorithm based on Double Particle Swarm Optimisation outperforms the conventional heuristic technique in terms of output rate and loss in energy. The findings demonstrate that in a network of the same size, the newly introduced router delivers the lowest loss on the longest path. Three factors, namely energy, latency or delay, and throughput, are compared between the suggested 3D mesh ONoC and its 2D version. When comparing power consumption between 3D ONoC and its electronic and 2D equivalents, which both have 512 IP cores, it may save roughly 79.9% of the energy used by the electronic counterpart and 24.3% of the energy used by the latter. The network efficiency of the 3D mesh ONoC is simulated by DPSO in a variety of configurations. The outcomes also demonstrate an increase in performance over the 2D ONoC. As a flexible communication solution, Network-On-Chips (NoCs) have been frequently employed in the development of multiprocessor system-on-chips (MPSoCs). By outsourcing their communication activities, NoCs permit on-chip Intellectual Property (IP) cores to communicate with one another and function at a better level. The important components in assigning application duties, distributing the work to the IPs, and coordinating communication among them are mapping and scheduling methods. This study aims to present an entirely advanced form of research in the area of 3D NoC mapping and scheduling applications, grouping the results according to various parameters and offering several suggestions for further research
IXIAM: ISA EXtension for Integrated Accelerator Management
During the last few years, hardware accelerators have been gaining popularity thanks to their ability to achieve higher performance and efficiency than classic general-purpose solutions. They are fundamentally shaping the current generations of Systems-on-Chip (SoCs), which are becoming increasingly heterogeneous. However, despite their widespread use, a standard, general solution to manage them while providing speed and consistency has not yet been found. Common methodologies rely on OS mediation and a mix of user-space and kernel-space drivers, which can be inefficient, especially for fine-grained tasks. This paper addresses these sources of inefficiencies by proposing an ISA eXtension for Integrated Accelerator Management (IXIAM), a cost-effective HW-SW framework to control a wide variety of accelerators in a standard way, and directly from the cores. The proposed instructions include reservation, work offloading, data transfer, and synchronization. They can be wrapped in a high-level software API or even integrated into a compiler. IXIAM features also a user-space interrupt mechanism to signal events directly to the user process. We implement it as a RISC-V extension in the gem5 simulator and demonstrate detailed support for complex accelerators, as well as the ability to specify sequences of memory transfers and computations directly from the ISA and with significantly lower overhead than driver-based schemes. IXIAM provides a performance advantage that is more evident for small and medium workloads, reaching around 90x in the best case. This way, we enlarge the set of workloads that would benefit from hardware acceleration
MemPool: A Scalable Manycore Architecture with a Low-Latency Shared L1 Memory
Shared L1 memory clusters are a common architectural pattern (e.g., in
GPGPUs) for building efficient and flexible multi-processing-element (PE)
engines. However, it is a common belief that these tightly-coupled clusters
would not scale beyond a few tens of PEs. In this work, we tackle scaling
shared L1 clusters to hundreds of PEs while supporting a flexible and
productive programming model and maintaining high efficiency. We present
MemPool, a manycore system with 256 RV32IMAXpulpimg "Snitch" cores featuring
application-tunable functional units. We designed and implemented an efficient
low-latency PE to L1-memory interconnect, an optimized instruction path to
ensure each PE's independent execution, and a powerful DMA engine and system
interconnect to stream data in and out. MemPool is easy to program, with all
the cores sharing a global view of a large, multi-banked, L1 scratchpad memory,
accessible within at most five cycles in the absence of conflicts. We provide
multiple runtimes to program MemPool at different abstraction levels and
illustrate its versatility with a wide set of applications. MemPool runs at 600
MHz (60 gate delays) in typical conditions (TT/0.80V/25{\deg}C) in 22 nm FDX
technology and achieves a performance of up to 229 GOPS or 192 GOPS/W with less
than 2% of execution stalls.Comment: 14 pages, 17 figures, 2 table
Towards Efficient Resource Allocation for Embedded Systems
Das Hauptthema ist die dynamische Ressourcenverwaltung in eingebetteten Systemen, insbesondere die Verwaltung von Rechenzeit und Netzwerkverkehr auf einem MPSoC. Die Idee besteht darin, eine Pipeline für die Verarbeitung von Mobiler Kommunikation auf dem Chip dynamisch zu schedulen, um die Effizienz der Hardwareressourcen zu verbessern, ohne den Ressourcenverbrauch des dynamischen Schedulings dramatisch zu erhöhen. Sowohl Software- als auch Hardwaremodule werden auf Hotspots im Ressourcenverbrauch untersucht und optimiert, um diese zu entfernen. Da Applikationen im Bereich der Signalverarbeitung normalerweise mit Hilfe von SDF-Diagrammen beschrieben werden können, wird deren dynamisches Scheduling optimiert, um den Ressourcenverbrauch gegenüber dem üblicherweise verwendeten statischen Scheduling zu verbessern. Es wird ein hybrider dynamischer Scheduler vorgestellt, der die Vorteile von Processing-Networks und der Planung von Task-Graphen kombiniert. Es ermöglicht dem Scheduler, ein Gleichgewicht zwischen der Parallelisierung der Berechnung und der Zunahme des dynamischen Scheduling-Aufands optimal abzuwägen. Der resultierende dynamisch erstellte Schedule reduziert den Ressourcenverbrauch um etwa 50%, wobei die Laufzeit im Vergleich zu einem statischen Schedule nur um 20% erhöht wird. Zusätzlich wird ein verteilter dynamischer SDF-Scheduler vorgeschlagen, der das Scheduling in verschiedene Teile zerlegt, die dann zu einer Pipeline verbunden werden, um mehrere parallele Prozessoren einzubeziehen. Jeder Scheduling-Teil wird zu einem Cluster mit Load-Balancing erweitert, um die Anzahl der parallel laufenden Scheduling-Jobs weiter zu erhöhen. Auf diese Weise wird dem vorhandene Engpass bei dem dynamischen Scheduling eines zentralisierten Schedulers entgegengewirkt, sodass 7x mehr Prozessoren mit dem Pipelined-Clustered-Dynamic-Scheduler für eine typische Signalverarbeitungsanwendung verwendet werden können.
Das neue dynamische Scheduling-System setzt das Vorhandensein von drei verschiedenen Kommunikationsmodi zwischen den Verarbeitungskernen voraus. Bei der Emulation auf Basis des häufig verwendeten RDMA-Protokolls treten Leistungsprobleme auf. Sehr gut kann RDMA für einmalige Punkt-zu-Punkt-Datenübertragungen verwendet werden, wie sie bei der Ausführung von Task-Graphen verwendet werden. Process-Networks verwenden normalerweise Datenströme mit hohem Volumen und hoher Bandbreite. Es wird eine FIFO-basierte Kommunikationslösung vorgestellt, die einen zyklischen Puffer sowohl im Sender als auch im Empfänger implementiert, um diesen Bedarf zu decken. Die Pufferbehandlung und die Datenübertragung zwischen ihnen erfolgen ausschließlich in Hardware, um den Software-Overhead aus der Anwendung zu entfernen. Die Implementierung verbessert die Zugriffsverwaltung mehrerer Nutzer auf flächen-effiziente Single-Port Speichermodule. Es werden 0,8 der theoretisch möglichen Bandbreite, die normalerweise nur mit flächenmäßig teureren Dual-Port-Speichern erreicht wird. Der dritte Kommunikationsmodus definiert eine einfache Message-Passing-Implementierung, die ohne einen Verbindungszustand auskommt. Dieser Modus wird für eine effiziente prozessübergreifende Kommunikation des verteilten Scheduling-Systems und der engen Ansteuerung der restlichen Prozessoren benötigt. Eine Flusskontrolle in Hardware stellt sicher, dass eine große Anzahl von Sendern Nachrichten an denselben Empfänger senden kann. Dabei wird garantiert, dass alle Nachrichten korrekt empfangen werden, ohne dass eine Verbindung hergestellt werden muss und die Nachrichtenlaufzeit gering bleibt.
Die Arbeit konzentriert sich auf die Optimierung des Codesigns von Hardware und Software, um die kompromisslose Ressourceneffizienz der dynamischen SDF-Graphen-Planung zu erhöhen. Besonderes Augenmerk wird auf die Abhängigkeiten zwischen den Ebenen eines verteilten Scheduling-Systems gelegt, das auf der Verfügbarkeit spezifischer hardwarebeschleunigter Kommunikationsmethoden beruht.:1 Introduction
1.1 Motivation
1.2 The Multiprocessor System on Chip Architecture
1.3 Concrete MPSoC Architecture
1.4 Representing LTE/5G baseband processing as Static Data Flow
1.5 Compuation Stack
1.6 Performance Hotspots Addressed
1.7 State of the Art
1.8 Overview of the Work
2 Hybrid SDF Execution
2.1 Addressed Performance Hotspot
2.2 State of the Art
2.3 Static Data Flow Graphs
2.4 Runtime Environment
2.5 Overhead of Deloying Tasks to a MPSoC
2.6 Interpretation of SDF Graphs as Task Graphs
2.7 Interpreting SDF Graphs as Process Networks
2.8 Hybrid Interpretation
2.9 Graph Topology Considerations
2.10 Theoretic Impact of Hybrid Interpretation
2.11 Simulating Hybrid Execution
2.12 Pipeline SDF Graph Example
2.13 Random SDF Graphs
2.14 LTE-like SDF Graph
2.15 Key Lernings
3 Distribution of Management
3.1 Addressed Performance Hotspot
3.2 State of the Art
3.3 Revising Deployment Overhead
3.4 Distribution of Overhead
3.5 Impact of Management Distribution to Resource Utilization
3.6 Reconfigurability
3.7 Key Lernings
4 Sliced FIFO Hardware
4.1 Addressed Performance Hotspot
4.2 State of the Art
4.3 System Environment
4.4 Sliced Windowed FIFO buffer
4.5 Single FIFO Evaluation
4.6 Multiple FIFO Evalutaion
4.7 Hardware Implementation
4.8 Key Lernings
5 Message Passing Hardware
5.1 Addressed Performance Hotspot
5.2 State of the Art
5.3 Message Passing Regarded as Queueing
5.4 A Remote Direct Memory Access Based Implementation
5.5 Hardware Implementation Concept
5.6 Evalutation of Performance
5.7 Key Lernings
6 SummaryThe main topic is the dynamic resource allocation in embedded systems, especially the allocation of computing time and network traffic on an multi processor system on chip (MPSoC). The idea is to dynamically schedule a mobile communication signal processing pipeline on the chip to improve hardware resource efficiency while not dramatically improve resource consumption because of dynamic scheduling overhead. Both software and hardware modules are examined for resource consumption hotspots and optimized to remove them. Since signal processing can usually be described with the help of static data flow (SDF) graphs, the dynamic handling of those is optimized to improve resource consumption over the commonly used static scheduling approach. A hybrid dynamic scheduler is presented that combines benefits from both processing networks and task graph scheduling. It allows the scheduler to optimally balance parallelization of computation and addition of dynamic scheduling overhead. The resulting dynamically created schedule reduces resource consumption by about 50%, with a runtime increase of only 20% compared to a static schedule. Additionally, a distributed dynamic SDF scheduler is proposed that splits the scheduling into different parts, which are then connected to a scheduling pipeli ne to incorporate multiple parallel working processors. Each scheduling stage is reworked into a load-balanced cluster to increase the number of parallel scheduling jobs further. This way, the still existing dynamic scheduling bottleneck of a centralized scheduler is widened, allowing handling 7x more processors with the pipelined, clustered dynamic scheduler for a typical signal processing application.
The presented dynamic scheduling system assumes the presence of three different communication modes between the processing cores. When emulated on top of the commonly used remote direct memory access (RDMA) protocol, performance issues are encountered. Firstly, RDMA can neatly be used for single-shot point-to-point data transfers, like used in task graph scheduling. Process networks usually make use of high-volume and high-bandwidth data streams. A first in first out (FIFO) communication solution is presented that implements a cyclic buffer on both sender and receiver to serve this need. The buffer handling and data transfer between them are done purely in hardware to remove software overhead from the application. The implementation improves the multi-user access to area-efficient single port on-chip memory modules. It achieves 0.8 of the theoretically possible bandwidth, usually only achieved with area expensive dual-port memories. The third communication mode defines a lightweight message passing (MP) implementation that is truly connectionless. It is needed for efficient inter-process communication of the distributed and clustered scheduling system and the worker processing units’ tight coupling. A hardware flow control assures that an arbitrary number of senders can spontaneously start sending messages to the same receiver. Yet, all messages are guaranteed to be correctly received while eliminating the need for connection establishment and keeping a low message delay.
The work focuses on the hardware-software codesign optimization to increase the uncompromised resource efficiency of dynamic SDF graph scheduling. Special attention is paid to the inter-level dependencies in developing a distributed scheduling system, which relies on the availability of specific hardwareaccelerated communication methods.:1 Introduction
1.1 Motivation
1.2 The Multiprocessor System on Chip Architecture
1.3 Concrete MPSoC Architecture
1.4 Representing LTE/5G baseband processing as Static Data Flow
1.5 Compuation Stack
1.6 Performance Hotspots Addressed
1.7 State of the Art
1.8 Overview of the Work
2 Hybrid SDF Execution
2.1 Addressed Performance Hotspot
2.2 State of the Art
2.3 Static Data Flow Graphs
2.4 Runtime Environment
2.5 Overhead of Deloying Tasks to a MPSoC
2.6 Interpretation of SDF Graphs as Task Graphs
2.7 Interpreting SDF Graphs as Process Networks
2.8 Hybrid Interpretation
2.9 Graph Topology Considerations
2.10 Theoretic Impact of Hybrid Interpretation
2.11 Simulating Hybrid Execution
2.12 Pipeline SDF Graph Example
2.13 Random SDF Graphs
2.14 LTE-like SDF Graph
2.15 Key Lernings
3 Distribution of Management
3.1 Addressed Performance Hotspot
3.2 State of the Art
3.3 Revising Deployment Overhead
3.4 Distribution of Overhead
3.5 Impact of Management Distribution to Resource Utilization
3.6 Reconfigurability
3.7 Key Lernings
4 Sliced FIFO Hardware
4.1 Addressed Performance Hotspot
4.2 State of the Art
4.3 System Environment
4.4 Sliced Windowed FIFO buffer
4.5 Single FIFO Evaluation
4.6 Multiple FIFO Evalutaion
4.7 Hardware Implementation
4.8 Key Lernings
5 Message Passing Hardware
5.1 Addressed Performance Hotspot
5.2 State of the Art
5.3 Message Passing Regarded as Queueing
5.4 A Remote Direct Memory Access Based Implementation
5.5 Hardware Implementation Concept
5.6 Evalutation of Performance
5.7 Key Lernings
6 Summar
DynAMO: Improving parallelism through dynamic placement of atomic memory operations
With increasing core counts in modern multi-core designs, the overhead of synchronization jeopardizes the scalability and efficiency of parallel applications. To mitigate these overheads, modern cache-coherent protocols offer support for Atomic Memory Operations (AMOs) that can be executed near-core (near) or remotely in the on-chip memory hierarchy (far).
This paper evaluates current available static AMO execution policies implemented in multi-core Systems-on-Chip (SoC) designs, which select AMOs' execution placement (near or far) based on the cache block coherence state. We propose three static policies and show that the performance of static policies is application dependent. Moreover, we show that one of our proposed static policies outperforms currently available implementations.
Furthermore, we propose DynAMO, a predictor that selects the best location to execute the AMOs. DynAMO identifies the different locality patterns to make informed decisions, improving AMO latency and increasing overall throughput. DynAMO outperforms the best-performing static policy and provides geometric mean speed-ups of 1.09× across all workloads and 1.31× on AMO-intensive applications with respect to executing all AMOs near.This research was supported by the Spanish Ministry of Science and Innovation (MCIN) through contracts [PID2019-107255GB-C21], [TED2021-132634A-I00], and [PID2019-105660RB-C21]; the Generalitat of Catalunya through contract [2021-SGR-00763]; the Government of Aragon [T5820R]; the Arm-BSC Center of Excellence, and the European Processor Initiative (EPI) which is part of the European Union’s Horizon 2020 research and innovation program under grant agreement No. 826647. V. Soria-Pardos has been supported through an FPU fellowship [FPU20-02132]; A. Armejach is a Serra Hunter Fellow and has been partially supported by the Grant [IJCI-2017-33945] funded by MCIN/AEI/10.13039/501100011033; M. Moreto through a Ramón y Cajal fellowship [RYC-2016-21104].Peer ReviewedPostprint (author's final draft
- …