3,980 research outputs found
Resource-aware scheduling for 2D/3D multi-/many-core processor-memory systems
This dissertation addresses the complexities of 2D/3D multi-/many-core processor-memory systems, focusing on two key areas: enhancing timing predictability in real-time multi-core processors and optimizing performance within thermal constraints. The integration of an increasing number of transistors into compact chip designs, while boosting computational capacity, presents challenges in resource contention and thermal management. The first part of the thesis improves timing predictability. We enhance shared cache interference analysis for set-associative caches, advancing the calculation of Worst-Case Execution Time (WCET). This development enables accurate assessment of cache interference and the effectiveness of partitioned schedulers in real-world scenarios. We introduce TCPS, a novel task and cache-aware partitioned scheduler that optimizes cache partitioning based on task-specific WCET sensitivity, leading to improved schedulability and predictability. Our research explores various cache and scheduling configurations, providing insights into their performance trade-offs. The second part focuses on thermal management in 2D/3D many-core systems. Recognizing the limitations of Dynamic Voltage and Frequency Scaling (DVFS) in S-NUCA many-core processors, we propose synchronous thread migrations as a thermal management strategy. This approach culminates in the HotPotato scheduler, which balances performance and thermal safety. We also introduce 3D-TTP, a transient temperature-aware power budgeting strategy for 3D-stacked systems, reducing the need for Dynamic Thermal Management (DTM) activation. Finally, we present 3QUTM, a novel method for 3D-stacked systems that combines core DVFS and memory bank Low Power Modes with a learning algorithm, optimizing response times within thermal limits. This research contributes significantly to enhancing performance and thermal management in advanced processor-memory systems
LIPIcs, Volume 251, ITCS 2023, Complete Volume
LIPIcs, Volume 251, ITCS 2023, Complete Volum
MGG: Accelerating Graph Neural Networks with Fine-grained intra-kernel Communication-Computation Pipelining on Multi-GPU Platforms
The increasing size of input graphs for graph neural networks (GNNs)
highlights the demand for using multi-GPU platforms. However, existing
multi-GPU GNN systems optimize the computation and communication individually
based on the conventional practice of scaling dense DNNs. For irregularly
sparse and fine-grained GNN workloads, such solutions miss the opportunity to
jointly schedule/optimize the computation and communication operations for
high-performance delivery. To this end, we propose MGG, a novel system design
to accelerate full-graph GNNs on multi-GPU platforms. The core of MGG is its
novel dynamic software pipeline to facilitate fine-grained
computation-communication overlapping within a GPU kernel. Specifically, MGG
introduces GNN-tailored pipeline construction and GPU-aware pipeline mapping to
facilitate workload balancing and operation overlapping. MGG also incorporates
an intelligent runtime design with analytical modeling and optimization
heuristics to dynamically improve the execution performance. Extensive
evaluation reveals that MGG outperforms state-of-the-art full-graph GNN systems
across various settings: on average 4.41X, 4.81X, and 10.83X faster than DGL,
MGG-UVM, and ROC, respectively
Guided rewriting and constraint satisfaction for parallel GPU code generation
Graphics Processing Units (GPUs) are notoriously hard to optimise for manually due to their scheduling and memory hierarchies. What is needed are good automatic code generators and optimisers for such parallel hardware. Functional approaches such as Accelerate, Futhark and LIFT leverage a high-level algorithmic Intermediate Representation (IR) to expose parallelism and abstract the implementation details away from the user. However, producing efficient code for a given accelerator remains challenging. Existing code generators depend on the user input to choose a subset of hard-coded optimizations or automated exploration of implementation search space. The former suffers from the lack of extensibility, while the latter is too costly due to the size of the search space. A hybrid approach is needed, where a space of valid implementations is built automatically and explored with the aid of human expertise.
This thesis presents a solution combining user-guided rewriting and automatically generated constraints to produce high-performance code. The first contribution is an automatic tuning technique to find a balance between performance and memory consumption. Leveraging its functional patterns, the LIFT compiler is empowered to infer tuning constraints and limit the search to valid tuning combinations only.
Next, the thesis reframes parallelisation as a constraint satisfaction problem. Parallelisation constraints are extracted automatically from the input expression, and a solver is used to identify valid rewriting. The constraints truncate the search space to valid parallel mappings only by capturing the scheduling restrictions of the GPU in the context of a given program. A synchronisation barrier insertion technique is proposed to prevent data races and improve the efficiency of the generated parallel mappings.
The final contribution of this thesis is the guided rewriting method, where the user encodes a design space of structural transformations using high-level IR nodes called rewrite points. These strongly typed pragmas express macro rewrites and expose design choices as explorable parameters. The thesis proposes a small set of reusable rewrite points to achieve tiling, cache locality, data reuse and memory optimisation.
A comparison with the vendor-provided handwritten kernel ARM Compute Library and the TVM code generator demonstrates the effectiveness of this thesis' contributions. With convolution as a use case, LIFT-generated direct and GEMM-based convolution implementations are shown to perform on par with the state-of-the-art solutions on a mobile GPU. Overall, this thesis demonstrates that a functional IR yields well to user-guided and automatic rewriting for high-performance code generation
Autonomous Rock Instance Segmentation for Extra-Terrestrial Robotic Missions
The collection and analysis of extra-terrestrial matter
are two of the main motivations for space exploration missions.
Due to the inherent risks for participating astronauts during
space missions, autonomous robotic systems are often consid-
ered as a promising alternative. In recent years, many (in-
ter)national space missions containing rovers to explore celestial
bodies have been launched. Hereby, the communication delay as
well as limited bandwidth creates a need for highly self-governed
agents that require only infrequent interaction with scientists at
a ground station. Such a setting is explored in the ARCHES mis-
sion, which seeks to investigate different means of collaboration
between scientists and autonomous robots in extra-terrestrial
environments. The analog mission focuses a team of hetero-
geneous agents (two Lightweight Rover Units and ARDEA, a
drone), which together perform various complex tasks under
strict communication constraints. In this paper, we highlight
three of these tasks that were successfully demonstrated during
a one-month test mission on Mt. Etna in Sicily, Italy, which was
chosen due to its similarity to the Moon in terms of geological
structure. All three tasks have in common, that they leverage an
instance segmentation approach deployed on the rovers to detect
rocks within camera imagery. The first application is a map-
ping scheme that incorporates semantically detected rocks into
its environment model to safely navigate to points of interest.
Secondly, we present a method for the collection and extraction of in-situ samples with a rover, which uses rock detection to localize relevant candidates to grasp. For the third task, we show the usefulness of stone segmentation to autonomously conduct a spectrometer measurement experiment. We perform a throughout analysis of the presented methods and evaluate our experimental results. The demonstrations on Mt. Etna show that our approaches are well suited for navigation, geological analysis, and sample extraction tasks within autonomous robotic extra-terrestrial missions
Design and Real-World Evaluation of Dependable Wireless Cyber-Physical Systems
The ongoing effort for an efficient, sustainable, and automated interaction between humans, machines, and our environment will make cyber-physical systems (CPS) an integral part of the industry and our daily lives. At their core, CPS integrate computing elements, communication networks, and physical processes that are monitored and controlled through sensors and actuators. New and innovative applications become possible by extending or replacing static and expensive cable-based communication infrastructures with wireless technology. The flexibility of wireless CPS is a key enabler for many envisioned scenarios, such as intelligent factories, smart farming, personalized healthcare systems, autonomous search and rescue, and smart cities.
High dependability, efficiency, and adaptivity requirements complement the demand for wireless and low-cost solutions in such applications. For instance, industrial and medical systems should work reliably and predictably with performance guarantees, even if parts of the system fail. Because emerging CPS will feature mobile and battery-driven devices that can execute various tasks, the systems must also quickly adapt to frequently changing conditions. Moreover, as applications become ever more sophisticated, featuring compact embedded devices that are deployed densely and at scale, efficient designs are indispensable to achieve desired operational lifetimes and satisfy high bandwidth demands.
Meeting these partly conflicting requirements, however, is challenging due to imperfections of wireless communication and resource constraints along several dimensions, for example, computing, memory, and power constraints of the devices. More precisely, frequent and correlated message losses paired with very limited bandwidth and varying delays for the message exchange significantly complicate the control design. In addition, since communication ranges are limited, messages must be relayed over multiple hops to cover larger distances, such as an entire factory. Although the resulting mesh networks are more robust against interference, efficient communication is a major challenge as wireless imperfections get amplified, and significant coordination effort is needed, especially if the networks are dynamic.
CPS combine various research disciplines, which are often investigated in isolation, ignoring their complex interaction. However, to address this interaction and build trust in the proposed solutions, evaluating CPS using real physical systems and wireless networks paired with formal guarantees of a system’s end-to-end behavior is necessary. Existing works that take this step can only satisfy a few of the abovementioned requirements. Most notably, multi-hop communication has only been used to control slow physical processes while providing no guarantees. One of the reasons is that the current communication protocols are not suited for dynamic multi-hop networks.
This thesis closes the gap between existing works and the diverse needs of emerging wireless CPS. The contributions address different research directions and are split into two parts. In the first part, we specifically address the shortcomings of existing communication protocols and make the following contributions to provide a solid networking foundation:
• We present Mixer, a communication primitive for the reliable many-to-all message exchange in dynamic wireless multi-hop networks. Mixer runs on resource-constrained low-power embedded devices and combines synchronous transmissions and network coding for a highly scalable and topology-agnostic message exchange. As a result, it supports mobile nodes and can serve any possible traffic patterns, for example, to efficiently realize distributed control, as required by emerging CPS applications.
• We present Butler, a lightweight and distributed synchronization mechanism with formally guaranteed correctness properties to improve the dependability of synchronous transmissions-based protocols. These protocols require precise time synchronization provided by a specific node. Upon failure of this node, the entire network cannot communicate. Butler removes this single point of failure by quickly synchronizing all nodes in the network without affecting the protocols’ performance.
In the second part, we focus on the challenges of integrating communication and various control concepts using classical time-triggered and modern event-based approaches. Based on the design, implementation, and evaluation of the proposed solutions using real systems and networks, we make the following contributions, which in many ways push the boundaries of previous approaches:
• We are the first to demonstrate and evaluate fast feedback control over low-power wireless multi-hop networks. Essential for this achievement is a novel co-design and integration of communication and control. Our wireless embedded platform tames the imperfections impairing control, for example, message loss and varying delays, and considers the resulting key properties in the control design. Furthermore, the careful orchestration of control and communication tasks enables real-time operation and makes our system amenable to an end-to-end analysis. Due to this, we can provably guarantee closed-loop stability for physical processes with linear time-invariant dynamics.
• We propose control-guided communication, a novel co-design for distributed self-triggered control over wireless multi-hop networks. Self-triggered control can save energy by transmitting data only when needed. However, there are no solutions that bring those savings to multi-hop networks and that can reallocate freed-up resources, for example, to other agents. Our control system informs the communication system of its transmission demands ahead of time so that communication resources can be allocated accordingly. Thus, we can transfer the energy savings from the control to the communication side and achieve an end-to-end benefit.
• We present a novel co-design of distributed control and wireless communication that resolves overload situations in which the communication demand exceeds the available bandwidth. As systems scale up, featuring more agents and higher bandwidth demands, the available bandwidth will be quickly exceeded, resulting in overload. While event-triggered control and self-triggered control approaches reduce the communication demand on average, they cannot prevent that potentially all agents want to communicate simultaneously. We address this limitation by dynamically allocating the available bandwidth to the agents with the highest need. Thus, we can formally prove that our co-design guarantees closed-loop stability for physical systems with stochastic linear time-invariant dynamics.:Abstract
Acknowledgements
List of Abbreviations
List of Figures
List of Tables
1 Introduction
1.1 Motivation
1.2 Application Requirements
1.3 Challenges
1.4 State of the Art
1.5 Contributions and Road Map
2 Mixer: Efficient Many-to-All Broadcast in Dynamic Wireless Mesh Networks
2.1 Introduction
2.2 Overview
2.3 Design
2.4 Implementation
2.5 Evaluation
2.6 Discussion
2.7 Related Work
3 Butler: Increasing the Availability of Low-Power Wireless Communication Protocols
3.1 Introduction
3.2 Motivation and Background
3.3 Design
3.4 Analysis
3.5 Implementation
3.6 Evaluation
3.7 Related Work
4 Feedback Control Goes Wireless: Guaranteed Stability over Low-Power Multi-Hop Networks
4.1 Introduction
4.2 Related Work
4.3 Problem Setting and Approach
4.4 Wireless Embedded System Design
4.5 Control Design and Analysis
4.6 Experimental Evaluation
4.A Control Details
5 Control-Guided Communication: Efficient Resource Arbitration and Allocation in Multi-Hop Wireless Control Systems
5.1 Introduction
5.2 Problem Setting
5.3 Co-Design Approach
5.4 Wireless Communication System Design
5.5 Self-Triggered Control Design
5.6 Experimental Evaluation
6 Scaling Beyond Bandwidth Limitations: Wireless Control With Stability Guarantees Under Overload
6.1 Introduction
6.2 Problem and Related Work
6.3 Overview of Co-Design Approach
6.4 Predictive Triggering and Control System
6.5 Adaptive Communication System
6.6 Integration and Stability Analysis
6.7 Testbed Experiments
6.A Proof of Theorem 4
6.B Usage of the Network Bandwidth for Control
7 Conclusion and Outlook
7.1 Contributions
7.2 Future Directions
Bibliography
List of Publication
Blending the Material and Digital World for Hybrid Interfaces
The development of digital technologies in the 21st century is progressing continuously and new device classes such as tablets, smartphones or smartwatches are finding their way into our everyday lives. However, this development also poses problems, as these prevailing touch and gestural interfaces often lack tangibility, take little account of haptic qualities and therefore require full attention from their users. Compared to traditional tools and analog interfaces, the human skills to experience and manipulate material in its natural environment and context remain unexploited. To combine the best of both, a key question is how it is possible to blend the material world and digital world to design and realize novel hybrid interfaces in a meaningful way. Research on Tangible User Interfaces (TUIs) investigates the coupling between physical objects and virtual data. In contrast, hybrid interfaces, which specifically aim to digitally enrich analog artifacts of everyday work, have not yet been sufficiently researched and systematically discussed.
Therefore, this doctoral thesis rethinks how user interfaces can provide useful digital functionality while maintaining their physical properties and familiar patterns of use in the real world. However, the development of such hybrid interfaces raises overarching research questions about the design: Which kind of physical interfaces are worth exploring? What type of digital enhancement will improve existing interfaces? How can hybrid interfaces retain their physical properties while enabling new digital functions? What are suitable methods to explore different design? And how to support technology-enthusiast users in prototyping?
For a systematic investigation, the thesis builds on a design-oriented, exploratory and iterative development process using digital fabrication methods and novel materials. As a main contribution, four specific research projects are presented that apply and discuss different visual and interactive augmentation principles along real-world applications. The applications range from digitally-enhanced paper, interactive cords over visual watch strap extensions to novel prototyping tools for smart garments. While almost all of them integrate visual feedback and haptic input, none of them are built on rigid, rectangular pixel screens or use standard input modalities, as they all aim to reveal new design approaches. The dissertation shows how valuable it can be to rethink familiar, analog applications while thoughtfully extending them digitally. Finally, this thesis’ extensive work of engineering versatile research platforms is accompanied by overarching conceptual work, user evaluations and technical experiments, as well as literature reviews.Die Durchdringung digitaler Technologien im 21. Jahrhundert schreitet stetig voran und neue Geräteklassen wie Tablets, Smartphones oder Smartwatches erobern unseren Alltag. Diese Entwicklung birgt aber auch Probleme, denn die vorherrschenden berührungsempfindlichen Oberflächen berücksichtigen kaum haptische Qualitäten und erfordern daher die volle Aufmerksamkeit ihrer Nutzer:innen. Im Vergleich zu traditionellen Werkzeugen und analogen Schnittstellen bleiben die menschlichen Fähigkeiten ungenutzt, die Umwelt mit allen Sinnen zu begreifen und wahrzunehmen. Um das Beste aus beiden Welten zu vereinen, stellt sich daher die Frage, wie neuartige hybride Schnittstellen sinnvoll gestaltet und realisiert werden können, um die materielle und die digitale Welt zu verschmelzen. In der Forschung zu Tangible User Interfaces (TUIs) wird die Verbindung zwischen physischen Objekten und virtuellen Daten untersucht. Noch nicht ausreichend erforscht wurden hingegen hybride Schnittstellen, die speziell darauf abzielen, physische Gegenstände des Alltags digital zu erweitern und anhand geeigneter Designparameter und Entwurfsräume systematisch zu untersuchen.
In dieser Dissertation wird daher untersucht, wie Materialität und Digitalität nahtlos ineinander übergehen können. Es soll erforscht werden, wie künftige Benutzungsschnittstellen nützliche digitale Funktionen bereitstellen können, ohne ihre physischen Eigenschaften und vertrauten Nutzungsmuster in der realen Welt zu verlieren. Die Entwicklung solcher hybriden Ansätze wirft jedoch übergreifende Forschungsfragen zum Design auf: Welche Arten von physischen Schnittstellen sind es wert, betrachtet zu werden? Welche Art von digitaler Erweiterung verbessert das Bestehende? Wie können hybride Konzepte ihre physischen Eigenschaften beibehalten und gleichzeitig neue digitale Funktionen ermöglichen? Was sind geeignete Methoden, um verschiedene Designs zu erforschen? Wie kann man Technologiebegeisterte bei der Erstellung von Prototypen unterstützen?
Für eine systematische Untersuchung stützt sich die Arbeit auf einen designorientierten, explorativen und iterativen Entwicklungsprozess unter Verwendung digitaler Fabrikationsmethoden und neuartiger Materialien. Im Hauptteil werden vier Forschungsprojekte vorgestellt, die verschiedene visuelle und interaktive Prinzipien entlang realer Anwendungen diskutieren. Die Szenarien reichen von digital angereichertem Papier, interaktiven Kordeln über visuelle Erweiterungen von Uhrarmbändern bis hin zu neuartigen Prototyping-Tools für intelligente Kleidungsstücke. Um neue Designansätze aufzuzeigen, integrieren nahezu alle visuelles Feedback und haptische Eingaben, um Alternativen zu Standard-Eingabemodalitäten auf starren Pixelbildschirmen zu schaffen. Die Dissertation hat gezeigt, wie wertvoll es sein kann, bekannte, analoge Anwendungen zu überdenken und sie dabei gleichzeitig mit Bedacht digital zu erweitern. Dabei umfasst die vorliegende Arbeit sowohl realisierte technische Forschungsplattformen als auch übergreifende konzeptionelle Arbeiten, Nutzerstudien und technische Experimente sowie die Analyse existierender Forschungsarbeiten
Flashpoint: A Low-latency Serverless Platform for Deep Learning Inference Serving
Recent breakthroughs in Deep Learning (DL) have led to high demand for executing inferences in interactive services such as ChatGPT and GitHub Copilot. However, these interactive services require low-latency inferences, which can only be met with GPUs and result in exorbitant operating costs. For instance, ChatGPT reportedly requires millions of U.S. dollars in cloud GPUs to serve its 1+ million users. A potential solution to meet low-latency requirements with acceptable costs is to use serverless platforms. These platforms automatically scale resources to meet user demands. However, current serverless systems have long cold starts which worsen with larger DL models and lead to poor performance during bursts of requests. Meanwhile, the demand for larger and larger DL models make it more challenging to deliver an acceptable user experience cost-effectively. While current systems over-provision GPUs to address this issue, they incur high costs in idle resources which greatly reduces the benefit of using a serverless platform.
In this thesis, we introduce Flashpoint, a GPU-based serverless platform that serves DL inferences with low latencies. Flashpoint achieves this by reducing cold start durations, especially for large DL models, making serverless computing feasible for latency-sensitive DL workloads. To reduce cold start durations, Flashpoint reduces download times by sourcing the DL model data from within the compute cluster rather than slow cloud storage. Additionally, Flashpoint minimizes in-cluster network congestion from redundant packet transfers of the same DL model to multiple machines with multicasting. Finally, Flashpoint also reduces cold start durations by automatically partitioning models and deploying them in parallel on multiple machines. The reduced cold start durations achieved by Flashpoint enable the platform to scale resource allocations elastically and complete requests with low latencies without over-provisioning expensive GPU resources.
We perform large-scale data center simulations that were parameterized with measurements our prototype implementations. We evaluate the system using six state-of-the-art DL models ranging from 499 MB to 11 GB in size. We also measure the performance of the system in representative real-world traces from Twitter and Microsoft Azure. Our results in the full-scale simulations show that Flashpoint achieves an arithmetic mean of 93.51% shorter average cold start durations, leading to 75.42% and 66.90% respective reductions in average and 99th percentile end-to-end request latencies across the DL models with the same amount of resources. These results show that Flashpoint boosts the performance of serving DL inferences on a serverless platform without increasing costs
A New Methodology to Manage FPGA Distributed Memory Content via Bitstream for Xilinx ZYNQ Devices
This paper proposes a methodology to access data and manage the content of distributed memories in FPGA designs through the configuration bitstream. Thanks to the methods proposed, it is possible to read and write the data content of registers without using the in/out ports of registers in a straightforward fashion. Hence, it offers the possibility of performing several operations, such as, to load, copy or compare the information stored in registers without the necessity of physical interconnections. This work includes two flows that simplify the designing process when using the proposed approach: while the first enables the protection or unprotection of writing on different partial regions through the bitstream, the second permits homogeneous instances of a design implemented in different reconfigurable regions to be obtained without losing efficiency. The approach is based and has been physically validated on the ZYNQ from Xilinx, and when using partially reconfigurable designs, it does not affect the hardware overhead nor the maximum operating frequency of the design.This work has been supported, within the fund for research groups of the Basque university system IT1440-22, by the Department of Education and, within PILAR ZE-2020/00022 and COMMUTE ZE-2021/00931 projects, by the Hazitek program, both of the Basque Government; the latter also by the Ministerio de Ciencia Innovación of Spain through the Centro para el Desarrollo Tecnológico Industrial (CDTI) within the projects IDI-20201264 and IDI-20220543, and through the Fondo Europeo de Desarrollo Regional 2014–2020 (FEDER funds)
Interactive Imitation Learning of Bimanual Movement Primitives
Performing bimanual tasks with dual robotic setups can drastically increase
the impact on industrial and daily life applications. However, performing a
bimanual task brings many challenges, like synchronization and coordination of
the single-arm policies. This article proposes the Safe, Interactive Movement
Primitives Learning (SIMPLe) algorithm, to teach and correct single or dual arm
impedance policies directly from human kinesthetic demonstrations. Moreover, it
proposes a novel graph encoding of the policy based on Gaussian Process
Regression (GPR) where the single-arm motion is guaranteed to converge close to
the trajectory and then towards the demonstrated goal. Regulation of the robot
stiffness according to the epistemic uncertainty of the policy allows for
easily reshaping the motion with human feedback and/or adapting to external
perturbations. We tested the SIMPLe algorithm on a real dual-arm setup where
the teacher gave separate single-arm demonstrations and then successfully
synchronized them only using kinesthetic feedback or where the original
bimanual demonstration was locally reshaped to pick a box at a different
height
- …