6,843 research outputs found
ENHANCING CLOUD SYSTEM RUNTIME TO ADDRESS COMPLEX FAILURES
As the reliance on cloud systems intensifies in our progressively digital world, understanding and reinforcing their reliability becomes more crucial than ever. Despite impressive advancements in augmenting the resilience of cloud systems, the growing incidence of complex failures now poses a substantial challenge to the availability of these systems. With cloud systems continuing to scale and increase in complexity, failures not only become more elusive to detect but can also lead to more catastrophic consequences. Such failures question the foundational premises of conventional fault-tolerance designs, necessitating the creation of novel system designs to counteract them.
This dissertation aims to enhance distributed systems’ capabilities to detect, localize, and react to complex failures at runtime. To this end, this dissertation makes contributions to address three emerging categories of failures in cloud systems. The first part delves into the investigation of partial failures, introducing OmegaGen, a tool adept at generating tailored checkers for detecting and localizing such failures. The second part grapples with silent semantic failures prevalent in cloud systems, showcasing our study findings, and introducing Oathkeeper, a tool that leverages past failures to infer rules and expose these silent issues. The third part explores solutions to slow failures via RESIN, a framework specifically designed to detect, diagnose, and mitigate memory leaks in cloud-scale infrastructures, developed in collaboration with Microsoft Azure. The dissertation concludes by offering insights into future directions for the construction of reliable cloud systems
Optimization of Beyond 5G Network Slicing for Smart City Applications
Transitioning from the current fifth-generation (5G) wireless technology, the advent of beyond 5G (B5G) signifies a pivotal stride toward sixth generation (6G) communication technology. B5G, at its essence, harnesses end-to-end (E2E) network slicing (NS) technology, enabling the simultaneous accommodation of multiple logical networks with distinct performance requirements on a shared physical infrastructure. At the forefront of this implementation lies the critical process of network slice design, a phase central to the realization of efficient smart city networks. This thesis assumes a key role in the network slicing life cycle, emphasizing the analysis and formulation of optimal procedures for configuring, customizing, and allocating E2E network slices. The focus extends to catering to the unique demands of smart city applications, encompassing critical areas such as emergency response, smart buildings, and video surveillance. By addressing the intricacies of network slice design, the study navigates through the complexities of tailoring slices to meet specific application needs, thereby contributing to the seamless integration of diverse services within the smart city framework. Addressing the core challenge of NS, which involves the allocation of virtual networks on the physical topology with optimal resource allocation, the thesis introduces a dual integer linear programming (ILP) optimization problem. This problem is formulated to jointly minimize the embedding cost and latency. However, given the NP-hard nature of this ILP, finding an efficient alternative becomes a significant hurdle. In response, this thesis introduces a novel heuristic approach the matroid-based modified greedy breadth-first search (MGBFS) algorithm. This pioneering algorithm leverages matroid properties to navigate the process of virtual network embedding and resource allocation. By introducing this novel heuristic approach, the research aims to provide near-optimal solutions, overcoming the computational complexities associated with the dual integer linear programming problem. The proposed MGBFS algorithm not only addresses the connectivity, cost, and latency constraints but also outperforms the benchmark model delivering solutions remarkably close to optimal. This innovative approach represents a substantial advancement in the optimization of smart city applications, promising heightened connectivity, efficiency, and resource utilization within the evolving landscape of B5G-enabled communication technology
Resource-aware scheduling for 2D/3D multi-/many-core processor-memory systems
This dissertation addresses the complexities of 2D/3D multi-/many-core processor-memory systems, focusing on two key areas: enhancing timing predictability in real-time multi-core processors and optimizing performance within thermal constraints. The integration of an increasing number of transistors into compact chip designs, while boosting computational capacity, presents challenges in resource contention and thermal management. The first part of the thesis improves timing predictability. We enhance shared cache interference analysis for set-associative caches, advancing the calculation of Worst-Case Execution Time (WCET). This development enables accurate assessment of cache interference and the effectiveness of partitioned schedulers in real-world scenarios. We introduce TCPS, a novel task and cache-aware partitioned scheduler that optimizes cache partitioning based on task-specific WCET sensitivity, leading to improved schedulability and predictability. Our research explores various cache and scheduling configurations, providing insights into their performance trade-offs. The second part focuses on thermal management in 2D/3D many-core systems. Recognizing the limitations of Dynamic Voltage and Frequency Scaling (DVFS) in S-NUCA many-core processors, we propose synchronous thread migrations as a thermal management strategy. This approach culminates in the HotPotato scheduler, which balances performance and thermal safety. We also introduce 3D-TTP, a transient temperature-aware power budgeting strategy for 3D-stacked systems, reducing the need for Dynamic Thermal Management (DTM) activation. Finally, we present 3QUTM, a novel method for 3D-stacked systems that combines core DVFS and memory bank Low Power Modes with a learning algorithm, optimizing response times within thermal limits. This research contributes significantly to enhancing performance and thermal management in advanced processor-memory systems
Modern computing: Vision and challenges
Over the past six decades, the computing systems field has experienced significant transformations, profoundly impacting society with transformational developments, such as the Internet and the commodification of computing. Underpinned by technological advancements, computer systems, far from being static, have been continuously evolving and adapting to cover multifaceted societal niches. This has led to new paradigms such as cloud, fog, edge computing, and the Internet of Things (IoT), which offer fresh economic and creative opportunities. Nevertheless, this rapid change poses complex research challenges, especially in maximizing potential and enhancing functionality. As such, to maintain an economical level of performance that meets ever-tighter requirements, one must understand the drivers of new model emergence and expansion, and how contemporary challenges differ from past ones. To that end, this article investigates and assesses the factors influencing the evolution of computing systems, covering established systems and architectures as well as newer developments, such as serverless computing, quantum computing, and on-device AI on edge devices. Trends emerge when one traces technological trajectory, which includes the rapid obsolescence of frameworks due to business and technical constraints, a move towards specialized systems and models, and varying approaches to centralized and decentralized control. This comprehensive review of modern computing systems looks ahead to the future of research in the field, highlighting key challenges and emerging trends, and underscoring their importance in cost-effectively driving technological progress
MGG: Accelerating Graph Neural Networks with Fine-grained intra-kernel Communication-Computation Pipelining on Multi-GPU Platforms
The increasing size of input graphs for graph neural networks (GNNs)
highlights the demand for using multi-GPU platforms. However, existing
multi-GPU GNN systems optimize the computation and communication individually
based on the conventional practice of scaling dense DNNs. For irregularly
sparse and fine-grained GNN workloads, such solutions miss the opportunity to
jointly schedule/optimize the computation and communication operations for
high-performance delivery. To this end, we propose MGG, a novel system design
to accelerate full-graph GNNs on multi-GPU platforms. The core of MGG is its
novel dynamic software pipeline to facilitate fine-grained
computation-communication overlapping within a GPU kernel. Specifically, MGG
introduces GNN-tailored pipeline construction and GPU-aware pipeline mapping to
facilitate workload balancing and operation overlapping. MGG also incorporates
an intelligent runtime design with analytical modeling and optimization
heuristics to dynamically improve the execution performance. Extensive
evaluation reveals that MGG outperforms state-of-the-art full-graph GNN systems
across various settings: on average 4.41X, 4.81X, and 10.83X faster than DGL,
MGG-UVM, and ROC, respectively
Towards a centralized multicore automotive system
Today’s automotive systems are inundated with embedded electronics to host chassis, powertrain, infotainment, advanced driver assistance systems, and other modern vehicle functions. As many as 100 embedded microcontrollers execute hundreds of millions of lines of code in a single vehicle. To control the increasing complexity in vehicle electronics and services, automakers are planning to consolidate different on-board automotive functions as software tasks on centralized multicore hardware platforms. However, these vehicle software services have different and contrasting timing, safety, and security requirements. Existing vehicle operating systems are ill-equipped to provide all the required service guarantees on a single machine. A centralized automotive system aims to tackle this by assigning software tasks to multiple criticality domains or levels according to their consequences of failures, or international safety standards like ISO 26262. This research investigates several emerging challenges in time-critical systems for a centralized multicore automotive platform and proposes a novel vehicle operating system framework to address them.
This thesis first introduces an integrated vehicle management system (VMS), called DriveOS™, for a PC-class multicore hardware platform. Its separation kernel design enables temporal and spatial isolation among critical and non-critical vehicle services in different domains on the same machine. Time- and safety-critical vehicle functions are implemented in a sandboxed Real-time Operating System (OS) domain, and non-critical software is developed in a sandboxed general-purpose OS (e.g., Linux, Android) domain. To leverage the advantages of model-driven vehicle function development, DriveOS provides a multi-domain application framework in Simulink. This thesis also presents a real-time task pipeline scheduling algorithm in multiprocessors for communication between connected vehicle services with end-to-end guarantees. The benefits and performance of the overall automotive system framework are demonstrated with hardware-in-the-loop testing using real-world applications, car datasets and simulated benchmarks, and with an early-stage deployment in a production-grade luxury electric vehicle
SmartChoices: Augmenting Software with Learned Implementations
We are living in a golden age of machine learning. Powerful models are being
trained to perform many tasks far better than is possible using traditional
software engineering approaches alone. However, developing and deploying those
models in existing software systems remains difficult. In this paper we present
SmartChoices, a novel approach to incorporating machine learning into mature
software stacks easily, safely, and effectively. We explain the overall design
philosophy and present case studies using SmartChoices within large scale
industrial systems
SCV-GNN: Sparse Compressed Vector-based Graph Neural Network Aggregation
Graph neural networks (GNNs) have emerged as a powerful tool to process
graph-based data in fields like communication networks, molecular interactions,
chemistry, social networks, and neuroscience. GNNs are characterized by the
ultra-sparse nature of their adjacency matrix that necessitates the development
of dedicated hardware beyond general-purpose sparse matrix multipliers. While
there has been extensive research on designing dedicated hardware accelerators
for GNNs, few have extensively explored the impact of the sparse storage format
on the efficiency of the GNN accelerators. This paper proposes SCV-GNN with the
novel sparse compressed vectors (SCV) format optimized for the aggregation
operation. We use Z-Morton ordering to derive a data-locality-based computation
ordering and partitioning scheme. The paper also presents how the proposed
SCV-GNN is scalable on a vector processing system. Experimental results over
various datasets show that the proposed method achieves a geometric mean
speedup of and over CSC and CSR aggregation
operations, respectively. The proposed method also reduces the memory traffic
by a factor of and over compressed sparse column
(CSC) and compressed sparse row (CSR), respectively. Thus, the proposed novel
aggregation format reduces the latency and memory access for GNN inference
- …