193 research outputs found

    Resource-aware scheduling for 2D/3D multi-/many-core processor-memory systems

    Get PDF
    This dissertation addresses the complexities of 2D/3D multi-/many-core processor-memory systems, focusing on two key areas: enhancing timing predictability in real-time multi-core processors and optimizing performance within thermal constraints. The integration of an increasing number of transistors into compact chip designs, while boosting computational capacity, presents challenges in resource contention and thermal management. The first part of the thesis improves timing predictability. We enhance shared cache interference analysis for set-associative caches, advancing the calculation of Worst-Case Execution Time (WCET). This development enables accurate assessment of cache interference and the effectiveness of partitioned schedulers in real-world scenarios. We introduce TCPS, a novel task and cache-aware partitioned scheduler that optimizes cache partitioning based on task-specific WCET sensitivity, leading to improved schedulability and predictability. Our research explores various cache and scheduling configurations, providing insights into their performance trade-offs. The second part focuses on thermal management in 2D/3D many-core systems. Recognizing the limitations of Dynamic Voltage and Frequency Scaling (DVFS) in S-NUCA many-core processors, we propose synchronous thread migrations as a thermal management strategy. This approach culminates in the HotPotato scheduler, which balances performance and thermal safety. We also introduce 3D-TTP, a transient temperature-aware power budgeting strategy for 3D-stacked systems, reducing the need for Dynamic Thermal Management (DTM) activation. Finally, we present 3QUTM, a novel method for 3D-stacked systems that combines core DVFS and memory bank Low Power Modes with a learning algorithm, optimizing response times within thermal limits. This research contributes significantly to enhancing performance and thermal management in advanced processor-memory systems

    Reshaping Higher Education for a Post-COVID-19 World: Lessons Learned and Moving Forward

    Get PDF
    No abstract available

    Analysing and Reducing Costs of Deep Learning Compiler Auto-tuning

    Get PDF
    Deep Learning (DL) is significantly impacting many industries, including automotive, retail and medicine, enabling autonomous driving, recommender systems and genomics modelling, amongst other applications. At the same time, demand for complex and fast DL models is continually growing. The most capable models tend to exhibit highest operational costs, primarily due to their large computational resource footprint and inefficient utilisation of computational resources employed by DL systems. In an attempt to tackle these problems, DL compilers and auto-tuners emerged, automating the traditionally manual task of DL model performance optimisation. While auto-tuning improves model inference speed, it is a costly process, which limits its wider adoption within DL deployment pipelines. The high operational costs associated with DL auto-tuning have multiple causes. During operation, DL auto-tuners explore large search spaces consisting of billions of tensor programs, to propose potential candidates that improve DL model inference latency. Subsequently, DL auto-tuners measure candidate performance in isolation on the target-device, which constitutes the majority of auto-tuning compute-time. Suboptimal candidate proposals, combined with their serial measurement in an isolated target-device lead to prolonged optimisation time and reduced resource availability, ultimately reducing cost-efficiency of the process. In this thesis, we investigate the reasons behind prolonged DL auto-tuning and quantify their impact on the optimisation costs, revealing directions for improved DL auto-tuner design. Based on these insights, we propose two complementary systems: Trimmer and DOPpler. Trimmer improves tensor program search efficacy by filtering out poorly performing candidates, and controls end-to-end auto-tuning using cost objectives, monitoring optimisation cost. Simultaneously, DOPpler breaks long-held assumptions about the serial candidate measurements by successfully parallelising them intra-device, with minimal penalty to optimisation quality. Through extensive experimental evaluation of both systems, we demonstrate that they significantly improve cost-efficiency of autotuning (up to 50.5%) across a plethora of tensor operators, DL models, auto-tuners and target-devices

    Multi-Path Bound for DAG Tasks

    Full text link
    This paper studies the response time bound of a DAG (directed acyclic graph) task. Recently, the idea of using multiple paths to bound the response time of a DAG task, instead of using a single longest path in previous results, was proposed and leads to the so-called multi-path bound. Multi-path bounds can greatly reduce the response time bound and significantly improve the schedulability of DAG tasks. This paper derives a new multi-path bound and proposes an optimal algorithm to compute this bound. We further present a systematic analysis on the dominance and the sustainability of three existing multi-path bounds and the proposed multi-path bound. Our bound theoretically dominates and empirically outperforms all existing multi-path bounds. What's more, the proposed bound is the only multi-path bound that is proved to be self-sustainable

    Precise Response Time Analysis for Multiple DAG Tasks with Intra-task Priority Assignment

    Get PDF
    In many real-time application domains, there are execution dependencies, such tasks may be formulated as multiple Directed Acyclic Graphs (DAGs) and scheduled with intra-task (i.e., intra-DAG) priority assignment. The worst-case completion time of a DAG must be bounded and schedulability analysis must be conducted during the design phase to estimate the required hardware resources. Typical examples include automotive systems and Ultra-Reliable Low Latency Communications (URLLC), which is the ``to-business'' protocol in 5G technologies, deployed in industrial automation for instance. To bound the execution time of multiple DAGs, there are two key factors to analyze: the intra-task interference for a single DAG and the inter-task interference between DAGs. While extensive efforts have been invested, the existing methods either still contain a large degree of pessimism or are even erroneous due to errors in the derived analysis. In this paper, we first provide an in-depth analysis of the limitation and defects of the existing methods. Inspired by these observations, we construct novel response time analysis for multiple DAG tasks with arbitrary intra-task priority assignment. Our analysis precisely accounts for both the intra- and inter-task interference by fully exploring the node parallelism in each DAG as well as between DAGs. Extensive experimental results show that the proposed analysis obtains tighter bounds and improves the system scheduability by at least 300\% compared to state-of-the-art approaches. This improvement is even larger when the scheduling pressure is relatively high, up to 100\% versus 0\% in many cases. This work notably advances the use of response time analysis in the industry. Practitioners have to resort to either potentially unsafe measurement results or significant resource over-provisioning when precise analysis is unavailable

    Optimal Multiprocessor Locking Protocols Under FIFO Scheduling

    Get PDF
    Real-time locking protocols are typically designed to reduce any priority-inversion blocking (pi-blocking) a task may incur while waiting to access a shared resource. For the multiprocessor case, a number of such protocols have been developed that ensure asymptotically optimal pi-blocking bounds under job-level fixed-priority scheduling. Unfortunately, no optimal multiprocessor real-time locking protocols are known that ensure tight pi-blocking bounds under any scheduler. This paper presents the first such protocols. Specifically, protocols are presented for mutual exclusion, reader-writer synchronization, and k-exclusion that are optimal under first-in-first-out (FIFO) scheduling when schedulability analysis treats suspension times as computation. Experiments are presented that demonstrate the effectiveness of these protocols

    High Performance Real-Time Scheduling Framework for Multiprocessor Systems

    Get PDF
    Embedded systems, performing specific functions in modern devices, have become pervasive in today's technology landscape. As many of these systems are real-time systems, they necessitate operations with stringent time constraints. This is especially evident in sectors like automotive and aerospace. This thesis introduces a High Performance Real-time Scheduling (HPRTS) framework, which is designed to navigate the multifaceted challenges faced by multiprocessor real-time systems. To begin with, the research attempts to bridge the gap between system reliability and resource sharing in Mixed-Criticality Systems (MCS). In addressing this, a novel fault-tolerance solution is presented. Its main goal is to enhance fault management and reduce blocking time during fault tolerance. Following this, the thesis delves into task allocation in systems with shared resources. In this context, we introduce a distinct Resource Contention Model (RCM). Using this model as a foundation, our allocation strategy is formulated with the aim to reduce resource contention. Moreover, in light of the escalating system complexity where tasks are represented using Directed Acyclic Graph (DAG) models, the research unveils a new Response Time Analysis (RTA) for multi-DAG systems. This particular analysis has been tailored to provide a safe and more refined bound. Reflecting on the contributions made, the achievements of the thesis highlight the potency of the HPRTS framework in steering real-time embedded systems toward high performance

    Adaptiveness, Asynchrony, and Resource Efficiency in Parallel Stochastic Gradient Descent

    Get PDF
    Accelerated digitalization and sensor deployment in society in recent years poses critical challenges for associated data processing and analysis infrastructure to scale, and the field of big data, targeting methods for storing, processing, and revealing patterns in huge data sets, has surged. Artificial Intelligence (AI) models are used diligently in standard Big Data pipelines due to their tremendous success across various data analysis tasks, however exponential growth in Volume, Variety and Velocity of Big Data (known as its three V’s) in recent years require associated complexity in the AI models that analyze it, as well as the Machine Learning (ML) processes required to train them. In order to cope, parallelism in ML is standard nowadays, with the aim to better utilize contemporary computing infrastructure, whether it being shared-memory multi-core CPUs, or vast connected networks of IoT devices engaging in Federated Learning (FL).Stochastic Gradient Descent (SGD) serves as the backbone of many of the most popular ML methods, including in particular Deep Learning. However, SGD has inherently sequential semantics, and is not trivially parallelizable without imposing strict synchronization, with associated bottlenecks. Asynchronous SGD (AsyncSGD), which relaxes the original semantics, has gained significant interest in recent years due to promising results that show speedup in certain contexts. However, the relaxed semantics that asynchrony entails give rise to fundamental questions regarding AsyncSGD, relating particularly to its stability and convergence rate in practical applications.This thesis explores vital knowledge gaps of AsyncSGD, and contributes in particular to: Theoretical frameworks – Formalization of several key notions related to the impact of asynchrony on the convergence, guiding future development of AsyncSGD implementations; Analytical results – Asymptotic convergence bounds under realistic assumptions. Moreover, several technical solutions are proposed, targeting in particular: Stability – Reducing the number of non-converging executions and the associated wasted energy; Speedup – Improving convergence time and reliability with instance-based adaptiveness; Elasticity – Resource-efficiency by avoiding over-parallelism, and thereby improving stability and saving computing resources. The proposed methods are evaluated on several standard DL benchmarking applications and compared to relevant baselines from previous literature. Key results include: (i) persistent speedup compared to baselines, (ii) increased stability and reduced risk for non-converging executions, (iii) reduction in the overall memory footprint (up to 17%), as well as the consumed computing resources (up to 67%).In addition, along with this thesis, an open-source implementation is published, that connects high-level ML operations with asynchronous implementations with fine-grained memory operations, leveraging future research for efficient adaptation of AsyncSGD for practical applications
    • …
    corecore