526 research outputs found
Resource-aware scheduling for 2D/3D multi-/many-core processor-memory systems
This dissertation addresses the complexities of 2D/3D multi-/many-core processor-memory systems, focusing on two key areas: enhancing timing predictability in real-time multi-core processors and optimizing performance within thermal constraints. The integration of an increasing number of transistors into compact chip designs, while boosting computational capacity, presents challenges in resource contention and thermal management. The first part of the thesis improves timing predictability. We enhance shared cache interference analysis for set-associative caches, advancing the calculation of Worst-Case Execution Time (WCET). This development enables accurate assessment of cache interference and the effectiveness of partitioned schedulers in real-world scenarios. We introduce TCPS, a novel task and cache-aware partitioned scheduler that optimizes cache partitioning based on task-specific WCET sensitivity, leading to improved schedulability and predictability. Our research explores various cache and scheduling configurations, providing insights into their performance trade-offs. The second part focuses on thermal management in 2D/3D many-core systems. Recognizing the limitations of Dynamic Voltage and Frequency Scaling (DVFS) in S-NUCA many-core processors, we propose synchronous thread migrations as a thermal management strategy. This approach culminates in the HotPotato scheduler, which balances performance and thermal safety. We also introduce 3D-TTP, a transient temperature-aware power budgeting strategy for 3D-stacked systems, reducing the need for Dynamic Thermal Management (DTM) activation. Finally, we present 3QUTM, a novel method for 3D-stacked systems that combines core DVFS and memory bank Low Power Modes with a learning algorithm, optimizing response times within thermal limits. This research contributes significantly to enhancing performance and thermal management in advanced processor-memory systems
Application-centric bandwidth allocation in datacenters
Today's datacenters host a large number of concurrently executing applications with diverse intra-datacenter latency and bandwidth requirements.
Some of these applications, such as data analytics, graph processing, and machine learning training, are data-intensive and require high bandwidth to function properly.
However, these bandwidth-hungry applications can often congest the datacenter network, leading to queuing delays that hurt application completion time.
To remove the network as a potential performance bottleneck, datacenter operators have begun deploying high-end HPC-grade networks like InfiniBand.
These networks offer fully offloaded network stacks, remote direct memory access (RDMA) capability, and non-discarding links, which allow them to provide both low latency and high bandwidth for a single application.
However, it is unclear how well such networks accommodate a mix of latency- and bandwidth-sensitive traffic in a real-world deployment.
In this thesis, we aim to answer the above question.
To do so, we develop RPerf, a latency measurement tool for RDMA-based networks that can precisely measure the InfiniBand switch latency without hardware support.
Using RPerf, we benchmark a rack-scale InfiniBand cluster in both isolated and mixed-traffic scenarios.
Our key finding is that the evaluated switch can provide either low latency or high bandwidth, but not both simultaneously in a mixed-traffic scenario.
We also evaluate several options to improve the latency-bandwidth trade-off and demonstrate that none are ideal.
We find that while queue separation is a solution to protect latency-sensitive applications, it fails to properly manage the bandwidth of other applications.
We also aim to resolve the problem with bandwidth management for non-latency-sensitive applications.
Previous efforts to address this problem have generally focused on achieving max-min fairness at the flow level.
However, we observe that different workloads exhibit varying levels of sensitivity to network bandwidth.
For some workloads, even a small reduction in available bandwidth can significantly increase completion time, while for others, completion time is largely insensitive to available network bandwidth.
As a result, simply splitting the bandwidth equally among all workloads is sub-optimal for overall application-level performance.
To address this issue, we first propose a robust methodology capable of effectively measuring the sensitivity of applications to bandwidth.
We then design Saba, an application-aware bandwidth allocation framework that distributes network bandwidth based on application-level sensitivity.
Saba combines ahead-of-time application profiling to determine bandwidth sensitivity with runtime bandwidth allocation using lightweight software support, with no modifications to network hardware or protocols.
Experiments with a 32-server hardware testbed show that Saba can significantly increase overall performance by reducing the job completion time for bandwidth-sensitive jobs
Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU
Many applications such as autonomous driving and augmented reality, require
the concurrent running of multiple deep neural networks (DNN) that poses
different levels of real-time performance requirements. However, coordinating
multiple DNN tasks with varying levels of criticality on edge GPUs remains an
area of limited study. Unlike server-level GPUs, edge GPUs are resource-limited
and lack hardware-level resource management mechanisms for avoiding resource
contention. Therefore, we propose Miriam, a contention-aware task coordination
framework for multi-DNN inference on edge GPU. Miriam consolidates two main
components, an elastic-kernel generator, and a runtime dynamic kernel
coordinator, to support mixed critical DNN inference. To evaluate Miriam, we
build a new DNN inference benchmark based on CUDA with diverse representative
DNN workloads. Experiments on two edge GPU platforms show that Miriam can
increase system throughput by 92% while only incurring less than 10\% latency
overhead for critical tasks, compared to state of art baselines
Using Simultaneous Multithreading to Support Real-Time Scheduling
The goal of real-time scheduling is to find a way to schedule every program in a specified system without unacceptable deadline misses. If doing so on a given hardware platform is not possible, then the question to ask is ``What can be changed?'' Simultaneous multithreading (SMT) is a technology that allows a single computer core to execute multiple programs at once, at the cost of increasing the time required to execute individual programs. SMT has been shown to improve performance in many areas of computing, but SMT has seen little application to the real-time domain. Reasons for not using SMT in real-time systems include the difficulty of knowing how much execution time a program will require when SMT is in use, concerns that longer execution times could cause unacceptable deadline misses, and the difficulty of deciding which programs should and should not use SMT to share a core. This dissertation shows how SMT can be used to support real-time scheduling in both the hard real-time (HRT) case, where deadline misses are never acceptable, and the soft real-time (SRT) case, where deadline misses are undesirable but tolerable. Contributions can be divided into three categories. First, the effects of SMT on execution times are measured and parameters for modeling the effects of SMT are given. Second, scheduling algorithms for the SRT case that take advantage of SMT are given and evaluated. Third, scheduling algorithms for the HRT case are given and evaluated. In both the SRT and HRT cases, using the proposed algorithms do not lead to unacceptable deadline misses and can have effects similar to increasing a platform's core count by a third or more.Doctor of Philosoph
Automotive Ethernet architecture and security: challenges and technologies
Vehicle infrastructure must address the challenges posed by today's advances toward connected and autonomous vehicles. To allow for more flexible architectures, high-bandwidth connections and scalability are needed to connect many sensors and electronic control units (ECUs). At the same time, deterministic and low latency is a critical and significant design requirement to support urgent real-time applications in autonomous vehicles. As a recent solution, the time-sensitive network (TSN) was introduced as Ethernet-based amendments in IEEE 802.1 TSN standards to meet those needs. However, it had hurdle to be overcome before it can be used effectively. This paper discusses the latest studies concerning the automotive Ethernet requirements, including transmission delay studies to improve worst-case end-to-end delay and end-to-end jitter. Also, the paper focuses on the securing Ethernet-based in-vehicle networks (IVNs) by reviewing new encryption and authentication methods and approaches
Changing Priorities. 3rd VIBRArch
In order to warrant a good present and future for people around the planet and to safe the care of the planet itself, research in architecture has to release all its potential. Therefore, the aims of the 3rd Valencia International Biennial of Research in Architecture are:
- To focus on the most relevant needs of humanity and the planet and what architectural research can do for solving them.
- To assess the evolution of architectural research in traditionally matters of interest and the current state of these popular and widespread topics.
- To deepen in the current state and findings of architectural research on subjects akin to post-capitalism and frequently related to equal opportunities and the universal right to personal development and happiness.
- To showcase all kinds of research related to the new and holistic concept of sustainability and to climate emergency.
- To place in the spotlight those ongoing works or available proposals developed by architectural researchers in order to combat the effects of the COVID-19 pandemic.
- To underline the capacity of architectural research to develop resiliency and abilities to adapt itself to changing priorities.
- To highlight architecture's multidisciplinarity as a melting pot of multiple approaches, points of view and expertise.
- To open new perspectives for architectural research by promoting the development of multidisciplinary and inter-university networks and research groups.
For all that, the 3rd Valencia International Biennial of Research in Architecture is open not only to architects, but also for any academic, practitioner, professional or student with a determination to develop research in architecture or neighboring fields.Cabrera Fausto, I. (2023). Changing Priorities. 3rd VIBRArch. Editorial Universitat Politècnica de València. https://doi.org/10.4995/VIBRArch2022.2022.1686
Reducing Loss of Service for Mixed-Criticality Systems through Cache-and Stress-Aware Scheduling
Hardware resources found in modern processor architecture, such as the memory hierarchy, can improve the performance of a task by anticipating its needs based on its execution history and behaviour. Interleaved jobs, belonging to other tasks with different behaviours, can cause stress on those resources by disrupting the execution history thus slowing down more sensitive tasks. Schedulability analyses and policies tend to ignore such behaviours, in favour of conservative assumptions, as their effects are difficult to assess. When they are included, the analysis can be very complex and the measures needed are hard to obtain. In this paper, we propose abstract timing models that capture stress and sensitivity with respect to the memory hierarchy. The advantage of an abstract timing model is that it can be derived through measurements without the need for a detailed understanding of the precise cache hierarchy and how it affects software execution. The disadvantage of course is that there is no hard timing guarantee especially if timing anomalies may exist. The contribution of this paper is to build on existing priority assignment schemes using the timing model to discriminate between tasks within sharing a priority levels and improve the system’s timing behaviours in overload. More specifically, we show that for task sets scheduled with mixed-criticality scheduling the number of jobs not executed is often reduced
Late-bound code generation
Each time a function or method is invoked during the execution of a program, a stream of instructions is issued to some underlying hardware platform. But exactly what underlying hardware, and which instructions, is usually left implicit. However in certain situations it becomes important to control these decisions. For example, particular problems can only be solved in real-time when scheduled on specialised accelerators, such as graphics coprocessors or computing clusters.
We introduce a novel operator for hygienically reifying the behaviour of a runtime function instance as a syntactic fragment, in a language which may in general differ from the source function definition. Translation and optimisation are performed by recursively invoked, dynamically dispatched code generators. Side-effecting operations are permitted, and their ordering is preserved.
We compare our operator with other techniques for pragmatic control, observing that: the use of our operator supports lifting arbitrary mutable objects, and neither requires rewriting sections of the source program in a multi-level language, nor interferes with the interface to individual software components. Due to its lack of interference at the abstraction level at which software is composed, we believe that our approach poses a significantly lower barrier to practical adoption than current methods.
The practical efficacy of our operator is demonstrated by using it to offload the user interface rendering of a smartphone application to an FPGA coprocessor, including both statically and procedurally defined user interface components. The generated pipeline is an application-specific, statically scheduled processor-per-primitive rendering pipeline, suitable for place-and-route style optimisation.
To demonstrate the compatibility of our operator with existing languages, we show how it may be defined within the Python programming language. We introduce a transformation for weakening mutable to immutable named bindings, termed let-weakening, to solve the problem of propagating information pertaining to named variables between modular code generating units.Open Acces
LIPIcs, Volume 274, ESA 2023, Complete Volume
LIPIcs, Volume 274, ESA 2023, Complete Volum
- …