70 research outputs found

    An adaptive, utilization-based approach to schedule real-time tasks for ARM big. LITTLE architectures

    Get PDF
    ARM big.LITTLE architectures are spreading more and more in the mobile world thanks to their power-saving capabilities due to the use of two ISA-compatible islands, one focusing on energy efficiency and the other one on computational power. This architecture makes the problem of energy-aware task scheduling particularly challenging, due to the number of variables to take into account and the need for having lightweight mechanisms that can be readily computed in an operating system kernel scheduler. This paper presents a novel task scheduler for big.LITTLE platforms, combining the well-known Constant Bandwidth Server algorithm with a power-aware per-job migration policy. This achieves real-time adaptation of the CPU islands' frequencies based on the individual cores' overall utilization, as available in the scheduler thanks to the use of the resource reservation paradigm. Preliminary results obtained by simulations based on modifications to the open-source RTSim tool show that the proposed technique is able to achieve interesting performance/energy trade-offs

    Real-time scheduling in multicore : time- and space-partitioned architectures

    Get PDF
    Tese de doutoramento, Informática (Engenharia Informática), Universidade de Lisboa, Faculdade de Ciências, 2014The evolution of computing systems to address size, weight and power consumption (SWaP) has led to the trend of integrating functions (otherwise provided by separate systems) as subsystems of a single system. To cope with the added complexity of developing and validating such a system, these functions are maintained and analyzed as components with clear boundaries and interfaces. In the case of real-time systems, the adopted component-based approach should maintain the timeliness properties of the function inside each individual component, regardless of the remaining components. One approach to this issue is time and space partitioning (TSP)—enforcing strict separation between components in the time and space domains. This allows heterogeneous components (different real-time requirements, criticality, developed by different teams and/or with different technologies) to safely coexist. The concepts of TSP have been adopted in the civil aviation, aerospace, and (to some extent) automotive industries. These industries are also embracing multiprocessor (or multicore) platforms, either with identical or nonidentical processors, but are not taking full advantage thereof because of a lack of support in terms of verification and certification. Furthermore, due to the use of the TSP in those domains, compatibility between TSP and multiprocessor is highly desired. This is not the present case, as the reference TSP-related specifications in the aforementioned industries show limited support to multiprocessor. In this dissertation, we defend that the active exploitation of multiple (possibly non-identical) processor cores can augment the processing capacity of the time- and space-partitioned (TSP) systems, while maintaining a compromise with size, weight and power consumption (SWaP), and open room for supporting self-adaptive behavior. To allow applying our results to a more general class of systems, we analyze TSP systems as a special case of hierarchical scheduling and adopt a compositional analysis methodology.Fundação para a Ciência e a Tecnologia (FCT, SFRH/BD/60193/2009, programa PESSOA, projeto SAPIENT); the European Space Agency Innovation (ESA) Triangle Initiative program through ESTEC Contract 21217/07/NL/CB, Project AIR-II; the European Commission Seventh Framework Programme (FP7) through project KARYON (IST-FP7-STREP-288195)

    EDF scheduling of real-time tasks on multiple cores: Adaptive Partitioning vs. Global Scheduling

    Get PDF
    This paper presents a novel migration algorithm for real-time tasks on multicore systems, based on the idea of migrating tasks only when strictly needed to respect their temporal constraints and a combination of this new algorithm with EDF scheduling. This new “adaptive migration” algorithm is evaluated through an extensive set of simulations showing good performance when compared with global or partitioned EDF: our results highlight that it provides a worst-case utilisation bound similar to partitioned EDF for hard real-time tasks and an empirical tardiness bound (like global EDF) for soft real-time tasks. Therefore, the proposed scheduler is effective for dealing with both hard and soft real-time workloads

    Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics

    Full text link
    Graphics Processing Units (GPUs) are having a transformational effect on numerical lattice quantum chromodynamics (LQCD) calculations of importance in nuclear and particle physics. The QUDA library provides a package of mixed precision sparse matrix linear solvers for LQCD applications, supporting single GPUs based on NVIDIA's Compute Unified Device Architecture (CUDA). This library, interfaced to the QDP++/Chroma framework for LQCD calculations, is currently in production use on the "9g" cluster at the Jefferson Laboratory, enabling unprecedented price/performance for a range of problems in LQCD. Nevertheless, memory constraints on current GPU devices limit the problem sizes that can be tackled. In this contribution we describe the parallelization of the QUDA library onto multiple GPUs using MPI, including strategies for the overlapping of communication and computation. We report on both weak and strong scaling for up to 32 GPUs interconnected by InfiniBand, on which we sustain in excess of 4 Tflops.Comment: 11 pages, 7 figures, to appear in the Proceedings of Supercomputing 2010 (submitted April 12, 2010

    Towards Compositional Hierarchical Scheduling Frameworks on Uniform Multiprocessors

    Get PDF
    In this report, we approach the problem of defining and analysing compositional hierarchical scheduling frameworks (HSF) upon uniform multiprocessor platforms. For this we propose the uniform multiprocessor periodic resource (UMPR) model for a component interface. We extend previous work by fellow researchers (for dedicated uniform multiprocessors, and for compositional HSFs on identical multiprocessors), by providing mechanisms for the multiple aspects of compositional analysis: a sufficient test for local schedulability of sporadic task sets under global Earliest Deadline First (GEDF) and guidelines for the complex problem of selecting the virtual platform when abstracting a component. Finally, we present experimental results that provide evidence for the need of future developments within the realm of compositional HSFs on uniform multiprocessors.FCT/Égide (PESSOA programme, project SAPIENT); European Commission (project IST-FP7-STREP-288195, KARYON); FCT (LaSIGE research unit strategic project,UI 408); FCT (Individual Doctoral Grant SFRH/BD/60193/2009)

    Towards Scalable Design of Future Wireless Networks

    Full text link
    Wireless operators face an ever-growing challenge to meet the throughput and processing requirements of billions of devices that are getting connected. In current wireless networks, such as LTE and WiFi, these requirements are addressed by provisioning more resources: spectrum, transmitters, and baseband processors. However, this simple add-on approach to scale system performance is expensive and often results in resource underutilization. What are, then, the ways to efficiently scale the throughput and operational efficiency of these wireless networks? To answer this question, this thesis explores several potential designs: utilizing unlicensed spectrum to augment the bandwidth of a licensed network; coordinating transmitters to increase system throughput; and finally, centralizing wireless processing to reduce computing costs. First, we propose a solution that allows LTE, a licensed wireless standard, to co-exist with WiFi in the unlicensed spectrum. The proposed solution bridges the incompatibility between the fixed access of LTE, and the random access of WiFi, through channel reservation. It achieves a fair LTE-WiFi co-existence despite the transmission gaps and unequal frame durations. Second, we consider a system where different MIMO transmitters coordinate to transmit data of multiple users. We present an adaptive design of the channel feedback protocol that mitigates interference resulting from the imperfect channel information. Finally, we consider a Cloud-RAN architecture where a datacenter or a cloud resource processes wireless frames. We introduce a tree-based design for real-time transport of baseband samples and provide its end-to-end schedulability and capacity analysis. We also present a processing framework that combines real-time scheduling with fine-grained parallelism. The framework reduces processing times by migrating parallelizable tasks to idle compute resources, and thus, decreases the processing deadline-misses at no additional cost. We implement and evaluate the above solutions using software-radio platforms and off-the-shelf radios, and confirm their applicability in real-world settings.PhDElectrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/133358/1/gkchai_1.pd

    Parallel Programming with Global Asynchronous Memory: Models, C++ APIs and Implementations

    Get PDF
    In the realm of High Performance Computing (HPC), message passing has been the programming paradigm of choice for over twenty years. The durable MPI (Message Passing Interface) standard, with send/receive communication, broadcast, gather/scatter, and reduction collectives is still used to construct parallel programs where each communication is orchestrated by the developer-based precise knowledge of data distribution and overheads; collective communications simplify the orchestration but might induce excessive synchronization. Early attempts to bring shared-memory programming model—with its programming advantages—to distributed computing, referred as the Distributed Shared Memory (DSM) model, faded away; one of the main issue was to combine performance and programmability with the memory consistency model. The recently proposed Partitioned Global Address Space (PGAS) model is a modern revamp of DSM that exposes data placement to enable optimizations based on locality, but it still addresses (simple) data- parallelism only and it relies on expensive sharing protocols. We advocate an alternative programming model for distributed computing based on a Global Asynchronous Memory (GAM), aiming to avoid coherency and consistency problems rather than solving them. We materialize GAM by designing and implementing a distributed smart pointers library, inspired by C++ smart pointers. In this model, public and pri- vate pointers (resembling C++ shared and unique pointers, respectively) are moved around instead of messages (i.e., data), thus alleviating the user from the burden of minimizing transfers. On top of smart pointers, we propose a high-level C++ template library for writing applications in terms of dataflow-like networks, namely GAM nets, consisting of stateful processors exchanging pointers in fully asynchronous fashion. We demonstrate the validity of the proposed approach, from the expressiveness perspective, by showing how GAM nets can be exploited to implement both standalone applications and higher-level parallel program- ming models, such as data and task parallelism. As for the performance perspective, preliminary experiments show both close-to-ideal scalability and negligible overhead with respect to state-of-the-art benchmark implementations. For instance, the GAM implementation of a high-quality video restoration filter sustains a 100 fps throughput over 70%-noisy high-quality video streams on a 4-node cluster of Graphics Processing Units (GPUs), with minimal programming effort

    Proactive elasticity and energy awareness in data stream processing

    Get PDF
    Data stream processing applications have a long running nature (24hr/7d) with workload conditions that may exhibit wide variations at run-time. Elasticity is the term coined to describe the capability of applications to change dynamically their resource usage in response to workload fluctuations. This paper focuses on strategies for elastic data stream processing targeting multicore systems. The key idea is to exploit Model Predictive Control, a control-theoretic method that takes into account the system behavior over a future time horizon in order to decide the best reconfiguration to execute. We design a set of energy-aware proactive strategies, optimized for throughput and latency QoS requirements, which regulate the number of used cores and the CPU frequency through the Dynamic Voltage and Frequency Scaling (DVFS) support offered by modern multicore CPUs. We evaluate our strategies in a high-frequency trading application fed by synthetic and real-world workload traces. We introduce specific properties to effectively compare different elastic approaches, and the results show that our strategies are able to achieve the best outcome

    Runtime Management of Multiprocessor Systems for Fault Tolerance, Energy Efficiency and Load Balancing

    Get PDF
    Efficiency of modern multiprocessor systems is hurt by unpredictable events: aging causes permanent faults that disable components; application spawnings and terminations taking place at arbitrary times, affect energy proportionality, causing energy waste; load imbalances reduce resource utilization, penalizing performance. This thesis demonstrates how runtime management can mitigate the negative effects of unpredictable events, making decisions guided by a combination of static information known in advance and parameters that only become known at runtime. We propose techniques for three different objectives: graceful degradation of aging-prone systems; energy efficiency of heterogeneous adaptive systems; and load balancing by means of work stealing. Managing aging-prone systems for graceful efficiency degradation, is based on a high-level system description that encapsulates hardware reconfigurability and workload flexibility and allows to quantify system efficiency and use it as an objective function. Different custom heuristics, as well as simulated annealing and a genetic algorithm are proposed to optimize this objective function as a response to component failures. Custom heuristics are one to two orders of magnitude faster, provide better efficiency for the first 20% of system lifetime and are less than 13% worse than a genetic algorithm at the end of this lifetime. Custom heuristics occasionally fail to satisfy reconfiguration cost constraints. As all algorithms\u27 execution time scales well with respect to system size, a genetic algorithm can be used as backup in these cases. Managing heterogeneous multiprocessors capable of Dynamic Voltage and Frequency Scaling is based on a model that accurately predicts performance and power: performance is predicted by combining static, application-specific profiling information and dynamic, runtime performance monitoring data; power is predicted using the aforementioned performance estimations and a set of platform-specific, static parameters, determined only once and used for every application mix. Three runtime heuristics are proposed, that make use of this model to perform partial search of the configuration space, evaluating a small set of configurations and selecting the best one. When best-effort performance is adequate, the proposed approach achieves 3% higher energy efficiency compared to the powersave governor and 2x better compared to the interactive and ondemand governors. When individual applications\u27 performance requirements are considered, the proposed approach is able to satisfy them, giving away 18% of system\u27s energy efficiency compared to the powersave, which however misses the performance targets by 23%; at the same time, the proposed approach maintains an efficiency advantage of about 55% compared to the other governors, which also satisfy the requirements. Lastly, to improve load balancing of multiprocessors, a partial and approximate view of the current load distribution among system cores is proposed, which consists of lightweight data structures and is maintained by each core through cheap operations. A runtime algorithm is developed, using this view whenever a core becomes idle, to perform victim core selection for work stealing, also considering system topology and memory hierarchy. Among 12 diverse imbalanced workloads, the proposed approach achieves better performance than random, hierarchical and local stealing for six workloads. Furthermore, it is at most 8% slower among the other six workloads, while competing strategies incur a penalty of at least 89% on some workload

    High Performance Real-Time Scheduling Framework for Multiprocessor Systems

    Get PDF
    Embedded systems, performing specific functions in modern devices, have become pervasive in today's technology landscape. As many of these systems are real-time systems, they necessitate operations with stringent time constraints. This is especially evident in sectors like automotive and aerospace. This thesis introduces a High Performance Real-time Scheduling (HPRTS) framework, which is designed to navigate the multifaceted challenges faced by multiprocessor real-time systems. To begin with, the research attempts to bridge the gap between system reliability and resource sharing in Mixed-Criticality Systems (MCS). In addressing this, a novel fault-tolerance solution is presented. Its main goal is to enhance fault management and reduce blocking time during fault tolerance. Following this, the thesis delves into task allocation in systems with shared resources. In this context, we introduce a distinct Resource Contention Model (RCM). Using this model as a foundation, our allocation strategy is formulated with the aim to reduce resource contention. Moreover, in light of the escalating system complexity where tasks are represented using Directed Acyclic Graph (DAG) models, the research unveils a new Response Time Analysis (RTA) for multi-DAG systems. This particular analysis has been tailored to provide a safe and more refined bound. Reflecting on the contributions made, the achievements of the thesis highlight the potency of the HPRTS framework in steering real-time embedded systems toward high performance
    corecore