26 research outputs found

    A study of scheduling problems with preemptions on multi-core computers with GPU accelerators

    Get PDF
    International audienceFor many years, scheduling problems have been concerned either with parallel processor systems or with dedicated processors-job shop type systems. With a development of new computing architectures this partition is no longer so obvious. Multi-core (processor) computers equipped with GPU co-processors require new scheduling strategies. This paper is devoted to a characterization of this new type of scheduling problems. After a thorough introduction of the new model of a computing system, an extension of the classical notation of scheduling problems is proposed. A special attention is paid to preemptions, since this feature of the new architecture differs the most as compared with the classical model. In the paper, several scheduling algorithms, new ones and those refining classical approaches, are presented. Possible extensions of the model are also discussed

    Independent tasks on 2 resources with co-scheduling effects

    Get PDF
    Concurrent kernel execution is a relatively new feature in modern GPUs, which was designed to improve hardware utilization and the overall system throughput. However, the decision on the simultaneous execution of tasks is performed by the hardware with a leftover policy, that assigns as many resources as possible for one task and then assigns the remaining resources to the next task. This can lead to unreasonable use of resources. In this work, we tackle the problem of co-scheduling for GPUs with and without preemption, with the focus on determining the kernels submission order to reduce the number of preemptions and the kernels makespan, respectively. We propose a graph-based theoretical model to build preemptive and non-preemptive schedules. We show that the optimal preemptive makespan can be computed by solving a Linear Program in polynomial time, and we propose an algorithm based on this solution which minimizes the number of preemptions. We also propose an algorithm that transforms a preemptive solution of optimal makespan into a non-preemptive solution with the smallest possible preemption overhead. We show, however, that finding the minimal amount of preemptions among all preemptive solutions of optimal makespan is a NP-hard problem, and computing the optimal non-preemptive schedule is also NP-hard. In addition, we study the non-preemptive problem, without searching first for a good preemptive solution, and present a Mixed Integer Linear Program solution to this problem. We performed experiments on real-world GPU applications and our approach can achieve optimal makespan by preempting 6 to 9% of the tasks. Our non-preemptive approach, on the other side, obtains makespan within 2.5% of the optimal preemptive schedules, while previous approaches exceed the preemptive makespan by 5 to 12%

    Parallel and Distributed Computing

    Get PDF
    The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing

    Fixed-Priority Memory-Centric Scheduler for COTS-Based Multiprocessors

    Get PDF
    Memory-centric scheduling attempts to guarantee temporal predictability on commercial-off-the-shelf (COTS) multiprocessor systems to exploit their high performance for real-time applications. Several solutions proposed in the real-time literature have hardware requirements that are not easily satisfied by modern COTS platforms, like hardware support for strict memory partitioning or the presence of scratchpads. However, even without said hardware support, it is possible to design an efficient memory-centric scheduler. In this article, we design, implement, and analyze a memory-centric scheduler for deterministic memory management on COTS multiprocessor platforms without any hardware support. Our approach uses fixed-priority scheduling and proposes a global "memory preemption" scheme to boost real-time schedulability. The proposed scheduling protocol is implemented in the Jailhouse hypervisor and Erika real-time kernel. Measurements of the scheduler overhead demonstrate the applicability of the proposed approach, and schedulability experiments show a 20% gain in terms of schedulability when compared to contention-based and static fair-share approaches

    Enabling the use of embedded and mobile technologies for high-performance computing

    Get PDF
    In the late 1990s, powerful economic forces led to the adoption of commodity desktop processors in High-Performance Computing(HPC). This transformation has been so effective that the November 2016 TOP500 list is still dominated by x86 architecture. In 2016, the largest commodity market in computing is not PCs or servers, but mobile computing, comprising smartphones andtablets, most of which are built with ARM-based Systems on Chips (SoC). This suggests that once mobile SoCs deliver sufficient performance, mobile SoCs can help reduce the cost of HPC. This thesis addresses this question in detail.We analyze the trend in mobile SoC performance, comparing it with the similar trend in the 1990s. Through development of real system prototypes and their performance analysis we assess the feasibility of building an HPCsystem based on mobile SoCs. Through simulation of the future mobile SoC, we identify the missing features and suggest improvements that would enable theuse of future mobile SoCs in HPC environment. Thus, we present design guidelines for future generations mobile SoCs, and HPC systems built around them, enabling the newclass of cheap supercomputers.A finales de la década de los 90, razones económicas llevaron a la adopción de procesadores de uso general en sistemas de Computación de Altas Prestaciones (HPC). Esta transformación ha sido tan efectiva que la lista TOP500 de noviembre de 2016 sigue aun dominada por la arquitectura x86. En 2016, el mayor mercado de productos básicos en computación no son los ordenadores de sobremesa o los servidores, sino la computación móvil, que incluye teléfonos inteligentes y tabletas, la mayoría de los cuales están construidos con sistemas en chip(SoC) de arquitectura ARM. Esto sugiere que una vez que los SoC móviles ofrezcan un rendimiento suficiente, podrán utilizarse para reducir el costo desistemas HPC. Esta tesis aborda esta cuestión en detalle. Analizamos la tendencia del rendimiento de los SoC para móvil, comparándola con la tendencia similar ocurrida en los añosnoventa. A través del desarrollo de prototipos de sistemas reales y su análisis de rendimiento, evaluamos la factibilidad de construir unsistema HPC basado en SoCs móviles. A través de la simulación de SoCs móviles futuros, identificamos las características que faltan y sugerimos mejoras quepermitirían su uso en entornos HPC. Por lo tanto, presentamos directrices de diseño para futuras generaciones de SoCs móviles y sistemas HPC construidos a sualrededor, para permitir la construcción de una nueva clase de supercomputadores de coste reducido

    On-Device Deep Learning Inference for System-on-Chip (SoC) Architectures

    Get PDF
    As machine learning becomes ubiquitous, the need to deploy models on real-time, embedded systems will become increasingly critical. This is especially true for deep learning solutions, whose large models pose interesting challenges for target architectures at the “edge” that are resource-constrained. The realization of machine learning, and deep learning, is being driven by the availability of specialized hardware, such as system-on-chip solutions, which provide some alleviation of constraints. Equally important, however, are the operating systems that run on this hardware, and specifically the ability to leverage commercial real-time operating systems which, unlike general purpose operating systems such as Linux, can provide the low-latency, deterministic execution required for embedded, and potentially safety-critical, applications at the edge. Despite this, studies considering the integration of real-time operating systems, specialized hardware, and machine learning/deep learning algorithms remain limited. In particular, better mechanisms for real-time scheduling in the context of machine learning applications will prove to be critical as these technologies move to the edge. In order to address some of these challenges, we present a resource management framework designed to provide a dynamic on-device approach to the allocation and scheduling of limited resources in a real-time processing environment. These types of mechanisms are necessary to support the deterministic behavior required by the control components contained in the edge nodes. To validate the effectiveness of our approach, we applied rigorous schedulability analysis to a large set of randomly generated simulated task sets and then verified the most time critical applications, such as the control tasks which maintained low-latency deterministic behavior even during off-nominal conditions. The practicality of our scheduling framework was demonstrated by integrating it into a commercial real-time operating system (VxWorks) then running a typical deep learning image processing application to perform simple object detection. The results indicate that our proposed resource management framework can be leveraged to facilitate integration of machine learning algorithms with real-time operating systems and embedded platforms, including widely-used, industry-standard real-time operating systems

    SCHEDULING REAL-TIME GRAPH-BASED WORKLOADS

    Get PDF
    Developments in the semiconductor industry in the previous decades have made possible computing platforms with very large computing capacities that, in turn, have stimulated the rapid progress of computationally intensive computer vision (CV) algorithms with highly parallelizable structure (often represented as graphs). Applications using such algorithms are the foundation for the transformation of semi-autonomous systems (e.g., advanced driver-assist systems) to future fully-autonomous systems (e.g., self-driving cars). Enabling mass-produced safety-critical systems with full autonomy requires real-time execution guarantees as a part of system certification.Since multiple CV applications may need to share the same hardware platform due to size, weight, power, and cost constraints, system component isolation is necessary to avoid explosive interference growth that breaks all execution guarantees. Existing software certification processes achieve component isolation through time partitioning, which can be broken by accelerator usage, which is essential for high-efficacy CV algorithms.The goal of this dissertation is to make a first step towards providing real-time guarantees for safety-critical systems by analyzing the scheduling of highly parallel accelerator-using workloads isolated in system components. The specific contributions are threefold.First, a general method for graph-based workloads’ response-time-bound reduction through graph structure modifications is introduced, leading to significant response-time-bound reductions. Second, a generalized real-time task model is introduced that enables real-time response-time bounds for a wider range of graph-based workloads. A proposed response-time analysis for the introduced model accounts for potential accelerator usage within tasks. Third, a scheduling approach for graph-based workloads in a single system component is proposed that ensures the temporal isolation of system components. A response-time analysis for workloads with accelerator usage is presented alongside a non-mandatory schedulability-improvement step. This approach can help to enable component-wise certification in the considered systems.Doctor of Philosoph
    corecore