510 research outputs found

    A Survey of Pipelined Workflow Scheduling: Models and Algorithms

    Get PDF
    International audienceA large class of applications need to execute the same workflow on different data sets of identical size. Efficient execution of such applications necessitates intelligent distribution of the application components and tasks on a parallel machine, and the execution can be orchestrated by utilizing task-, data-, pipelined-, and/or replicated-parallelism. The scheduling problem that encompasses all of these techniques is called pipelined workflow scheduling, and it has been widely studied in the last decade. Multiple models and algorithms have flourished to tackle various programming paradigms, constraints, machine behaviors or optimization goals. This paper surveys the field by summing up and structuring known results and approaches

    CASCH: a tool for computer-aided scheduling

    Get PDF
    A software tool called Computer-Aided Scheduling (CASCH) for parallel processing on distributed-memory multiprocessors in a complete parallel programming environment is presented. A compiler automatically converts sequential applications into parallel codes to perform program parallelization. The parallel code that executes on a target machine is optimized by CASCH through proper scheduling and mapping.published_or_final_versio

    FPGA Implementation of Data Flow Graphs for Digital Signal Processing Applications

    Get PDF
    A rapid growth in digital signal processing applications has increased the requirement for high-speed digital systems. Multiprocessor systems are the best choice for these applications. A prior sequence of operations should be applied to the operations that described the nature of these applications before hardware implementation is produced. These operations should be scheduled and hardware allocated. This paper proposes a new scheduling technique for digital signal processing (DSP) applications has been represented by data flow graphs (DFGs). In addition, hardware allocation is implemented in the form of embedded system. A proposed scheduling technique also achieves the optimal scheduling of a DFG at design time. The optimality criteria considered in this algorithm are the maximum throughput within the available hardware resources. The maximum throughput is achieved by arranging the DFG nodes according to their inter-related data dependencies. Then, two nodes can be clustered into one compound task to reduce the overall execution time by minimizing the number of tasks to be executed that minimizing the number of cycles to execute them. Then each task is presented in form of instruction to be executed in the hardware system. A hardware system is composed of one or multiple homogenous pipelined processing elements and it is designed to meet the maximum-rate schedule.  Two implementations are proposed of the system architecture according to the number of the processing elements, namely:  the serial system and the parallel system. The serial system comprises one processing element where all tasks are processed sequentially, whilst the parallel system has four processing elements to execute tasks concurrently. These systems consist mainly of seven units: central shared memory, state table, multiway function unit buffer, execution array, processing element/s, instruction buffer and the address generation unit. The hardware components were built on an FPGA chip using Verilog HDL. In synthesis results, the parallel system has better system performance by 25.5% than the serial system. While the serial system requires smaller area size, which described by the number of slice registers and the number of the slice lookup tables (LUTs) than the parallel one. The relationship between the number of instructions that are executed in both systems, and the system area and the system performance that presented by system frequency, are studied. By increasing memories size in both systems, the system performance isn’t affected as in a serial system, and it is slightly decreased as the parallel system by 1.5% to 4.5%. In terms of the systems area, both serial system area and parallel system area are increased and in some cases are doubled. The proposed scheduling technique is shown to outperform the retaining technique, which we have chosen to compare with.  The serial system has better performance by 19.3% higher system frequency than a retiming technique. And the parallel system also outperforms the retaining technique by 51.2% higher system frequency in synthesis results

    Planning And Scheduling For Large-scaledistributed Systems

    Get PDF
    Many applications require computing resources well beyond those available on any single system. Simulations of atomic and subatomic systems with application to material science, computations related to study of natural sciences, and computer-aided design are examples of applications that can benefit from the resource-rich environment provided by a large collection of autonomous systems interconnected by high-speed networks. To transform such a collection of systems into a user\u27s virtual machine, we have to develop new algorithms for coordination, planning, scheduling, resource discovery, and other functions that can be automated. Then we can develop societal services based upon these algorithms, which hide the complexity of the computing system for users. In this dissertation, we address the problem of planning and scheduling for large-scale distributed systems. We discuss a model of the system, analyze the need for planning, scheduling, and plan switching to cope with a dynamically changing environment, present algorithms for the three functions, report the simulation results to study the performance of the algorithms, and introduce an architecture for an intelligent large-scale distributed system

    Semi-automated parallel programming in heterogeneous intelligent reconfigurable environments (SAPPHIRE)

    Get PDF
    In recent years, as we come closer to approaching physical limits in making smaller (and faster) computer processors, focus has instead been turned toward including multiple processor cores in each device. While this technically allows for more computational power as compared with only one traditional processor core, conventional software can typically only make use of a single processor. Furthermore, we see an increasing number of stream programs that process streams of data such as a stream of images or audio. For stream programs to effectively utilize multi-core processors, multithreading is the key, but it may be difficult to implement in practice depending on the complexity of the programs. We present SAPPHIRE: Semi-Automated Parallel Programming in Heterogeneous Intelligent Reconfigurable Environment, a middleware and SDK for developing multithreaded stream programs. In this middleware, we implement our semi-automated program construction technique which is designed to aid in writing multithreaded software by reducing needed complexity and lines of code written by software developers. We also present a novel static task-scheduling algorithm for stream programs with heterogeneous implementation choices. Our algorithm is capable of scheduling stream programs with provably near-optimal results given a specific set of assumptions, without requiring the unrolling of the task graph. Unrolling the task graph greatly increases the size of the input to the NP-Complete part of the task-scheduling problem as in related work. Finally, we present two case study programs implemented using SAPPHIRE. One case study, EM-Capture, has analyzed over 50 billion frames of endoscopy video in real-time in a real hospital, discerning over 71,000 unique endoscopy procedures. The other case study, EM-Feedback-RT, is a collaborative extension to EM-Capture, and is an attempt to provide real-time quality analysis feedback to physicians during a colonoscopy exam

    Modèles de calculs flot de données avec paramètres entiers et booléens. Modélisation - Analyses - Mise en oeuvre

    Get PDF
    Streaming applications are responsible for the majority of the computation load in many embedded systems (video conferencing, computer vision etc). Their high performance requirements make parallel implementations a necessity. Hence, more and more modern embedded systems include many-core processors that allow massive parallelism. Parallel implementation of streaming applications on many-core platforms is challenging because of their complexity, which tends to increase, and their strict requirements both qualitative (e.g., robustness, reliability) and quantitative (e.g., throughput, power consumption). This is observed in the evolution of video codecs that keep increasing in complexity, while their performance requirements remain the same or even increase. Data flow models of computation (MoCs) have been developed to facilitate the design process of such applications, which are typically composed of filters exchanging streams of data via communication links. Data flow MoCs provide an intuitive representation of streaming applications, while exposing the available parallelism of the application. Moreover, they provide static analyses for liveness and boundedness. However, modern streaming applications feature filters that exchange variable amounts of data, and communication links that are not always active. In this thesis, we present a new data flow MoC, the Boolean Parametric Data Flow (BPDF), that allows parametrization of the amount of data exchanged between the filters using integer parameters and the enabling and disabling of communication links using boolean parameters. In this way, BPDF is able to capture more complex streaming applications, like video decoders. Despite the increase in expressiveness, BPDF applications remain statically analyzable for liveness and boundedness. However, increased expressiveness greatly complicates implementation. Integer parameters result in parametric data dependencies and the boolean parameters disable communication links, effectively removing data dependencies. We propose a scheduling framework that facilitates the scheduling of BPDF applications. Our scheduling framework produces as soon as possible schedules for a given static mapping. It takes us input scheduling constraints that derive either from the application (data dependencies) or from the user (schedule optimizations). The constraints are analyzed for liveness and, if possible, simplified. In this way, our framework provides flexibility, while guaranteeing the liveness of the application. Finally, calculation of the throughput of an application is important both at compile-time and at run-time. It allows to verify at compile-time that the application meets its performance requirements and it allows to take scheduling decisions at run-time that can improve performance or power consumption. We approach this problem by finding parametric throughput expressions for the maximum throughput of a subset of BPDF graphs. Finally, we provide an algorithm that calculates sufficient buffer sizes for the BPDF graph to operate at maximum throughput.Les applications de gestion de flux sont responsables de la majorité des calculs des systèmes embarqués (vidéo conférence, vision par ordinateur). Leurs exigences de haute performance rendent leur mise en œuvre parallèle nécessaire. Par conséquent, il est de plus en plus courant que les systèmes embarqués modernes incluent des processeurs multi-cœurs qui permettent un parallélisme massif. La mise en œuvre des applications de gestion de flux sur des multi-cœurs est difficile à cause de leur complexité, qui tend à augmenter, et de leurs exigences strictes à la fois qualitatives (robustesse, fiabilité) et quantitatives (débit, consommation d'énergie). Ceci est observé dans l'évolution de codecs vidéo qui ne cessent d'augmenter en complexité, tandis que leurs exigences de performance demeurent les mêmes. Les modèles de calcul (MdC) flot de données ont été développés pour faciliter la conception de ces applications qui sont typiquement composées de filtres qui échangent des flux de données via des liens de communication. Ces modèles fournissent une représentation intuitive des applications de gestion de flux, tout en exposant le parallélisme de tâches de l'application. En outre, ils fournissent des analyses statiques pour la vivacité et l'exécution en mémoire bornée. Cependant, les applications de gestion de flux modernes comportent des filtres qui échangent des quantités de données variables, et des liens de communication qui peuvent être activés / désactivés. Dans cette thèse, nous présentons un nouveau MdC flot de données, le Boolean Parametric Data Flow (BPDF), qui permet le paramétrage de la quantité de données échangées entre les filtres en utilisant des paramètres entiers et l'activation et la désactivation de liens de communication en utilisant des paramètres booléens. De cette manière, BPDF est capable de exprimer des applications plus complexes, comme les décodeurs vidéo modernes. Malgré l'augmentation de l'expressivité, les applications BPDF restent statiquement analysables pour la vivacité et l'exécution en mémoire bornée. Cependant, l'expressivité accrue complique grandement la mise en œuvre. Les paramètres entiers entraînent des dépendances de données de type paramétrique et les paramètres booléens peuvent désactiver des liens de communication et ainsi éliminer des dépendances de données. Pour cette raison, nous proposons un cadre d'ordonnancement qui produit des ordonnancements de type ``aussi tôt que possible'' (ASAP) pour un placement statique donné. Il utilise des contraintes d'ordonnancement, soit issues de l'application (dépendance de données) ou de l'utilisateur (optimisations d'ordonnancement). Les contraintes sont analysées pour la vivacité et, si possible, simplifiées. De cette façon, notre cadre permet une grande variété de politiques d'ordonnancement, tout en garantissant la vivacité de l'application. Enfin, le calcul du débit d'une application est important tant avant que pendant l'exécution. Il permet de vérifier que l'application satisfait ses exigences de performance et il permet de prendre des décisions d'ordonnancement à l'exécution qui peuvent améliorer la performance ou la consommation d'énergie. Nous traitons ce problème en trouvant des expressions paramétriques pour le débit maximum d'un sous-ensemble de BPDF. Enfin, nous proposons un algorithme qui calcule une taille des buffers suffisante pour que l'application BPDF ait un débit maximum

    Extensions of Task-based Runtime for High Performance Dense Linear Algebra Applications

    Get PDF
    On the road to exascale computing, the gap between hardware peak performance and application performance is increasing as system scale, chip density and inherent complexity of modern supercomputers are expanding. Even if we put aside the difficulty to express algorithmic parallelism and to efficiently execute applications at large scale, other open questions remain. The ever-growing scale of modern supercomputers induces a fast decline of the Mean Time To Failure. A generic, low-overhead, resilient extension becomes a desired aptitude for any programming paradigm. This dissertation addresses these two critical issues, designing an efficient unified linear algebra development environment using a task-based runtime, and extending a task-based runtime with fault tolerant capabilities to build a generic framework providing both soft and hard error resilience to task-based programming paradigm. To bridge the gap between hardware peak performance and application perfor- mance, a unified programming model is designed to take advantage of a lightweight task-based runtime to manage the resource-specific workload, and to control the data ow and parallel execution of tasks. Under this unified development, linear algebra tasks are abstracted across different underlying heterogeneous resources, including multicore CPUs, GPUs and Intel Xeon Phi coprocessors. Performance portability is guaranteed and this programming model is adapted to a wide range of accelerators, supporting both shared and distributed-memory environments. To solve the resilient challenges on large scale systems, fault tolerant mechanisms are designed for a task-based runtime to protect applications against both soft and hard errors. For soft errors, three additions to a task-based runtime are explored. The first recovers the application by re-executing minimum number of tasks, the second logs intermediary data between tasks to minimize the necessary re-execution, while the last one takes advantage of algorithmic properties to recover the data without re- execution. For hard errors, we propose two generic approaches, which augment the data logging mechanism for soft errors. The first utilizes non-volatile storage device to save logged data, while the second saves local logged data on a remote node to protect against node failure. Experimental results have confirmed that our soft and hard error fault tolerant mechanisms exhibit the expected correctness and efficiency

    QoS-aware predictive workflow scheduling

    Full text link
    This research places the basis of QoS-aware predictive workflow scheduling. This research novel contributions will open up prospects for future research in handling complex big workflow applications with high uncertainty and dynamism. The results from the proposed workflow scheduling algorithm shows significant improvement in terms of the performance and reliability of the workflow applications
    • …
    corecore