151 research outputs found

    Exascale machines require new programming paradigms and runtimes

    Get PDF
    Extreme scale parallel computing systems will have tens of thousands of optionally accelerator-equiped nodes with hundreds of cores each, as well as deep memory hierarchies and complex interconnect topologies. Such Exascale systems will provide hardware parallelism at multiple levels and will be energy constrained. Their extreme scale and the rapidly deteriorating reliablity of their hardware components means that Exascale systems will exhibit low mean-time-between-failure values. Furthermore, existing programming models already require heroic programming and optimisation efforts to achieve high efficiency on current supercomputers. Invariably, these efforts are platform-specific and non-portable. In this paper we will explore the shortcomings of existing programming models and runtime systems for large scale computing systems. We then propose and discuss important features of programming paradigms and runtime system to deal with large scale computing systems with a special focus on data-intensive applications and resilience. Finally, we also discuss code sustainability issues and propose several software metrics that are of paramount importance for code development for large scale computing systems

    Proactive cloud service assurance framework for fault remediation in cloud environment

    Get PDF
    Cloud resiliency is an important issue in successful implementation of cloud computing systems. Handling cloud faults proactively, with a suitable remediation technique having minimum cost is an important requirement for a fault management system. The selection of best applicable remediation technique is a decision making problem and considers parameters such as i) Impact of remediation technique ii) Overhead of remediation technique ii) Severity of fault and iv) Priority of the application. This manuscript proposes an analytical model to measure the effectiveness of a remediation technique for various categories of faults, further it demonstrates the implementation of an efficient fault remediation system using a rule-based expert system. The expert system is designed to compute an utility value for each remediation technique in a novel way and select the best remediation technique from its knowledgebase. A prototype is developed for experimentation purpose and the results shows improved availability with less overhead as compared to a reactive fault management system

    Scheduling Tasks on Intermittently-Powered Real-Time Systems

    Get PDF
    Batteryless systems go through sporadic power on and off phases due to intermittently available energy; thus, they are called intermittent systems. Unfortunately, this intermittence in power supply hinders the timely execution of tasks and limits such devices’ potential in certain application domains, e.g., healthcare, live-stock tracking. Unlike prior work on time-aware intermittent systems that focuses on timekeeping [1, 2, 3] and discarding expired data [4], this dissertation concentrates on finishing task execution on time. I leverage the data processing and control layer of batteryless systems by developing frameworks that (1) integrate energy harvesting and real-time systems, (2) rethink machine learning algorithms for an energy-aware imprecise task scheduling framework, (3) develop scheduling algorithms that, along with deciding what to compute, answers when to compute and when to harvest, and (4) utilize distributed systems that collaboratively emulate a persistently powered system. Scheduling Framework for Intermittently Powered Computing Systems. Batteryless systems rely on sporadically available harvestable energy. For example, kinetic-powered motion detector sensors on the impalas can only harvest energy when the impalas are moving, which cannot be ascertained in advance. This uncertainty poses a unique real-time scheduling problem where existing real-time algorithms fail due to the interruption in execution time. This dissertation proposes a unified scheduling framework that includes both harvesting and computing. Imprecise Deep Neural Network Inference in Deadline-Aware Intermittent Systems. This dissertation proposes Zygarde- an energy-aware and outcome-aware soft-real-time imprecise deep neural network (DNN) task scheduling framework for intermittent systems. Zygarde leverages the semantic diversity of input data and layer-dependent expressiveness of deep features and infers only the necessary DNN layers based on available time and energy. Zygarde proposes a novel technique to determine the imprecise boundary at the runtime by exploiting the clustering classifiers and specialized offline training of the DNNs to minimize the loss of accuracy due to partial execution. It also proposes a single metric, η to represent a system’s predictability that measures how close a harvesterâs harvesting pattern is to a constant energy source. Besides, Zygarde consists of a scheduling algorithm that takes available time, available energy, impreciseness, and the classifier's performance into account. Scheduling Mutually Exclusive Computing and Harvesting Tasks in Deadline-Aware Intermittent Systems. The lack of sufficient ambient energy to directly power the intermittent systems introduces mutually exclusive computing and charging cycles of intermittently powered systems. This introduces a challenging real-time scheduling problem where the existing real-time algorithms fail due to the lack of interruption in execution time. To address this, this dissertation proposes Celebi, which considers the dynamics of the available energy and schedules when to harvest and when to compute in batteryless systems. Using data-driven simulation and real-world experiments, this dissertation shows that Celebi significantly increases the number of tasks that complete execution before their deadline when power was only available intermittently. Persistent System Emulation with Distributed Intermittent System. Intermittently-powered sensing and computing systems go through sporadic power-on and off periods due to the uncertain availability of energy sources. Despite the recent efforts to advance time-sensitive intermittent systems, such systems fail to capture important target events when the energy is absent for a prolonged time. This event miss limits the potential usage of intermittent systems in fault- intolerant and safety-critical applications. To address this problem, this dissertation proposes Falinks, a framework that allows a swarm of distributed intermittently powered nodes to collaboratively imitate the sensing and computing capabilities of a persistently powered system. This framework provides power-on and off schedules for the swamp of intermittent nodes which has no communication capability with each other.Doctor of Philosoph

    QoS-aware predictive workflow scheduling

    Full text link
    This research places the basis of QoS-aware predictive workflow scheduling. This research novel contributions will open up prospects for future research in handling complex big workflow applications with high uncertainty and dynamism. The results from the proposed workflow scheduling algorithm shows significant improvement in terms of the performance and reliability of the workflow applications

    A distributed processor state management architecture for large-window processors

    Get PDF
    Processor architectures with large instruction windows have been proposed to expose more instruction-level parallelism (ILP) and increase performance. Some of the proposed architectures replace a re-order buffer (ROB) with a check-pointing mechanism and an out-of-order release of processor resources. Check-pointing, however, leads to an imprecise processor state recovery on mis-predicted branches and exceptions and re-execution of correct-path instructions after state recovery. It also requires large register files complicating renaming, allocation and release of physical registers. This paper proposes a new processor architecture called a Multi-State Processor (MSP). The MSP does not use check-pointing, avoids the above-mentioned problems, and has a fast, distributed state recovery mechanism. The MSP uses a novel register management architecture allowing implementation of large register files with simpler and more scalable register allocation, renaming, and release. It is also key to precise processor state recovery mechanism. The MSP is shown to improve IPC by 14%, on average, for integer SPEC CPU2000 benchmarks compared to a check-pointing based mechanism ([2]) when a fast and simple branch predictor is used. With a very aggressive branch predictor the IPC improvement is 1%, on average, and 3% if some of the programs are optimized for the MSP. The MSP also reduces the average number of executed instructions by 16.5% (12% for the aggressive branch predictor), mostly due to precise state recovery. This improves the MSP processor energy efficiency even though it uses a larger register file.Peer ReviewedPostprint (published version

    Self-adaptivity of applications on network on chip multiprocessors: the case of fault-tolerant Kahn process networks

    Get PDF
    Technology scaling accompanied with higher operating frequencies and the ability to integrate more functionality in the same chip has been the driving force behind delivering higher performance computing systems at lower costs. Embedded computing systems, which have been riding the same wave of success, have evolved into complex architectures encompassing a high number of cores interconnected by an on-chip network (usually identified as Multiprocessor System-on-Chip). However these trends are hindered by issues that arise as technology scaling continues towards deep submicron scales. Firstly, growing complexity of these systems and the variability introduced by process technologies make it ever harder to perform a thorough optimization of the system at design time. Secondly, designers are faced with a reliability wall that emerges as age-related degradation reduces the lifetime of transistors, and as the probability of defects escaping post-manufacturing testing is increased. In this thesis, we take on these challenges within the context of streaming applications running in network-on-chip based parallel (not necessarily homogeneous) systems-on-chip that adopt the no-remote memory access model. In particular, this thesis tackles two main problems: (1) fault-aware online task remapping, (2) application-level self-adaptation for quality management. For the former, by viewing fault tolerance as a self-adaptation aspect, we adopt a cross-layer approach that aims at graceful performance degradation by addressing permanent faults in processing elements mostly at system-level, in particular by exploiting redundancy available in multi-core platforms. We propose an optimal solution based on an integer linear programming formulation (suitable for design time adoption) as well as heuristic-based solutions to be used at run-time. We assess the impact of our approach on the lifetime reliability. We propose two recovery schemes based on a checkpoint-and-rollback and a rollforward technique. For the latter, we propose two variants of a monitor-controller- adapter loop that adapts application-level parameters to meet performance goals. We demonstrate not only that fault tolerance and self-adaptivity can be achieved in embedded platforms, but also that it can be done without incurring large overheads. In addressing these problems, we present techniques which have been realized (depending on their characteristics) in the form of a design tool, a run-time library or a hardware core to be added to the basic architecture

    A novel architecture for large windows processors

    Get PDF
    Several processor architectures with large instruction windows have been proposed. They improve performance by maintaining hundreds of instructions in flight to increase the level of instruction parallelism (ILP). Such architectures replace a re-order buffer (ROB) with a check-pointing mechanism and an out-of-order release of the processor resources. Check-pointing, however, leads to an imprecise state recovery on mispredicted branches and exceptions and frequent re-execution of current-path instructions during the state recovery. It also requires large register files complicating renaming, allocation and release of physical registers. This technical report proposes a new processor architecture that does not use either a traditional ROB or check-pointing, avoids the above-mentioned problems, and has a fast, distributed state recovery mechanism. Its novel register management architecture allows implementation of large register files with simpler and more scalable, register renaming and commit. It is also key to the precise recovery mechanism.Postprint (published version

    Saturn: An Optimized Data System for Large Model Deep Learning Workloads

    Full text link
    Large language models such as GPT-3 & ChatGPT have transformed deep learning (DL), powering applications that have captured the public's imagination. These models are rapidly being adopted across domains for analytics on various modalities, often by finetuning pre-trained base models. Such models need multiple GPUs due to both their size and computational load, driving the development of a bevy of "model parallelism" techniques & tools. Navigating such parallelism choices, however, is a new burden for end users of DL such as data scientists, domain scientists, etc. who may lack the necessary systems knowhow. The need for model selection, which leads to many models to train due to hyper-parameter tuning or layer-wise finetuning, compounds the situation with two more burdens: resource apportioning and scheduling. In this work, we tackle these three burdens for DL users in a unified manner by formalizing them as a joint problem that we call SPASE: Select a Parallelism, Allocate resources, and SchedulE. We propose a new information system architecture to tackle the SPASE problem holistically, representing a key step toward enabling wider adoption of large DL models. We devise an extensible template for existing parallelism schemes and combine it with an automated empirical profiler for runtime estimation. We then formulate SPASE as an MILP. We find that direct use of an MILP-solver is significantly more effective than several baseline heuristics. We optimize the system runtime further with an introspective scheduling approach. We implement all these techniques into a new data system we call Saturn. Experiments with benchmark DL workloads show that Saturn achieves 39-49% lower model selection runtimes than typical current DL practice.Comment: Under submission at VLDB. Code available: https://github.com/knagrecha/saturn. 12 pages + 3 pages references + 2 pages appendi
    • …
    corecore