622 research outputs found

    Exploiting cache locality at run-time

    Get PDF
    With the increasing gap between the speeds of the processor and memory system, memory access has become a major performance bottleneck in modern computer systems. Recently, Symmetric Multi-Processor (SMP) systems have emerged as a major class of high-performance platforms. Improving the memory performance of Parallel applications with dynamic memory-access patterns on Symmetric Multi-Processors (SMP) is a hard problem. The solution to this problem is critical to the successful use of the SMP systems because dynamic memory-access patterns occur in many real-world applications. This dissertation is aimed at solving this problem.;Based on a rigorous analysis of cache-locality optimization, we propose a memory-layout oriented run-time technique to exploit the cache locality of parallel loops. Our technique have been implemented in a run-time system. Using simulation and measurement, we have shown our run-time approach can achieve comparable performance with compiler optimizations for those regular applications, whose load balance and cache locality can be well optimized by tiling and other program transformations. However, our approach was shown to improve significantly the memory performance for applications with dynamic memory-access patterns. Such applications are usually hard to optimize with static compiler optimizations.;Several contributions are made in this dissertation. We present models to characterize the complexity and present a solution framework for optimizing cache locality. We present an effective estimation technique for memory-access patterns to support efficient locality optimizations and information integration. We present a memory-layout oriented run-time technique for locality optimization. We present efficient scheduling algorithms to trade off locality and load imbalance. We provide a detailed performance evaluation of the run-time technique

    MCFlow: Middleware for Mixed-Criticality Distributed Real-Time Systems

    Get PDF
    Traditional fixed-priority scheduling analysis for periodic/sporadic task sets is based on the assumption that all tasks are equally critical to the correct operation of the system. Therefore, every task has to be schedulable under the scheduling policy, and estimates of tasks\u27 worst case execution times must be conservative in case a task runs longer than is usual. To address the significant under-utilization of a system\u27s resources under normal operating conditions that can arise from these assumptions, several \emph{mixed-criticality scheduling} approaches have been proposed. However, to date there has been no quantitative comparison of system schedulability or run-time overhead for the different approaches. In this dissertation, we present what is to our knowledge the first side-by-side implementation and evaluation of those approaches, for periodic and sporadic mixed-criticality tasks on uniprocessor or distributed systems, under a mixed-criticality scheduling model that is common to all these approaches. To make a fair evaluation of mixed-criticality scheduling, we also address some previously open issues and propose modifications to improve schedulability and correctness of particular approaches. To facilitate the development and evaluation of mixed-criticality applications, we have designed and developed a distributed real-time middleware, called MCFlow, for mixed-criticality end-to-end tasks running on multi-core platforms. The research presented in this dissertation provides the following contributions to the state of the art in real-time middleware: (1) an efficient component model through which dependent subtask graphs can be configured flexibly for execution within a single core, across cores of a common host, or spanning multiple hosts; (2) support for optimizations to inter-component communication to reduce data copying without sacrificing the ability to execute subtasks in parallel; (3) a strict separation of timing and functional concerns so that they can be configured independently; (4) an event dispatching architecture that uses lock free algorithms where possible to reduce memory contention, CPU context switching, and priority inversion; and (5) empirical evaluations of MCFlow itself and of different mixed criticality scheduling approaches both with a single host and end-to-end across multiple hosts. The results of our evaluation show that in terms of basic distributed real-time behavior MCFlow performs comparably to the state of the art TAO real-time object request broker when only one core is used and outperforms TAO when multiple cores are involved. We also identify and categorize different use cases under which different mixed criticality scheduling approaches are preferable

    Effects of Communication Protocol Stack Offload on Parallel Performance in Clusters

    Get PDF
    The primary research objective of this dissertation is to demonstrate that the effects of communication protocol stack offload (CPSO) on application execution time can be attributed to the following two complementary sources. First, the application-specific computation may be executed concurrently with the asynchronous communication performed by the communication protocol stack offload engine. Second, the protocol stack processing can be accelerated or decelerated by the offload engine. These two types of performance effects can be quantified with the use of the degree of overlapping Do and degree of acceleration Daccs. The composite communication speedup metrics S_comm(Do, Daccs) can be used in order to quantify the combined effects of the protocol stack offload. This dissertation thesis is validated empirically. The degree of overlapping Do, the degree of acceleration Daccs, and the communication speedup Scomm characteristic of the system configurations under test are derived in the course of experiments performed for the system configurations of interest. It is shown that the proposed metrics adequately describe the effects of the protocol stack offload on the application execution time. Additionally, a set of analytical models of the networking subsystem of a PC-based cluster node is developed. As a result of the modeling, the metrics Do, Daccs, and Scomm are obtained. The models are evaluated as to their complexity and precision by comparing the modeling results with the measured values of Do, Daccs, and Scomm. The primary contributions of this dissertation research are as follows. First, the metric Daccs and Scomm are introduced in order to complement the Do metric in its use for evaluation of the effects of optimizations in the networking subsystem on parallel performance in clusters. The metrics are shown to adequately describe CPSO performance effects. Second, a method for assessing performance effects of CPSO scenarios on application performance is developed and presented. Third, a set of analytical models of cluster node networking subsystems with CPSO capability is developed and characterised as to their complexity and precision of the prediction of the Do and Daccs metrics

    Scheduling Techniques for Operating Systems for Medical and IoT Devices: A Review

    Get PDF
    Software and Hardware synthesis are the major subtasks in the implementation of hardware/software systems. Increasing trend is to build SoCs/NoC/Embedded System for Implantable Medical Devices (IMD) and Internet of Things (IoT) devices, which includes multiple Microprocessors and Signal Processors, allowing designing complex hardware and software systems, yet flexible with respect to the delivered performance and executed application. An important technique, which affect the macroscopic system implementation characteristics is the scheduling of hardware operations, program instructions and software processes. This paper presents a survey of the various scheduling strategies in process scheduling. Process Scheduling has to take into account the real-time constraints. Processes are characterized by their timing constraints, periodicity, precedence and data dependency, pre-emptivity, priority etc. The affect of these characteristics on scheduling decisions has been described in this paper

    Framework for simulation of fault tolerant heterogeneous multiprocessor system-on-chip

    Full text link
    Due to the ever growing requirement in high performance data computation, current Uniprocessor systems fall short of hand to meet critical real-time performance demands in (i) high throughput (ii) faster processing time (iii) low power consumption (iv) design cost and time-to-market factors and more importantly (v) fault tolerant processing. Shifting the design trend to MPSOCs is a work-around to meet these challenges. However, developing efficient fault tolerant task scheduling and mapping techniques requires optimized algorithms that consider the various scenarios in Multiprocessor environments. Several works have been done in the past few years which proposed simulation based frameworks for scheduling and mapping strategies that considered homogenous systems and error avoidance techniques. However, most of these works inadequately describe today\u27s MPSOC trend because they were focused on the network domain and didn\u27t consider heterogeneous systems with fault tolerant capabilities; In order to address these issues, this work proposes (i) a performance driven scheduling algorithm (PD SA) based on simulated annealing technique (ii) an optimized Homogenous-Workload-Distribution (HWD) Multiprocessor task mapping algorithm which considers the dynamic workload on processors and (iii) a dynamic Fault Tolerant (FT) scheduling/mapping algorithm to employ robust application processing system. The implementation was accompanied by a heterogeneous Multiprocessor system simulation framework developed in systemC/C++. The proposed framework reads user data, set the architecture, execute input task graph and finally generate performance variables. This framework alleviates previous work issues with respect to (i) architectural flexibility in number-of-processors, processor types and topology (ii) optimized scheduling and mapping strategies and (iii) fault-tolerant processing capability focusing more on the computational domain; A set of random as well as application specific STG benchmark suites were run on the simulator to evaluate and verify the performance of the proposed algorithms. The simulations were carried out for (i) scheduling policy evaluation (ii) fault tolerant evaluation (iii) topology evaluation (iv) Number of processor evaluation (v) Mapping policy evaluation and (vi) Processor Type evaluation. The results showed that PD scheduling algorithm showed marginally better performance than EDF with respect to utilization, Execution-Time and Power factors. The dynamic Fault Tolerant implementation showed to be a viable and efficient strategy to meet real-time constraints without posing significant system performance degradation. Torus topology gave better performance than Tile with respect to task completion time and power factors. Executing highly heterogeneous Tasks showed higher power consumption and execution time. Finally, increasing the number of processors showed a decrease in average Utilization but improved task completion time and power consumption; Based on the simulation results, the system designer can compare tradeoffs between a various design choices with respect to the performance requirement specifications. In general, designing an optimized Multiprocessor scheduling and mapping strategy with added fault tolerant capability will enable to develop efficient Multiprocessor systems which meet future performance goal requirements. This is the substance of this work

    Cyclic executive for safety-critical Java on chip-multiprocessors

    Get PDF

    Design of testbed and emulation tools

    Get PDF
    The research summarized was concerned with the design of testbed and emulation tools suitable to assist in projecting, with reasonable accuracy, the expected performance of highly concurrent computing systems on large, complete applications. Such testbed and emulation tools are intended for the eventual use of those exploring new concurrent system architectures and organizations, either as users or as designers of such systems. While a range of alternatives was considered, a software based set of hierarchical tools was chosen to provide maximum flexibility, to ease in moving to new computers as technology improves and to take advantage of the inherent reliability and availability of commercially available computing systems

    Complex scheduling models and analyses for property-based real-time embedded systems

    Get PDF
    Modern multi core architectures and parallel applications pose a significant challenge to the worst-case centric real-time system verification and design efforts. The involved model and parameter uncertainty contest the fidelity of formal real-time analyses, which are mostly based on exact model assumptions. In this dissertation, various approaches that can accept parameter and model uncertainty are presented. In an attempt to improve predictability in worst-case centric analyses, the exploration of timing predictable protocols are examined for parallel task scheduling on multiprocessors and network-on-chip arbitration. A novel scheduling algorithm, called stationary rigid gang scheduling, for gang tasks on multiprocessors is proposed. In regard to fixed-priority wormhole-switched network-on-chips, a more restrictive family of transmission protocols called simultaneous progression switching protocols is proposed with predictability enhancing properties. Moreover, hierarchical scheduling for parallel DAG tasks under parameter uncertainty is studied to achieve temporal- and spatial isolation. Fault-tolerance as a supplementary reliability aspect of real-time systems is examined, in spite of dynamic external causes of fault. Using various job variants, which trade off increased execution time demand with increased error protection, a state-based policy selection strategy is proposed, which provably assures an acceptable quality-of-service (QoS). Lastly, the temporal misalignment of sensor data in sensor fusion applications in cyber-physical systems is examined. A modular analysis based on minimal properties to obtain an upper-bound for the maximal sensor data time-stamp difference is proposed
    • …
    corecore