15 research outputs found

    Compatible phase co-scheduling on a CMP of multi-threaded processors

    Full text link

    Extending Scojo-PECT by migration based on system-level checkpointing

    Get PDF
    In recent years, a significant amount of research has been done on job scheduling in high performance computing area. Parallel jobs have different running time and require a different number of processors, thus jobs need to be scheduled and packed to improve system utilization. Scojo-PECT is a job scheduler which provides service guarantees by using coarse-grain time sharing. However, Scojo-PECT does not provide process migration. We extend the Scojo-PECT by migrating parallel jobs based on system-level checkpointing. We investigate different cases in the Scojo-PECT scheduling algorithm where migration based on system-level checkpointing can be used to improve resource utilization and reduce job response time. Our experimental results show reduction of relative response times on medium jobs over the results of the original Scojo-PECT scheduler and the long jobs do not suffer any disadvantage

    Quality Time: A simple online technique for quantifying multicore execution efficiency

    Full text link
    Abstract—In order to increase utilization, multicore pro-cessors share memory resources among an increasing number of cores. This sharing leads to memory interference, which in turn leads to a non-uniform degradation in the execution of concurrent applications, even in the presence of fairness mechanisms. Many utilities rely on application CPU Time both for measuring resource usage and inferring application progress. These utilities are therefore directly affected by the distorting effects of multicore interference on the representativeness of CPU Time as a proxy for progress. This makes reasoning about myriad properties from fairness, to QoS, to throughput optimality very difficult in consolidated environments, such as IaaS. We introduce the notion of Quality Time, which provides a measure of application progress analogous to CPU Time’s measure of resource usage, and we propose a simple online sampling-based technique to approximate Quality Time with high accuracy. We have implemented three user-space tools called Qtime, Qtop, and Qplacer. Qtime can attach to an application to calculate its Quality Time online, Qtop is a dashboard that monitors the Quality Times of all applications on the system, and Qplacer leverages Quality Time information to find better application placements and improve overall system quality. With Quality Time, we are able to reduce the error in inferring execution efficiency from 150.3 % to 25.1 % in the worst case and from 30.0 % to 7.5 % on average. Qplacer can increase average system throughput by 3.2 % when compared to static application placement. I

    G-LOMARC-TS: Lookahead group matchmaking for time/space sharing on multi-core parallel machines

    Get PDF
    Parallel machines with multi-core nodes are becoming increasingly popular. The performances of applications running on these machines are improved gradually due to the resource competition in each node. Researches have found that coscheduling different applications with complementary resource characteristics on the same set of nodes (semi time sharing) may improve the performance. We propose a scheduling algorithm G-LOMARC-TS which incorporates both space and semi time sharing scheduling methods and matches groups of jobs if possible for coscheduling. Since matchmaking may select jobs further down the waiting queue and the jobs in front of the queue may be delayed subsequently, fairness for each individual job will be watched and the delay will be kept within a limited bound. Several heuristics are used to solve the NP-complete problem of forming groups. Our experiment results show both utilization gain and average relative response time improvements of G-LOMARC-TS over other several scheduling policies

    Compatible phase co-scheduling on a CMP of multi-threaded processors

    No full text
    The industry is rapidly moving towards the adoption of Chip Multi-Processors (CMPs) of Simultaneous Multi-Threaded (SMT) cores for general purpose systems. The most prominent use of such processors, at least in the near term, will be as job servers running multiple independent threads on the different contexts of the various SMT cores. In such an environment, the co-scheduling of phases from different threads plays a significant role in the overall throughput. Less throughput is achieved when phases from different threads that conflict for particular hardware resources are scheduled together, compared with the situation where compatible phases are co-scheduled on the same SMT core. Achieving the latter requires precise per-phase hardware statistics that the scheduler can use to rapidly identify possible incompatibilities among phases of different threads, thereby avoiding the potentially high performance cost of inter-thread contention. In this paper, we devise phase co-scheduling policies for a dual-core CMP of dual-threaded SMT processors. We explore a number of approaches and find that the use of ready and in-flight instruction metrics permits effective coscheduling of compatible phases among the four contexts. This approach significantly outperforms the worst static grouping of threads, and very closely matches the best static grouping, even outperforming it by as much as 7%

    An input centric paradigm for program dynamic optimizations and lifetime evolvement

    Get PDF
    Accurately predicting program behaviors (e.g., memory locality, method calling frequency) is fundamental for program optimizations and runtime adaptations. Despite decades of remarkable progress, prior studies have not systematically exploited the use of program inputs, a deciding factor of program behaviors, to help in program dynamic optimizations. Triggered by the strong and predictive correlations between program inputs and program behaviors that recent studies have uncovered, the dissertation work aims to bring program inputs into the focus of program behavior analysis and program dynamic optimization, cultivating a new paradigm named input-centric program behavior analysis and dynamic optimization.;The new optimization paradigm consists of three components, forming a three-layer pyramid. at the base is program input characterization, a component for resolving the complexity in program raw inputs and extracting important features. In the middle is input-behavior modeling, a component for recognizing and modeling the correlations between characterized input features and program behaviors. These two components constitute input-centric program behavior analysis, which (ideally) is able to predict the large-scope behaviors of a program\u27s execution as soon as the execution starts. The top layer is input-centric adaptation, which capitalizes on the novel opportunities created by the first two components to facilitate proactive adaptation for program optimizations.;This dissertation aims to develop this paradigm in two stages. In the first stage, we concentrate on exploring the implications of program inputs for program behaviors and dynamic optimization. We construct the basic input-centric optimization framework based on of line training to realize the basic functionalities of the three major components of the paradigm. For the second stage, we focus on making the paradigm practical by addressing multi-facet issues in handling input complexities, transparent training data collection, predictive model evolvement across production runs. The techniques proposed in this stage together cultivate a lifelong continuous optimization scheme with cross-input adaptivity.;Fundamentally the new optimization paradigm provides a brand new solution for program dynamic optimization. The techniques proposed in the dissertation together resolve the adaptivity-proactivity dilemma that has been limiting the effectiveness of existing optimization techniques. its benefits are demonstrated through proactive dynamic optimizations in Jikes RVM and version selection using IBM XL C Compiler, yielding significant performance improvement on a set of Java and C/C++ programs. It may open new opportunities for a broad range of runtime optimizations and adaptations. The evaluation results on both Java and C/C++ applications demonstrate the new paradigm is promising in advancing the current state of program optimizations

    Analysis and Approximation of Optimal Co-Scheduling on CMP

    Get PDF
    In recent years, the increasing design complexity and the problems of power and heat dissipation have caused a shift in processor technology to favor Chip Multiprocessors. In Chip Multiprocessors (CMP) architecture, it is common that multiple cores share some on-chip cache. The sharing may cause cache thrashing and contention among co-running jobs. Job co-scheduling is an approach to tackling the problem by assigning jobs to cores appropriately so that the contention and consequent performance degradations are minimized. This dissertation aims to tackle two of the most prominent challenges in job co-scheduling.;The first challenge is in the computational complexity for determining optimal job co-schedules. This dissertation presents one of the first systematic analyses on the complexity of job co-scheduling. Besides proving the NP completeness of job co-scheduling, it introduces a set of algorithms, based on graph theory and Integer/Linear Programming, for computing optimal co-schedules or their lower bounds in scenarios with or without job migrations. For complex cases, it empirically demonstrates the feasibility for approximating the optimal schedules effectively by proposing several heuristics-based algorithms. These discoveries facilitate the assessment of job co-schedulers by providing necessary baselines, and shed insights to the development of practical co-scheduling systems.;The second challenge resides in the prediction of the performance of processes co-running on a shared cache. This dissertation explores the influence on co-run performance prediction imposed by co-runners, program inputs, and cache configurations. Through a sequence of formal analysis, we derive an analytical co-run locality model, uncovering the inherent statistical connections between the data references of programs single-runs and their co-run locality. The model offers theoretical insights on co-run locality analysis and leads to a lightweight approach for fast prediction of shared cache performance. We demonstrate the effectiveness of the model in enabling proactive job co-scheduling.;Together, the two-dimensional findings open up many new opportunities for cache management on modern CMP by laying the foundation for job co-scheduling, and enhancing the understanding to data locality and cache sharing significantly

    Locality Enhancement and Dynamic Optimizations on Multi-Core and GPU

    Get PDF
    Enhancing the match between software executions and hardware features is key to computing efficiency. The match is a continuously evolving and challenging problem. This dissertation focuses on the development of programming system support for exploiting two key features of modern hardware development: the massive parallelism of emerging computational accelerators such as Graphic Processing Units (GPU), and the non-uniformity of cache sharing in modern multicore processors. They are respectively driven by the important role of accelerators in today\u27s general-purpose computing and the ultimate importance of memory performance. This dissertation particularly concentrates on optimizing control flows and memory references, at both compilation and execution time, to tap into the full potential of pure software solutions in taking advantage of the two key hardware features.;Conditional branches cause divergences in program control flows, which may result in serious performance degradation on massively data-parallel GPU architectures with Single Instruction Multiple Data (SIMD) parallelism. On such an architecture, control divergence may force computing units to stay idle for a substantial time, throttling system throughput by orders of magnitude. This dissertation provides an extensive exploration of the solution to this problem and presents program level transformations based upon two fundamental techniques --- thread relocation and data relocation. These two optimizations provide fundamental support for swapping jobs among threads so that the control flow paths of threads converge within every SIMD thread group.;In memory performance, this dissertation concentrates on two aspects: the influence of nonuniform sharing on multithreading applications, and the optimization of irregular memory references on GPUs. In shared cache multicore chips, interactions among threads are complicated due to the interplay of cache contention and synergistic prefetching. This dissertation presents the first systematic study on the influence of non-uniform shared cache on contemporary parallel programs, reveals the mismatch between the software development and underlying cache sharing hierarchies, and further demonstrates it by proposing and applying cache-sharing-aware data transformations that bring significant performance improvement. For the second aspect, the efficiency of GPU accelerators is sensitive to irregular memory references, which refer to the memory references whose access patterns remain unknown until execution time (e.g., A[P[i]]). The root causes of the irregular memory reference problem are similar to that of the control flow problem, while in a more general and complex form. I developed a framework, named G-Streamline, as a unified software solution to dynamic irregularities in GPU computing. It treats both types of irregularities at the same time in a holistic fashion, maximizing the whole-program performance by resolving conflicts among optimizations
    corecore