1,453 research outputs found

    Analysis and Approximation of Optimal Co-Scheduling on CMP

    Get PDF
    In recent years, the increasing design complexity and the problems of power and heat dissipation have caused a shift in processor technology to favor Chip Multiprocessors. In Chip Multiprocessors (CMP) architecture, it is common that multiple cores share some on-chip cache. The sharing may cause cache thrashing and contention among co-running jobs. Job co-scheduling is an approach to tackling the problem by assigning jobs to cores appropriately so that the contention and consequent performance degradations are minimized. This dissertation aims to tackle two of the most prominent challenges in job co-scheduling.;The first challenge is in the computational complexity for determining optimal job co-schedules. This dissertation presents one of the first systematic analyses on the complexity of job co-scheduling. Besides proving the NP completeness of job co-scheduling, it introduces a set of algorithms, based on graph theory and Integer/Linear Programming, for computing optimal co-schedules or their lower bounds in scenarios with or without job migrations. For complex cases, it empirically demonstrates the feasibility for approximating the optimal schedules effectively by proposing several heuristics-based algorithms. These discoveries facilitate the assessment of job co-schedulers by providing necessary baselines, and shed insights to the development of practical co-scheduling systems.;The second challenge resides in the prediction of the performance of processes co-running on a shared cache. This dissertation explores the influence on co-run performance prediction imposed by co-runners, program inputs, and cache configurations. Through a sequence of formal analysis, we derive an analytical co-run locality model, uncovering the inherent statistical connections between the data references of programs single-runs and their co-run locality. The model offers theoretical insights on co-run locality analysis and leads to a lightweight approach for fast prediction of shared cache performance. We demonstrate the effectiveness of the model in enabling proactive job co-scheduling.;Together, the two-dimensional findings open up many new opportunities for cache management on modern CMP by laying the foundation for job co-scheduling, and enhancing the understanding to data locality and cache sharing significantly

    An input centric paradigm for program dynamic optimizations and lifetime evolvement

    Get PDF
    Accurately predicting program behaviors (e.g., memory locality, method calling frequency) is fundamental for program optimizations and runtime adaptations. Despite decades of remarkable progress, prior studies have not systematically exploited the use of program inputs, a deciding factor of program behaviors, to help in program dynamic optimizations. Triggered by the strong and predictive correlations between program inputs and program behaviors that recent studies have uncovered, the dissertation work aims to bring program inputs into the focus of program behavior analysis and program dynamic optimization, cultivating a new paradigm named input-centric program behavior analysis and dynamic optimization.;The new optimization paradigm consists of three components, forming a three-layer pyramid. at the base is program input characterization, a component for resolving the complexity in program raw inputs and extracting important features. In the middle is input-behavior modeling, a component for recognizing and modeling the correlations between characterized input features and program behaviors. These two components constitute input-centric program behavior analysis, which (ideally) is able to predict the large-scope behaviors of a program\u27s execution as soon as the execution starts. The top layer is input-centric adaptation, which capitalizes on the novel opportunities created by the first two components to facilitate proactive adaptation for program optimizations.;This dissertation aims to develop this paradigm in two stages. In the first stage, we concentrate on exploring the implications of program inputs for program behaviors and dynamic optimization. We construct the basic input-centric optimization framework based on of line training to realize the basic functionalities of the three major components of the paradigm. For the second stage, we focus on making the paradigm practical by addressing multi-facet issues in handling input complexities, transparent training data collection, predictive model evolvement across production runs. The techniques proposed in this stage together cultivate a lifelong continuous optimization scheme with cross-input adaptivity.;Fundamentally the new optimization paradigm provides a brand new solution for program dynamic optimization. The techniques proposed in the dissertation together resolve the adaptivity-proactivity dilemma that has been limiting the effectiveness of existing optimization techniques. its benefits are demonstrated through proactive dynamic optimizations in Jikes RVM and version selection using IBM XL C Compiler, yielding significant performance improvement on a set of Java and C/C++ programs. It may open new opportunities for a broad range of runtime optimizations and adaptations. The evaluation results on both Java and C/C++ applications demonstrate the new paradigm is promising in advancing the current state of program optimizations

    Locality Enhancement and Dynamic Optimizations on Multi-Core and GPU

    Get PDF
    Enhancing the match between software executions and hardware features is key to computing efficiency. The match is a continuously evolving and challenging problem. This dissertation focuses on the development of programming system support for exploiting two key features of modern hardware development: the massive parallelism of emerging computational accelerators such as Graphic Processing Units (GPU), and the non-uniformity of cache sharing in modern multicore processors. They are respectively driven by the important role of accelerators in today\u27s general-purpose computing and the ultimate importance of memory performance. This dissertation particularly concentrates on optimizing control flows and memory references, at both compilation and execution time, to tap into the full potential of pure software solutions in taking advantage of the two key hardware features.;Conditional branches cause divergences in program control flows, which may result in serious performance degradation on massively data-parallel GPU architectures with Single Instruction Multiple Data (SIMD) parallelism. On such an architecture, control divergence may force computing units to stay idle for a substantial time, throttling system throughput by orders of magnitude. This dissertation provides an extensive exploration of the solution to this problem and presents program level transformations based upon two fundamental techniques --- thread relocation and data relocation. These two optimizations provide fundamental support for swapping jobs among threads so that the control flow paths of threads converge within every SIMD thread group.;In memory performance, this dissertation concentrates on two aspects: the influence of nonuniform sharing on multithreading applications, and the optimization of irregular memory references on GPUs. In shared cache multicore chips, interactions among threads are complicated due to the interplay of cache contention and synergistic prefetching. This dissertation presents the first systematic study on the influence of non-uniform shared cache on contemporary parallel programs, reveals the mismatch between the software development and underlying cache sharing hierarchies, and further demonstrates it by proposing and applying cache-sharing-aware data transformations that bring significant performance improvement. For the second aspect, the efficiency of GPU accelerators is sensitive to irregular memory references, which refer to the memory references whose access patterns remain unknown until execution time (e.g., A[P[i]]). The root causes of the irregular memory reference problem are similar to that of the control flow problem, while in a more general and complex form. I developed a framework, named G-Streamline, as a unified software solution to dynamic irregularities in GPU computing. It treats both types of irregularities at the same time in a holistic fashion, maximizing the whole-program performance by resolving conflicts among optimizations

    Medium access control in wireless network-on-chip: a context analysis

    Get PDF
    © 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Wireless on-chip communication is a promising candidate to address the performance and efficiency issues that arise when scaling current NoC techniques to manycore processors. A WNoC can serve global and broadcast traffic with ultra-low latency even in thousand-core chips, thus acting as a natural complement to conventional and throughput-oriented wireline NoCs. However, the development of MAC strategies needed to efficiently share the wireless medium among the increasing number of cores remains a considerable challenge given the singularities of the environment and the novelty of the research area. In this position article, we present a context analysis describing the physical constraints, performance objectives, and traffic characteristics of the on-chip communication paradigm. We summarize the main differences with respect to traditional wireless scenarios, and then discuss their implications on the design of MAC protocols for manycore WNoC, with the ultimate goal of kickstarting this arguably unexplored research area.Peer ReviewedPostprint (author's final draft

    A Fuzzy Logic Reconfiguration Engine for Symmetric Chip Multiprocessors

    Get PDF
    Recent developments in reconfigurable multiprocessor system on chip (MPSoC) have offered system designers a great amount of flexibility to exploit task concurrency with higher throughput and less energy consumption. This paper presents a novel fuzzy logic reconfiguration engine (FLRE) for coarse grain MPSoC reconfiguration that facilitates to identify an optimum balance between power and performance of the system. The FLRE is composed on two levels of abstraction layers. The system selects an optimal configuration of Level 1 / Level 2 cache size and Associativity, processor operating frequency and voltage, the number of cores based on miss rate, and energy and throughput information of the system both at core and SoC level. An 8-core symmetric chip multiprocessor has been used to evaluate the proposed scheme. The results show an overall decrease of energy consumption with not more than 30% decrease in the throughput

    Developing graph-based co-scheduling algorithms on multicore computers

    Get PDF
    It is common that multiple cores reside on the same chip and share the on-chip cache. As a result, resource sharing can cause performance degradation of co-running jobs.Job co-scheduling is a technique that can effectively alleviate this contention and many co-schedulers have been reported in related literature. Most solutions however do not aim to find the optimal co-scheduling solution. Being able to determine the optimal solution is critical for evaluating co-scheduling systems. Moreover, most co-schedulers only consider serial jobs, and there often exist both parallel and serial jobs in real-world systems. In this paper a graph-based method is developed to find the optimal co-scheduling solution for serial jobs; the method is then extended to incorporate parallel jobs, including multi-process, and multithreaded parallel jobs. A number of optimization measures are also developed to accelerate the solving process. Moreover, a flexible approximation technique is proposed to strike a balance between the solving speed and the solution quality. Extensive experiments are conducted to evaluate the effectiveness of the proposed co-scheduling algorithms. The results show that the proposed algorithms can find the optimal co-scheduling solution for both serial and parallel jobs. The proposed approximation technique is also shown to be flexible in the sense that we can control the solving speed by setting the requirement for the solution quality
    • …
    corecore