28 research outputs found

    Scheduling threads for constructive cache sharing on CMPs

    Get PDF
    In chip multiprocessors (CMPs), limiting the number of offchip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this paper, we compare the performance of two state-of-the-art schedulers proposed for fine-grained multithreaded programs: Parallel Depth First (PDF), which is specifically designed for constructive cache sharing, and Work Stealing (WS), which is a more traditional design. Our experimental results indicate that PDF scheduling yields a 1.3 - 1.6X performance improvement relative to WS for several fine- grain parallel benchmarks on projected future CMP configurations; we also report several issues that may limit the advantage of PDF in certain applications. These results also indicate that PDF more effectively utilizes off-chip bandwidth, making it possible to trade-off on-chip cache for a larger number of cores. Moreover, we find that task granularity plays a key role in cache performance. Therefore, we present an automatic approach for selecting effective grain sizes, based on a new working set profiling algorithm that is an order of magnitude faster than previous approaches. This is the first paper demonstrating the effectiveness of PDF on real benchmarks, providing a direct comparison between PDF and WS, revealing the limiting factors for PDF in practice, and presenting an approach for overcoming these factors. Copyright 2007 ACM

    Parallel depth first vs. work stealing schedulers on CMP architectures

    Get PDF
    In chip multiprocessors (CMPs), limiting the number of off-chip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this brief announcement, we highlight our ongoing study [4] comparing the performance of two schedulers designed for fine-grained multithreaded programs: Parallel Depth First (PDF) [2], which is designed for constructive sharing, and Work Stealing (WS) [3], which takes a more traditional approach.Overview of schedulers. In PDF, processing cores are allocated ready-to-execute program tasks such that higher scheduling priority is given to those tasks the sequential program would have executed earlier. As a result, PDF tends to co-schedule threads in a way that tracks the sequential execution. Hence, the aggregate working set is (provably) not much larger than the single thread working set [1]. In WS, each processing core maintains a local work queue of readyto-execute threads. Whenever its local queue is empty, the core steals a thread from the bottom of the first non-empty queue it finds. WS is an attractive scheduling policy because when there is plenty of parallelism, stealing is quite rare. However, WS is not designed for constructive cache sharing, because the cores tend to have disjoint working sets.CMP configurations studied. We evaluated the performance of PDF and WS across a range of simulated CMP configurations. We focused on designs that have fixed-size private L1 caches and a shared L2 cache on chip. For a fixed die size (240 mm2), we varied the number of cores from 1 to 32. For a given number of cores, we used a (default) configuration based on current CMPs and realistic projections of future CMPs, as process technologies decrease from 90nm to 32nm.Summary of findings. We studied a variety of benchmark programs to show the following findings.For several application classes, PDF enables significant constructive sharing between threads, leading to better utilization of the on-chip caches and reducing off-chip traffic compared to WS. In particular, bandwidth-limited irregular programs and parallel divide-and-conquer programs present a relative speedup of 1.3-1.6X over WS, observing a 13- 41% reduction in off-chip traffic. An example is shown in Figure 1, for parallel merge sort. For each schedule, the number of L2 misses (i.e., the off-chip traffic) is shown on the left and the speed-up over running on one core is shown on the right, for 1 to 32 cores. Note that reducing the offchip traffic has the additional benefit of reducing the power consumption. Moreover, PDF's smaller working sets provide opportunities to power down segments of the cache without increasing the running time. Furthermore, when multiple programs are active concurrently, the PDF version is also less of a cache hog and its smaller working set is more likely to remain in the cache across context switches.For several other applications classes, PDF and WS have roughly the same execution times, either because there is only limited data reuse that can be exploited or because the programs are not limited by off-chip bandwidth. In the latter case, the constructive sharing PDF enables does provide the power and multiprogramming benefits discussed above.Finally, most parallel benchmarks to date, written for SMPs, use such a coarse-grained threading that they cannot exploit the constructive cache behavior inherent in PDF.We find that mechanisms to finely grain multithreaded applications are crucial to achieving good performance on CMPs

    Log-based architectures for general-purpose monitoring of deployed code

    Get PDF
    Runtime monitoring tools are invaluable for detecting various types of bugs, in both sequential and multi-threaded programs. However, these tools often slow down the monitored program by an order of magnitude or more [4], implying that the tools are ill-suited for always-on monitoring of deployed code. Fortunately, the emergence of chip multiprocessors as a dominant computing platform means that resources are available on-chip to assist in monitoring tasks. In this brief note, we advocate Log-Based Architectures (LBA) that exploit such on-chip resources in order to dramatically reduce the overhead of runtime program monitoring. Specifically, we propose adding hardware support for logging a main program's trace and delivering it to another (otherwise idle) processing core for inspection. A life-guard program running on this other core executes the desired monitoring task

    Reasoning about Programs by Exploiting The Environment

    No full text
    A method for making aspects of a computational model explicit in the formulas of a programming logic is given. The method is based on a new notion of environment---an environment augments the state transitions defined by a program's atomic actions rather than being interleaved with them. Two simple semantic principles are presented for extending a programming logic in order to reason about executions feasible in various environments

    Verification of Temporal Properties

    Full text link
    The paper presents a relatively complete deductive system for proving branching time temporal properties of reactive programs. No deductive system for verifying branching time temporal properties has been presented before. Our deductive system enjoys the following advantages. First, given a well-formed specification there is no need to translate it into a normal-form specification since the system can handle any well-formed specification. Second, given a specification to be verified, the proof rule to be applied is easily determined according to the top level operator of the specification. Third, the system reduces temporal verification to assertional reasoning rather than to temporal reasoning

    Hybrid Verification by Exploiting the Environment

    No full text
    A method for verifying hybrid systems is given. Such systems involve state components whose values are changed by continuous (physical) processes. The verification method is based on proving that only those executions that satisfy constraints imposed by an environment also satisfy the property of interest. A suitably expressive logic then allows the environment to model state components that are changed by physical processes

    A Determinizable Class of Timed Automata

    Full text link
    We introduce the class of event-recording timed automata (ERA). An event-recording automaton contains, for every event A, a clock that records the time of the last occurrence of A. The class ERA is, on one hand, expressive enough to model (finite) timed transition systems and, on the other hand, determinizable and closed under all boolean operations. As a result, the language inclusion problem is decidable for event-recording automata. We present a translation from timed transition systems to event-recording automata, which leads to an algorithm for checking if two timed transition systems have the same set of timed behaviors. We also consider event-predicting timed automata (EPA), which contain clocks that predict the time of the next occurrence of an event. The class of event-clock automata (ECA), which contain both event-recording and event-predicting clocks, is a suitable specification language for real-time properties. We provide an algorithm for checking if a timed automaton meets a specification that is given as an event-clock automaton

    Event-Clock Automata: A Determinizable Class of Timed Automata

    Get PDF
    We introduce event-recording automata. An event-recording automaton is a timed automaton that contains, for every event a, a clock that records the time of the last occurrence of a. The class of event-recording automata is, on one hand, expressive enough to model (finite) timed transition systems and, on the other hand, determinizable and closed under all boolean operations. As a result, the language inclusion problem is decidable for event-recording automata. We present a translation from timed transition systems to event-recording automata, which leads to an algorithm for checking if two timed transition systems have the same set of timed behaviors. We also consider event-predicting automata, which contain clocks that predict the time of the next occurrence of an event. The class of event-clock automata, which contain both event-recording and event-predicting clocks, is a suitable specification language for real-time properties. We provide an algorithm for checking if a timed automa..
    corecore