6 research outputs found

    Cache-conscious scheduling of streaming applications

    Get PDF
    This paper considers the problem of scheduling streaming applications on uniprocessors in order to minimize the number of cache-misses. Streaming applications are represented as a directed graph (or multigraph), where nodes are computation modules and edges are channels. When a module fires, it consumes some data-items from its input channels and produces some items on its output channels. In addition, each module may have some state (either code or data) which represents the memory locations that must be loaded into cache in order to execute the module. We consider synchronous dataflow graphs where the input and output rates of modules are known in advance and do not change during execution. We also assume that the state size of modules is known in advance. Our main contribution is to show that for a large and important class of streaming computations, cache-efficient scheduling is essentially equivalent to solving a constrained graph partitioning problem. A streaming computation from this class has a cache-efficient schedule if and only if its graph has a low-bandwidth partition of the modules into components (subgraphs) whose total state fits within the cache, where the bandwidth of the partition is the number of data items that cross intercomponent channels per data item that enters the graph. Given a good partition, we describe a runtime strategy for scheduling two classes of streaming graphs: pipelines, where the graph consists of a single directed chain, and a fairly general class of directed acyclic graphs (dags) with some additional restrictions. The runtime scheduling strategy consists of adding large external buffers at the input and output edges of each component, allowing each component to be executed many times. Partitioning enables a reduction in cache misses in two ways. First, any items that are generated on edges internal to subgraphs are never written out to memory, but remain in cache. Second, each subgraph is executed many times, allowing the state to be reused. We prove the optimality of this runtime scheduling for all pipelines and for dags that meet certain conditions on buffer-size requirements. Specifically, we show that with constant-factor memory augmentation, partitioning on these graphs guarantees the optimal number of cache misses to within a constant factor. For the pipeline case, we also prove that such a partition can be found in polynomial time. For the dags we prove optimality if a good partition is provided; the partitioning problem itself is NP-complete.National Science Foundation (U.S.) (Grant CCF-1150036)National Science Foundation (U.S.) (Grant CNS-1017058)National Science Foundation (U.S.) (Grant CCF-0937860)United States-Israel Binational Science Foundation (Grant 2010231

    Cache-Conscious Scheduling of Streaming Applications

    Get PDF
    ABSTRACT This paper considers the problem of scheduling streaming applications on uniprocessors in order to minimize the number of cache-misses. Streaming applications are represented as a directed graph (or multigraph), where nodes are computation modules and edges are channels. When a module fires, it consumes some data-items from its input channels and produces some items on its output channels. In addition, each module may have some state (either code or data) which represents the memory locations that must be loaded into cache in order to execute the module. We consider synchronous dataflow graphs where the input and output rates of modules are known in advance and do not change during execution. We also assume that the state size of modules is known in advance. Our main contribution is to show that for a large and important class of streaming computations, cache-efficient scheduling is essentially equivalent to solving a constrained graph partitioning problem. A streaming computation from this class has a cache-efficient schedule if and only if its graph has a low-bandwidth partition of the modules into components (subgraphs) whose total state fits within the cache, where the bandwidth of the partition is the number of data items that cross intercomponent channels per data item that enters the graph. Given a good partition, we describe a runtime strategy for scheduling two classes of streaming graphs: pipelines, where the graph consists of a single directed chain, and a fairly general class of directed acyclic graphs (dags) with some additional restrictions. The runtime scheduling strategy consists of adding large external buffers at the input and output edges of each component, allowing each component to be executed many times. Partitioning enables a reduction in cache misses in two ways. First, any items that are generated on edges internal to subgraphs are never written Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SPAA '12, June 25-27, 2012, Pittsburgh, Pennsylvania, USA. Copyright 2012 ACM 978-1-4503-1213 out to memory, but remain in cache. Second, each subgraph is executed many times, allowing the state to be reused. We prove the optimality of this runtime scheduling for all pipelines and for dags that meet certain conditions on buffersize requirements. Specifically, we show that with constantfactor memory augmentation, partitioning on these graphs guarantees the optimal number of cache misses to within a constant factor. For the pipeline case, we also prove that such a partition can be found in polynomial time. For the dags we prove optimality if a good partition is provided; the partitioning problem itself is NP-complete

    A Monitoring System for Runtime Adaptations of Streaming Applications

    Get PDF
    International audienceStreaming languages are adequate for expressing many applications quite naturally and have been proven to be a good approach for taking advantage of the intrinsic parallelism of modern CPU architectures. While numerous works focus on improving the throughput of streaming programs, we rather focus on satisfying quality-of-service requirements of streaming applications executed alongside non-streaming processes. We monitor synchronous dataflow (SDF) programs at runtime both at the application and system levels in order to identify violations of quality-of-service requirements. Our monitoring requires the programmer to provide the expected throughput of its application (e.g 25 frames per second for a video decoder), then takes full benefit from the compilation of the SDF graph to detect bottlenecks in this graph and identify causes among processor or memory overloading. It can then be used to perform dynamic adaptations of the applications in order to optimize the use of computing and memory resources

    Locality-Aware Concurrency Platforms

    Get PDF
    Modern computing systems from all domains are becoming increasingly more parallel. Manufacturers are taking advantage of the increasing number of available transistors by packaging more and more computing resources together on a single chip or within a single system. These platforms generally contain many levels of private and shared caches in addition to physically distributed main memory. Therefore, some memory is more expensive to access than other and high-performance software must consider memory locality as one of the first level considerations. Memory locality is often difficult for application developers to consider directly, however, since many of these NUMA affects are invisible to the application programmer and only show up in low performance. Moreover, on parallel platforms, the performance depends on both locality and load balance and these two metrics are often at odds with each other. Therefore, directly considering locality and load balance at the application level may make the application much more complex to program. In this work, we develop locality-conscious concurrency platforms for multiple different structured parallel programming models, including streaming applications, task-graphs and parallel for loops. In all of this work, the idea is to minimally disrupt the application programming model so that the application developer is either unimpacted or must only provide high-level hints to the runtime system. The runtime system then schedules the application to provide good locality of access while, at the same time also providing good load balance. In particular, we address cache locality for streaming applications through static partitioning and developed an extensible platform to execute partitioned streaming applications. For task-graphs, we extend a task-graph scheduling library to guide scheduling decisions towards better NUMA locality with the help of user-provided locality hints. CilkPlus parallel for loops utilize a randomized dynamic scheduler to distribute work which, in many loop based applications, results in poor locality at all levels of the memory hierarchy. We address this issue with a novel parallel for loop implementation that can get good cache and NUMA locality while providing support to maintain good load balance dynamically
    corecore