4 research outputs found

    A power-aware, self-adaptive macro data flow framework

    Get PDF
    The dataflow programming model has been extensively used as an effective solution to implement efficient parallel programming frameworks. However, the amount of resources allocated to the runtime support is usually fixed once by the programmer or the runtime, and kept static during the entire execution. While there are cases where such a static choice may be appropriate, other scenarios may require to dynamically change the parallelism degree during the application execution. In this paper we propose an algorithm for multicore shared memory platforms, that dynamically selects the optimal number of cores to be used as well as their clock frequency according to either the workload pressure or to explicit user requirements. We implement the algorithm for both structured and unstructured parallel applications and we validate our proposal over three real applications, showing that it is able to save a significant amount of power, while not impairing the performance and not requiring additional effort from the application programmer

    COMPROF and COMPLACE : shared-memory communication profiling and automated thread placement via dynamic binary instrumentation

    Get PDF
    Funding: This work was generously supported by UK EPSRC Energise, grant number EP/V006290/1.This paper presents COMPROF and COMPLACE, a novel profiling tool and thread placement technique for shared-memory architectures that requires no recompilation or user intervention. We use dynamic binary instrumentation to intercept memory operations and estimate inter-thread communication overhead, deriving (and possibly visualising) a communication graph of data-sharing between threads. We then use this graph to map threads to cores in order to optimise memory traffic through the memory system. Different paths through a system's memory hierarchy have different latency, throughput and energy properties, COMPLACE exploits this heterogeneity to provide automatic performance and energy improvements for multi-threaded programs. We demonstrate COMPLACE on the NAS Parallel Benchmark (NPB) suite where, using our technique, we are able to achieve improvements of up to 12% in the execution time and up to 10% in the energy consumption (compared to default Linux scheduling) while not requiring any modification or recompilation of the application code.Postprin

    When parallel speedups hit the memory wall

    Get PDF
    After Amdahl's trailblazing work, many other authors proposed analytical speedup models but none have considered the limiting effect of the memory wall. These models exploited aspects such as problem-size variation, memory size, communication overhead, and synchronization overhead, but data-access delays are assumed to be constant. Nevertheless, such delays can vary, for example, according to the number of cores used and the ratio between processor and memory frequencies. Given the large number of possible configurations of operating frequency and number of cores that current architectures can offer, suitable speedup models to describe such variations among these configurations are quite desirable for off-line or on-line scheduling decisions. This work proposes new parallel speedup models that account for variations of the average data-access delay to describe the limiting effect of the memory wall on parallel speedups. Analytical results indicate that the proposed modeling can capture the desired behavior while experimental hardware results validate the former. Additionally, we show that when accounting for parameters that reflect the intrinsic characteristics of the applications, such as degree of parallelism and susceptibility to the memory wall, our proposal has significant advantages over machine-learning-based modeling. Moreover, besides being black-box modeling, our experiments show that conventional machine-learning modeling needs about one order of magnitude more measurements to reach the same level of accuracy achieved in our modeling.Comment: 24 page

    Simplifying self-adaptive and power-aware computing with Nornir

    No full text
    Self-adaptation is an emerging requirement in parallel computing. It enables the dynamic selection of resources toallocate to the application in order to meet performance and power consumption requirements. This is particularly relevant in Fog Applications, where data is generated by a number of devices at a varying rate, according to users’ activity. By dynamically selecting the appropriate number of resources it is possible, for example, to use at each time step the minimum amount of resources needed to process the incoming data. Implementing such kind of algorithms may be a complex task, due to low-level interactions with the underlying hardware and to non-intrusive and low-overhead monitoring of the applications. For these reasons, in this paper we propose NORNIR, a C++-based framework, which can be used to enforce performance and power consumption constraints on parallel applications running on shared memory multicores. The framework can be easily customized by algorithm designers to implement new self-adaptive policies. By instrumenting the applications in the PARSEC benchmark, we provide to strategy designers a wide set of applications already interfaced to NORNIR. In addition to this, to prove its flexibility, we implemented and compared several state-of-the-art existing policies, showing that NORNIR can also be used to easily analyze different algorithms and to provide useful insights on them
    corecore