1,441 research outputs found

    PRISM: an intelligent adaptation of prefetch and SMT levels

    Get PDF
    Current microprocessors include hardware to optimize some specifics workloads. In general, these hardware knobs are set on a default configuration on the booting process of the machine. This default behavior cannot be beneficial for all types of workloads and they are not controlled by anyone but the end user, who needs to know what configuration is the best one for the workload running. Some of these knobs are: (1) the Simultaneous MultiThreading level, which specifies the number of threads that can run simultaneously on a physical CPU, and (2) the data prefetch engine, that manages the prefetches on memory. Parallel programming models are here to stay, and one programming model that succeed in allowing programmers to easily parallelize applications is Open Multi Processing (OMP). Also, the architecture of microprocessors is getting more complex that end users cannot afford to optimize their workloads for all the architectural details. These architectural knobs can help to increase performance but it is needed an automatic and adaptive system managing them. In this work we propose an independent library for OpenMP runtimes to increase performance up to 220% (14.7% on average) while reducing dynamic power consumption up to 13% (2% on average) on a real POWER8 processor

    Efficient openMP over sequentially consistent distributed shared memory systems

    Get PDF
    Nowadays clusters are one of the most used platforms in High Performance Computing and most programmers use the Message Passing Interface (MPI) library to program their applications in these distributed platforms getting their maximum performance, although it is a complex task. On the other side, OpenMP has been established as the de facto standard to program applications on shared memory platforms because it is easy to use and obtains good performance without too much effort. So, could it be possible to join both worlds? Could programmers use the easiness of OpenMP in distributed platforms? A lot of researchers think so. And one of the developed ideas is the distributed shared memory (DSM), a software layer on top of a distributed platform giving an abstract shared memory view to the applications. Even though it seems a good solution it also has some inconveniences. The memory coherence between the nodes in the platform is difficult to maintain (complex management, scalability issues, high overhead and others) and the latency of the remote-memory accesses which can be orders of magnitude greater than on a shared bus due to the interconnection network. Therefore this research improves the performance of OpenMP applications being executed on distributed memory platforms using a DSM with sequential consistency evaluating thoroughly the results from the NAS parallel benchmarks. The vast majority of designed DSMs use a relaxed consistency model because it avoids some major problems in the area. In contrast, we use a sequential consistency model because we think that showing these potential problems that otherwise are hidden may allow the finding of some solutions and, therefore, apply them to both models. The main idea behind this work is that both runtimes, the OpenMP and the DSM layer, should cooperate to achieve good performance, otherwise they interfere one each other trashing the final performance of applications. We develop three different contributions to improve the performance of these applications: (a) a technique to avoid false sharing at runtime, (b) a technique to mimic the MPI behaviour, where produced data is forwarded to their consumers and, finally, (c) a mechanism to avoid the network congestion due to the DSM coherence messages. The NAS Parallel Benchmarks are used to test the contributions. The results of this work shows that the false-sharing problem is a relative problem depending on each application. Another result is the importance to move the data flow outside of the critical path and to use techniques that forwards data as early as possible, similar to MPI, benefits the final application performance. Additionally, this data movement is usually concentrated at single points and affects the application performance due to the limited bandwidth of the network. Therefore it is necessary to provide mechanisms that allows the distribution of this data through the computation time using an otherwise idle network. Finally, results shows that the proposed contributions improve the performance of OpenMP applications on this kind of environments
    • …
    corecore