5 research outputs found

    SANComSim: A scalable, adaptive and non-intrusive framework to optimize performance in computational science applications

    Get PDF
    Parallel processing has become the most common solution for developing and executing scientific computing applications. Actually, the best way to obtain good performance ratios is to exploit parallelism in both processing and communications. Although the study of computational performance has historically involved CPU power, currently the CPU is not the only concern in the overall performance. Due to the underlying design of parallel applications, communication networks play a very important role in the field of computational science. Despite the fact that networks used in multicore clusters are fast and have low latency, the amount of transferred data may cause a bottleneck in the communication system, as communication-intensive, parallel applications spend a significant amount of their total execution time exchanging data between processes. Moreover, in most cases, several users are executing different parallel applications at the same time in the cluster. In this paper we present SANComSim, a Scalable, Adaptive and Non-intrusive framework, based on simulation techniques, for optimizing the performance of the network system to execute complex applications. The main objective of this framework is to apply run-time compression, to reduce the data sent through the network, in order to increase the overall system performance. The main features of SANComSim are: adaptability, to dynamically adapt to the current state of the system; portability, the framework is neither focused on a specific programming language nor a platform; non-intrusive, since this framework is based on simulation techniques, which does not require exclusive access of the entire cluster system; scalability, any parallel application, independently of the number of processed and computing nodes, can use this framework to improve performance in cluster systems

    Dynamically Balanced Synchronization-Avoiding LU Factorization with Multicore and GPUs

    Full text link

    Multi-GPU Implementation of LU Factorization

    Get PDF
    AbstractLU factorization is the most computationally intensive step in solving systems of linear equations. By obtaining first the LU factorization of the coefficient matrix, we then may readily solve the system using backward substitution. The computational cost of LU factorization in terms fioating point operations is cubic. There are various efforts to improve the performance of LU factorization. We propose a multi-core multi-GPU hybrid LU factorization algorithm that leverages the strengths of both multiple CPUs and multiple GPUs. Our algorithm uses some of the CPU cores for panel factorization, and the rest of the CPU cores together with all the available GPUs for trailing submatrix updates. Our algorithm employs both dynamic scheduling and static scheduling. Experiments show that our approach reaches 1134 Gflop/s with 4 Fermi GPU boards when combined with the total of 48 CPU cores from AMD. This is the first time such level of performance have been reported in a shared memory environment. Execution trace shows that our code also achieves good load balance and high system utilization

    Multi-GPU Implementation of LU Factorization

    No full text
    corecore