749 research outputs found

    Guiding the Optimization of Parallel Codes on Multicores Using an Analytical Cache Model

    Get PDF
    Versión final aceptada de: https://doi.org/10.1007/978-3-319-93713-7_32This is a post-peer-review, pre-copyedit version of an article published in Lecture Notes on Computer Science (ICCS 2018 proceedings). The final authenticated version is available online at: http://dx.doi.org/10.1007/978-3-319-93713-7_32[Abstract]: Cache performance is particularly hard to predict in modern multicore processors as several threads can be concurrently in execution, and private cache levels are combined with shared ones. This paper presents an analytical model able to evaluate the cache performance of the whole cache hierarchy for parallel applications in less than one second taking as input their source code and the cache configuration. While the model does not tackle some advanced hardware features, it can help optimizers to make reasonably good decisions in a very short time. This is supported by an evaluation based on two modern architectures and three different case studies, in which the model predictions differ on average just 5.05% from the results of a detailed hardware simulator and correctly guide different optimization decisions.This research was supported by the Ministry of Economy and Competitiveness of Spain and FEDER funds (80%) of the EU (TIN2016-75845-P), and by the Government of Galicia (Xunta de Galicia) co-founded by the European Regional Development Fund (ERDF) under the Consolidation Programme of Competitive Reference Groups (ED431C 2017/04) as well as under the Centro Singular de Investigación de Galicia accreditation 2016-2019 (ED431G/01). We also acknowledge the Centro de Supercomputación de Galicia (CESGA) for the use of their computers.Xunta de Galicia; ED431C 2017/04Xunta de Galicia; ED431G/0

    Exact Sparse Matrix-Vector Multiplication on GPU's and Multicore Architectures

    Full text link
    We propose different implementations of the sparse matrix--dense vector multiplication (\spmv{}) for finite fields and rings \Zb/m\Zb. We take advantage of graphic card processors (GPU) and multi-core architectures. Our aim is to improve the speed of \spmv{} in the \linbox library, and henceforth the speed of its black box algorithms. Besides, we use this and a new parallelization of the sigma-basis algorithm in a parallel block Wiedemann rank implementation over finite fields
    corecore