366 research outputs found

    Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors

    Get PDF
    Asymmetric multicore processors (AMPs) have recently emerged as an appealing technology for severely energy-constrained environments, especially in mobile appliances where heterogeneity in applications is mainstream. In addition, given the growing interest for low-power high performance computing, this type of architectures is also being investigated as a means to improve the throughput-per-Watt of complex scientific applications. In this paper, we design and embed several architecture-aware optimizations into a multi-threaded general matrix multiplication (gemm), a key operation of the BLAS, in order to obtain a high performance implementation for ARM big.LITTLE AMPs. Our solution is based on the reference implementation of gemm in the BLIS library, and integrates a cache-aware configuration as well as asymmetric--static and dynamic scheduling strategies that carefully tune and distribute the operation's micro-kernels among the big and LITTLE cores of the target processor. The experimental results on a Samsung Exynos 5422, a system-on-chip with ARM Cortex-A15 and Cortex-A7 clusters that implements the big.LITTLE model, expose that our cache-aware versions of gemm with asymmetric scheduling attain important gains in performance with respect to its architecture-oblivious counterparts while exploiting all the resources of the AMP to deliver considerable energy efficiency

    Multi-threaded dense linear algebra libraries for low-power asymmetric multicore processors

    Full text link
    [EN] Dense linear algebra libraries, such as BLAS and LAPACK, provide a relevant collection of numerical tools for many scientific and engineering applications. While there exist high performance implementations of the BLAS (and LAPACK) functionality for many current multi-threaded architectures, the adaption of these libraries for asymmetric multicore processors (AMPs) is still pending. In this paper we address this challenge by developing an asymmetry-aware implementation of the BLAS, based on the BLIS framework, and tailored for AMPs equipped with two types of cores: fast/power-hungry versus slow/energy-efficient. For this purpose, we integrate coarse-grain and fine-grain parallelization strategies into the library routines which, respectively, dynamically distribute the workload between the two core types and statically repartition this work among the cores of the same type. Our results on an ARM (R) big.LITTLE (TM) processor embedded in the Exynos 5422 SoC, using the asymmetry-aware version of the BLAS and a plain migration of the legacy version of LAPACK, experimentally assess the benefits, limitations, and potential of this approach from the perspectives of both throughput and energy efficiency. (C) 2016 Elsevier B.V. All rights reserved.The researchers from Universidad Jaume I were supported by projects CICYT TIN2011-23283 and TIN2014-53495-R of MINECO and FEDER, and the FPU program of MECD. The researcher from Universidad Complutense de Madrid was supported by project CICYT TIN2015-65277-R. The researcher from Universitat Politecnica de Catalunya was supported by projects TIN2015-65316-P from the Spanish Ministry of Education and 2014 SGR 1051 from the Generalitat de Catalunya, Dep. dinnovacio, Universitats i Empresa.Catalán, S.; Herrero, JR.; Igual Peña, FD.; Rodríguez-Sánchez, R.; Quintana Ortí, ES.; Adeniyi-Jones, C. (2018). Multi-threaded dense linear algebra libraries for low-power asymmetric multicore processors. Journal of Computational Science. 25:140-151. https://doi.org/10.1016/j.jocs.2016.10.020S1401512

    Static scheduling of the LU factorization with look-ahead on asymmetric multicore processors

    Get PDF
    [EN] We analyze the benefits of look-ahead in the parallel execution of the LU factorization with partial pivoting (LUpp) in two distinct "asymmetric" multicore scenarios. The first one corresponds to an actual hardware-asymmetric architecture such as the Samsung Exynos 5422 system-on-chip (SoC), equipped with an ARM big.LITTLE processor consisting of a quad core Cortex-A15 cluster plus a quad-core Cortex-A7 cluster. For this scenario, we propose a careful mapping of the different types of tasks appearing in LUpp to the computational resources, in order to produce an efficient architecture-aware exploitation of the computational resources integrated in this SoC. The second asymmetric configuration appears in a hardware-symmetric multicore architecture where the cores can individually operate at a different frequency levels. In this scenario, we show how to employ the frequency slack to accelerate the tasks in the critical path of LUpp in order to produce a faster global execution as well as a lower energy consumption. (C) 2018 Elsevier B.V. All rights reserved.The researchers from Universidad Jaume I were supported by projects TIN2014-53495-R and TIN2017-82972-R of MINECO and FEDER, and the FPU program of MECD. The researcher from Universitat Politecnica de Catalunya was supported by projects TIN2015-65316-P of MINECO and FEDER and 2017-SGR-1414 from the Generalitat de Catalunya.Catalán, S.; Herrero, JR.; Quintana Ortí, ES.; Rodríguez-Sánchez, R. (2018). Static scheduling of the LU factorization with look-ahead on asymmetric multicore processors. Parallel Computing. 76:18-27. https://doi.org/10.1016/j.parco.2018.04.006S18277

    Vector support for multicore processors with major emphasis on configurable multiprocessors

    Get PDF
    It recently became increasingly difficult to build higher speed uniprocessor chips because of performance degradation and high power consumption. The quadratically increasing circuit complexity forbade the exploration of more instruction-level parallelism (JLP). To continue raising the performance, processor designers then focused on thread-level parallelism (TLP) to realize a new architecture design paradigm. Multicore processor design is the result of this trend. It has proven quite capable in performance increase and provides new opportunities in power management and system scalability. But current multicore processors do not provide powerful vector architecture support which could yield significant speedups for array operations while maintaining arealpower efficiency. This dissertation proposes and presents the realization of an FPGA-based prototype of a multicore architecture with a shared vector unit (MCwSV). FPGA stands for Filed-Programmable Gate Array. The idea is that rather than improving only scalar or TLP performance, some hardware budget could be used to realize a vector unit to greatly speedup applications abundant in data-level parallelism (DLP). To be realistic, limited by the parallelism in the application itself and by the compiler\u27s vectorizing abilities, most of the general-purpose programs can only be partially vectorized. Thus, for efficient resource usage, one vector unit should be shared by several scalar processors. This approach could also keep the overall budget within acceptable limits. We suggest that this type of vector-unit sharing be established in future multicore chips. The design, implementation and evaluation of an MCwSV system with two scalar processors and a shared vector unit are presented for FPGA prototyping. The MicroBlaze processor, which is a commercial IP (Intellectual Property) core from Xilinx, is used as the scalar processor; in the experiments the vector unit is connected to a pair of MicroBlaze processors through standard bus interfaces. The overall system is organized in a decoupled and multi-banked structure. This organization provides substantial system scalability and better vector performance. For a given area budget, benchmarks from several areas show that the MCwSV system can provide significant performance increase as compared to a multicore system without a vector unit. However, a MCwSV system with two MicroBlazes and a shared vector unit is not always an optimized system configuration for various applications with different percentages of vectorization. On the other hand, the MCwSV framework was designed for easy scalability to potentially incorporate various numbers of scalar/vector units and various function units. Also, the flexibility inherent to FPGAs can aid the task of matching target applications. These benefits can be taken into account to create optimized MCwSV systems for various applications. So the work eventually focused on building an architecture design framework incorporating performance and resource management for application-specific MCwSV (AS-MCwSV) systems. For embedded system design, resource usage, power consumption and execution latency are three metrics to be used in design tradeoffs. The product of these metrics is used here to choose the MCwSV system with the smallest value

    ERASE: Energy Efficient Task Mapping and Resource Management for Work Stealing Runtimes

    Get PDF
    Parallel applications often rely on work stealing schedulers in combination with fine-grained tasking to achieve high performance and scalability. However, reducing the total energy consumption in the context of work stealing runtimes is still challenging, particularly when using asymmetric architectures with different types of CPU cores. A common approach for energy savings involves dynamic voltage and frequency scaling (DVFS) wherein throttling is carried out based on factors like task parallelism, stealing relations, and task criticality. This article makes the following observations: (i) leveraging DVFS on a per-task basis is impractical when using fine-grained tasking and in environments with cluster/chip-level DVFS; (ii) task moldability, wherein a single task can execute on multiple threads/cores via work-sharing, can help to reduce energy consumption; and (iii) mismatch between tasks and assigned resources (i.e., core type and number of cores) can detrimentally impact energy consumption. In this article, we propose EneRgy Aware SchedulEr (ERASE), an intra-application task scheduler on top of work stealing runtimes that aims to reduce the total energy consumption of parallel applications. It achieves energy savings by guiding scheduling decisions based on per-task energy consumption predictions of different resource configurations. In addition, ERASE is capable of adapting to both given static frequency settings and externally controlled DVFS. Overall, ERASE achieves up to 31% energy savings and improves performance by 44% on average, compared to the state-of-the-art DVFS-based schedulers
    • …
    corecore