3,344 research outputs found

    Dynamic partitioning of loop iterations on heterogeneous PC clusters

    Get PDF
    [[abstract]]Loop partitioning on parallel and distributed systems has been a critical problem. Furthermore, it becomes more difficult to deal with on the emerging heterogeneous PC cluster environments. In the past, some loop self-scheduling schemes have been proposed to be applicable to heterogeneous cluster environments. In this paper, we propose a performance-based approach, which partitions loop iterations according to the performance ratio of cluster nodes. To verify the proposed approach, a heterogeneous cluster is built, and three types of application programs are implemented to be executed in this testbed. Experimental results show that the proposed approach performs better than traditional schemes

    Dynamic partitioning of loop iterations on heterogeneous PC clusters

    Get PDF
    [[abstract]]Loop partitioning on parallel and distributed systems has been a critical problem. Furthermore, it becomes more difficult to deal with on the emerging heterogeneous PC cluster environments. In the past, some loop self-scheduling schemes have been proposed to be applicable to heterogeneous cluster environments. In this paper, we propose a performance-based approach, which partitions loop iterations according to the performance ratio of cluster nodes. To verify the proposed approach, a heterogeneous cluster is built, and three types of application programs are implemented to be executed in this testbed. Experimental results show that the proposed approach performs better than traditional schemes. © 2007 Springer Science+Business Media, LLC

    Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors

    Get PDF
    Asymmetric multicore processors (AMPs) have recently emerged as an appealing technology for severely energy-constrained environments, especially in mobile appliances where heterogeneity in applications is mainstream. In addition, given the growing interest for low-power high performance computing, this type of architectures is also being investigated as a means to improve the throughput-per-Watt of complex scientific applications. In this paper, we design and embed several architecture-aware optimizations into a multi-threaded general matrix multiplication (gemm), a key operation of the BLAS, in order to obtain a high performance implementation for ARM big.LITTLE AMPs. Our solution is based on the reference implementation of gemm in the BLIS library, and integrates a cache-aware configuration as well as asymmetric--static and dynamic scheduling strategies that carefully tune and distribute the operation's micro-kernels among the big and LITTLE cores of the target processor. The experimental results on a Samsung Exynos 5422, a system-on-chip with ARM Cortex-A15 and Cortex-A7 clusters that implements the big.LITTLE model, expose that our cache-aware versions of gemm with asymmetric scheduling attain important gains in performance with respect to its architecture-oblivious counterparts while exploiting all the resources of the AMP to deliver considerable energy efficiency

    Concurrent Design of Embedded Control Software

    Get PDF
    Embedded software design for mechatronic systems is becoming an increasingly time-consuming and error-prone task. In order to cope with the heterogeneity and complexity, a systematic model-driven design approach is needed, where several parts of the system can be designed concurrently. There is however a trade-off between concurrency efficiency and integration efficiency. In this paper, we present a case study on the development of the embedded control software for a real-world mechatronic system in order to evaluate how we can integrate concurrent and largely independent designed embedded system software parts in an efficient way. The case study was executed using our embedded control system design methodology which employs a concurrent systematic model-based design approach that ensures a concurrent design process, while it still allows a fast integration phase by using automatic code synthesis. The result was a predictable concurrently designed embedded software realization with a short integration time
    corecore