4,668 research outputs found
Static partitioning and mapping of kernel-based applications over modern heterogeneous architectures
Heterogeneous Architectures Are Being Used Extensively To Improve System Processing Capabilities. Critical Functions Of Each Application (Kernels) Can Be Mapped To Different Computing Devices (I.E. Cpus, Gpgpus, Accelerators) To Maximize Performance. However, Best Performance Can Only Be Achieved If Kernels Are Accurately Mapped To The Right Device. Moreover, In Some Cases Those Kernels Could Be Split And Executed Over Several Devices At The Same Time To Maximize The Use Of Compute Resources On Heterogeneous Parallel Architectures. In This Paper, We Define A Static Partitioning Model Based On Profiling Information From Previous Executions. This Model Follows A Quantitative Model Approach Which Computes The Optimal Match According To User-Defined Constraints. We Test Different Scenarios To Evaluate Our Model: Single Kernel And Multi-Kernel Applications. Experimental Results Show That Our Static Partitioning Model Could Increase Performance Of Parallel Applications By Deploying Not Only Different Kernels Over Different Devices But A Single Kernel Over Multiple Devices. This Allows To Avoid Having Idle Compute Resources On Heterogeneous Platforms, As Well As Enhancing The Overall Performance. (C) 2015 Elsevier B.V. All Rights Reserved.The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007–2013) under grant agreement n. 609666 [24]
Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors
Asymmetric multicore processors (AMPs) have recently emerged as an appealing
technology for severely energy-constrained environments, especially in mobile
appliances where heterogeneity in applications is mainstream. In addition,
given the growing interest for low-power high performance computing, this type
of architectures is also being investigated as a means to improve the
throughput-per-Watt of complex scientific applications.
In this paper, we design and embed several architecture-aware optimizations
into a multi-threaded general matrix multiplication (gemm), a key operation of
the BLAS, in order to obtain a high performance implementation for ARM
big.LITTLE AMPs. Our solution is based on the reference implementation of gemm
in the BLIS library, and integrates a cache-aware configuration as well as
asymmetric--static and dynamic scheduling strategies that carefully tune and
distribute the operation's micro-kernels among the big and LITTLE cores of the
target processor. The experimental results on a Samsung Exynos 5422, a
system-on-chip with ARM Cortex-A15 and Cortex-A7 clusters that implements the
big.LITTLE model, expose that our cache-aware versions of gemm with asymmetric
scheduling attain important gains in performance with respect to its
architecture-oblivious counterparts while exploiting all the resources of the
AMP to deliver considerable energy efficiency
The Virtual Block Interface: A Flexible Alternative to the Conventional Virtual Memory Framework
Computers continue to diversify with respect to system designs, emerging
memory technologies, and application memory demands. Unfortunately, continually
adapting the conventional virtual memory framework to each possible system
configuration is challenging, and often results in performance loss or requires
non-trivial workarounds. To address these challenges, we propose a new virtual
memory framework, the Virtual Block Interface (VBI). We design VBI based on the
key idea that delegating memory management duties to hardware can reduce the
overheads and software complexity associated with virtual memory. VBI
introduces a set of variable-sized virtual blocks (VBs) to applications. Each
VB is a contiguous region of the globally-visible VBI address space, and an
application can allocate each semantically meaningful unit of information
(e.g., a data structure) in a separate VB. VBI decouples access protection from
memory allocation and address translation. While the OS controls which programs
have access to which VBs, dedicated hardware in the memory controller manages
the physical memory allocation and address translation of the VBs. This
approach enables several architectural optimizations to (1) efficiently and
flexibly cater to different and increasingly diverse system configurations, and
(2) eliminate key inefficiencies of conventional virtual memory. We demonstrate
the benefits of VBI with two important use cases: (1) reducing the overheads of
address translation (for both native execution and virtual machine
environments), as VBI reduces the number of translation requests and associated
memory accesses; and (2) two heterogeneous main memory architectures, where VBI
increases the effectiveness of managing fast memory regions. For both cases,
VBI significanttly improves performance over conventional virtual memory
- …