4,193 research outputs found
Improving Memory Hierarchy Utilisation for Stencil Computations on Multicore Machines
Although modern supercomputers are composed of multicore machines, one can
find scientists that still execute their legacy applications which were
developed to monocore cluster where memory hierarchy is dedicated to a sole
core. The main objective of this paper is to propose and evaluate an algorithm
that identify an efficient blocksize to be applied on MPI stencil computations
on multicore machines. Under the light of an extensive experimental analysis,
this work shows the benefits of identifying blocksizes that will dividing data
on the various cores and suggest a methodology that explore the memory
hierarchy available in modern machines
A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures
As multicore systems continue to gain ground in the High Performance
Computing world, linear algebra algorithms have to be reformulated or new
algorithms have to be developed in order to take advantage of the architectural
features on these new processors. Fine grain parallelism becomes a major
requirement and introduces the necessity of loose synchronization in the
parallel execution of an operation. This paper presents an algorithm for the
Cholesky, LU and QR factorization where the operations can be represented as a
sequence of small tasks that operate on square blocks of data. These tasks can
be dynamically scheduled for execution based on the dependencies among them and
on the availability of computational resources. This may result in an out of
order execution of the tasks which will completely hide the presence of
intrinsically sequential tasks in the factorization. Performance comparisons
are presented with the LAPACK algorithms where parallelism can only be
exploited at the level of the BLAS operations and vendor implementations
PowerPack: Energy Profiling and Analysis of High-Performance Systems and Applications
Energy efficiency is a major concern in modern high-performance computing system design. In the past few years, there has been mounting evidence that power usage limits system scale and computing density, and thus, ultimately system performance. However, despite the impact of power and energy on the computer systems community, few studies provide insight to where and how power is consumed on high-performance systems and applications. In previous work, we designed a framework called PowerPack that was the first tool to isolate the power consumption of devices including disks, memory, NICs, and processors in a high-performance cluster and correlate these measurements to application functions. In this work, we extend our framework to support systems with multicore, multiprocessor-based nodes, and then provide in-depth analyses of the energy consumption of parallel applications on clusters of these systems. These analyses include the impacts of chip multiprocessing on power and energy efficiency, and its interaction with application executions. In addition, we use PowerPack to study the power dynamics and energy efficiencies of dynamic voltage and frequency scaling (DVFS) techniques on clusters. Our experiments reveal conclusively how intelligent DVFS scheduling can enhance system energy efficiency while maintaining performance
Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors
Asymmetric multicore processors (AMPs) have recently emerged as an appealing
technology for severely energy-constrained environments, especially in mobile
appliances where heterogeneity in applications is mainstream. In addition,
given the growing interest for low-power high performance computing, this type
of architectures is also being investigated as a means to improve the
throughput-per-Watt of complex scientific applications.
In this paper, we design and embed several architecture-aware optimizations
into a multi-threaded general matrix multiplication (gemm), a key operation of
the BLAS, in order to obtain a high performance implementation for ARM
big.LITTLE AMPs. Our solution is based on the reference implementation of gemm
in the BLIS library, and integrates a cache-aware configuration as well as
asymmetric--static and dynamic scheduling strategies that carefully tune and
distribute the operation's micro-kernels among the big and LITTLE cores of the
target processor. The experimental results on a Samsung Exynos 5422, a
system-on-chip with ARM Cortex-A15 and Cortex-A7 clusters that implements the
big.LITTLE model, expose that our cache-aware versions of gemm with asymmetric
scheduling attain important gains in performance with respect to its
architecture-oblivious counterparts while exploiting all the resources of the
AMP to deliver considerable energy efficiency
Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes
The ongoing hardware evolution exhibits an escalation in the number, as well
as in the heterogeneity, of computing resources. The pressure to maintain
reasonable levels of performance and portability forces application developers
to leave the traditional programming paradigms and explore alternative
solutions. PaStiX is a parallel sparse direct solver, based on a dynamic
scheduler for modern hierarchical manycore architectures. In this paper, we
study the benefits and limits of replacing the highly specialized internal
scheduler of the PaStiX solver with two generic runtime systems: PaRSEC and
StarPU. The tasks graph of the factorization step is made available to the two
runtimes, providing them the opportunity to process and optimize its traversal
in order to maximize the algorithm efficiency for the targeted hardware
platform. A comparative study of the performance of the PaStiX solver on top of
its native internal scheduler, PaRSEC, and StarPU frameworks, on different
execution environments, is performed. The analysis highlights that these
generic task-based runtimes achieve comparable results to the
application-optimized embedded scheduler on homogeneous platforms. Furthermore,
they are able to significantly speed up the solver on heterogeneous
environments by taking advantage of the accelerators while hiding the
complexity of their efficient manipulation from the programmer.Comment: Heterogeneity in Computing Workshop (2014
Recommended from our members
On the conditions for efficient interoperability with threads: An experience with PGAS languages using Cray communication domains
Today's high performance systems are typically built from shared memory nodes connected by a high speed network. That architecture, combined with the trend towards less memory per core, encourages programmers to use a mixture of message passing and multithreaded programming. Unfortunately, the advantages of using threads for in-node programming are hindered by their inability to efficiently communicate between nodes. In this work, we identify some of the performance problems that arise in such hybrid programming environments and characterize conditions needed to achieve high communication performance for multiple threads: addressability of targets, separability of communication paths, and full direct reachability to targets. Using the GASNet communication layer on the Cray XC30 as our experimental platform, we show how to satisfy these conditions. We also discuss how satisfying these conditions is influenced by the communication abstraction, implementation constraints, and the interconnect messaging capabilities. To evaluate these ideas, we compare the communication performance of a thread-based node runtime to a process-based runtime. Without our GASNet extensions, thread communication is significantly slower than processes - up to 21x slower. Once the implementation is modified to address each of our conditions, the two runtimes have comparable communication performance. This allows programmers to more easily mix models like OpenMP, CILK, or pthreads with a GASNet-based model like UPC, with the associated performance, convenience and interoperability advantages that come from using threads within a node. © 2014 ACM
DReAM: An approach to estimate per-Task DRAM energy in multicore systems
Accurate per-task energy estimation in multicore systems would allow performing per-task energy-aware task scheduling and energy-aware billing in data centers, among other applications. Per-task energy estimation is challenged by the interaction between tasks in shared resources, which impacts tasks’ energy consumption in uncontrolled ways. Some accurate mechanisms have been devised recently to estimate per-task energy consumed on-chip in multicores, but there is a lack of such mechanisms for DRAM memories. This article makes the case for accurate per-task DRAM energy metering in multicores, which opens new paths to energy/performance optimizations. In particular, the contributions of this article are (i) an ideal per-task energy metering model for DRAM memories; (ii) DReAM, an accurate yet low cost implementation of the ideal model (less than 5% accuracy error when 16 tasks share memory); and (iii) a comparison with standard methods (even distribution and access-count based) proving that DReAM is much more accurate than these other methods.Peer ReviewedPostprint (author's final draft
- …