2,035 research outputs found
Efficient and scalable scheduling for performance heterogeneous multicore systems
a b s t r a c t Performance heterogeneous multicore processors (HMP for brevity) consisting of multiple cores with the same instruction set but different performance characteristics (e.g., clock speed, issue width), are of great concern since they are able to deliver higher performance per watt and area for programs with diverse architectural requirements than comparable homogeneous ones. However, such power and area efficiencies of performance heterogeneous multicore systems can only be achieved when workloads are matched with cores according to both the properties of the workload and the features of the cores. Several heterogeneity-aware schedulers were proposed in the previous work. In terms of whether properties of workloads are obtained online or not, those scheduling algorithms can be categorized into two classes: online monitoring and offline profiling. The previous online monitoring approaches had to trace threads' execution on all core types, which is impractical as the number of core types grows. Besides, to trace all core types threads have to be migrated among cores, which may cause load imbalance and degrade the performance. The existing offline profiling approaches profile programs with a given input set before really executing them and thus remove the overhead associated with the number of core types. However, offline profiling approaches do not account for phase changes of threads. Moreover, since the properties they have collected are based on the given input set, those offline profiling approaches are hard to adapt to various input sets and therefore will drastically affect the program performance. To address the above problems in the existing approaches, we propose a new technique, ASTPI (Average Stall Time Per Instruction), to measure the efficiencies of threads in using fast cores. We design, implement and evaluate a new online monitoring approach called ESHMP, which is based on the metric. Our evaluation in the Linux 2.6.21 operating system shows that ESHMP delivers scalability while adapting to a wide variety of applications. Also, our experiment results show that among HMP systems in which heterogeneity-aware schedulers are adopted and there are more than one LLC (Last Level Cache), the architecture where heterogeneous cores share LLCs gain better performance than the ones where homogeneous cores share LLCs
Thread Progress Equalization: Dynamically Adaptive Power and Performance Optimization of Multi-threaded Applications
Dynamically adaptive multi-core architectures have been proposed as an
effective solution to optimize performance for peak power constrained
processors. In processors, the micro-architectural parameters or
voltage/frequency of each core to be changed at run-time, thus providing a
range of power/performance operating points for each core. In this paper, we
propose Thread Progress Equalization (TPEq), a run-time mechanism for power
constrained performance maximization of multithreaded applications running on
dynamically adaptive multicore processors. Compared to existing approaches,
TPEq (i) identifies and addresses two primary sources of inter-thread
heterogeneity in multithreaded applications, (ii) determines the optimal core
configurations in polynomial time with respect to the number of cores and
configurations, and (iii) requires no modifications in the user-level source
code. Our experimental evaluations demonstrate that TPEq outperforms
state-of-the-art run-time power/performance optimization techniques proposed in
literature for dynamically adaptive multicores by up to 23%
Achieving Efficient Realization of Kalman Filter on CGRA through Algorithm-Architecture Co-design
In this paper, we present efficient realization of Kalman Filter (KF) that
can achieve up to 65% of the theoretical peak performance of underlying
architecture platform. KF is realized using Modified Faddeeva Algorithm (MFA)
as a basic building block due to its versatility and REDEFINE Coarse Grained
Reconfigurable Architecture (CGRA) is used as a platform for experiments since
REDEFINE is capable of supporting realization of a set algorithmic compute
structures at run-time on a Reconfigurable Data-path (RDP). We perform several
hardware and software based optimizations in the realization of KF to achieve
116% improvement in terms of Gflops over the first realization of KF. Overall,
with the presented approach for KF, 4-105x performance improvement in terms of
Gflops/watt over several academically and commercially available realizations
of KF is attained. In REDEFINE, we show that our implementation is scalable and
the performance attained is commensurate with the underlying hardware resourcesComment: Accepted in ARC 201
Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout
We introduce a task-parallel algorithm for sparse incomplete Cholesky
factorization that utilizes a 2D sparse partitioned-block layout of a matrix.
Our factorization algorithm follows the idea of algorithms-by-blocks by using
the block layout. The algorithm-by-blocks approach induces a task graph for the
factorization. These tasks are inter-related to each other through their data
dependences in the factorization algorithm. To process the tasks on various
manycore architectures in a portable manner, we also present a portable tasking
API that incorporates different tasking backends and device-specific features
using an open-source framework for manycore platforms i.e., Kokkos. A
performance evaluation is presented on both Intel Sandybridge and Xeon Phi
platforms for matrices from the University of Florida sparse matrix collection
to illustrate merits of the proposed task-based factorization. Experimental
results demonstrate that our task-parallel implementation delivers about 26.6x
speedup (geometric mean) over single-threaded incomplete Cholesky-by-blocks and
19.2x speedup over serial Cholesky performance which does not carry tasking
overhead using 56 threads on the Intel Xeon Phi processor for sparse matrices
arising from various application problems.Comment: 25 page
FlashAbacus: A Self-Governing Flash-Based Accelerator for Low-Power Systems
Energy efficiency and computing flexibility are some of the primary design
constraints of heterogeneous computing. In this paper, we present FlashAbacus,
a data-processing accelerator that self-governs heterogeneous kernel executions
and data storage accesses by integrating many flash modules in lightweight
multiprocessors. The proposed accelerator can simultaneously process data from
different applications with diverse types of operational functions, and it
allows multiple kernels to directly access flash without the assistance of a
host-level file system or an I/O runtime library. We prototype FlashAbacus on a
multicore-based PCIe platform that connects to FPGA-based flash controllers
with a 20 nm node process. The evaluation results show that FlashAbacus can
improve the bandwidth of data processing by 127%, while reducing energy
consumption by 78.4%, as compared to a conventional method of heterogeneous
computing. \blfootnote{This paper is accepted by and will be published at 2018
EuroSys. This document is presented to ensure timely dissemination of scholarly
and technical work.Comment: This paper is published at the 13th edition of EuroSy
Parallelization in Scientific Workflow Management Systems
Over the last two decades, scientific workflow management systems (SWfMS)
have emerged as a means to facilitate the design, execution, and monitoring of
reusable scientific data processing pipelines. At the same time, the amounts of
data generated in various areas of science outpaced enhancements in
computational power and storage capabilities. This is especially true for the
life sciences, where new technologies increased the sequencing throughput from
kilobytes to terabytes per day. This trend requires current SWfMS to adapt:
Native support for parallel workflow execution must be provided to increase
performance; dynamically scalable "pay-per-use" compute infrastructures have to
be integrated to diminish hardware costs; adaptive scheduling of workflows in
distributed compute environments is required to optimize resource utilization.
In this survey we give an overview of parallelization techniques for SWfMS,
both in theory and in their realization in concrete systems. We find that
current systems leave considerable room for improvement and we propose key
advancements to the landscape of SWfMS.Comment: 24 pages, 17 figures (13 PDF, 4 PNG
Best-by-Simulations: A Framework for Comparing Efficiency of Reconfigurable Multicore Architectures on Workloads with Deadlines
Energy consumption is a major concern in multicore systems. Perhaps the
simplest strategy for reducing energy costs is to use only as many cores as
necessary while still being able to deliver a desired quality of service.
Motivated by earlier work on a dynamic (heterogeneous) core allocation scheme
for H.264 video decoding that reduces energy costs while delivering desired
frame rates, we formulate operationally the general problem of executing a
sequence of actions on a reconfigurable machine while meeting a corresponding
sequence of absolute deadlines, with the objective of reducing cost. Using a
transition system framework that associates costs (e.g., time, energy) with
executing an action on a particular resource configuration, we use the notion
of amortised cost to formulate in terms of simulation relations appropriate
notions for comparing deadline-conformant executions. We believe these notions
can provide the basis for an operational theory of optimal cost executions and
performance guarantees for approximate solutions, in particular relating the
notion of simulation from transition systems to that of competitive analysis
used for, e.g., online algorithms.Comment: In Proceedings PLACES 2017, arXiv:1704.0241
A Graph-Partition-Based Scheduling Policy for Heterogeneous Architectures
In order to improve system performance efficiently, a number of systems
choose to equip multi-core and many-core processors (such as GPUs). Due to
their discrete memory these heterogeneous architectures comprise a distributed
system within a computer. A data-flow programming model is attractive in this
setting for its ease of expressing concurrency. Programmers only need to define
task dependencies without considering how to schedule them on the hardware.
However, mapping the resulting task graph onto hardware efficiently remains a
challenge. In this paper, we propose a graph-partition scheduling policy for
mapping data-flow workloads to heterogeneous hardware. According to our
experiments, our graph-partition-based scheduling achieves comparable
performance to conventional queue-base approaches.Comment: Presented at DATE Friday Workshop on Heterogeneous Architectures and
Design Methods for Embedded Image Systems (HIS 2015) (arXiv:1502.07241
A Left-Looking Selected Inversion Algorithm and Task Parallelism on Shared Memory Systems
Given a sparse matrix , the selected inversion algorithm is an efficient
method for computing certain selected elements of . These selected
elements correspond to all or some nonzero elements of the LU factors of .
In many ways, the type of matrix updates performed in the selected inversion
algorithm is similar to that performed in the LU factorization, although the
sequence of operation is different. In the context of LU factorization, it is
known that the left-looking and right-looking algorithms exhibit different
memory access and data communication patterns, and hence different behavior on
shared memory and distributed memory parallel machines. Corresponding to
right-looking and left-looking LU factorization, selected inversion algorithm
can be organized as a left-looking and a right-looking algorithm. The parallel
right-looking version of the algorithm has been developed in [1]. The sequence
of operations performed in this version of the selected inversion algorithm is
similar to those performed in a left-looking LU factorization algorithm. In
this paper, we describe the left-looking variant of the selected inversion
algorithm, and based on task parallel method, present an efficient
implementation of the algorithm for shared memory machines. We demonstrate that
with the task scheduling features provided by OpenMP 4.0, the left-looking
selected inversion algorithm can scale well both on the Intel Haswell multicore
architecture and on the Intel Knights Corner (KNC) manycore architecture.
Compared to the right-looking selected inversion algorithm, the left-looking
formulation facilitates pipelining of work along different branches of the
elimination tree, and can be a promising candidate for future development of
massively parallel selected inversion algorithms on heterogeneous architecture.Comment: 9 pages, 7 figures, submitted to SuperComputing 201
Adaptive-Latency DRAM (AL-DRAM)
This paper summarizes the idea of Adaptive-Latency DRAM (AL-DRAM), which was
published in HPCA 2015. The key goal of AL-DRAM is to exploit the extra margin
that is built into the DRAM timing parameters to reduce DRAM latency. The key
observation is that the timing parameters are dictated by the worst-case
temperatures and worst-case DRAM cells, both of which lead to small amount of
charge storage and hence high access latency. One can therefore reduce latency
by adapting the timing parameters to the current operating temperature and the
current DIMM that is being accessed. Using an FPGA-based testing platform, our
work first characterizes the extra margin for 115 DRAM modules from three major
manufacturers. The experimental results demonstrate that it is possible to
reduce four of the most critical timing parameters by a minimum/maximum of
17.3%/54.8% at 55C while maintaining reliable operation. AL-DRAM adaptively
selects between multiple different timing parameters for each DRAM module based
on its current operating condition. AL-DRAM does not require any changes to the
DRAM chip or its interface; it only requires multiple different timing
parameters to be specified and supported by the memory controller. Real system
evaluations show that AL-DRAM improves the performance of memory-intensive
workloads by an average of 14% without introducing any errors.Comment: This is a summary of the original paper, entitled "Adaptive-Latency
DRAM: Optimizing DRAM Timing for the Common-Case" which appears in HPCA 201
- …