2,611 research outputs found
Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors
Asymmetric multicore processors (AMPs) have recently emerged as an appealing
technology for severely energy-constrained environments, especially in mobile
appliances where heterogeneity in applications is mainstream. In addition,
given the growing interest for low-power high performance computing, this type
of architectures is also being investigated as a means to improve the
throughput-per-Watt of complex scientific applications.
In this paper, we design and embed several architecture-aware optimizations
into a multi-threaded general matrix multiplication (gemm), a key operation of
the BLAS, in order to obtain a high performance implementation for ARM
big.LITTLE AMPs. Our solution is based on the reference implementation of gemm
in the BLIS library, and integrates a cache-aware configuration as well as
asymmetric--static and dynamic scheduling strategies that carefully tune and
distribute the operation's micro-kernels among the big and LITTLE cores of the
target processor. The experimental results on a Samsung Exynos 5422, a
system-on-chip with ARM Cortex-A15 and Cortex-A7 clusters that implements the
big.LITTLE model, expose that our cache-aware versions of gemm with asymmetric
scheduling attain important gains in performance with respect to its
architecture-oblivious counterparts while exploiting all the resources of the
AMP to deliver considerable energy efficiency
AutoParallel: A Python module for automatic parallelization and distributed execution of affine loop nests
The last improvements in programming languages, programming models, and
frameworks have focused on abstracting the users from many programming issues.
Among others, recent programming frameworks include simpler syntax, automatic
memory management and garbage collection, which simplifies code re-usage
through library packages, and easily configurable tools for deployment. For
instance, Python has risen to the top of the list of the programming languages
due to the simplicity of its syntax, while still achieving a good performance
even being an interpreted language. Moreover, the community has helped to
develop a large number of libraries and modules, tuning them to obtain great
performance.
However, there is still room for improvement when preventing users from
dealing directly with distributed and parallel computing issues. This paper
proposes and evaluates AutoParallel, a Python module to automatically find an
appropriate task-based parallelization of affine loop nests to execute them in
parallel in a distributed computing infrastructure. This parallelization can
also include the building of data blocks to increase task granularity in order
to achieve a good execution performance. Moreover, AutoParallel is based on
sequential programming and only contains a small annotation in the form of a
Python decorator so that anyone with little programming skills can scale up an
application to hundreds of cores.Comment: Accepted to the 8th Workshop on Python for High-Performance and
Scientific Computing (PyHPC 2018
A Scalable Asynchronous Distributed Algorithm for Topic Modeling
Learning meaningful topic models with massive document collections which
contain millions of documents and billions of tokens is challenging because of
two reasons: First, one needs to deal with a large number of topics (typically
in the order of thousands). Second, one needs a scalable and efficient way of
distributing the computation across multiple machines. In this paper we present
a novel algorithm F+Nomad LDA which simultaneously tackles both these problems.
In order to handle large number of topics we use an appropriately modified
Fenwick tree. This data structure allows us to sample from a multinomial
distribution over items in time. Moreover, when topic counts
change the data structure can be updated in time. In order to
distribute the computation across multiple processor we present a novel
asynchronous framework inspired by the Nomad algorithm of
\cite{YunYuHsietal13}. We show that F+Nomad LDA significantly outperform
state-of-the-art on massive problems which involve millions of documents,
billions of words, and thousands of topics
The Potential of Synergistic Static, Dynamic and Speculative Loop Nest Optimizations for Automatic Parallelization
Research in automatic parallelization of loop-centric programs started with
static analysis, then broadened its arsenal to include dynamic
inspection-execution and speculative execution, the best results involving
hybrid static-dynamic schemes. Beyond the detection of parallelism in a
sequential program, scalable parallelization on many-core processors involves
hard and interesting parallelism adaptation and mapping challenges. These
challenges include tailoring data locality to the memory hierarchy, structuring
independent tasks hierarchically to exploit multiple levels of parallelism,
tuning the synchronization grain, balancing the execution load, decoupling the
execution into thread-level pipelines, and leveraging heterogeneous hardware
with specialized accelerators. The polyhedral framework allows to model,
construct and apply very complex loop nest transformations addressing most of
the parallelism adaptation and mapping challenges. But apart from
hardware-specific, back-end oriented transformations (if-conversion, trace
scheduling, value prediction), loop nest optimization has essentially ignored
dynamic and speculative techniques. Research in polyhedral compilation recently
reached a significant milestone towards the support of dynamic, data-dependent
control flow. This opens a large avenue for blending dynamic analyses and
speculative techniques with advanced loop nest optimizations. Selecting
real-world examples from SPEC benchmarks and numerical kernels, we make a case
for the design of synergistic static, dynamic and speculative loop
transformation techniques. We also sketch the embedding of dynamic information,
including speculative assumptions, in the heart of affine transformation search
spaces
Implementation of a parallel unstructured Euler solver on shared and distributed memory architectures
An efficient three dimensional unstructured Euler solver is parallelized on a Cray Y-MP C90 shared memory computer and on an Intel Touchstone Delta distributed memory computer. This paper relates the experiences gained and describes the software tools and hardware used in this study. Performance comparisons between two differing architectures are made
- …