Challenges Towards Deploying Data Intensive Scientific Applications on
  Extreme Heterogeneity Supercomputers by Liu, Hang et al.
ar
X
iv
:1
80
4.
09
73
8v
1 
 [c
s.D
C]
  2
5 A
pr
 20
18
Challenges Towards Deploying Data Intensive Scientific
Applications on Extreme Heterogeneity Supercomputers
Hang Liu∗ Yufei Ding‡ Da Zheng† Seung Woo Son∗ Da Yan⋄
∗: UMass Lowell ‡: UCSB †: Unaffiliated ⋄: UAB
Hang Liu@uml.edu
Abstract
Shrinking transistors, which powered the advancement
of computing in the past half century, has stalled due
to power wall; now extreme heterogeneity is promised
to be the next driving force to feed the needs of ever-
increasingly diverse scientific domains. To unlock the
potentials of such supercomputers, we identify eight po-
tential challenges in three categories: First, one needs
fast data movement since extreme heterogeneity will in-
evitably complicate the communication circuits – thus
hampering the data movement. Second, we need to in-
telligently schedule suitable hardware for corresponding
applications/stages. Third, we have to lower the program-
ming complexity in order to encourage the adoption of
heterogeneous computing.
1 Introduction
Extreme heterogeneity is the result of using multiple
types of processors, accelerators and memory/storage in a
single computing platform, which is different from tradi-
tional Top 500 supercomputers (8) that install at most two
types of processors, one types of accelerator and mem-
ory/storage according to their specifications (e.g., Titan,
Sequoia, Trinity and Cori). Clearly, higher (or extreme)
heterogeneity supercomputing will enable the support of
a variety of application workflows and meet the needs of
increasingly diverse scientific domains.
This is especially true for data intensive scientific ap-
plications, such as, High-Performance Conjugate Gradi-
ent (HPCG) (4; 7) and High-Performance Data Analytics
(HPDA), each of which comprise of a diverse collection
of operations. In particular, HPCG involves a sequence of
repeated Basic Linear Algebra Subprograms (BLAS) op-
erations, e.g., SpMV, xaxpy, xdot, xcopy and xxdot and
dissimilar BLAS operations that prefer different proces-
sors and accelerators. For instance, SpMV prefers GPUs
while xaxpy and xxdot may benefit more on many-core
CPUs and xdot and xcopy should favor multi-core CPUs.
HPDA stands for a collection of data mining, graph com-
puting and machine learning algorithms (5; 11; 3). These
applications typically process a daunting volume of data
and generate nontrivial amount of algorithmic metadata
with random access patterns on both data structures. As
such, a richer hierarchy of memory/storage from extreme
heterogeneity will provide users more options to cache the
data with respect to its size and reuse frequency.
Putting hardware together is important, a system that
can sufficiently and intelligently extract the potentials of
corresponding hardware is even more crucial. Briefly,
we envision there is a need to address challenges along
the following three directions to unleash the capabilities
of extreme heterogeneity computing. First, one needs to
provide fast data movement from storage media to pro-
cessors since installing more devices together will in-
evitably complicate the communication circuit, and thus
hamper the data movement. Second, intelligently suggest-
ing the most suitable hardware for corresponding applica-
tion (stages) with dynamic adjustment is essential. Third,
we have to lower the programming entry bar for both ex-
pert and non-expert users in order to encourage the adop-
tion of heterogeneous computing.
2 Data Movement
Despite deeper memory/storage hierarchy provides
users the opportunity of caching data at various hierar-
chies according to their reuse frequencies, it may also
elongate the distance between data and processing units.
We admit data movement is yet a significant bottleneck
for contemporary computing nodes because they are in
the form of modest heterogeneity. However, such adver-
sarial effects will deteriorate rapidly as the system het-
erogeneity climbs towards extremum. For instance, re-
cent study (10) demonstrates that the cross non-uniform
memory access (NUMA) node memory access through-
put drops from 20% on a two-socket (i.e., Intel Xeon E5-
2683 v3) processor to 6.85× of an eight-socket (i.e., Intel
Xeon E7-8850 v3) processor.
Unfortunately, contemporary Operating Systems
(OSes) (1; 6) pay seldom attention on the potential
drawbacks of heterogeneities. Particularly, they disregard
the differences of circuit distance between processor,
accelerator and storage media caused by heterogeneity,
reflected as they follow round robin fashion to assign
processors and accelerators to applications, as well
as storage media. In addition, even computation only
happens on accelerators, they still require the application
to copy the data from storage media to CPU memory
before further sending the data to accelerators. Beyond
that, existing pagecache relies on two-layer FIFO queue
policy to cache all data that is fetched from disk to CPU
memory. Assuming the should be cached data fails to
satisfy two-layer FIFO queue policy, which is most likely
the case in HPDA, current pagecache may waste both
memory space and time to cache and query the data (9).
As such, we advocate to address the following challenges
in this field.
Challenge 1. Need to configure affinity across storage,
memory, processor and accelerator.
1
Challenge 2. Need to introduce direct data access from
accelerator to storage.
Challenge 3. Need user controllable page cache with a
variety of caching policies.
3 Resource Management
Albeit with certain drawbacks, we envision the future
extreme heterogeneity supercomputer can be a strong suit
for today’s increasingly irregular scientific applications if
with intelligent resource management. We will illustrate
the computation irregularity of scientific applications with
cardiac simulation (7) which simulates the blood flow pat-
terns in left ventricle. Particularly, this process needs to, in
an interleaved manner, simulate muscle systolic and dias-
tolic, as well as blood flow entering and exiting the heart
as time goes by. In this application, muscle simulation
involves tremendous data access thus prefers many-core
CPUs, blood simulation expects more benefits in acceler-
ators while the initializations of both stages perform better
at multi-core CPUs.
Similarly, various hierarchies of memory/storage,
which have different capacities, throughputs and laten-
cies, if used effectively, can help enhance performance,
reduce cost and save energy. For instance, Neurodata
storage system (2) lives in the cloud. It uses S3 for the
backend storage and uses memory and SSDs for caching.
To accelerate different types of queries, such as data in-
gestion and data analytics, this system maintains separate
caching layers, which further increase the effectiveness of
resource utilization and performance in the system. How-
ever, neither of the above benefits will be possible for sci-
entific computing if we do not have a toolkit that can in-
telligently select correct hardware for distinct datasets and
application. Consequently, we envision the following re-
source management challenges in this category.
Challenge 4. Need a judicious processor selection based
on the execution features of various applications (stages).
Challenge 5. Need an intelligent data distribution across
HDD, SSD, NVMe, fast and slow memory components.
4 Compiler Framework
Demand for expertise in various programming mod-
els and a spectrum of hardware architectures should be
relieved from application-domain expects, otherwise, it
may discourage them from deploying their applications
in a heterogeneous supercomputing system as exhaus-
tively developing source codes for all types of hardware
is both time consuming and labor intensive. To utilize a
contemporary modest heterogeneity supercomputer with
CPUs, GPUs, and FPGAs, for instance, users need to
learn MPI/OpenMP, CUDA/OpenCL and Verilog/VHDL.
One common remedy is to develop functional libraries
for each of these processors. Upon these libraries, a set of
unified APIs can then be designed to alleviate the process-
specific programming efforts. This turns out to be a rea-
sonable attempt, and it has been used extensively in the
deep learning domain. For example, Tensorflow provides
APIs in both C++ and python, with which users can de-
ploy their program in CPUs or GPUs without learning to
program in MPI/OpenMP or CUDA/OpenCL.
However, such library-API methodology suffers from
two major drawbacks. First, tuning a special-purpose li-
brary itself could take over years from even domain ex-
perts. For instance, cuDNN for GPU, as a specialized ver-
sion of the dense linear algebra libraries (BLAS) for CPU,
was tuned by a team of expert programmers in NVIDIA
and was not released until two years of extensive tunning.
Second, high-level APIs often implicitly sacrifice perfor-
mance for abstractions. It is known that cross-layer and
other large-scope optimizations are missing in Tensorflow
and other library-API based frameworks, as those sup-
plied APIs are at the layer level (e.g., convolution layer,
pooling layer, fully-connected layer). Consequently, a
program written with these APIs will, by default, suffer
from the loss of of those optimizations.
To resolve these problems, we suggest building a more
powerful compiler framework which will tackle the fol-
lowing challenges.
Challenge 6. Need a compiler to translate source
code from some high-level programming languages, e.g.,
python to a process-specific language, e.g., CUDA.
Challenge 7. The compiler can apply a good set of op-
timizations to the translated programs, including tradi-
tional optimizations like loop titling and fusion, as well as
some predefined/user-specified domain-specific optimiza-
tions.
Challenge 8. Need a runtime that examines the data flow
of the program and other dynamic information for en-
abling a larger set of optimizations dynamically, includ-
ing those cross-layer and other large-scope optimizations.
References
[1] Daniel P Bovet and Marco Cesati. Understanding the Linux Kernel: from
I/O ports to process management. O’Reilly Media, Inc., 2005.
[2] Randal Burns et al. The open connectome project data cluster: Scalable
analysis and vision for high-throughput neuroscience. In SSDBM, 2013.
[3] Yufei Ding et al. Yinyang k-means: A drop-in replacement of the classic
k-means with consistent speedup. In ICML, 2015.
[4] High-Performance Conjugate Gradient (HPCG) Benchmark.
http://www.hpcg-benchmark.org/, 2017.
[5] Hang Liu et al. Enterprise: Breadth-first graph traversal on gpus. In SC,
2015.
[6] Hang Liu and H Howie Huang. Graphene: Fine-grained io management for
graph computing. In FAST, 2017.
[7] Rajat Mittal, Jung Hee Seo, Hang Liu, et al. Computational modeling of
cardiac hemodynamics: current status and future outlook. JCP, 2016.
[8] Top 500 November Ranking. https://www.top500.org/lists/2017/11/, 2017.
[9] Brian Van Essen, Henry Hsieh, Sasha Ames, Roger Pearce, and Maya
Gokhale. Di-mmap – a scalable memory-map runtime for out-of-core data-
intensive applications. Cluster Computing, 2015.
[10] Kaiyuan Zhang, Rong Chen, and Haibo Chen. Numa-aware graph-structured
analytics. In PPoPP, 2015.
[11] Da Zheng et al. Flashgraph: Processing billion-node graphs on an array of
commodity ssds. In FAST, 2015.
2
