Optimizing parallel I/O performance of HPC applications by Behzad, Babak
Copyright 2015 Babak Behzad
OPTIMIZING PARALLEL I/O PERFORMANCE OF HPC APPLICATIONS
BY
BABAK BEHZAD
DISSERTATION
Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy in Computer Science
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2015
Urbana, Illinois
Doctoral Committee:
Professor Marc Snir, Chair
Professor Marianne Winslett
Professor William Gropp
Doctor Dean Hildebrand, IBM Almaden Research Center
ABSTRACT
Parallel I/O is an essential component of modern High Performance Computing (HPC). Ob-
taining good I/O performance for a broad range of applications on diverse HPC platforms is
a major challenge, in part because of complex inter-dependencies between I/O middleware
and hardware. The parallel file system and I/O middleware layers all offer optimization
parameters that can, in theory, result in better I/O performance. Unfortunately, the right
combination of parameters is highly dependent on the application, HPC platform, and prob-
lem size/concurrency. Scientific application developers do not have the time or expertise
to take on the substantial burden of identifying good parameters for each problem config-
uration. They resort to using system defaults, a choice that frequently results in poor I/O
performance. We expect this problem to be compounded on exascale class machines, which
will likely have a deeper software stack with hierarchically arranged hardware resources.
We present a line of solution to this problem containing an autotuning system for optimiz-
ing I/O performance, I/O performance modeling, I/O tuning, I/O kernel generation, and
I/O patterns. We demonstrate the value of these solution across platforms, applications,
and at scale.
ii
To my loving family and friends, who helped me on this journey.
iii
ACKNOWLEDGMENTS
I would like to express my deepest appreciation to Marc Snir, who I was thrilled to have as
my adviser.
I also greatly appreciate the help and support by Marianne Winslett, Bill Gropp and Dean
Hildebrand. Without their support this work was not possible.
Let me also thank the non-contiguous support of LBL’s Prabhat and Suren Byna and
the whole ExaHDF5 project members. I also would like to thank Quincey Koziol and Ruth
Aydt from The HDF Group.
My labmates deserve a special mention. I spent many hours with them, shared many
ideas, and discussed many issues. They taught me as many things as the courses did.
Fredrik Kjolstad, Aparna Sasidharan, Jon Calhoun, Alex Brooks, Hoang-Vu Dang, Farah
Hariri were the best labmates one can hope for.
Without the love and help of my family this was not possible. I would like to first thank
my sister and brother-in-law, Banafsheh Behzad and Hadi Tavassol for always being there
fore me. And finally my mom and dad, Simin Samadian and Mohammad Reza Behzad for
giving me this oppurtunity to be in this journey.
iv
GRANTS
This work is supported by the Director, Office of Science, Office of Advanced Scientific
Computing Research, of the U.S. Department of Energy under Contract No. DE-AC02-
05CH11231. This research used resources of the National Energy Research Scientific Com-
puting Center, the Texas Advanced Computing Center and the Argonne Leadership Com-
puting Facility at Argonne National Laboratory, which is supported by the Office of Science
of the U.S. Department of Energy under contract DE-AC02-06CH11357. It was partly sup-
ported by NSF grant 0938064.
This work was supported by the Office of Advanced Scientific Computing Research, Of-
fice of Science, U.S. Department of Energy, under contract numbers DE-AC02-05CH11231
and DE-AC02-06CH11357. This research used resources of the National Energy Research
Scientific Computing Center.
This work is supported by NSF grant 0938064; by the Director, Office of Science, Office of
Advanced Scientific Computing Research, of the U.S. Department of Energy under Contract
No. DE-AC02-05CH11231; and by The HDF Group. This research used resources of Texas
Advanced Computing Center.
This work is supported by the Director, Office of Science, Office of Advanced Scien-
tific Computing Research, of the U.S. Department of Energy under Contract DE-AC02-
05CH11231 and DE-AC02-06CH11307. It used resources of the Texas Advanced Computing
Center.
v
TABLE OF CONTENTS
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
List of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
CHAPTER 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
CHAPTER 2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Parallel I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 HPC Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Application I/O Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
CHAPTER 3 Taming Parallel I/O Complexity with Autotuning . . . . . . . . . . . 12
3.1 Autotuning Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 H5Evolve: Sampling the search space . . . . . . . . . . . . . . . . . . 14
3.1.2 H5Tuner: Setting I/O parameters at runtime . . . . . . . . . . . . . . 16
3.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.1 Scale and dataset sizes . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.2 Parameter space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.1 Tuned I/O performance results . . . . . . . . . . . . . . . . . . . . . 22
3.3.2 Tuned configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.3 Tuned I/O performance across platforms . . . . . . . . . . . . . . . . 23
3.3.4 Tuned I/O for different benchmarks . . . . . . . . . . . . . . . . . . . 27
3.3.5 Tuned I/O at different scales . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
CHAPTER 4 Improving Parallel I/O Autotuning with Performance Modeling . . . 31
4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Empirical Performance Models . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.1 Nonlinear regression model preliminaries . . . . . . . . . . . . . . . . 32
4.2.2 Development of I/O performance models . . . . . . . . . . . . . . . . 33
4.3 Integration of Performance Models in Autotuning Framework . . . . . . . . . 40
vi
4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4.1 Performance models vs. Genetic algorithms . . . . . . . . . . . . . . . 42
4.4.2 Testing on space similar to training . . . . . . . . . . . . . . . . . . . 42
4.4.3 Testing on a larger space . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4.4 Testing on a different application: VORPAL-IO . . . . . . . . . . . . 45
4.4.5 Testing on larger scale . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4.6 Large-scale results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4.7 Overall improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4.8 Analysis of the interdependencies . . . . . . . . . . . . . . . . . . . . 51
4.5 I/O Interference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
CHAPTER 5 A Multi-Level Approach for Understanding I/O Activity in HPC
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.1 VPIC-IO Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.2 Simple HDF5 Benchmark . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
CHAPTER 6 Automatic Generation of I/O Kernels for HPC Applications . . . . . 69
6.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.1.1 I/O Tracing: Recorder . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.1.2 Trace Merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.1.3 Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2 Setup and Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2.1 Correctness of the framework . . . . . . . . . . . . . . . . . . . . . . 75
6.2.2 Quality of the generated code . . . . . . . . . . . . . . . . . . . . . . 76
6.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
CHAPTER 7 Pattern-driven Parallel I/O Tuning . . . . . . . . . . . . . . . . . . . 78
7.1 I/O Autotuning Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.1.1 I/O Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.1.2 Extraction and Identification of High-level I/O Patterns . . . . . . . 80
7.2 Setup and Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.2.1 An application with the same I/O pattern . . . . . . . . . . . . . . . 85
7.2.2 An application with similar I/O pattern . . . . . . . . . . . . . . . . 85
7.2.3 A new application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
CHAPTER 8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.1 Autotuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
8.2 I/O Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.3 I/O Recording . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.4 I/O Replaying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
vii
8.5 I/O Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
CHAPTER 9 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
9.1 Comparison of the approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 95
9.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
9.3 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
viii
LIST OF FIGURES
2.1 Parallel I/O Stack and various tunable parameters . . . . . . . . . . . . . . . 5
2.2 An illustration of Cray CB algorithm 2. . . . . . . . . . . . . . . . . . . . . . 6
2.3 Partitioning of file domains and processors between aggregators in VPIC-
IO when the Lustre stripe size is (a) 16MB, (b) 128MB. . . . . . . . . . . . . 10
2.4 3D Block structure of VORPAL-IO datasets in HDF5 . . . . . . . . . . . . . 11
3.1 Overall Architecture of the Autotuning Framework . . . . . . . . . . . . . . 14
3.2 A pictorial depiction of the genetic algorithm used in the autotuning framework. 15
3.3 Design of H5Tuner component as a dynamic library which intercepts HDF5
functions to tune I/O parameters . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 An XML file showing a sample configuration with optimization parameters
at different levels of the parallel I/O stack. The tuning can be applied to
all files an application writes or to a specific file. . . . . . . . . . . . . . . . . 18
3.5 Summary of performance improvement for each I/O benchmark running
on (a) 128 cores, (b) 2048 cores, (c) 4096 cores. The I/O bandwidth axes’
scales are different in each of the plots. . . . . . . . . . . . . . . . . . . . . . 21
3.6 Speedups with respect to platforms, benchmarks, and scale of the experiments. 25
3.7 The effect of Hopper’s Collective Buffer Size on performance of VPIC-IO
on 2048 cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.8 The effect of Intrepid’s Collective Buffer Size on performance of VPIC-IO
on 2048 cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.9 The effect of Lustre Stripe Size value on performance of VORPAL on 2048
cores of Stampede . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.10 The effect of Lustre Stripe Size value on performance of GCRM on 2048
cores of Stampede . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.11 Raw Bandwidth plots and breakdown across scale . . . . . . . . . . . . . . . 30
4.1 I/O performance variability and effect of interference on a single node
writing to a file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Correlation between observed and predicted single-node write times on
training (50%) and testing (50%) subsets. . . . . . . . . . . . . . . . . . . . 36
4.3 Raw data and nonlinear model of the form (4.6) for VPIC-IO write times
as the number of aggregators is varied. . . . . . . . . . . . . . . . . . . . . . 37
4.4 Design of our new autotuning system making use of performance models. . 41
ix
4.5 Comparison of the model-predicted, measured, and default-setting write
times of the top twenty configurations for VPIC-IO running on 512 cores. . . 43
4.6 Comparison of the model-predicted, measured, and default-setting write
times of the top twenty configurations for VPIC-IO on 512 cores. . . . . . . 44
4.7 Comparison of the model-predicted, measured, and default-setting write
times of the top twenty configurations for VPIC-IO on 2048 cores. . . . . . . 45
4.8 Comparison of the model-predicted, measured, and default-setting write
times of the top twenty configurations for VORPAL-IO on 512 cores. . . . . 47
4.9 Comparison of the model-predicted, measured, and default-setting write
times of the top twenty configurations for VORPAL-IO on 2048 cores. . . . . 47
4.10 Comparison of the model-predicted, measured, and default-setting write
times of the top twenty configurations for VPIC-IO on 8192 cores. . . . . . . 49
4.11 Comparison of the model-predicted, measured, and default-setting write
times of the top twenty configurations for VORPAL-IO on 8192 cores. . . . . 49
4.12 Effect of MPI-IO aggregators on training set of VPIC-IO on Stampede . . . 52
4.13 Effect of Lustre stripe size on training set of VPIC-IO on Stampede . . . . . 53
4.14 Effect of Lustre stripe size on performance of the Top 20 VPIC-IO config-
urations on 4K cores of Edison. Stripe count is fixed at 96 (maximum #
of OSTs on Edison) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.15 Effect of MPI-IO aggregators on performance of 14 configurations of the
Top 20 experiments on 16K cores of Edison. Stripe count is fixed at 96
(maximum # of OSTs on Edison) . . . . . . . . . . . . . . . . . . . . . . . . 55
4.16 Summary of the best I/O performance obtained in the Top-20 configura-
tions for each I/O benchmark running on (a) 4K cores, (b) 8K cores, (c)
16K cores – Note that (a) and (b) are log-scale plots. . . . . . . . . . . . . . 57
4.17 Effect of Lustre stripe count at three different scales of VPIC-IO on Stam-
pede (a) 512 cores, (b) 1K cores, (c) 2K cores . . . . . . . . . . . . . . . . . 59
4.18 Ratio of MPI-IO’s aggregators and Lustre’s stripe count on three different
applications on 2K cores of Hopper (a) VPIC-IO, (b) VORPAL-IO, (c)
GCRM-IO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.1 Dynamic instrumentation of the I/O stack by Recorder . . . . . . . . . . . . 63
5.2 Trace file sizes for VPIC-IO process ranks . . . . . . . . . . . . . . . . . . . 65
5.3 VPIC-IO write operations, rank 0, default parameters . . . . . . . . . . . . . 66
5.4 VPIC-IO write operations, rank 1, default parameters . . . . . . . . . . . . . 66
5.5 VPIC-IO write operations, rank 0, tuned parameters . . . . . . . . . . . . . 67
5.6 Independent MPI-IO read operations issued by all processes to read HDF5
metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.1 Flow of the Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2 Method of intercepting HDF5 calls by the Recorder . . . . . . . . . . . . . . 71
6.3 Illustrating three consecutive merging operations . . . . . . . . . . . . . . . . 72
6.4 Comparison of Darshan POSIX I/O counters of original application and
the generated one by the framework (a) VPIC-IO, (b) VORPAL-IO, (c)
GCRM-IO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
x
7.1 An overview of our I/O autotuning framework . . . . . . . . . . . . . . . . . 79
7.2 A sample I/O trace generated by the Recorder for a simple parallel appli-
cation called pH5Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.3 The four HDF5 hyperslab selection function calls across different ranks of
a parallel four-process run of pH5Example . . . . . . . . . . . . . . . . . . . 82
7.4 I/O patterns of (a) VPIC-IO (b) GCRM-IO (c) VORPAL-IO benchmark . . 84
7.5 The I/O performance of the autotuned (a) IOR (b) Resemble-VORPAL-
IO (c) FLASH-IO application on Hopper and Edison compared the default
configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
xi
LIST OF TABLES
2.1 Details of various HPC systems used in this thesis . . . . . . . . . . . . . . . 7
3.1 Weak scaling configuration for the three I/O benchmarks . . . . . . . . . . . 18
3.2 A list of the tunable parameters and their ranges used for experiments in
this study. We show the minimum and maximum values for each param-
eter, with powers-of-two values in between. The last column shows the
number of distinct values used for each parameter. . . . . . . . . . . . . . . . 20
3.3 I/O rate and speedups of I/O Benchmarks with Tuned Parameters over
Default Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 Tuned parameters of benchmarks on the systems for 2048-core experiments . 24
4.1 Training configurations (90 in total) tested as part of the single-node experiment. 34
4.2 Breakdown of training set for the parallel I/O model. . . . . . . . . . . . . . 39
4.3 Top ten predicted configurations of VPIC-IO on 512 cores. The best-
performing configuration is provided to the user as output of the autotun-
ing process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4 Top ten predicted configurations of VPIC-IO on 512 cores selected by the
autotuning framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.5 Best ten predicted configurations selected by the framework for VPIC-IO
on 2048 cores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.6 Best ten predicted configurations for VORPAL-IO on 512 cores. . . . . . . . 48
4.7 Highest bandwidth achieved for the two applications by selecting the best-
performing configuration suggested by our autotuning framework on Hopper. 50
4.8 Highest bandwidth achieved for the three applications by selecting the
best-performing configuration suggested by our proposed framework. . . . . 51
4.9 The top twenty configurations predicted by our model and their respective
I/O bandwidth for VPIC-IO on 4K cores of Stampede generating a 1 TB
file size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.1 Features of Recorder Framework . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.2 VPIC-IO parameters and I/O bandwidth . . . . . . . . . . . . . . . . . . . . 64
5.3 Darshan MPI-IO counters for simple HDF5 benchmark on eight cores . . . . 67
6.1 Comparison of the code size of original and generated benchmarks . . . . . . 77
9.1 A comparison of GA, modeling and default configuration. . . . . . . . . . . . 96
xii
LIST OF ALGORITHMS
1 Merging Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2 MergeEvent method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3 GetDistance method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
xiii
CHAPTER 1
INTRODUCTION
High Performance Computing (HPC) applications are constantly moving towards simulating
scientific phenomenon at finer granularities and massive scales by leveraging advances in
parallel processing hardware. Finer granularities mean larger amount of data and this has
caused the growth of data to be at an unprecedented rate. Such rapid growth of data in size
and in complexity requires efficient techniques to manage data on file systems. However,
scalability of applications is often limited by poorly performing parallel I/O. Ensuring fast
and efficient parallel I/O is critical for many HPC applications.
I/O can be a significant bottleneck on HPC application performance. The need to in-
crease checkpoint frequency and the increasing emphasis on big data analytics increase the
importance of I/O. Parallel I/O systems are complex: I/O is often done at the application
level using a high-level library, such as HDF5 [1]; HDF5 is implemented atop MPI-IO [2]
which, in turn, performs POSIX I/O calls against a parallel file system, such as Lustre [3].
Each of these subsystems has multiple configuration parameters and performance can be
very sensitive to their settings.
However, the configuration of these parameters to obtain the best possible I/O perfor-
mance depends on diverse factors, such as the I/O application, storage hardware, problem
size, and number of processors. HPC application developers, typically experts in their sci-
entific domains, do not have the time or expertise to explore the intricacies of I/O systems.
They often resort to using default I/O parameter settings that can result in poor performance
and inefficient use of available I/O bandwidth. As the complexity and concurrency of future
HPC systems grow, we expect that so too will obstacles to achieving high-performance I/O.
Application developers should be able to achieve good I/O performance without becoming
experts on the tunable parameters for every file system and I/O middleware layer they
encounter. Scientists want to write their application once and obtain reasonable performance
across multiple systems–they want I/O performance portability across platforms. From an
I/O research-centric viewpoint, a considerable amount of effort is spent optimizing individual
applications for specific platforms. While the benefits are definitely worthwhile for specific
application codes, and some optimizations carry over to other applications and middleware
1
layers, it would be ideal if a single optimization framework was capable of generalizing across
multiple applications.
In order to use HPC machines and human resources effectively, it is imperative that we
design systems that can hide the complexity of the I/O stack from scientific application
developers without penalizing performance. Our vision is to develop a system that will
allow application developers to issue I/O calls without modification and rely on an intelligent
runtime system to transparently determine and execute an I/O strategy that takes all the
levels of the I/O stack into account.
1.1 Dissertation Organization
Our autotuning framework discussed in detail in Chapter 3 is the first effort towards this
goal. This framework uses a genetic algorithm to search a large space of tunable parameters
and to identify effective settings at all layers of the parallel I/O stack. The parameter settings
are applied transparently by the autotuning system via dynamically intercepted HDF5 calls.
To validate our autotuning system, we applied it to three I/O benchmarks (VPIC, VOR-
PAL, and GCRM) that replicate the I/O activity of their respective applications. We tested
the system with different weak-scaling configurations (128, 2048, and 4096 CPU cores) that
generate 30 GB to 1 TB of data, and executed these configurations on diverse HPC plat-
forms (Cray XE6, IBM BG/P, and Dell Cluster). In all cases, the autotuning framework
identified parameter values that substantially improved write performance over default sys-
tem settings. We consistently demonstrate I/O write speedups between 2x and 100x for test
configurations as compared to the default I/O settings.
While the GA-based autotuing framework consistently demonstrates I/O write speeds
between 2x and 100x in Chapter 3, the overhead of the GA approach is significant. We
reduce the search time significantly by using empirical models of the I/O performance in
Chapter 4.
Tools that capture and analyze I/O activity and guide performance optimization are
highly desired. Unfortunately, understanding the I/O behavior of an HPC application is not
a simple task due to the previously noted complex interactions between multiple software
components. Chapter 5 shows how Recorder’s trace output can be used to investigate I/O
activity and identify performance inefficiencies in two I/O benchmarks running on a leading
edge HPC platform. We believe that a multi-level I/O tracing framework can provide key
insights to end users and I/O library developers working to improve I/O on HPC platforms.
Realistic I/O kernels are an important tool for the study of I/O. They can be used to
2
evaluate storage systems (both current storage systems, through execution and new designs,
through simulation); they facilitate collaboration between institutions: There are many
cases where full-fledged application codes can not be shared between institutions, because
of their proprietary or classified nature. An I/O kernel can provide detailed information
on the I/O characteristics of such an application while providing little information on the
computation it performs. In addition, it can be run faster. In Chapter 6, we show how to
create automatically such an I/O kernel program, by executing the target application with
an instrumented I/O library, next “compressing” the resulting I/O traces into a compact C
program that generates those traces.
Chapter 7 focuses on developing and testing a framework for tuning parallel I/O of ar-
bitrary applications. We believe that I/O patterns are the key to this general problem.
Therefore, we first define a notion of I/O patterns, and populate a database of good config-
urations for these patterns calculated by our autotuning framework. We then implement an
intelligent runtime system, which will be capable of extracting I/O patterns from arbitrary
applications, and consulting the performance database to propose an improved I/O strategy.
In Chapter 8, we go over the related research projects and publications to each of the
components of this work. We finally conclude with the list of contributions along with some
possible future research directions in Chapter 9.
3
CHAPTER 2
BACKGROUND
2.1 Parallel I/O
A parallel I/O subsystem typically consists of various layers of middleware libraries and
hardware. The most common parallel I/O stack in current HPC machines has high-level
I/O libraries and file formats (e.g., HDF5, NetCDF, and ADIOS), I/O middleware (e.g.,
MPI-IO and POSIX), parallel file systems (e.g., Lustre, GPFS, and PVFS), and storage and
I/O hardware. When parallel applications perform I/O operations, the data moves from
individual processors to the storage hardware through the multiple layers of the stack.
Figure 2.1 shows a contemporary parallel I/O software stack with HDF5 [4] as the high-
level I/O library, MPI-IO as the middleware layer, and a parallel file system (Lustre, GPFS,
etc.). While each layer of the stack exposes tunable parameters for improving performance,
there is little guidance for application developers on how these parameters interact with each
other and affect overall I/O performance.
Additionally, to achieve good I/O performance, each of the layers offers optimization
strategies. For instance, MPI-IO provides two modes of writing data to disks: independent
I/O and collective I/O [5]. With independent I/O, each MPI process writes the data to
storage independent of other processes of the application. In collective I/O mode, the data
is collected at a few aggregator processes and the aggregators write the data to storage. The
collective I/O mode is preferable when the number of MPI processes is large because too
many requests to the file system degrade I/O performance. Throughout this chapter, we
focus on the write operations that originate from large simulations.
A typical implementation of a collective I/O write operation includes two phases: the
data collection phase at aggregators and the I/O phase [6]. Each MPI process first analyzes
its request to the file and calculates the start offset and end offset. These two variables
identify the segment of the file accessed by the processor. After calculating these variables,
each process sends their values to all the other processes. The aggregators then compute the
partitions, called file domains, of the file they are responsible for writing. In ROMIO [5],
4
HDF5
(Alignment, Chunking, etc.)
MPI I/O
(Enabling collective buffering, Sieving buffer size, 
collective buffer size, collective buffer nodes, etc.)
Application
Parallel File System
(Number of I/O nodes, stripe size, enabling prefetching 
buffer, etc.)
Storage Hardwaretorage ard are
Figure 2.1: Parallel I/O Stack and various tunable parameters
which is the basis for many MPI-IO implementations, the aggregators split the range of the
file being updated equally in a block-cyclic distribution. Figure 2.2 shows an example file
domain assignment in a configuration with “a” I/O aggregators (processes shaded in gray),
each of them in charge of one file domain.
Parallel file systems, such as Lustre, typically use multiple storage servers to parallelize
I/O operations. Lustre uses Object Storage Targets (OSTs) for storing chunks of data.
Lustre allows users or applications to control the number of OSTs, called the stripe count,
and the size of contiguous chunks of data, called the stripe size, for storing the data. The
MPI-IO aggregators write blocks of size equal to the stripe size in a round-robin fashion [7].
Several algorithms have been designed for selecting the aggregators and writing data to
stripes [8]. Among these, CB alignment algorithm 2 has been developed by the Cray MPT
library. Figure 2.2 illustrates the CB algorithm 2, where the block size used to partition the
file into domains is equal to the stripe size, consequently written to OSTs in a round-robin
fashion. In this algorithm, Cray’s MPT sets the collective buffering buffer size equal to the
Lustre stripe size. Therefore, the main I/O parameters to tune are: Lustre stripe count,
Lustre stripe size, and the MPI-IO number of collective buffering nodes (aggregators).
Note that from now on, we define write time to be the time elapsed from calling a write
operation in a higher-level library until the function is done, consisting of all the communi-
cation and I/O time needed for this operation.
5
OST 0
OST 1
OST c
Stripe 0
Stripe 1
Stripe 
c-1
Stripe c
Stripe 
c+1
Stripe 
2c-1
Stripe 2c
.
.
.
.
.
.
.
.
.
.
.
.
P0
P1
P2
Pn-1
File_Domain_0
...
File_Domain_a
...
Pn-2
.
.
.
...
...
File_Domain_1
File_Domain_a
File_Domain_0
File_Domain_1
...
Figure 2.2: An illustration of Cray CB algorithm 2.
2.2 HPC Platforms
All the experiments in this dissertation are conducted on the following HPC platforms:
• Edison: Edison is a supercomputer at National Energy Research Scientific Computing
Center (NERSC). It is a Cray XC30 system consisting 5, 576 twenty-four core nodes
with 64GB of memory per node. It has Cray Aries with Dragonfly topology and three
Lustre file systems with aggregate bandwidth of 168 GB/s. We only used scratch2
file system in these experiments with maximum of 96 OSTs and 48 GB/s peak I/O
bandwidth. Cray’s MPI library v7.0.4, HDF5 v1.8.11, and H5Part v1.6.6 were used
on Edison.
• Hopper: Hopper is another supercomputing system located at NERSC. It is a Cray
XE6 system containing 6, 384 twenty-four core nodes with 32GB of memory per node.
It employs the Gemini interconnect with a 3D torus topology. We used a Lustre file
system with 156 OSTs and a peak bandwidth of about 35GB/s for storing data. We
used Cray’s MPI library v6.0.1, HDF5 v1.8.11, and H5Part v1.6.6 for compiling the
I/O kernels.
• Intrepid: Intrepid, a IBM BlueGene/P (BG/P) system at Argonne Leadership Com-
puting Facility (ALCF) is a 40-rack half a peta-flop system. Each rack contains 1024
nodes with 850 MHz quad-core processors and 2GB RAM per node. It is also equipped
with 640 I/O nodes with more 7.6 PB of storage.
6
• Stampede: Stampede is a Dell PowerEdge C8220 cluster at the Texas Advanced
Computing Center. It has 6, 400 sixteen core nodes with 32GB of memory per node.
It uses Mellanox FDR InfiniBand technology with a two-level fat-tree topology. Stam-
pede’s Lustre file system with 160 OSTs (in the testing experiments for consistent
comparisons we use 156 OSTs as the maximum stripe count for Stampede as well) has
shown peak of 159 GB/s I/O bandwidth. Intel compiler v13.0.2, MVAPICH2 v1.9a2,
HDF5 v1.8.10 and H5Part v1.6.5 were used on Stampede.
Table 2.1 lists details of these HPC systems; note that the number and type of I/O
resources vary across these platforms. We also note that the I/O stack is different on
Intrepid from that on Hopper and Stampede. On Intrepid, the parallel file system is GPFS,
while Hopper and Stampede use the Lustre file system. GPFS uses dedicated I/O Nodes
(IONs) to act as proxies between the compute nodes and storage nodes. Each ION serves
64 4-core nodes, and the collection of the ION and compute nodes is called a pset.
HPC System Architecture Node Hardware Filesystem Storage Hardware Peak I/O BW
NERSC/Hopper Cray XE6
AMD Opteron processors,
24 cores per node,
32 GB memory
Lustre
156 OSTs,
26 OSSs
35 GB/s [9]
ALCF/Intrepid IBM BG/P
PowerPC 450 processors,
4 cores per node,
2 GB memory
GPFS
640 IO Nodes,
128 file servers
47 GB/s (write) [10]
TACC/Stampede Dell PowerEdge
Xeon E5-2680 processors,
16 cores per node,
32GB memory
Lustre
160 OSTs,
58 OSSs
159 GB/s [11]
Table 2.1: Details of various HPC systems used in this thesis
2.3 Application I/O Kernels
We chose three parallel I/O kernels to evaluate our autotuning framework: VPIC-IO, VORPAL-
IO, and GCRM-IO. These kernels are derived from the I/O calls of three applications,
Vector Particle-In-Cell (VPIC) [12], VORPAL [13], and Global Cloud Resolving Model
(GCRM) [14], respectively. These I/O kernels represent three distinct I/O write motifs
with different data sizes.
• VPIC-IO—plasma physics: VPIC is a highly optimized and scalable particle
physics simulation developed by Los Alamos National Lab [12]. VPIC-IO uses the
H5Part [15] API to create a file, write eight variables, and close the file. H5Part
provides a simple veneer API for issuing HDF5 calls corresponding to a time-varying,
7
multi-variate particle data model. VPIC-IO extracts all the H5Part function calls of
the VPIC code to form the VPIC-IO kernel. The particle data written in the kernel
is random data of float data type. The I/O motif of VPIC-IO is a 1D particle of a
given number of particles and each particle has eight variables. The kernel writes 8M
particles per MPI process for all experiments reported in this chapter.
VPIC-IO uses the H5Part library [16] to initiate and write data pertaining to particles.
The code is run in weak-scaling configuration, where each MPI process writes eight
million particles; as the number of processes increases, the number of particles increases
proportionately. Each particle has eight (six float and two integer) variables. All
processes issue one write call per variable (i.e., eight write calls) in order to write the
data into a single shared HDF5 file.
Figure 2.3 shows the partitioning of VPIC-IO file domains for two Lustre stripe size
settings. In VPIC-IO with a 1D-array pattern, each processor writes 4 bytes per
particle for each variable (since all the variables are of 32-bit floating point type) of
this 1D dataset into the file in the order of their ranks. All of our experiments are
conducted with 8 million particles which leads to write sizes of 32 MB by each processor.
Therefore for each of the collective write calls, process 0 writes to file offset 0 to 32 MB,
process 1 writes to file offset 32 MB to 64 MB, ... The top figure shows the partitioning
for a stripe size of 16MB and the bottom figure shows it for a stripe size of 128MB. The
notation Pi refers to the MPI processes, while ai refers to aggregators. On Hopper,
the size of the file domains of each aggregator is equal to the stripe size. Hence, the
stripe size directly affects the number of write calls issued at each aggregator.
• VORPAL-IO—accelerator modeling: This I/O kernel is extracted from VOR-
PAL, a computational plasma framework application simulating the dynamics of elec-
tromagnetic systems, plasmas, and rarefied as well as dense gases, developed by TechX
[13]. This benchmark uses H5Block to write non-uniform chunks of 3D data per proces-
sor. The kernel takes 3D block dimensions (x, y, and z) and the number of components
as input. In our experiments, we used 3D blocks of 100x100x60 with different number
of processors and the data is written for 20 time steps.
VORPAL-IO leverages the H5Block library [17], which uses HDF5 library to handle
block structured data. VORPAL-IO partitions a 3D grid of points into a 3D grid of
processes. Each process writes the sub-block of points in its partition. For example, in
a 128-process run with a block of size 300× 100× 60 and a decomposition of (8, 4, 4),
the size of the total block is going to be 2400×400×240. This kernel is also configured
8
to run in a weak-scaling mode.
In terms of I/O pattern, VORPAL-IO is more complex than VPIC-IO. It writes 3D
block structured grids using 3D HDF5 datasets. The way that we have configured
this I/O kernel is that each block is of size (300× 100× 60). In contrast to VPIC-IO,
VORPAL-IO variables are of type double and of size 8 bytes. Therefore, the size of
each block is 13MB. The method to scale VORPAL-IO is also different than VPIC-IO.
VORPAL-IO has a configurable non-uniform grid decomposition scheme, in which user
can specify out of the number of processors, how does each of these three dimensions
get scaled. For example, for a 128-core run of VORPAL-IO, if the user chooses the
block decomposition as (8, 4, 4), total X-dimension of the grid will be 2400 (= 300×8),
Y-dimension and Z-dimension will be 400 (= 100× 4) and 240 (= 60× 4) respectively.
This grid and the way the blocks are assigned to each process rank is shown in figure
2.4.
• GCRM-IO—global atmospheric model: This I/O kernel simulates I/O for
GCRM, a global atmospheric circulation model, simulating the circulations associ-
ated with large convective clouds. This I/O benchmark also uses H5Part to perform
I/O operations. The kernel performs all the GCRM I/O operations with random data.
I/O pattern of GCRM-IO corresponds to a semi-structured geodesic mesh, where the
grid resolution and subdomain resolution are specified as input. In our tests we used
varying grid resolutions at different concurrencies. By default, this benchmark uses 25
vertical levels and 1 iteration.
9
a0=Pi
P0
P3
16 MBFileDomain 0 P0
File
Domain 1 P0a1=Pj
P1
P2
P4
P7
P5
P6 .
.
.
16 MB
File
Domain 2 P1
File
Domain 3 P1
16 MB
16 MB
File
Domain 4 P2
a2=Pk
a3=Pe
.
.
.
.
.
.
16 MB
(a) Lustre stripe size = 16 MB
a0=Pi
P0
P3
128 MB
File
Domain
0
P0
P1
P2
P3
File
Domain
1
P4
P5
P6
P7
a1=Pj
P1
P2
P4
P7
P5
P6
.
.
.
.
.
.
.
.
.
128 MB
(b) Lustre stripe size = 128 MB
Figure 2.3: Partitioning of file domains and processors between aggregators in
VPIC-IO when the Lustre stripe size is (a) 16MB, (b) 128MB.
10
y=
40
0
300
Rank 0
60
x=2400
10
0
z=2
40
Rank 1 ...
Rank 31...
Rank 7
Rank 8
Rank 16
Rank 24
Rank 23
Rank 15
...
...
...
...
...
...
Figure 2.4: 3D Block structure of VORPAL-IO datasets in HDF5
11
CHAPTER 3
TAMING PARALLEL I/O COMPLEXITY WITH
AUTOTUNING
In this chapter, we present our first step towards accomplishing the ambitious goal of taming
parallel I/O complexity automatically. We develop an autotuning system that searches a
large space of configurable parameters for multiple layers and transparently sets I/O param-
eters at runtime via intercepted HDF5 calls of the I/O stack to identify parameter settings
that perform well. We apply the autotuning system to three I/O kernels extracted from
real scientific applications and identify tuned parameters on three HPC systems that have
different architectures and parallel file systems.
In brief, this chapter makes the following research contributions:
• We design and implement an autotuning system that hides the complexity of tuning
the Parallel I/O stack.
• We demonstrate high performance across diverse HPC platforms.
• We demonstrate the applicability of the system to multiple scientific application bench-
marks.
• We demonstrate I/O performance tuning at different scales (both concurrency and
dataset size).
The remainder of this chapter is structured as follows: Section 3.1 presents our I/O
autotuning system; Section 3.2 discusses the experimental setup used to evaluate benefits
of the autotuning system across platforms, applications, and at scale. Section 3.3 presents
performance results from our tests and discusses the insights gained from the autotuning
effort. Finally, Section 3.4 offers concluding thoughts.
3.1 Autotuning Framework
The main challenges in designing and implementing an I/O autotuning system are (1) se-
lecting an effective set of tunable parameter values at all layers of the stack, and (2) applying
12
the parameters to applications or I/O benchmarks without modifying the source code. We
tackle these challenges with the development of two components: H5Evolve and H5Tuner.
For selecting tunable parameters, a na¨ıve strategy is to execute an application or a repre-
sentative I/O kernel of the application using all possible combinations of tunable parameter
values for all layers of the I/O stack. This is an extremely time and resource consuming
approach, as there are many thousands of combinations in a typical parameter space. A rea-
sonable approach is to search the parameter space with a small number of tests. Towards this
goal, we developed H5Evolve to search the I/O parameter space using a genetic algorithm
(GA). H5Evolve samples the parameter space by testing a set of parameter combinations and
then, based on I/O performance, adjusts the combination of tunable parameters for further
testing. As H5Evolve passes through multiple generations, better parameter combinations
(i.e., sets of tuned parameters with high I/O performance) emerge.
An application can control tuning parameters for each layer of the I/O stack using
hints set via API calls. For instance, HDF5 alignment parameters can be set using the
H5Pset_alignment() function. MPI-IO hints can be set in a similar fashion for the
collective I/O and file system striping parameters. While changing the application source
code is possible if the code is available, it is impractical when testing a sizable number of
parameter combinations. H5Tuner solves this problem by dynamically intercepting HDF5
calls and injecting selected parameter values into parallel I/O calls at multiple layers of the
stack without the need for source code modifications. H5Tuner is a transparent shared li-
brary that can be preloaded before the HDF5 library, prioritizing it over the original HDF5
function calls.
Figure 3.1 shows our autotuning system that uses both H5Tuner and H5Evolve for search-
ing a parallel I/O parameter space. H5Evolve takes the I/O parameter space as input and
for each experiment generates a configuration file in XML format. The parameter space
contains possible values for I/O tuning parameters at each layer of the I/O stack and the
configuration file contains the the parameter settings that will be used for a given run.
H5Tuner reads the configuration file and dynamically links to HDF5 calls of an application
or I/O benchmark. After running the executable, the parameter settings and I/O perfor-
mance results are fed back to H5Evolve and influence the contents of the next configuration
file. As H5Evolve tests various combinations of parameter settings, the autotuning system
selects the best found performing configuration for a specific I/O kernel or application.
13
Figure 3.1: Overall Architecture of the Autotuning Framework
3.1.1 H5Evolve: Sampling the search space
As mentioned previously, due to large size of the parameter space and possibly long execution
time of a trial run, finding optimal parameter sets for writing data of a given size is a
nontrivial task. Depending on the granularity with which the parameter values are set, the
size of the parameter space can grow exponentially and unmanageably large for a brute force
and enumerative optimization approach. As an example, for these experiments in this work,
for Lustre-specifc configurations without chunking, the parameter space contains 13, 440
total possible configurations; Lustre-specific configurations with chunking contains 336, 000
different configurations.
Exact optimization techniques are not appropriate for sampling the search space given
the variable-length nature of the objective function, which is the runtime of a particular
configuration. Instead of relying on the simplest approach, exhaustive search, adaptive
heuristic search approaches such as genetic evolution algorithms, simulated annealing, etc.,
can traverse the search space in a reasonable amount of time. In H5Evolve, we explore
genetic algorithms for sampling the search space.
A genetic algorithm (GA) [18] is a meta-heuristic for approaching an optimization prob-
14
Figure 3.2: A pictorial depiction of the genetic algorithm used in the autotuning
framework.
lem, particularly one that is ill-suited for traditional exact or approximation methods. A
GA is meant to emulate the natural process of evolution, working with a “population” of
potential solutions through successive “generations” (iterations) as they “reproduce” (inter-
mingle portions between two members of the population) and are subject to “mutations”
(random changes to portions of the solution). A GA is expected, although this cannot nec-
essarily be proven, to converge to an optimal or near-optimal solution, as strong solutions
beget stronger children, while the random mutations offer a sampling of the remainder of
the space.
Our implementation, dubbed H5Evolve, is shown in Figure 3.2. It was built in Python
using the Pyevolve [19] module, which provides a framework for performing genetic algorithm
experiments in Python.
The workflow of H5Evolve is as follows. For a given benchmark at a specific concurrency
and problem size, H5Evolve runs the genetic algorithm (GA). H5Evolve takes a predefined
parameter space which contains possible values for the I/O tuning parameters at each layer
of the I/O stack. The evolution process starts with a randomly selected initial population.
H5Evolve generates an XML file containing the selected I/O parameters (an I/O configura-
tion) that H5Tuner injects into the benchmark. In all of our experiments, the H5Evolve GA
15
uses a population size of 15; this size is a configurable option. Starting with an initial group
of configuration sets, the genetic algorithm passes through successive generations. H5Evolve
uses the runtime as the fitness evaluation for a given I/O configuration. After each gen-
eration has completed, H5Evolve evaluates the fitness of the population and considers the
fastest I/O configurations (i.e., the “elite members”) for inclusion in the next generation.
Additionally, the entire current population undergoes a series of mutations and crossovers
to populate the other member sets in the population of the next generation. This process
repeats for each generation. In our experiments, we set the number of generations to be
40, meaning that H5Evolve runs a maximum of 600 executions of a given benchmark. We
used a mutation rate of 15%, meaning that 15% of the population undergoes mutation at
each generation. After H5Evolve finishes sampling the search space, the best performing
I/O configuration is stored as the tuned parameter set.
3.1.2 H5Tuner: Setting I/O parameters at runtime
The goal of the H5Tuner component is to develop an autonomous parallel I/O parameter
injector for scientific applications with minimal user involvement, allowing parameters to be
altered without requiring source code modifications and a recompilation of the application.
The H5Tuner dynamic library is able to set the parameters of different levels of the I/O
stack—namely, the HDF5, MPI-IO, and parallel file system levels in our implementation.
Assuming all the I/O optimization parameters for different levels of the stack are in a con-
figuration file, H5Tuner first reads the values of the I/O configuration. When the HDF5
calls appear in the code during the execution of a benchmark or application, the H5Tuner
library intercepts the HDF5 initialization function calls via dynamic linking. The library
reroutes the intercepted HDF5 calls to a new library, where the parameters from the con-
figuration are set and then the original HDF5 function is called using the dynamic library
package functions. This approach has the added benefit of being completely transparent to
the user; the function calls remain exactly the same and all alterations are made without
change to the source code. We show an example in Figure 3.3, where H5Tuner intercepts
an H5FCreate() function call that creates an HDF5 file, applies various I/O parameters,
and calls the original H5FCreate() function call.
H5Tuner uses MiniXML [20], a small XML library to read the XML configuration files. In
our implementation, we are reading the configuration file from user’s home directory, giving
the user the ability to change the configuration file. Figure 3.4 shows a sample configuration
file with HDF5, MPI-IO, and Lustre parallel file system tunable parameters.
16
H5Tuner Design
hid_t H5Fcreate(const char *name, unsigned flags, 
hid_t create_id, hid_t access_id )
HDF5 Library (Unmodified)
Application,  
I/O benchmark,
Appl. I/O kernel
H5Fcreate()
H5Tuner
1. Obtain the address of H5Fcreate using dlsym()
2. Read I/O parameters from the XML control file
3. Set the I/O parameters(e.g. for MPI we use 
MPI_Info_set()) 
4. Setup the new access_id using new MPI_Info
5. Call real_H5Fcreate(name, flags, 
create_id, new_access_id)
H5Fcreate()
6. Return the result of call to real_H5Fcreate()
Figure 3.3: Design of H5Tuner component as a dynamic library which intercepts
HDF5 functions to tune I/O parameters
3.2 Experimental Setup
We have evaluated the effectiveness of our autotuning framework on three HPC platforms
using three I/O benchmarks at three different scales. The HPC platforms include Hopper,
Intrepid, and Stampede. The I/O benchmarks are derived from the I/O traces of the VPIC,
VORPAL, and GCRM applications. We ran these benchmarks using 128, 2048, and 4096
cores. A full description of these platfroms and applications can be found in Chapter 2.
In the following subsections, we briefly explain the data sizes of these I/O benchmarks at
different scales.
3.2.1 Scale and dataset sizes
We designed a weak-scaling configuration to test the performance of the autotuning frame-
work at three concurrencies, i.e., 128, 2048, and 4096 cores. The amount of data each core
writes is constant for a given I/O kernel, i.e., the amount of data an I/O kernel writes in-
creases proportional to the number of cores used. Table 3.1 shows the sizes of the datasets
generated by the I/O benchmarks. The amount of data written by a kernel ranges from 32
17
<Parameters>
    <High_Level_IO_Library>
        <alignment> 0, 65536 </alignment>
    </High_Level_IO_Library>
    <Middleware_Layer>
        <cb_buffer_size> 1048576 </cb_buffer_size>
        <cb_nodes> 32 </cb_nodes>
    </Middleware_Layer>
    <Parallel_File_System>
        <striping_factor FileName="sample_dataset.h5part"> 4 </striping_factor>
        <striping_factor> 16 </striping_factor>
        <striping_unit> 65536 </striping_unit>
    </Parallel_File_System>
</Parameters>
Figure 3.4: An XML file showing a sample configuration with optimization pa-
rameters at different levels of the parallel I/O stack. The tuning can be applied
to all files an application writes or to a specific file.
GB (with 128 cores) to 1.1 TB (with 4096 cores).
I/O Benchmark 128 Cores 2048 Cores 4096 Cores
VPIC-IO 32 GB 512 GB 1.1 TB
VORPAL-IO 34 GB 549 GB 1.1 TB
GCRM-IO 40 GB 650 GB 1.3 TB
Table 3.1: Weak scaling configuration for the three I/O benchmarks
3.2.2 Parameter space
H5Evolve can take arbitrary values as input for a parameter space. However, evolution of
the GA will require more generations to search a parameter space with arbitrary values.
To shorten the search time, we selected a few meaningful parallel I/O parameters for all
the layers of the I/O stack based on previous research efforts [21] and our experience [9].
We have chosen most of the parameter values to be powers-of-two except some parallel file
system parameters. We set the largest parameter value of Lustre stripe count to be equal
to the maximum number of available OSTs, which is 156 on Hopper and 160 on Stampede.
The GPFS parameters that we tuned are Boolean. The process of curtailing parameter
values to reasonable ranges based on knowledge of page sizes, min/max striping ranges and
powers-of-two values can be done by one who is modestly familiar with the system. And
this task needs to be performed only once on a per-system basis. Table 3.2 shows ranges of
18
various parameter values. A user of our autotuning system can set the parameter space by
simply modifying the parameter list in H5Evolve. Adding new parameters to search needs
simple modifications to H5Tuner. The following is a list of parameters we used as part of
the parameter space and their target platforms.
• Lustre (on Hopper and Stampede):
– Stripe count (strp fac) sets the number of OSTs over which a file is distributed.
– Stripe size (strp unt) sets the number of bytes written to an OST before cycling
to the next OST.
• GPFS (on BG/P Intrepid):
– Locking: Intrepid has a ROMIO (an MPI-IO implementation [22]) driver to
avoid NFS-type file locking. This option is enabled by prefixing a file name
with bglockless:.
– Large blocks: ROMIO has a hint for GPFS named IBM largeblock io which
optimizes I/O with operations on large blocks.
• MPI-IO (on all three platforms):
– Number of collective buffering nodes (cb nds) sets the number of aggregators for
collective buffering. On Intrepid, the parameter to set the number of aggregators
is bgl nodes pset.
– Collective buffer size (cb buf size) is the size of the intermediate buffer on an
aggregator for collective I/O. We set this value to be equal to the stripe size on
Hopper and Stampede.
• HDF5 (on all three platforms):
– Alignment (align(thresh, bndry)): HDF5 file access is faster if certain
data elements are aligned in a specific manner. Alignment sets the start of any
file object with size more than a threshold value to an address that is a multiple
of a user-specified block size.
– Chunk size (chunk size): In addition to contiguous datasets, where datasets
are stored in single blocks in files, HDF5 supports chunked layout in which the
data are stored in separate chunks. We used this parameter specifically for the
GCRM-IO kernel.
19
Parameter Min Max # Values
strp fac 4 156/160 10
strp unt / cb buf siz 1 MB 128 MB 8
cb nds 1 256 12
align(thresh, bndry) (0,1) (16KB, 32MB) 14
bglockless True False 2
IBM largeblock io True False 2
chunk size 10 MB 2 GB 25
Table 3.2: A list of the tunable parameters and their ranges used for experiments
in this study. We show the minimum and maximum values for each parameter,
with powers-of-two values in between. The last column shows the number of
distinct values used for each parameter.
3.3 Results
Out of the 27 experiments (3 I/O benchmarks x 3 concurrency levels x 3 HPC platforms),
we successfully completed 24 experiments in time for this chapter. Due to computer resource
allocation limitations on Stampede, we could not finish the three 4096-core experiments on
that system. However, we expect the performance improvement trends in the remaining
runs to be the same as the completed experiments.
In the following subsections, we first compare the I/O rates that our autotuning system
achieved with those obtained using system default settings. We then discuss the achieved
speedup with respect to different platforms, I/O benchmarks, and concurrency/scale in
Sections 3.3.3, 3.3.4, and 3.3.5, respectively.
Application/
Bandwidth (MB/s)
# Cores
Platform VPIC-IO VORPAL-IO GCRM-IO
Default Tuned Speedup Default Tuned Speedup Default Tuned Speedup
128
Hopper 400 3034 7.57 378 2614 6.90 757 2348 3.10
Intrepid 659 1126 1.70 846 1102 1.30 255 1801 7.05
Stampede 394 2328 5.90 439 2130 4.85 331 2291 6.90
2048
Hopper 365 14900 40.80 370 12669 34.16 240 17816 74.12
Intrepid 2282 5964 2.61 2033 4842 2.38 414 870 2.10
Stampede 380 13047 34.28 436 12542 28.70 128 13825 107.73
4096
Hopper 348 17620 50.60 320 12192 38.00 413 20136 48.67
Intrepid 2841 7014 2.46 3131 7766 2.47 523 2177 4.16
Table 3.3: I/O rate and speedups of I/O Benchmarks with Tuned Parameters
over Default Parameters
20
VPIC VORPAL GCRM
0
1000
2000
3000
Hopper Intrepid Stampede Hopper Intrepid Stampede Hopper Intrepid Stampede
I/O
 B
an
dw
id
th
 (M
B/
s)
Default
Tuned
(a) 128 cores
VPIC VORPAL GCRM
0
5000
10000
15000
Hopper Intrepid Stampede Hopper Intrepid Stampede Hopper Intrepid Stampede
I/O
 B
an
dw
id
th
 (M
B/
s)
Default
Tuned
(b) 2048 cores
VPIC VORPAL GCRM
0
5000
10000
15000
20000
Hopper Intrepid Hopper Intrepid Hopper Intrepid
I/O
 B
an
dw
id
th
 (M
B/
s)
Default
Tuned
(c) 4096 cores
Figure 3.5: Summary of performance improvement for each I/O benchmark
running on (a) 128 cores, (b) 2048 cores, (c) 4096 cores. The I/O bandwidth
axes’ scales are different in each of the plots.
21
3.3.1 Tuned I/O performance results
The plots in Figure 3.5 present the I/O rate improvement using tuned parameters that
our autotuning system detected for the three I/O benchmarks. H5Evolve ran for 10 hours,
12 hours, and 24 hours for the three scale to search through the parameter space of each
experiment. In most cases, the GA evolved through 15 to 40 generations. We selected
the tuned configuration that achieves the best I/O performance through the course of the
GA evolution. Figure 3.5 compares the tuned I/O rate with the default I/O rate for all
applications on all HPC systems at 128, 2048, and 4096 core scales. We calculated I/O rate
as the ratio of the amount of data a benchmark writes into a HDF5 file at any given scale to
the time taken to write the data. The time taken includes the overhead of opening, writing,
and closing the HDF5 file. The overhead of HDF5 call interception by H5Tuner, which is
included in the time taken, was negligibly small, even at high core count. The I/O rate on
the y-axis is expressed in MB/s. Readers should note that the range of I/O rates shown in
each of the three plots is different. The measured default I/O rate for a benchmark on a
HPC platform is the average I/O rate we obtained after running the benchmark three times.
The default experiments correspond to the system default settings that a typical user of the
HPC platform would encounter should he/she not have access to an autotuning framework.
Table 3.3 shows the raw I/O rate numbers (in MB/s) of the default and tuned experiments
for all 24 experiments. We also show the speedup that the auto-tuned settings achieved over
the default settings for each experiment. For all the benchmarks, platforms, and concur-
rencies, the speedup numbers are between 1.3X and 38X, with 48X, 50X, 70X, and 100X
speedups in four cases. We note that the default I/O rates for the Intrepid platform are
noticeably higher than those on Hopper and Stampede. Hence, the speedups on Hopper and
Stampede with tuned parameters are much larger than those on Intrepid.
3.3.2 Tuned configurations
Table 3.4 shows the sets of selected parameter values for all benchmarks on all systems for
the 2048-core experiments. Due to space constraints, we cannot present a detailed analysis
for all experimental configurations at the other two scales; we generally observed similar
trends for the 128-core and 4096-core experiments. First, we note that the tuned values are
different for each benchmark and platform. This highlights the strength of the autotuning
framework: while I/O experts and sysadmins can probably recommend good settings for a
few cases based on their experience, it is hard to encapsulate that knowledge and generalize
it across multiple problem configurations.
22
VPIC-IO and VORPAL-IO on Hopper and Stampede have similar tuned parameters, i.e.,
strp fac, strp unt, cb nds, cb buf size, and align. On Intrepid, these two bench-
marks include bgl nodes pset, cb buf size, bglockless,
IBM largeblock io, and align. On all platforms, GCRM-IO achieved better perfor-
mance with HDF5’s chunking and alignment parameters, and Lustre parameters (stripe
factor and stripe size) without the MPI-IO collective buffering parameters. We chose this
parameter space for GCRM-IO as Howison et al. [21] demonstrated that the HDF5 chunk-
ing provides a significant performance improvement for this I/O benchmark. Moreover, we
show that the autotuning system is capable of searching a parameter space with multiple
HDF5 tunable parameters. On Intrepid, GCRM-IO did not use GPFS tunable parameters
because going through HDF5’s MPI-POSIX driver (this is a driver in HDF5 which only uses
MPI’s communication) voids the MPI-IO layer, which is needed to set the GPFS parameters.
Despite that, HDF5 tuning alone achieved 2X improvement.
We note some higher-level trends from Table 3.4. At the same scale and with the same
benchmark, the tuned parameters are different on various platforms, even with the same
parallel file system. For example, although the VPIC-IO benchmark on Hopper and Stam-
pede use the Lustre file system, their stripe settings to achieve the highest performance are
different. The tuned parameters can be different on the same platform and at the same scale
for different benchmarks. For instance, the VPIC-IO and VORPAL-IO benchmarks obtain
the highest I/O rates with different MPI-IO collective buffering settings and HDF5 align-
ment settings, whereas their Lustre settings are the same. Similarly, the same benchmark
at different scales on the same platform have different tunable parameters. For example,
at 128-cores (not shown in the table), VPIC-IO achieves tuned performance with 48 Lustre
stripes and 32 MB stripe size, whereas at 2048-cores, VPIC-IO uses 128 stripes with 64 MB
stripe size. We analyze these observations further in the following sections.
3.3.3 Tuned I/O performance across platforms
Figure 3.6(a) shows the distribution of speedups (time to write with the system’s default
I/O configuration over the tuned I/O configurations) with tuned parameters across Hopper,
Intrepid, and Stampede systems representing three different architectures. The speedups
are color-coded by I/O benchmark. Overall, the autotuning system achieved improved per-
formance on all platforms for all benchmarks. We can observe that the speedups on Hopper
and Intrepid are lower than those on Stampede. The speedups on Hopper range from 3.10
to 74.12, with an average of 28.55. Speedups on Intrepid range from 1.30 to 7.05 with an
23
I/O Kernel System Tuned Parameters
VPIC-IO Hopper
strp fac=128, strp unt=64MB,
cb nds=1024,
cb buf size=64MB, align=(0,64K)
VPIC-IO Intrepid
bgl nodes pset=512,
cb buf size=128MB,
bglockless=true,
largeblock io=false, align=(8K,
1MB)
VPIC-IO Stampede
strp fac=128, strp unt=8MB,
cb nds=512,
cb buf size=8MB, align=(8K, 2MB)
VORPAL-IO Hopper
strp fac=128, strp unt=16MB,
cb nds=1024,
cb buf size=16MB, align=(1K,16K)
VORPAL-IO Intrepid
bgl nodes pset=128,
cb buf size=128MB,
bglockless=true,
largeblock io=true, align=(8K, 8MB)
VORPAL-IO Stampede
strp fac=160, strp unt=2MB,
cb nds=512,
cb buf size=2MB, align=(8K, 8MB)
GCRM-IO Hopper
strp fac=156, strp unt=32MB,
chunk size=(1,26,327680)=32MB,
align=(2K, 64KB)
GCRM-IO Intrepid
chunk size=(1,26,1048760)=1GB,
align=(1MB, 4MB)
GCRM-IO Stampede
strp fac=160, strp unt=32MB,
chunk size=(1,26,1048760)=1GB,
align=(1MB, 4MB)
Table 3.4: Tuned parameters of benchmarks on the systems for 2048-core ex-
periments
average of 2.76. Speedups on Stampede ranges from 4.85 to 107.73, with an average of 31.39.
As mentioned earlier, higher speedups on Stampede are due to poor default performance. In
contrast, lower speedups on Intrepid can be attributed to higher default performance. The
tuned raw I/O rates on Stampede are similar to those on Hopper.
The aim of this section is to highlight how the autotuning framework can deduce high per-
formance configurations for the same application at the same scale, but running on different
platforms. We highlight this capability by choosing the VPIC-IO benchmark running on
2048 cores on Hopper and Intrepid, and provide some insights on the configuration returned
by the GA.
We consider the effect of choosing the collective buffer size parameter for VPIC-IO as
illustrated by Figures 3.7 and 3.8. On Hopper (Figure 3.7), multiple buffer size values
(equal to the Lustre stripe sizes) obtain good I/O performance, and on average the 32 MB
24
ll
l
2
5
10
20
50
10
0
Platform
Sp
ee
du
p
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
l
l
l
l
Hopper Intrepid Stampede
l
l
l
VPIC Runs
VORPAL Runs
GCRM Runs
(a) Speedups across platforms
l
l
l
2
5
10
20
50
10
0
Application
Sp
ee
du
p
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l l
l
VPIC VORPAL GCRM
l
l
l
128−core Runs
2048−core Runs
4096−core Runs
(b) Speedups across benchmarks
l
l
l
2
5
10
20
50
10
0
# of Cores
Sp
ee
du
p
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
128 2048 4096
l
l
l
Hopper
Intrepid
Stampede
(c) Speedups across scale
Figure 3.6: Speedups with respect to platforms, benchmarks, and scale of the
experiments.
25
buffer size achieves the best I/O rate (although 64 MB has achieved the peak performance,
but at other scales and in this scale 32 MB has consistently shown high performance). In the
VPIC-IO benchmark, each MPI process writes eight variables and the size of each variable
is equal to 32 MB. When the Lustre stripe size is equal to 32 MB, it obtains the best
performance on Hopper. The powers-of-two fractions and multiples of 32 MB also obtain
reasonably good performance. On Intrepid (Figure 3.8), we obtain the best performance
when the collective buffer size is 128 MB. From Table 3.4, we can see that the number of
pset nodes from the tuned parameters is 512, i.e., four MPI processes are being served by
one collective buffer. When VPIC-IO writes 32 MB per process, a total of 128 MB data
gets collected at the collective buffer node (aggregator) and this node writes data to the
file system as one I/O request, which we believe aligns well with the GPFS file system to
achieve the best performance. We note that the framework is able to derive these meaningful
configurations without detailed prior knowledge of platform specific features.
l
2 4 8 16 32 64 128
40
00
60
00
80
00
10
00
0
12
00
0
14
00
0
CB Buffer Size (MB)
I/O
 B
an
dw
id
th
 (M
B/
s)
Figure 3.7: The effect of Hopper’s Collective Buffer Size on performance of
VPIC-IO on 2048 cores
26
1 2 4 8 16 32 64 128
10
00
20
00
30
00
40
00
50
00
60
00
CB Buffer Size (MB)
I/O
 B
an
dw
id
th
 (M
B/
s)
Figure 3.8: The effect of Intrepid’s Collective Buffer Size on performance of
VPIC-IO on 2048 cores
3.3.4 Tuned I/O for different benchmarks
Figure 3.6(b) presents the speedup numbers with respect to different I/O benchmarks.
Speedups for VPIC range from 1.70 to 50.60, with an average of 16.04. Speedups for VOR-
PAL range from 1.30 to 38.00 with an average of 13.69. Speedups for GCRM ranges from
2.10 to 107.73 with an average of 33.50.
We now discuss the configurations returned by the autotuning framework for different
applications, while holding the platform and scale constant. We highlight the VORPAL-IO
and GCRM-IO applications, running on 2048 cores of Stampede, and consider tuned the
Lustre configurations returned by the GA. Figures 3.9 and 3.10 show the impact of Lustre
stripe size on the VORPAL-IO and GCRM-IO benchmarks. Both of these benchmarks
obtain the highest performance using Lustre stripe count of 160. However, VORPAL-IO
obtains the best performance using 2 MB stripe size, whereas GCRM-IO works well using
32 MB stripe size. We note that these different high performance configurations likely result
from the different I/O patterns exercised by these benchmarks: VORPAL-IO uses MPI-IO
in collective mode whereas GCRM-IO uses MPI-POSIX driver. This result highlights a
strength and a weakness of the autotuning approach: the autotuning process can produce
a good configuration which performs well in practice, but is hard to reason about. On the
other hand, it would be very hard for a human expert to propose this configuration in the
first place since the interactions in the software stack are very complicated to analyze.
27
2 4 8 16 32 64
70
00
80
00
90
00
10
00
0
12
00
0
Stripe Size (MB)
I/O
 B
an
dw
id
th
 (M
B/
s)
Figure 3.9: The effect of Lustre Stripe Size value on performance of VORPAL
on 2048 cores of Stampede
3.3.5 Tuned I/O at different scales
Figure 3.6(c) demonstrates the weak scaling performance obtained by our framework. We
observe that the autotuning system obtains higher speedups on 2048 and 4096-core experi-
ments. This shows that the default settings on all platforms fare reasonably well at a smaller
scale. But, as the scale of the application increases, more resources are at stake and that
presents more opportunities to optimize the stack.
Figure 3.11 shows another view of Figure 3.6(c) with raw I/O rates of benchmarks at
various scales grouped based on platform. Each box illustrates the range of I/O rates of the
benchmarks. This also illustrates our observation above that autotuning is more beneficial
at larger scale. This figure also shows that the Lustre-based platforms, i.e., Hopper and
Stampede, can achieve higher I/O rates than the GPFS-based platform (Intrepid) with
tuning at the scales we experimented. We also show that tuning helps improving performance
on BG/P based Intrepid.
3.4 Conclusions
We have presented an autotuning framework for optimizing I/O performance of scientific
applications. The framework is capable of transparently optimizing all levels of the I/O
stack, consisting of HDF5, MPI-IO, and Lustre/GPFS parameters, without requiring any
modification of user code. We have successfully demonstrated the power of the framework
28
2 4 8 16 32 64 128
20
00
40
00
60
00
80
00
12
00
0
Stripe Size (MB)
I/O
 B
an
dw
id
th
 (M
B/
s)
Figure 3.10: The effect of Lustre Stripe Size value on performance of GCRM on
2048 cores of Stampede
by obtaining a wide range of speedups across diverse HPC platforms, benchmarks, and
concurrencies. Perhaps most importantly, we believe that the autotuning framework can
provide a route to hiding the complexity of the I/O stack from application developers,
thereby providing a truly performance portable I/O solution for scientific applications.
29
50
00
10
00
0
15
00
0
20
00
0
# of Cores
I/O
 B
an
dw
id
th
 (M
B/
s)
128 2048 4096
l
l
l
Hopper
Intrepid
Stampede
Figure 3.11: Raw Bandwidth plots and breakdown across scale
30
CHAPTER 4
IMPROVING PARALLEL I/O AUTOTUNING WITH
PERFORMANCE MODELING
While we consistently demonstrated I/O write speedups between 2X and 100X in the pre-
vious chapter, the overhead of the GA approach was significant. For example, running the
GA for fifteen generations with a population of forty members typically takes about twelve
hours. This overhead is considerable; it severely limits the general-purpose applicability of
such an autotuning framework.
In this chapter, we significantly reduce the search time by using empirical models of the
I/O performance. We characterize performance of a typical parallel I/O subsystem with
multiple levels of data movement and develop performance prediction models. Existing
models for predicting parallel I/O performance (see, e.g., [23–25]) often aim for highly accu-
rate predictions of I/O performance and are relatively complex. Many of these models have
limited applicability, being restricted to specific systems or I/O kernels. We take a two-step
approach: the first step crafts an empirical model that effectively reduces the search space
of interest and the second step searches in this small parameter space.
This chapter makes the following technical contributions:
• We develop an approach to construct automatically an I/O performance model
• We use the model thus constructed to reduce the search space for good I/O configu-
rations
• We demonstrate the applicability of the autotuning framework to scientific I/O kernels
with different write patterns and various problem sizes
The chapter is structured as follows. Section 4.1 describes the platform and the application
I/O kernels used in this study. Section 4.2 presents a general development of nonlinear
models for predicting parallel I/O performance while Section 4.3 describes the usage of
models by our autotuning framework. In Section 4.4, we demonstrate the performance
benefit over different settings. We discuss initial work on accounting for interference from
the I/O activity of the other jobs running on a system in Section 4.5. We conclude the
chapter in Section 4.6.
31
4.1 Experimental Setup
We have tested our autotuning framework on Hopper and examined three I/O kernels in our
study: VPIC-IO, VORPAL-IO and GCRM-IO. Plese refer to Chapter 2 for a full description
of Hopper and the I/O Kernels.
4.2 Empirical Performance Models
We now summarize our approach for building models based on I/O performance data for
the purposes of autotuning.
4.2.1 Nonlinear regression model preliminaries
We denote the independent variables/parameters (e.g., the stripe count) in our model by
x = [x1, · · · , xnx ] and the scalar-valued output/dependent variable (e.g., the write time)
associated with the configuration x by y(x). In our setting, this output depends on the state
of the system and can be viewed as stochastic. By yj we denote a particular measurement
of the output at a specific xj. Hence, data collected from a set of experiments is of the
form {(xj, yj) : j = 1, . . . , ny}, where the xj need not be distinct (which occurs if replicated
measurements are conducted at a particular xj).
We consider smooth, nonlinear models, which can be written as linear combinations of nb
nonlinear basis functions φ,
m(x; β) =
nb∑
k=1
βkφk(x). (4.1)
Once a basis φ has been selected, the hyperparameters β can be selected by standard
regression-based approaches. For example, since these models are linear in β, a common
approach is to employ
βˆ = arg min
β
ny∑
j=1
(
m(xj; β)− yj)2 , (4.2)
which corresponds to the maximum likelihood estimator for β under the assumption that y
is Gaussian.
There can be many choices of basis functions; for simplicity, we focus on terms that are
low-degree polynomials in either the parameter, xi, or the inverse of the parameter,
1
xi
. In
32
particular, we consider terms of the form{
nx∏
i=1
(xi)
pi : pi ∈ {−1, 0, 1}, i = 1, . . . , nx
}
. (4.3)
We could have expanded our set to include terms that could better account for differences
in scale (e.g., x1 log(x2)) or higher degree polynomials (e.g.,
x21x2
x23
), but found that the set
(4.3) was sufficiently rich for our purposes.
Since one of our goals in building a model of the form (4.2) was simplicity of the model,
we desired to incorporate only a handful of basis terms, nb, from the set (4.3). Each term in
(4.3) can be defined by the integer vector p ∈ {−1, 0, 1}nx . We let mˆ(x; P) denote the model
prediction at x resulting from selecting a basis defined by P = {p1, · · · ,pnb} and using the
coefficients defined by (4.2). Given an initially empty set P, we follow a greedy procedure
(also known as a forward model selection approach) of adding to P the p that most reduces
the prediction error. Formally, this means we determine the p that solves
min
p∈{−1,0,1}nx
ny∑
j=1
(
mˆ(xj; P ∪ p)− yj
yj
)2
. (4.4)
After updating P, this procedure can be repeated until: (i) we have reached a desired limit
on the number of terms to include, (ii) we have exhausted the set in (4.3), or (iii) additional
terms lead to negligible reductions of the prediction error (which, under certain regularity
assumptions can be interpreted as the terms not being statistically significant). In our
experiments, we always terminated the approach based on (i), reaching an upper limit to
the number of model terms.
Before proceeding, we note that in (4.4) we are using a relative error metric that is slightly
different from the usual least-squares error criterion (e.g., as used in (4.2)). We made this
choice in order to bias our model terms toward smaller values of the output y. In the context
of I/O models for optimization, we are less interested in accurately predicting large times
than we are small times. An alternative approach to building models based on a bias toward
high-performing configurations is discussed in [26].
4.2.2 Development of I/O performance models
We now examine nonlinear regression models in the context of modeling I/O write times
for a given application. As discussed previously, the main I/O parameters on a Lustre file
system are Lustre stripe settings (e.g., stripe count and stripe size) and MPI-IO collective
33
buffering settings (e.g., number of collective buffering nodes and collective buffering size).
In order to identify potential challenges and illustrate our approach, we first vary only the
stripe settings. In subsequent tests, we also vary the collective buffering settings and consider
multiple file sizes.
Output variability on a single node
We begin by examining the problem of building a single model for write times as Lustre
settings are modified. In order to isolate Lustre settings, we developed a micro-benchmark
that uses POSIX I/O from a single node to write a single file on the Lustre file system. We
fixed the file size to about 20 GB (20 * 1024 = 20480 MB). Since we have a single node using
POSIX I/O, the number of I/O aggregators is also fixed. Table 4.1 shows the different stripe
settings that comprised the set of training configurations in this first set of experiments.
# of
Parameter Tested Values Values
c, stripe count 1,2,4,8,16,32,64,96,128,156 10
s, stripe size (MB) 1,2,4,8,16,32,64,96,128 9
Table 4.1: Training configurations (90 in total) tested as part of the single-node
experiment.
One of our goals in this initial analysis was to inspect write time variability in simple
settings (in this case, using a single node). Therefore, we evaluated all 90 training configu-
rations in four different experiments (each taking place on different days of a week) in order
to increase our chances of encountering different levels of interference from the I/O activity
of other jobs running on a shared system, such as Hopper.
Figure 4.1 shows the 360 write times recorded as part of these four experiments. Here, the
90 training configurations are sorted by the minimum write time across the four experiments.
Variability within a particular configuration is illustrated by a vertical line connecting the
four write times for that experiment. It can be seen that, even in this single-node setting,
interference/noise can have a significant impact on the performance.
This variability can significantly complicate the modeling process since it necessitates a
more careful definition of the modeling objectives prior to performing experiments. For ex-
ample, if one wishes to model “average” I/O performance, then experimental setups would
need to sufficiently sample across different system states/sources of the variability. Further-
more, since this variance is nonstationary (having different magnitudes from configuration
to configuration), accurately modeling performance across the entire configuration space can
34
Figure 4.1: I/O performance variability and effect of interference on a single
node writing to a file.
be a daunting task, likely requiring one to also model the variability over the configuration
space.
In our context, we are interested in identifying sets of high-performing configurations
(that are not already in the training set) for subsequent evaluation; we are less concerned
with the accuracy of model predictions in an absolute sense. In Figure 4.1, we observe
that the highest-performing configurations tend to be less sensitive to noise; reordering the
configurations based on the mean or median of the four experiments has little effect on the
constituents of the highest-performing quartile. Consequently, we have decided to use the
minimum time of each of the experiments in building our models. In Section 4.5 we discuss
the problem of variability.
Given the nx = 2 parameters c (the stripe count) and s (the stripe size), there are 3
nx =
9 possible terms in the set (4.3). Following the approach described in Section 4.2.1 and
selecting nb = 6 terms, we arrive at the basis {1, c, s, 1c , 1s , sc} (the remaining, unselected
terms being cs, c
s
and 1
cs
). The maximum of 4 aggregators were chosen because of a well-
known criteria in the literature of having 1 aggregators per node and 128 cores of Hopper
is at least 5 nodes. We performed a cross-validation test of this form of model by randomly
partitioning the 90 outputs into training and testing subsets. We determine values of the
model coefficients β1, . . . , β6 for a model using the six terms {1, c, s, 1c , 1s , sc} by using the
training subset. The resulting model is then evaluated on the testing subset (to which the
35
model was not fit). Figure 4.2 shows that the trained model predicts the write times of the
testing subset nearly as well as it does the write times of the training subset. Furthermore,
the lowest quartile of write times are predicted within 10% of the observed write time values.
Figure 4.2: Correlation between observed and predicted single-node write times
on training (50%) and testing (50%) subsets.
To complete our description of our methodology, we provide our final model for the write
time data. Using the nonlinear basis above and all 90 data points, we obtain the model
m(c, s) = β1 + β2c+ β3s+ β4
1
c
+ β5
1
s
+ β6
s
c
= 28 + 0.3c+ 0.3s+ 20
c
+ 10
s
− 0.2 s
c
.
(4.5)
Write time models for multiple nodes
Having observed that nonlinear regression models can predict the trend of I/O performance
when one node is writing to one file, we now show how such a model can be used when
writing to shared files from multiple nodes. To this end, we use the VPIC-IO benchmark
with 128 cores and a file size of 32 GB.
For training, we consider the same 90 combinations of the stripe size, s, and stripe count,
c, shown in Table 4.1, but we enrich the configuration space to include aggregators, a. In
particular, we consider one, two, and four collective buffering nodes, for a total of 270 (c, s, a)
36
configurations. We performed two different runs of the 270 configurations, with the training
data again taken as the minimum write time over these two runs.
The data, shown in Figure 4.3, reveal that, for small stripe count values (c ≤ 2), the
write behavior is difficult to predict with the simple models considered here. Consequently,
we formed our model on the basis of the remaining 216 configurations. As illustrated in
Figure 4.3, the five-term model of the form
m(c, s, a) = β1 + β2
c
a
+ β3
s
a
+ β4
1
a
+ β5
a
cs
(4.6)
tends to reproduce the training data well. We again note that the empirical data suggests
that the variability is smaller for the configurations with lower write times. Furthermore,
even though the models we consider in this chapter do not directly account for the variability,
we observe that our realized predictions tend to yield more accurate predictions for those
configurations where little variability is seen.
Figure 4.3: Raw data and nonlinear model of the form (4.6) for VPIC-IO write
times as the number of aggregators is varied.
Thus far we have only considered a single file size when building nonlinear regression
models. This modeling approach reflects the typical workflow in automatic empirical per-
formance tuning, where one wishes to determine parameter values for actionable decisions.
37
In the language of mathematical optimization, we would seek to solve, for example,
min
(c,s,a)∈Ω
g(c, s, a), (4.7)
where g(c, s, a) represents an empirical performance metric (e.g., run time when there is
no contention, mean energy consumption) for a single problem (fixed file size/input, fixed
machine, etc.) and Ω represents the space of realizable/correct configurations [27]. In general
(i.e., with no additional assumptions on the metric g), solving (4.7) directly is challenging
because it requires many empirical evaluations.
Provided that a model for g is available, one can consider substituting this model in for
g in (4.7). If this model is easy to optimize over, one can obtain values for (c∗, s∗, a∗) that
minimize the model in far less time than a single empirical evaluation of g would take.
Updating the model based on an empirical evaluation at (c∗, s∗, a∗) and iterating would lead
to a so-called model-based optimization algorithm [27].
One of the main benefits of our models is that their simple, algebraic form allows us to
very quickly solve optimization problems involving them. For example, for the model (4.6),
we can obtain algebraic expressions for the first- and second-order derivatives of the model:
∇m =
 β2
1
a
− β5 ac2s
β3
1
a
− β5 acs2
−β2c+β3s+β4
a2
+ β5
cs

∇2m =
 2β5
a
c3s
β5
a
c2s2
−β2
a2
− β5
c2s
β5
a
c2s2
2β5
a
cs3
−β3
a2
− β5
cs2
−β2
a2
− β5
c2s
−β3
a2
− β5
cs2
2β2c+β3s+β4
a3
 .
Given specific values for the constants β and the domain Ω, these derivatives can be used to
quickly obtain minimizers of model. This is in contrast with the models in [26], which may
better capture differences in variability across the decision space, but are significantly more
computationally expensive to evaluate/minimize.
Write time models for multiple file sizes
We now consider models that could be employed in tuning for multiple different file sizes
simultaneously. Consequently we will now have nx = 4 independent variables, x = (c, s, a, f),
and there are 3nx = 81 possible terms in the set (4.3).
Experimentally, we ran tests using VPIC-IO and different file sizes (i.e., different core
38
counts) on Hopper. The training set for each of the VPIC-IO experiments and their file sizes
are shown in Table 4.2. For VPIC-IO on 128 cores, the 216 configurations were the ones that
we have seen previously. We have chosen to decrease the size of the training set as the core
counts (and hence file sizes) increase because of the corresponding increase in computational
resources required. The way that these training sets are chosen is done in a systematic and
automatic manner: For example, for the 2048-core experiments for stripe count, out of the
10 values shown in Table 4.1, 3 were chosen to cover the space: [16, 32, 256]. We chose 4
values (in MB) for stripe size, [1, 4, 16, 64], and 5 values for the number of aggregators, [16,
32, 48, 64, 80]. This leads to 60 configurations used for training our model.
# of cores file size (GB) training set size
128 32 216
256 64 120
512 128 72
1024 256 60
2048 512 60
Table 4.2: Breakdown of training set for the parallel I/O model.
Following the approach in Section 4.2.1 on the entire training data set, we obtain a six-
term basis of {1, f, f
a
, a
c
, cs
a
, cf
a
}. However, inspection of this basis shows that any resulting
model is necessarily monotone in s: if the coefficient for cs
a
is positive, the write times are
increasing in s, otherwise the write times are non-increasing in s. Consequently, we made
the decision to include a seventh term. The term with a factor 1
s
that best solved (4.4) given
the other six terms was determined to be a
s
. Therefore, our seven-term model is of the form
m(x) = β1 + β2f + β3
f
a
+ β4
a
c
+ β5
a
s
+ β6
cs
a
+ β7
cf
a
, (4.8)
with a fit to the data yielding
βˆ = [−20.65, 0.11, 4.17, 27.13, 4.50, 0.0038, 0.01] .
We do not perform a detailed validation of this model here. Instead, in the next section
we will analyze in detail this model’s ability to perform space reduction and optimization for
a variety of I/O tuning tasks. Before proceeding to this study, we note that the model (4.8)
includes both actionable parameters (c, s, a) as well as an ancillary parameter (f) determined
from an input. In the context of model-based optimization, we could use this new model in
39
a minimization for any file size for which the model is deemed reliable,
m∗(f) ≡ min
(c,s,a)∈Ω
m(c, s, a, f). (4.9)
Before proceeding, we note that both of the applications considered are weak-scaling ap-
plications (i.e., the number of processors used to run the application is directly proportional
to the file size). Therefore, there was no need to use the number of processors (p) as another
parameter in the model. If instead, the file size is fixed as we scale the number of processors,
p should also be an independent variable in the model.
4.3 Integration of Performance Models in Autotuning Framework
In this section, we explain how we use the empirical performance model described earlier
in our parallel I/O autotuning framework. Figure 4.4 shows the flow of the three steps
of the autotuning process: pruning, exploration, and refitting. In the pruning step, for a
given I/O kernel and problem size, the framework takes the sets of all possible values of
tunable parameters and uses the model to predict the I/O cost for all combinations. It
then sorts (by a derivative-based optimization or enumeration) all the configurations based
on the predicted write times and chooses the top k configurations with the least write
times. In the exploration step, the framework executes the I/O kernel with the selected
twenty configurations to determine their empirical (rather than predicted) performance.
The framework can then refit the model with the newly collected write time data included.
The selection of the best performing configurations from the model-predicted write times
and the number of iterations of refitting are controllable by the user of our framework. While
we used the top twenty configurations, which proved to be effective in our tests, if a user
prefers to select a different number of best-predicted configurations or wishes to refit the
model iteratively, the user can configure the framework with simple settings. After train-
ing the model for the search space pruning step, the process of choosing the top twenty
configurations only involves evaluating the model, a task whose computational expense is
made negligible (relative to evaluation of a configuration) for our simple choice of models.
Therefore, using such an approach will only require a single batch evaluation of a few con-
figurations on the platform, decreasing the optimization time significantly. In fact, there is
an added bonus to evaluating configurations with low predicted write times: in our experi-
ments, these top twenty configurations always resulted in low write times and, as opposed
to the previous approach, the system never spent excessive time associated with evaluating
40
Overview of Dynamic
Model-driven I/O tuning
Exploration
Pruning
Model Generation
HPC 
System
Training Phase
Storage 
System
Develop an
I/O Model
Training 
Set
I/O Kernel
Top k 
Configurations
Re
fit 
th
e 
m
od
el
(C
on
tro
led
 b
y u
se
r)
Performance Results
Select the Best 
Performing Configuration
I/O ModelAll Possible Values 
Refitting
Figure 4.4: Design of our new autotuning system making use of performance
models.
especially poor configurations. When we used a GA for selecting the best-performing config-
uration [28], which executed populations of I/O kernel configurations numerous times, there
were a large number of inefficient configurations (especially in the early populations) that
led to high write times. As an example, for all the results of this work shown in Section 4.4,
our wall-time request for the sum of running the top twenty configurations was less than two
hours compared to the GA runs.
4.4 Experimental Results
In this section, we present the I/O write time performance results for the VPIC-IO and
VORPAL-IO kernels at different scales. We first compare the performance of our autotuning
framework using the empirical models with that of the previous framework using GAs. In the
subsequent sections we evaluate the effectiveness of the model-based framework on a variety
41
of problem settings: for a configuration space relatively similar to that in the training space
(Section 4.4.2); for a configuration space much larger than the training space (Section 4.4.3);
for a different application (Section 4.4.4); and for a larger-scale run not covered by the
training space (Section 4.4.5).
4.4.1 Performance models vs. Genetic algorithms
To develop the model, we ran various training configurations. The number of configurations
for each scale was shown in Table 4.2. The total time to run all the configurations of VPIC-IO
at the specified number of processors was 16.5 hours; the same runs for VORPAL-IO required
28 hours. Note that this training cost is a one-time expense for the performance model. The
resulting model is used for predicting write times across different concurrencies. Once the
model is ready, the incremental time spent in the pruning, exploration, and refitting steps
is minimal. For example, the exploration step of the VPIC-IO kernel using 2048 cores took
31 minutes; that of VORPAL-IO kernel using the same number of cores took 89 minutes.
In contrast, our GA-based tuning process, which tested roughly 400 configurations for the
VPIC-IO and VORPAL-IO kernels (running at 2048 cores), ran for 12 hours.
To summarize, the GA-based approach has a high runtime overhead associated with every
kernel and scale level. The empirical-model-based approach has a one-time cost associated
with fitting a model for a specific kernel, but can thereafter be used to predict times for any
number of processors, with a fractional cost for refitting.
4.4.2 Testing on space similar to training
As expected, we find that tests conducted on a space relatively similar to that used in
the training phase leads to accurate prediction. As an example, for the 512-core VPIC-IO
experiment, we used the 72 configurations in Table 4.2 to train the model. In tuning the
VPIC-IO kernel at the same scale using our autotuning framework, we derived a larger space
of 384 configurations by increasing the number of values for each parameter (by increasing
the granularity of the allowed configuration space). We provided the extended search space
to our framework as input. The framework predicted the write time for all 384 configurations
and pruned the space by selecting the top twenty configurations. In the exploration step,
the framework executed the kernel and obtained the write time for selected configurations.
Figure 4.5 compares the predicted write times (labeled “Predicted”), measured write times
(labeled “Actual”), the default write time of the kernel without any tuning, and the least
42
write time observed among the measured performance of the 72 training configurations
(labeled “Training Best”). We use the same labels for all the remaining plots. The top ten
configurations with the best performance are shown in Table 4.3. We also note that the
best configuration achieved 12X speedup over the default configuration.
0
100
200
300
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Configuration Number
W
rit
e 
Ti
m
e 
(s)
Predicted
Actual
Default
Training_Best
Figure 4.5: Comparison of the model-predicted, measured, and default-setting
write times of the top twenty configurations for VPIC-IO running on 512 cores.
Stripe count Stripe size Aggregators Pred. Time(s)
64 64 20 34.88
64 128 20 34.95
128 64 20 35.51
64 32 20 35.89
128 32 20 36.14
128 128 20 36.37
156 64 20 36.88
156 32 20 37.34
156 128 20 38.08
64 16 20 38.51
Table 4.3: Top ten predicted configurations of VPIC-IO on 512 cores. The best-
performing configuration is provided to the user as output of the autotuning
process.
43
4.4.3 Testing on a larger space
In this experiment, we evaluate the framework with a larger space of 640 configurations for
the VPIC-IO kernel running on 512 cores. Figure 4.6 shows the twenty selected configurations
with the least predicted write times for the original space and Table 4.4 shows the top ten
best-performing configurations on the new space. Comparing Tables 4.3 and 4.4, we see
that the autotuning framework found configurations that achieve an approximately 1.4X
speedup of write time performance by using the larger configuration/search space. The new
configurations use larger stripe counts, stripe sizes and numbers of aggregators. In this
512-core VPIC-IO experiment, the number of nodes used is equal to 22 (i.e., 512 divided
by 24 cores per node). It has been suggested by some studies that using one aggregator
per node achieves the best write times. We observe in Table 4.4 that all of the top-ten
configurations use more aggregators than the number of nodes. Further analysis is needed
for understanding the reasons for such behavior.
0
50
100
150
200
250
300
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Configuration Number
W
rit
e 
Ti
m
e 
(s)
Predicted
Actual
Default
Training_Best
Figure 4.6: Comparison of the model-predicted, measured, and default-setting
write times of the top twenty configurations for VPIC-IO on 512 cores.
We tested the autotuning framework to tune VPIC-IO kernel on different concurrencies.
Figure 4.7 and Table 4.5 show the performance comparison of the top twenty selected con-
figurations (from a space of 640 input configurations) and the best ten configurations after
the refitting process, respectively. Comparing the configurations in Tables 4.4 and 4.5, we
see that the tuned parameters values differ for I/O kernels running at different number of
processors. Among the configurations, the number of aggregators is again larger than the
number of nodes (85 for the 2048-core test) in many cases. Although the accuracy of pre-
44
Stripe count Stripe size Aggregators Pred. Time(s)
128 128 32 25.08
128 64 32 25.24
156 64 32 25.35
156 128 32 25.41
128 32 32 27.00
156 32 32 27.01
128 64 28 27.36
128 128 28 27.49
156 64 28 27.82
156 128 28 28.19
Table 4.4: Top ten predicted configurations of VPIC-IO on 512 cores selected
by the autotuning framework.
dicted write times is lower than the 512-core experiment, the best configuration achieves
27X speedup over the default configuration.
0
200
400
600
800
1000
1200
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Configuration Number
W
rit
e 
Ti
m
e 
(s)
Predicted
Actual
Default
Training_Best
Figure 4.7: Comparison of the model-predicted, measured, and default-setting
write times of the top twenty configurations for VPIC-IO on 2048 cores.
4.4.4 Testing on a different application: VORPAL-IO
We now evaluate the model based on the VPIC-IO training configurations to tune a different
I/O kernel. We first ran the VORPAL-IO kernel with the same training configurations as
shown in Table 4.2. We observed minor differences in the coefficient values (i.e., the βi
45
Stripe count Stripe size Aggregators Pred. Time(s)
156 128 128 85.94
128 128 128 89.59
156 64 128 90.15
128 64 128 93.84
128 128 64 96.06
156 128 64 96.08
156 64 64 97.73
128 64 64 97.82
156 32 128 99.00
156 128 256 100.95
Table 4.5: Best ten predicted configurations selected by the framework for VPIC-
IO on 2048 cores.
values) for the VORPAL-IO kernel, the model being
m(x) = β1 + β2f + β3
f
a
+ β4
a
c
+ β5
a
s
+ β6
cs
a
+ β7
cf
a
, (4.10)
with
βˆ = [27.47, 0.57, 3.42, 9.89, 11.94, 0.0013, −0.0054] .
Using the new coefficients and the same model form, we performed the same tests for
VORPAL-IO as we did for VPIC-IO in the previous subsection. Figures 4.8 and 4.9 compare
the predicted, measured, and default write times for the VORPAL-IO kernel running on 512
and 2048 cores, respectively. Table 4.6 shows the best ten predicted configurations selected
by the framework for the 512-core VORPAL-IO run. Note that the time predicted by the
model for this application is larger than the corresponding prediction time for the VPIC-IO
application, while the size of the generated files are comparable. The increased write time
is due to the complex write pattern of VORPAL-IO. Additionally, it is clear that the model
tends to choose larger stripe counts and stripe sizes for VORPAL-IO. Overall, the tuned
configurations achieve a speedup of 6X and 13X for VORPAL-IO running on 512 and 2048
cores, respectively.
4.4.5 Testing on larger scale
We now evaluate the model developed using the training configurations at lower number
scale (see Table 4.2) to tune the two I/O kernels running at 8192 cores. Note that we did
not use any configurations from the 8192-core runs in training the model. The 8192-core runs
46
0100
200
300
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Configuration Number
W
rit
e 
Ti
m
e 
(s)
Predicted
Actual
Default
Training_Best
Figure 4.8: Comparison of the model-predicted, measured, and default-setting
write times of the top twenty configurations for VORPAL-IO on 512 cores.
0
500
1000
1500
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Configuration Number
W
rit
e 
Ti
m
e 
(s)
Predicted
Actual
Default
Training_Best
Figure 4.9: Comparison of the model-predicted, measured, and default-setting
write times of the top twenty configurations for VORPAL-IO on 2048 cores.
47
Stripe count Stripe size Aggregators Pred. Time(s)
156 64 32 131.25
128 64 32 131.59
128 128 32 132.14
64 128 32 132.59
156 128 32 132.57
64 64 32 133.81
156 32 32 135.07
128 32 32 135.79
32 128 32 136.53
32 64 32 138.63
Table 4.6: Best ten predicted configurations for VORPAL-IO on 512 cores.
use 342 nodes of Hopper and produce roughly 2TB of data. We used a configuration/search
space of 1080 configurations for both kernels.
For each application, Figures 4.10 and 4.11 show the selected top twenty configurations
after the pruning and exploration steps. We observe significant performance improvement
for both applications. The speedups over the default I/O configurations on Hopper at a
concurrency of 8192 cores are 54X and 35X for the VPIC-IO and VORPAL-IO kernels,
respectively.
Table 4.8 summarizes the achieved speedups for both I/O kernels running at different
concurrencies. The table also shows the size of the data written to the file system and the
I/O bandwidth achieved. Overall, the tuned configurations achieve speedups ranging from
3.5X to 50X, which are consistent with exploring the search space using GAs. The time to
traverse the search space after training was reduced from 12 hours to a maximum of two
hours. In most cases, exploring the top twenty configurations took one hour, resulting in
significant improvements to overall parallel I/O performance.
4.4.6 Large-scale results
In this section, we first present the I/O performance results for the three I/O kernels at differ-
ent scales on the three platforms. The achieved I/O bandwidth and the overall improvement
compared to the default I/O settings are presented. We then analyze the interdependencies
of the I/O parameters by taking a closer look at these results.
48
02000
4000
6000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Configuration Number
W
rit
e 
Ti
m
e 
(s)
Predicted
Actual
Default
Figure 4.10: Comparison of the model-predicted, measured, and default-setting
write times of the top twenty configurations for VPIC-IO on 8192 cores.
0
3000
6000
9000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Configuration Number
W
rit
e 
Ti
m
e 
(s)
Predicted
Actual
Default
Figure 4.11: Comparison of the model-predicted, measured, and default-setting
write times of the top twenty configurations for VORPAL-IO on 8192 cores.
49
# cores I/O Kernel
File Size
(GB)
Actual B.W.
(MB/s)
Default B.W.
(MB/s)
Speedup
128
VPIC 32 2074.65 471.75 4.40
VORPAL 35.156 1501.87 427.6 3.51
512
VPIC 128 5185.4 408.6 12.69
VORPAL 140.625 3035.99 453.89 6.69
1024
VPIC 256 6181.75 336.6 18.37
VORPAL 281.25 2707.63 429.56 6.30
2048
VPIC 512 11422.28 412.19 27.71
VORPAL 562.5 4917.61 378.35 13.00
8192
VPIC 2048 18857.3 345.27 54.62
VORPAL 2250 7442.58 207.25 35.91
Table 4.7: Highest bandwidth achieved for the two applications by selecting
the best-performing configuration suggested by our autotuning framework on
Hopper.
4.4.7 Overall improvement
The best I/O bandwidth results we have obtained for each of the applications on different
platforms are summarized below. For each experiment, this is the best performing configu-
ration among the top 20 configurations predicted by the model. Figure 4.16 shows the I/O
bandwidth grouped by the number of cores from 4K to 16K. For all these experiments we
have used the training phase experiments without a refitting phase. As can be observed, the
I/O bandwidths of the kernels are in the range of 5-30 GB/s, which is efficient performance
for writing to one shared file on these platforms at their respective scales. We also show the
default I/O performance of the applications for their respective scales at 4096 and 8192 on
Hopper platform. Compared to the default performance, our tuned configurations perform
6X-94X better. We expect the default performance and our speedup to be at the same level
for the other platforms. Note that for the Stampede platform, we have scaled our runs only
up to 4K cores due to queue policies in running large scale tests.
Table 4.8 summarizes the achieved I/O bandwidths for the three I/O kernels running at
different concurrencies on the three platforms. The table also shows the size of the data
written to the file system. The time to traverse the search space after training was at most
three hours. In most cases, exploring the top twenty configurations took less than one hour,
resulting in significant improvements to overall parallel I/O performance.
50
#
cores
I/O Kernel
File Size
(GB)
Edison
(GB/s)
Hopper
(GB/s)
Stampede
(GB/s)
Hopper
Default
(GB/s)
512
VPIC 128 8.19 3.00 9.30 0.39
VORPAL 140.625 3.24 2.67 7.76 0.44
GCRM 166.4 9.78 5.27 11.62 -
1K
VPIC 256 14.24 5.09 14.71 0.32
VORPAL 281.25 9.91 2.34 9.10 0.41
GCRM 166.4 14.63 6.70 13.28 -
2K
VPIC 512 19.72 8.18 14.75 0.40
VORPAL 562.5 17.81 4.63 12.67 0.36
GCRM 665.6 23.96 6.82 21.05 0.24
4K
VPIC 1024 20.57 12.57 29.20 0.34
VORPAL 1197 10.26 4.50 15.35 0.31
GCRM 2600 16.64 10.59 26.99 0.41
8K
VPIC 2048 24.32 18.93 - 0.20
VORPAL 2250 12.77 7.26 - 0.33
GCRM 10400 28.60 22.09 - -
16K
VPIC 512 23.21 21.96 - -
VORPAL 4394 15.20 9.45 - -
GCRM 10400 24.58 19.73 - -
Table 4.8: Highest bandwidth achieved for the three applications by selecting
the best-performing configuration suggested by our proposed framework.
4.4.8 Analysis of the interdependencies
In this subsection we analyze some of the interdependencies of the parallel I/O tunable
parameters by looking at the results of the experiments we conducted.
We first analyze the impact of individual tuning parameters (stripe count, number of
aggregators, and stripe size) on performance, and then discuss the combined impact of
stripe count and aggregators, stripe size and aggregators.
In order to see the effect of Lustre’s stripe count parameter on I/O performance of an
application we look at the three different scales for which we ran VPIC-IO on Stampede.
Figure 4.17 shows the box plots for these experiments, where each of the plots contain all
the training set configurations for that scale. We can see that as the stripe count increases
the I/O performance improves; especially at higher scales of VPIC-IO application since the
amount of data to be written is large. This behavior is exactly reflected in the model since
it tries to use all available OSTs for VPIC-IO.
Figure 4.12 shows the variation of the number of aggregators on VPIC-IO’s training set
on Stampede. Similar to Lustre’s stripe count, increasing the number of aggregators helps
51
Figure 4.12: Effect of MPI-IO aggregators on training set of VPIC-IO on Stam-
pede
in improving the I/O performance for VPIC-IO. Therefore, the model has taken this into
account and tries to maximize the number of aggregators for the larger-scale testing sets.
Figure 4.13 shows the same box plots for the stripe size of VPIC-IO on Stampede. As
the plot shows, Lustre’s stripe size does not have the same behavior as the stripe count and
each of the values chosen in the training set for the stripe size has shown both good and bad
I/O performance depending on the values of other I/O parameters. As we show next, the
model’s behavior for this value is interesting.
Now that the variations and the performance of individual parameters are observed, we
analyze the top twenty configurations predicted by the model for a larger-scale VPIC-IO
experiment on Stampede. Table 4.9 contains the configurations proposed by the model as
the top performing configurations for VPIC-IO on 4K cores of Stampede, which leads to an
output file of size 1024 GB. As noted before, the number of aggregators is chosen to be the
maximum of 1024 and the stripe counts are varying from 156 (maximum in the testing set)
to 64. Since there is no strong correlation in stripe size and I/O performance in the training
sets, all the stripe sizes in the testing sets are chosen by the model to be tested. Looking at
this table, we can see that for experiment (exp id 6), the highest I/O bandwidth of approx
30 GB/s is achieved with a stripe size of 64MB.
Another interesting behavior we found in the results of the training sets experiments
is a relationship between the number of aggregators and the stripe count. We analyze
this relationship using the “ratio of the number of aggregators to the stripe count”. This
52
Figure 4.13: Effect of Lustre stripe size on training set of VPIC-IO on Stampede
relationship makes sense from the parallel I/O perspective as the number of aggregators each
OST handles has an impact on concurrency of Lustre and the communication between an
aggregator and an OST.
Figure 4.18 shows the impact of the ratio of aggregators to the stripe count for various
I/O kernels running on Hopper at a concurrency of 2K. We can observe that for both VPIC-
IO and VORPAL-IO, the impact of the ratio is similar, while GCRM-IO shows a different
behavior. It is not surprising to see that the higher ends of the spectrum are not performing
well for all kernels as they are related to those experiments with lower stripe count. There
is a peak in the middle of this plot where we can see the best I/O performance for VPIC-
IO and VORPAL-IO. This is where both stripe count and aggregators are large enough to
get the most parallelism, but not too large so that the overhead causes the performance to
drop. This is different for GCRM because stripe count should be large but the number of
aggregators should not be that large.
Analyzing the top twenty results predicted by our model once we ran them on the platforms
provides insight as well. Here we show some of the insights that we think are important for
the scientific community to achieve efficient parallel I/O performance.
The first insight we gained is the role of Lustre’s stripe size. Figure 4.14 compares the
performance of the top twenty configurations proposed by the model for VPIC-IO on 4K
cores of Edison. The stripe count for all these configurations are fixed to 96 thus it is easy
to compare the impact of stripe size in one plot. The three bars in different colors show the
numbers for three different sets of aggregators chosen by the model and the X-axis shows
53
exp id c s a f (GB) time (s)
bandwidth
(GB/s)
0 156 1 1024 1024 58.87 17.39
1 156 2 1024 1024 49.84 20.54
2 156 4 1024 1024 47.06 21.75
3 156 8 1024 1024 42.11 24.31
4 156 16 1024 1024 38.99 26.25
5 156 32 1024 1024 40.28 25.41
6 156 64 1024 1024 35.06 29.20
7 156 128 1024 1024 44.96 22.77
8 128 1 1024 1024 61.33 16.69
9 128 2 1024 1024 65.87 15.54
10 128 4 1024 1024 58.94 17.37
11 128 8 1024 1024 54.72 18.71
12 128 16 1024 1024 68.53 14.94
13 128 32 1024 1024 61.76 16.57
14 128 64 1024 1024 49.47 20.69
15 128 128 1024 1024 57.31 17.86
16 64 1 1024 1024 104.13 9.83
17 64 2 1024 1024 95.14 10.76
18 64 4 1024 1024 129.01 7.93
19 64 8 1024 1024 78.20 13.09
Table 4.9: The top twenty configurations predicted by our model and their
respective I/O bandwidth for VPIC-IO on 4K cores of Stampede generating a 1
TB file size
different values for stripe size in MB. We can observe the difference between poor performing
and best performing configurations is almost two-fold. This behavior is similar to what we
observed in Table 4.9, which is the same application at the same concurrency on Stampede;
However, on Edison, the best I/O performance was gained when stripe size is equal to 16
while on Stampede that is 64. This shows that depending on the platform, the values of
these parameters are different. This also emphasizes that the selection of stripe size has an
impact on I/O performance contrary to a recent study [29] that downplays this impact.
Another insight we gained from the results is that unlike Lustre’s stripe count, where
increasing the number of OSTs gives better performance, the number of aggregators have
a sweet spot depending on the I/O pattern of an application. Figure 4.15 demonstrates
this impact for VORPAL application on 16K cores of Edison. 14 configurations out of the
top twenty proposed by our model for this experiment have stripe count equal to 96 and
therefore we can compare the effect of aggregators for each stripe size value. Based on the
plot, one can conclude that having too many aggregators does not provide good performance
54
 0
 5
 10
 15
 20
 25
1 2 4 8 16 32 64 128
I/O
 B
an
dw
id
th
 (G
B
/s
)
Stripe Size (MB)
1024 Aggregators
512 Aggregators
256 Aggregators
Figure 4.14: Effect of Lustre stripe size on performance of the Top 20 VPIC-IO
configurations on 4K cores of Edison. Stripe count is fixed at 96 (maximum #
of OSTs on Edison)
 0
 2
 4
 6
 8
 10
 12
 14
 16
 18
128 256 512 1024 2048 4096
I/O
 B
an
dw
id
th
 (G
B
/s
)
Number of Aggregators
stripe size = 128 MB
stripe size = 64 MB
stripe size = 32 MB
Figure 4.15: Effect of MPI-IO aggregators on performance of 14 configurations
of the Top 20 experiments on 16K cores of Edison. Stripe count is fixed at 96
(maximum # of OSTs on Edison)
55
(most likely because of high overhead). On the other hand, having too few aggregators is
suboptimal because nodes are not able to saturate the I/O bandwidth. We can conclude
from the plot that the value of stripe size has a role in choosing the number of aggregators
as well proving the existence of the interdependency among various I/O parameters.
In summary, we list the following findings from our analysis of I/O performance:
• As we increase the size of a file, increasing Lustre’s stripe count, irrespective of the
platform, causes more parallelism and therefore results in an improvement in the I/O
bandwidth.
• Lustre’s stripe size is an important factor in tuning I/O performance. It can have a
dramatic impact on I/O performance and its values depends on the I/O operations
and the other I/O parameters (e.g. aggregators) and the HPC platform.
• The number of MPI-IO aggregators should be specified carefully and not blindly min-
imized or maximized. This parameter also depends on the I/O operations, other I/O
parameters, the platform, and the amount of communication happening in the appli-
cation, i.e., the I/O pattern of the application.
4.5 I/O Interference
As described earlier, one of our initial goals in this work was to characterize interference
(“noise”). This characterization would allow us to take noise into account as a source of
error in the model and/or in posing the tuning objectives (e.g., maximizing ideal performance
or worst-case performance). The problem of interference is well-known in high-performance
storage systems, mostly due to the storage system being a shared resource (see, e.g., [30]).
Depending on the activity on the network and storage system, one can observe different I/O
performance for identical I/O configurations.
In order to estimate this noise accurately, one approach is to run experiments multiple
times under the system states one wishes to capture (e.g., on different days of a week
or in the presence of different rates of utilization of the I/O subsystem). However, this
approach may be difficult to implement (e.g., because of the challenges associated with
producing/controlling exact system states) and impractical to execute (e.g., because of the
large number of runs that would need to be conducted to obtain data under the desired
states).
56
0.1
0.3
0.4
1
10
20
30
40
VPIC-IO VORPAL-IO GCRM-IO
I/O
 B
an
dw
id
th
 (G
B
/s
)
Edison
Hopper
Stampede
Default VPIC-IO
Default VORPAL I/O
Default GCRM I/O
(a) 4K cores
0.1
0.2
0.3
1
10
20
30
VPIC-IO VORPAL-IO GCRM-IO
I/O
 B
an
dw
id
th
 (G
B
/s
)
Edison
Hopper
Default VPIC-IO
Default VORPAL I/O
(b) 8K cores
 0
 5
 10
 15
 20
 25
 30
VPIC-IO VORPAL-IO GCRM-IO
I/O
 B
an
dw
id
th
 (G
B
/s
)
Edison
Hopper
(c) 16K cores
Figure 4.16: Summary of the best I/O performance obtained in the Top-20
configurations for each I/O benchmark running on (a) 4K cores, (b) 8K cores,
(c) 16K cores – Note that (a) and (b) are log-scale plots.
An alternative approach is to monitor the amounts of data read from and written to each
OST while a configuration is being evaluated. These read/write amounts could be captured
along with the experimental data (OSTs involved and time recorded from the evaluation),
and then used as a proxy for the uncertainty associated with the time. For example, when
the OSTs associated with a run have external read/write amounts above a threshold, this
run could be flagged as a potentially “noisy” experiment.
As a first step towards investigating the potential of the latter approach, we attempted to
use data gathered through the Lustre Monitoring Toolkit (LMT) [31], which is stored on a
daily basis on Hopper. The original version of LMT keeps track of the bytes read/written
by sampling this information on each OST every 5 seconds. We extracted the total amount
of bytes read/written to all of the OSTs on the file-system during program execution (corre-
57
sponding to aggregate I/O). Unfortunately, no meaningful correlation was observed between
the write times across multiple runs and the the aggregate I/O activity. We then attempted
to examine the LMT data from only those OSTs that our application was using; but we
unable to find a correlation between I/O times and aggregate activity. Richer network and
storage system monitoring data will become available in the coming years; we expect signifi-
cant opportunities to leverage such information to resolve the fidelity of both the underlying
empirical data and the predictions from the models in our framework.
4.6 Conclusions
This chapter has presented an important development in our work on autotuning parallel
I/O. We have dramatically reduced the run time for our framework from 12 hours to 2 hours
by incorporating an empirical performance model. The model accounts for major parameters
pertaining to parallel I/O operations on a production supercomputing platform. We fit the
model with a relatively small training set of application runs. The model was then used
to predict configurations with high levels of I/O performance on two applications and at
varying levels of concurrency.
58
(a) 512 cores (b) 1K cores
(c) 2K cores
Figure 4.17: Effect of Lustre stripe count at three different scales of VPIC-IO
on Stampede (a) 512 cores, (b) 1K cores, (c) 2K cores
59
(a) VPIC-IO (b) VORPAL-IO
(c) GCRM-IO
Figure 4.18: Ratio of MPI-IO’s aggregators and Lustre’s stripe count on three
different applications on 2K cores of Hopper (a) VPIC-IO, (b) VORPAL-IO, (c)
GCRM-IO
60
CHAPTER 5
A MULTI-LEVEL APPROACH FOR
UNDERSTANDING I/O ACTIVITY IN HPC
APPLICATIONS
In a hierarchical I/O stack, the layers provide bridges between the data representations of
adjacent levels and offer essential abstractions to users. The layers help hide complex im-
plementation details and employ optimization techniques designed to improve performance.
Unfortunately, since each layer is normally treated as a black box, optimizations are sel-
dom coordinated across layers and the source of performance bottlenecks can be extremely
difficult to determine. A multi-level I/O tracing and trace data analysis tool that presents
a view of the function call flow through the entire I/O stack can expose cause and effect
relationships across layers and make the origin of performance bottlenecks more apparent.
To the best of our knowledge, while there are several tracing facilities for the MPI-IO
and POSIX I/O levels, none of the currently available tracing tools work with higher-level
I/O libraries such as HDF5. We believe that tracing I/O functions at higher levels in the
stack is important because events closer to the application better reflect inherent application
characteristics and are more intuitive to analyze. In addition, insights into all levels of the
I/O stack are necessary in order to get a full picture of interactions between layers and to
identify sources of performance bottlenecks.
In this chapter, we argue that a multi-level I/O tracing and trace data analysis tool can
help end users understand the behavior of their application and I/O subsystem, and can
provide insights into the source of I/O performance bottlenecks. We make the following
contributions:
• We implement a multi-level I/O tracing framework, called Recorder, that can capture
I/O function calls at multiple levels of the I/O stack, including HDF5, MPI-IO, and
POSIX I/O. Recorder requires no modification or recompilation of the application and
users can control what levels are traced.
• We demonstrate the effectiveness of Recorder as an aid to understanding the I/O
activity of applications and identifying a performance bottleneck in HDF5’s current
implementation of metadata read.
The remainder of this chapter is organized as follows: We describe our framework in
61
Feature Recorder
Parallel file system compatibility Yes
Ease of installation and use 1 (V. Easy)
Anonymization 3 (Medium)
Event Types System calls and Library calls
Control of trace granularity Moderate
Replayable trace generation Planned
Trace replay fidelity N/A
Reveals dependencies No
Intrusive vs. Passive 1 (V. Passive)
Analysis tools Planned
Trace data format Binary and Human Readable
Accounts for time skew and drift No
Elapsed time overhead Under investigation
Table 5.1: Features of Recorder Framework
Section 5.1. In Section 5.2, we evaluate the effectiveness of our framework. Finally, we
summarize our current efforts in Section 5.3, discuss open issues, and outline future work.
5.1 Framework
The features of Recorder based on the taxonomy proposed in [32] are summarized in Table
5.1 and briefly described here. Recorder is designed to work with parallel file systems and
does not require any modifications to application or I/O library source code. It provides a
medium level of anonymization to protect sensitive data by, for example, recording only the
size of a message and not its value. Recorder can capture I/O functions at the HDF5, MPI
I/O, and POSIX I/O layers, and users can specify the layers to be traced when Recorder is
compiled, providing some control over trace granularity. Recorder provides a passive method
of tracing events through the use of dynamic library preloading. Development of analysis
tools is future work, while direct inspection of the trace files is possible now as both binary
and human readable formats are supported. Initial experiments showed an acceptable level
of elapsed time overhead, but attempts to measure and compare runtimes with and without
tracing yielded noisy results. We are currently exploring better ways to collect time overhead
measurements.
We chose to build Recorder as a shared library so that it does not require modification
or recompilation of the application. Recorder uses function interpositioning to prioritize
itself over standard functions, as shown in Figure 5.1. Once Recorder is specified as the
62
HDF5 Library (Unmodified)
Application: H5Fcreate("sample_dataset.h5", 
H5F_ACC_TRUNC, H5P_DEFAULT, plist_id )
Recorder
1. Obtain the address of H5Fcreate using dlsym()
2. Record timestamp, function name and it's arguments.
3. Call real_H5Fcreate(name, flags, create_id, 
new_access_id)
High-Level I/O Library: hid_t H5Fcreate(const char 
*name, unsigned flags, hid_t create_id, hid_t 
access_id )
MPI I/O Library: int MPI_File_open(MPI_Comm 
comm, char *filename, int amode, MPI_Info 
info, MPI_File *fh)
Recorder
...
POSIX Library: int open(const char *pathname, int 
flags, mode_t mode)
Recorder
...
MPI-IO Library (Unmodified)
C POSIX Library (Unmodified)
Figure 5.1: Dynamic instrumentation of the I/O stack by Recorder
preloading library, it intercepts HDF5 function calls issued by the application and reroutes
them to the tracing implementation where the timestamp, function name, and function
parameters are recorded. The original HDF5 function is called after this recording process.
The mechanism is the same for the MPI and POSIX layers. Figures 5.3, 5.4, and 5.5 show
sample trace output. This tracing approach is transparent to the user because alterations
are made without change to application or library source code.
5.2 Evaluation
We evaluated the effectiveness of our tracing framework using two parallel I/O benchmarks
running on Stampede [33], a Dell PowerEdge C8220 cluster at the Texas Advanced Comput-
ing Center. Stampede has 6,400 nodes, each with 32 GB memory and 16 cores. The peak
I/O bandwidth is 159 GB/s.
5.2.1 VPIC-IO Benchmark
In our first case study, we used I/O traces from Recorder to investigate how I/O parameter
settings change the underlying behavior of an HPC I/O benchmark in runs with default and
tuned parameters.
Table 5.2 summarizes the parameters, their default and tuned values, and the measured
63
Description Default Parameters Tuned Parameters
Lustre stripe count 2 48
Lustre stripe size 1 MB 2 MB
Collective buffering nodes chosen by ROMIO 16
Collective buffer size 1 MB 2 MB
HDF5 alignment none (0,4KB)
I/O bandwidth 394 MB/s 2329 MB/s
Table 5.2: VPIC-IO parameters and I/O bandwidth
I/O bandwidth for VPIC-IO experiments run on 128 cores of Stampede.
Since I/O trace size can be a metric of I/O activity, we first compared the sizes of the
per-process I/O trace files for the default and tuned VPIC-IO experiments, as shown in
Figure 5.2. In the default case, there are two spikes in the trace file size, both at about 1.7
MB. In the tuned case, we see eight smaller peaks at around 280 KB.
After noting the patterns in the trace file sizes, we looked at the trace files in more detail
to understand the source of the observed patterns. Figures 5.3, 5.4, and 5.5 show lines from
the trace files produced by Recorder during runs of VPIC-IO. Comparing figures 5.3 and
5.4, corresponding to ranks 0 and 1 using default I/O parameters, the trace for rank 0 shows
POSIX write operations to the file sample dataset.h5, while the trace for rank 1 does not.
This difference may come as a surprise to an application developer with little experience in
parallel I/O, as the HDF5 calls are identical.
Recognizing that the actual POSIX write operations were happening on the processors
with the larger trace files, we next sought to understand how the number of spikes was
related to the I/O parameter settings. One of the tuned parameters was the number of
collective buffering nodes, also referred to as aggregators. In the tuned version, an MPI-IO
hint was used to set the number of aggregators to 16. However, the MPI-IO library will
not assign more than one aggregator per node. Since Stampede has 16 cores per node, a
128-process run resides on 8 nodes, and ROMIO uses no more than eight aggregators. This
is why we see 8 spikes in the trace file size plot for the tuned run instead of 16. The default
case did not explicitly specify the number of aggregators and ROMIO chose two. Without
the trace file output, these patterns of I/O behavior are difficult to anticipate from just the
application source and specified parameter settings.
Looking more closely at figures 5.3 and 5.5 (and even more apparent in the full trace files),
we see the POSIX write operations never transfer more than 1 MB of data in the default
case while in the tuned case each operation can transfer up to 2 MBs. For the collective
buffer size, the observed behavior agrees with the expected behavior based on the parameter
64
0	  
200	  
400	  
600	  
800	  
1000	  
1200	  
1400	  
1600	  
1800	  
2000	  
1	   5	   9	   13	  
17	  
21	  
25	  
29	  
33	  
37	  
41	  
45	  
49	  
53	  
57	  
61	  
65	  
69	  
73	  
77	  
81	  
85	  
89	  
93	  
97	  
101	  
105	  
109	  
113	  
117	  
121	  
125	  
Tr
ac
e	  
Fi
le
	  S
iz
e	  
(in
	  K
By
te
s)
	  
Process	  Rank	  
Default	  I/O	  Conﬁgura<on	   Tuned	  I/O	  Conﬁgura<on	  
Figure 5.2: Trace file sizes for VPIC-IO process ranks
settings specified.
Although not shown in detail, the Recorder trace output can also be used to investigate the
effects of the HDF5 alignment parameters on the I/O patterns throughout the stack. Since
the traces do not go lower than the POSIX I/O level, they do not give direct insights into the
behavior of the Lustre file system with different stripe count and stripe size parameters. In
general, the traces do provide a valuable resource for understanding how an identical HDF5
call in the application can be instantiated very differently at lower levels of the I/O stack
depending on process rank and the parameter settings chosen.
5.2.2 Simple HDF5 Benchmark
A simple HDF5 benchmark was developed for the purpose of testing the Recorder framework.
In the first phase, the benchmark creates an HDF5 file on the parallel file system and writes
five large global attributes to the file. Each rank then writes 1 GB of data to a rank-based
offset in a single HDF5 dataset, and the file is closed. In the second phase, the file is re-
65
1378057484.0000 H5Dwrite(83886080,50331690,67108866,67108868,167772178)
1378057484.00000 MPI_File_set_view(fh=30077080, disp=2184, 
etype=MPI_BYTE, filetype=, datarep=native, info=469762048)
1378057484.00000 MPI_File_write_at_all(fh=30077080, offset=0, buf=*buf, 
count=33554432, datatype=MPI_BYTE, status=-680384912)
1378057484.00000 write(uverbs0, void *buf, 48)
1378057484.00000 write(sample_dataset.h5, void *buf, 1046392)
1378057484.00000 write(uverbs0, void *buf, 48)
1378057484.00000 write(sample_dataset.h5, void *buf, 1048576)
...
Figure 5.3: VPIC-IO write operations, rank 0, default parameters
1378057484.0000 H5Dwrite(83886080,50331690,67108866,67108868,167772178)
1378057484.00000 MPI_File_set_view(fh=15792648, disp=2184, 
etype=MPI_BYTE, filetype=, datarep=native, info=469762048) 
1378057484.00000 MPI_File_write_at_all(fh=15792648, offset=0, buf=*buf, 
count=33554432, datatype=MPI_BYTE, status=-1274399632) 
1378057484.00000 write(/dev/infiniband/uverbs0, void *buf, 88) 
1378057484.00000 write(/dev/infiniband/uverbs0, void *buf, 120)
1378057484.00000 write(/dev/infiniband/uverbs0, void *buf, 120) 
...
Figure 5.4: VPIC-IO write operations, rank 1, default parameters
opened and each rank reads the global attributes. Each rank then reads its portion of the
dataset, compares the values read to those written in the first phase, and closes the file.
I/O performance characterization tools such as Darshan can provide helpful information
about the MPI-IO calls issued when an application executes. Table 5.3 shows the values
of six important Darshan MPI-IO counters for an experimental run of the simple HDF5
benchmark on eight cores of Stampede. Since the experiment was made with a collective
version of the benchmark, the 16 collective opens are expected–one open per rank for each
phase of the benchmark. The 8 collective writes and 8 collective reads are also expected,
and correspond to the dataset writes and reads by the eight processes.
The 72 independent reads and 7 independent writes are more difficult to understand;
they do not correlate with the benchmark’s writes and reads of the global attributes. I/O
characterization tools do not provide the level of detail needed to investigate further, but the
traces from Recorder did allow us to identify the HDF5 calls that resulted in these puzzling
independent reads and writes. We will focus our discussion on the reads.
As can be seen in figure 5.6, an HDF5 dataset open triggered each process to perform
66
1378057875.0000 H5Dwrite(83886080,50331690,67108866,67108868,167772178)
1378057875.00000 MPI_File_set_view(fh=40052760, disp=262144, 
etype=MPI_BYTE, filetype=, datarep=native, info=-603979774)
1378057875.00000 write(uverbs0, void *buf, 48)
1378057875.00000 MPI_File_write_at_all(fh=40052760, offset=0, buf=*buf, 
count=33554432, datatype=MPI_BYTE, status=-172795264)
1378057875.00000 write(uverbs0, void *buf, 48)
...
1378057875.00000 write(sample_dataset.h5, void *buf, 1835008)
1378057875.00000 write(uverbs0, void *buf, 48)
...
1378057875.00000 write(sample_dataset.h5, void *buf, 2097152)
...
Figure 5.5: VPIC-IO write operations, rank 0, tuned parameters
Darshan counter Value
Number of independent opens 0
Number of independent reads 72
Number of independent writes 7
Number of collective opens 16
Number of collective reads 8
Number of collective writes 8
Table 5.3: Darshan MPI-IO counters for simple HDF5 benchmark on eight cores
multiple small independent read operations that were not expected. Looking at other parts
of the trace files, we became aware of a performance bottleneck in the current HDF5 library
that was confirmed by The HDF Group.
The majority of the 72 independent read operations were due to HDF5 metadata opera-
tions, such as reading the file’s superblock, getting the root group’s object header, B-tree,
and local heap, retrieving information for the dataset, and so on. While accessing this
metadata is necessary, the current implementation introduces a performance bottleneck as
the number of ranks increases because all processes in the application perform the same
metadata read operations when they are operating collectively. The HDF Group is working
on a feature called collective metadata reads which allows an application to eliminate this
duplication of operations by indicating that the metadata reads can be performed by one
process and the results shared with other processes.
This case study demonstrates the value of Recorder traces to end users, as they provide the
details needed to understand how a simple call at the HDF5 level may result in numerous
I/O accesses at lower levels of the stack. Without this understanding, especially when
67
1379881178.00000 H5Dopen2(16777216,DS1,0)
1379881178.00000 MPI_File_read_at(fh=34945416, offset=680, buf=*buf, 
count=512, datatype=MPI_BYTE, status=1330219344) 
1379881178.00000 read(20, buf, 512) 
1379881178.00000 MPI_File_read_at(fh=34945416, offset=136, buf=*buf, 
count=544, datatype=MPI_BYTE, status=1330219856) 
1379881178.00000 read(20, buf, 544) 
1379881178.00000 MPI_File_read_at(fh=34945416, offset=42416, buf=*buf, 
count=328, datatype=MPI_BYTE, status=1330219232) 
1379881178.00000 read(20, buf, 328) 
1379881178.00000 MPI_File_read_at(fh=34945416, offset=42144, buf=*buf, 
count=512, datatype=MPI_BYTE, status=1330221072) 
1379881178.00000 read(20, buf, 512) 
1379881178.00000 H5Dget_space(83886080)
1379881178.00000 H5Sget_simple_extent_dims(67108866,1330223856,0)
Figure 5.6: Independent MPI-IO read operations issued by all processes to read
HDF5 metadata
performance is not as expected, it is almost impossible to identify the source of problems.
Armed with this understanding, application developers can see the importance of tuning
I/O across all the levels of the stack.
5.3 Conclusions
This chapter presents our first steps toward a full-featured multi-level I/O tracing framework
that we believe will help end users and library developers diagnose bottlenecks and optimize
performance throughout the parallel I/O stack. Recorder is built as a dynamic library so
it does not require any modification or recompilation of the application. Early case studies
have shown it to be very useful in performing in-depth analyses of I/O activity at the HDF5,
MPI-IO, and POSIX levels of the parallel I/O stack. In one case, output from Recorder made
a bottleneck in the current HDF5 library’s implementation of metadata reads apparent to
an end user who was previously unaware of the implementation details.
68
CHAPTER 6
AUTOMATIC GENERATION OF I/O KERNELS
FOR HPC APPLICATIONS
Efforts conducted towards increasing the I/O performance of current HPC platforms with
their large amount of parallelism can be categorized into different categories: algorithms such
as data sieving and collective I/O [5]; manual optimizations of these systems based on I/O
expertise [34]; and autotuning and systematically searching for good I/O configurations [28].
All of these techniques can be helped by I/O profiling and tracing.
I/O kernels are typically built manually, which is a time-consuming and error-prone pro-
cess. An alternative is to record a trace of the I/O operations and “replay” the trace.
Existing systems capture traces mostly at the level of POSIX I/O and MPI-IO calls [35–37].
Ganger [38] explains the limitations of this approach.
In the previous chapeter, we implemented a multi-level I/O tracing framework, called
Recorder [39]. It captures parallel I/O function calls at multiple levels of the parallel I/O
stack, including HDF5, MPI-IO, and POSIX I/O. Having such tracing framework enabled
us to compare and contrast the abilities to replay and/or generate I/O kernels from the
traces at every level. We have concluded that tracing the HDF5 library eases significantly
the process of creating an standalone parallel I/O kernel. Every object in HDF5 such as files,
dataspaces, datasets, attributes, groups, etc. have a unique integer identifier. Therefore, it
is easy to keep track of them in the generated code. Additionally, most of the HDF5 I/O
operations are called collectively. This eases the process of merging traces across the ranks.
Since the HDF5 calls determine the calls at the lower levels, capturing only these calls causes
no loss of information.
This chapter is focused on automating the creation of an I/O skeleton code: A code that
generates the same (HDF5) I/O calls as the original program, while shedding details of the
computation. Furthermore, we wish to do so without requiring access to the program’s
source. This, because, in many situations, we can run the program, but have access only to
binaries.
We start with trace files generated by an instrumented HDF5 library at each process; the
traces are merged into one file; the order of the I/O operations is preserved by merging these
traces correctly. The skeleton program is generated from this merged trace file.
69
A naive application of this algorithm would yield a kernel program of length proportional
to the total length of the trace files. However, simple pattern matching based compression
techniques can be used to reduce the size of the kernel code: It is often possible to generate
a kernel code of length proportional to the number of HDF5 calls in the original code.
The remainder of this chapter is organized as follows: We discuss related work in the
next section and discuss our framework in Section 6.1. Section 6.2 presents the results of
experiments and Section 6.3 provides conclusions and discussions future work.
6.1 Framework
Figure 6.1: Flow of the Framework
This section introduces our framework for automatically generating I/O kernels from HPC
applications. Figure 6.1 shows the overall flow of this framework. An HPC application is
first linked to our recorder library. The recorder library stores traces for all the I/O calls
into a separate trace file for each MPI rank. Therefore, after running the application, n
trace files are generated, where n is the number of MPI processes. These n trace files are
fed into another tool of our framework in order to be merged together. Once these traces
are merged into a single trace file, the code generator tool of our application generates an
SPMD MPI-based application for it. Each of these three steps are explained in detail in the
following subsections.
6.1.1 I/O Tracing: Recorder
Figure 6.2 shows the process of intercepting an HDF5 function call (H5Fcreate(), used
for creating an HDF5 file). Once Recorder is used, it intercepts HDF5 function calls issued
by the application and reroutes them to the tracing implementation where the timestamp,
70
function name, function parameters, return values, and the duration of the call are recorded.
The original HDF5 function is called after this recording process. This tracing approach
is transparent to the user because alterations are made without change to application or
library source code. Recorder can be built as a shared library and linked to the application
at runtime so that it does not require modification or recompilation of the application. It
also can be built statically using -wrap functionality.
HDF5 Library (Unmodified)
Application
H5Fcreate("sample_dataset.h5", 
H5F_ACC_TRUNC, H5P_DEFAULT, plist_id ) Recorder
1. Record timestamp, function name and it's arguments.
2. Call real_H5Fcreate(name, flags, create_id, 
new_access_id)Rest of the application.
.
.
Figure 6.2: Method of intercepting HDF5 calls by the Recorder
6.1.2 Trace Merging
The output of the recorder is n log files, where n is the number of MPI processes. Before
executing the merging process, these files get coded into a uniform format that is easier
to process. The merger takes as input these log files, and outputs one file containing the
merged traces. We expect that many HDF5 calls are collective, therefore, the merge process
attempts to identify tuples of records, one from each file, that have the same “signature”
(we discuss how to match signature later). We assume that matching records will appear
at roughly the same location in the different files (i.e., that the I/O operations executed by
different processes are similar). Therefore we keep, within each trace file, a fixed-size window
of records under consideration for merge. Note that incomplete matching affects the size of
the merged trace, but not its correctness.
We keep track of our position in each file ti using pointers pi. If all the records we are
currently pointing at have the same signature, we merge them. If not, we pick from the
current tuple of records the record that has the farthest matches in the other files and move
it to the merged file. The intuition behind this heuristic is that we would like to merge
traces in a greedy manner, merging the largest number of records at each time. Hence, if a
71
record has a match in another file that is close by, we prefer to keep it until we can merge
it with its matches. This case is illustrated in Figure 6.3.
Figure 6.3: Illustrating three consecutive merging operations
In order to quickly find the matches of a given record, a hash table keeps track of all
the records from all the files within a windows of m records consecutive records. During
our experiment we fixed this value at a constant as 200. In order to understand our data
structure, we define the following terms:
• Let n be the number of MPI processes.
• Let e be a record in a trace file. It has the function name, arguments and other
recorded information.
• The signature of a record e is defined by a function k(e). Signatures are chosen so that
records have the same signature if and only if they match and can be merged. Matching
records have the same function name and same values for significant parameters. The
set of significant parameters is different from function to function and are tagged at
a trace preprocessing step. An example of a significant argument is the MPI commu-
nicator since the difference in the communicator clearly indicates the two traces come
from different MPI calls. An example of non-significant argument is pointer addresses
since they are always different from function to function.
• Let δ(e) be the smallest distance to a matching record of e in another file.
The hash table uses a hash of k(e) to store records. For each value of k(e), we maintain
a linked list of records with that signature within the window for each MPI process. The
72
records in the linked list are ordered in an order consistent with their order in the trace files.
We also store, with each record, the value of δ which represents the current value of δ(e)
within the window. Assuming adding, removing or searching for an entry from the hash
table is constant time, we could keep update the δ when an entry is added or removed from
the table in constant time. Similarly whenever we need δ(e), we query δ in the hash table
of k(e), and thus it is constant time.
The algorithm considers the current records on each file and decides to merge only if they
are all matching each other. Otherwise, we pick the record e with the largest value δ(e) and
move it to the merged file.
Algorithm 1 shows the high level pseudocode of the merging algorithm. First, we initialize
all the pointers pis to zero referring to the first entry of each log file. Then, we insert the
first m entries of each processors into our hash table h. While we did not reach the end
of all the trace files, we keep repeating the merging process which executes the two cases
previously explained.
Input: Trace files to be merged
Output: File containing the merged trace (OF)
Variables:
pi= current position in the ith trace file
m = maximum depth
n = total number of trace files
ti = ith trace file, 0<i≤n
HT = the hashtable data structure
Pseudo Code:
Initialize all pi s to zero.
Insert the first m entries of each ti to h.
while ∃pi! = ti.size() do
if all events pointed to by pis are matched then
MergeEvents(epi for any i).
else
∀pi, δi = GetDistance(epi)
δmax = max(δi)
MergeEvents(epj for j corresponding to δmax).
end if
end while
Algorithm 1: Merging Algorithm
Recall that each event is added into or removed from the data structure once in constant
time. While choosing the candidate at each step to emit, each event is accessed to find its δ
distance value until it is finally merged and emitted. In case there is no match for its in the
73
Input: Trace event e
Pseudo Code:
h2 = h(k(e)).h2
Pop all events ej at head of h2.list[j] and recompute h2.δ
Merge ej to an entry and output to OF.
Increment all pis by 1.
Insert new entries from tis to h
Algorithm 2: MergeEvent method
current window, it will be immediately emitted since the distance is infinite, otherwise the
number of accesses to such event is bound by the constant size of the windows. Since each
access takes constant time as well, the run-time complexity of our algorithm is linear to the
number of events in all processors. In practice since the number of I/O functions is small an
event will quickly find its matches and is merged before the size of the window is reached.
6.1.3 Code Generation
Once the merged trace is created by the merger, a code generator will generate a compilable
Single-Program, Multiple Data (SPMD) code for it. The merged records which are called
collectively by all the processors are easily generated in the I/O trace generator application.
For the not-merged functions, there are number of ways one can take care of differentiating
what each processors are doing:
1. Using conditions: The most straightforward solution to this problem is to use an
if - else statement and put each of the rank operations in their corresponding
if clause. The problem with this approach is that code length is proportional to the
number of processes, so that for large-scale experiments the generated code will be
very large.
2. Using memory: The second solution is to trade constant memory for code size. The
way this works is that for every number or array which is different for different ranks, a
new dimension is added corresponding to the rank of the MPI processes. This solution
Input: Trace event e at (pi)
Output: Distance value δ(e)
Pseudo Code:
return h2 = h(k(e)).δ
Algorithm 3: GetDistance method
74
will decrease the size of generated code significantly, but still has some downsides such
as requiring extra memory and a need to replicate this data on the memory of each
MPI process.
3. Identifying the relationship with MPI rank: In most of the cases, there is a
simple relationship between the offsets of the file a process is accessing and the rank
of that process. Therefore, the code generator can try to identify this pattern and
find a relationship between these numbers. A math symbolic library can be used for
this purpose. Currently, our code generator first checks for such a relationship. More
specifically, it tries to check for this relationship for each of the dimensions of each of
the different arguments of the functions that the merger has specified. In case it can
not, it will fall back to the second solution and uses memory in order to differentiate
different values for different processes.
6.2 Setup and Evaluation Results
In this section we discuss the way that our framework performs for different applications at
different scales. All the traces are gathered on the Stampede Dell cluster at Texas Advanced
Computing Center (TACC).
Three different I/O kernels described previously are used for this purpose. For each I/O
kernel, we used our framework to generate an I/O trace gnerator and compared it to the
original kernel. An advantage of these experiments is that it enables a fair comparison of I/O
calls of the original I/O kernel with the generated I/O kernel using our framework. We ran
all the three benchmarks on 2048 cores of Stampede each generating about 500 GB output
file.
6.2.1 Correctness of the framework
Figure 6.4(a) shows the comparison of some POSIX I/O counters of the original and gen-
erated VPIC-IO kernel. These counters are derived using Darshan [40]. As it can be seen,
these numbers are exactly the same for the two kernels. The output files generated by the
framework are also exactly correct, both in terms of size and also using h5dump utility.
Figure 6.4(b) and 6.4(c) show the same comparison for the VORPAL-IO and GCRM-IO
kernels respectively. In these case the framework has also been able to generate the same
output file as the original VORPAL-IO and GCRM-IO applications both in terms of size
75
 0
 2000
 4000
 6000
 8000
 10000
 12000
 14000
 16000
 18000
CP_POSIX_READS CP_POSIX_WRITES CP_POSIX_OPENS CP_POSIX_SEEKS
D
ar
sh
an
 C
ou
nt
er
s
Original VPIC-IO
Replayed VPIC-IO
(a) VPIC-IO
 0
 2000
 4000
 6000
 8000
 10000
 12000
 14000
 16000
 18000
CP_POSIX_READS CP_POSIX_WRITES CP_POSIX_OPENS CP_POSIX_SEEKS
D
ar
sh
an
 C
ou
nt
er
s
Original VORPAL-IO
Replayed VORPAL-IO
(b) VORPAL-IO
 0
 100000
 200000
 300000
 400000
 500000
 600000
 700000
CP_POSIX_READS CP_POSIX_WRITES CP_POSIX_OPENS CP_POSIX_SEEKS
D
ar
sh
an
 C
ou
nt
er
s
Original GCRM-IO
Replayed GCRM-IO
(c) GCRM-IO
Figure 6.4: Comparison of Darshan POSIX I/O counters of original application
and the generated one by the framework (a) VPIC-IO, (b) VORPAL-IO, (c)
GCRM-IO
and HDF5 file format.
6.2.2 Quality of the generated code
The previous subsection showed the correctness of the framework for the three benchmarks;
In this subsection we will look at the quality of the generated code represented by its size.
Table 6.1 compares the size of the original I/O benchmark source codes with the generated
source code by our framework. As it can be seen VPIC-IO and GCRM-IO have generated
code of size proportional to the original code (Original GCRM-IO code has more options
causing the original code to be larger than the generated I/O benchmark). VORPAL-
IO however has much larger generated source code. The reason for this is the complex
76
relationship between the starting addresses of the 3D blocks assigned to the processes and
their MPI ranks. The code generator was not able to find this relationship and had to fall
back to using memory (solution #2) in order to generate correct code (since 2048 cores
were used for these experiments, the code for initializing this array causes the large size of
the source code). It is very easy for the program developer to put this relationship in the
generated code and reduce the size though.
I/O Benchmark Original Code Generated Code with user’s help
VPIC-IO 8 KB 8 KB 8 KB
VORPAL-IO 12 KB 616 KB 36 KB
GCRM-IO 36 KB 12 KB 12 KB
Table 6.1: Comparison of the code size of original and generated benchmarks
6.3 Conclusions
The use of I/O kernels is getting more popular in the HPC community. High-level I/O
libraries with their higher productivity are also getting more popular as they show higher
performance and simplify coding. There has been several efforts to automatically trace and
replay I/O operations of applications but they mostly focused on the lower-level layers of the
I/O stack which is much more complex. In this work, we show that it is easier to trace and
generate I/O kernels from a full application at the higher-level I/O libraries such as HDF5.
This framework consists of a recorder library to trace the higher-level I/O operations, a
merger tool which merges traces recorded on each process, and a code generator generating
the I/O skeleton application out of the merged I/O trace. We have shown the applicability
of this framework for four I/O kernels with very different I/O patterns.
77
CHAPTER 7
PATTERN-DRIVEN PARALLEL I/O TUNING
In the prior chapters, we have shown the effectiveness of I/O tuning at multiple layers of
tunable parameters using genetic algorithms. We have improved the configuration search
process significantly by developing an empirical performance prediction model for a selection
of I/O kernels derived from real scientific simulations. Despite these efforts, the challenge
of tuning an arbitrary I/O phase at runtime in a simulation remains an open issue. For
instance, when a simulation needs to perform a large write operation, an I/O autotuning
framework is required to identify the characteristics of the write operation, to find optimal
tunable parameters, and to apply them at runtime without the need to stop the simulation
for recompiling the simulation code with the optimal configurations.
In this chapter, we address the requirements of an autotuning framework mentioned above.
We first define high-level I/O patterns to characterize write operations. We use our tracing
library to collect high-level I/O calls, such as HDF5 data model definition and write calls.
This library uses binary instrumentation to redirect a set of HDF5 calls to collect the required
information. We analyze these traces to obtain the I/O pattern information of a simulation’s
I/O phase. We then match the patterns with previously tuned I/O kernels for obtaining
their optimal configurations. We provide a runtime library to apply the selected optimal
configuration without the need for recompiling the code. If a matching previously tuned
pattern was not available, we use our empirical prediction model to find tuning parameters
at oﬄine and store them in the database for future use.
Overall, this chapter has the following contributions:
• We provide a new representation for I/O patterns based on the traces of high-level I/O
libraries, such as HDF5. This definition contains the global view of I/O accesses from
all MPI processes in parallel applications.
• We develop a trace analysis tool for identifying I/O patterns of an application auto-
matically.
• We show that using our runtime library, users can achieve significant portion of the
peak I/O performance for arbitrary I/O patterns.
78
The remainder of the chapter is structured as follows: In Section 7.1, we introduce our
auto-tuning framework and present the functions of various components in the framework.
We describe our experimental setup to test the framework and to evaluate performance
improvement in Section 7.2. We finally conclude the discussion in Section 7.3.
7.1 I/O Autotuning Framework
Figure 7.1 illustrates an overview of our proposed I/O auto-tuning framework. It consists of
two phases: The first phase is the tuning phase, which performs extraction of the I/O pattern
of an application. Once a pattern is extracted, there is a look-up phase in which the pattern
is queried in a database of patterns and corresponding tuned configurations for the best I/O
performance. If the pattern is found in this database, then the model associated with the
pattern are stored in an XML file. In the adoption phase, the application is dynamically
linked with our H5Tuner library for setting the selected tuning parameters in the XML file
at runtime.
Tuning	  
Phase	  
Adop0on	  
Phase	  
Applica0on	  
Extract	  I/O	  
Kernel	  and	  
Pa;ern	  
Lookup	  for	  
Tuned	  
Parameters	  
Pairs	  of	  pa;erns	  and	  tuned	  
parameters	  
Tuned	  
parameter	  
set	  (XML	  
ﬁle)	  
Tuned	  
parameter	  
set	  (XML	  
ﬁle)	  
Applica0on	  
H5Tuner	  
Dynamic	  
Library	  
HPC	  
System	  
HDF5	  
File	  
Model-­‐based	  
tuning	  
Pa;ern	  
previously	  
tuned?	  
Yes	  
No	  
Figure 7.1: An overview of our I/O autotuning framework
Our previous work [28, 41] describe the adoption phase in detail. This paper describes
the tuning phase of the framework, on detecting I/O pattern and matching a detected
79
pattern with the history of tuned parameter. In order to have a simpler description of these
components, we use a sample parallel HDF5 application distributed along with the HDF5
source code, called pH5Example. The code creates two two-dimensional HDF5 datasets
and writes them to a file.
7.1.1 I/O Traces
To be able to automatically extract the I/O activities of an application, we need to first
extract the characteristics of I/O operations it is conducting. The I/O trace of an application
is used towards this end. In our previous work, we have developed a multi-level I/O tracer
tool, called Recorder [39]; It uses dynamic library pre-loading and intercepting I/O functions
at different levels of the I/O stack. We observe that the best level of the I/O stack to define
I/O patterns is at the higher-level I/O libraries such as HDF5. Therefore, we made use
of the Recorder to capture all the HDF5 I/O operations of an application. At the end
of one run of the application on P processes, P trace files are generated by the Recorder
library. Figure 7.2 shows the trace file for process 0 of a four-process run of pH5example
code. There are different function calls traced, causing to first create a HDF5 file (named
"ParaEg0.h5"), then create two datasets (named "Data1" and "Data2"), then each
process selects a hyperslab of these datasets, they write the data to them and close the file.
The following subsection discusses how we make use of the information in the trace files
to come up with the I/O pattern of the application.
7.1.2 Extraction and Identification of High-level I/O Patterns
For performing automatic tuning of writing large datasets, we first need to identify the I/O
pattern of the write operation. We define these patterns from observing the high-level I/O
library calls, i.e., HDF5 calls.
As mentioned previously, high-level I/O libraries give us much more information in order
to define and distinguish the way different applications conduct the I/O operations. One
example and probably the main one is the concept of selection in HDF5. Selection is an
important and a very powerful feature of HDF5 library that lets the developers select different
parts of a file and different parts of memory in order to conduct I/O operations. It also is
the main mechanism for the processes to choose different parts of the file in a parallel I/O
application. Therefore, we base our definition of I/O patterns on the concept of selection.
In summary, we will define the I/O pattern of an application as a coverage of the datasets
80
1396296304.23583 H5Pcreate (H5P_FILE_ACCESS) 167772177 0.00003
1396296304.23587 H5Pset_fapl_mpio (167772177,MPI_COMM_WORLD,
469762048) 0 0.00025
1396296304.23613 H5Fcreate (output/ParaEg0.h5,2,0,167772177) 16777216 
0.00069
1396296304.23683 H5Pclose (167772177) 0 0.00002
1396296304.23685 H5Screate_simple (2,{24;24},NULL) 67108866 0.00002
1396296304.23688 H5Dcreate2 (16777216,Data1,H5T_STD_I32LE,
67108866,0,0,0) 83886080 0.00012
1396296304.23702 H5Dcreate2 (16777216,Data2,H5T_STD_I32LE,
67108866,0,0,0) 83886081 0.00003
1396296304.23707 H5Dget_space (83886080) 67108867 0.00001
1396296304.23708 H5Sselect_hyperslab (67108867,0,{0;0},{1;1},
{6;24},NULL) 0 0.00002
1396296304.23710 H5Screate_simple (2,{6;24},NULL) 67108868 0.00001
1396296304.23710 H5Dwrite (83886080,50331660,67108868,67108867,0) 0 
0.00009
1396296304.23721 H5Dwrite (83886081,50331660,67108868,67108867,0) 0 
0.00002
1396296304.23724 H5Sclose (67108867) 0 0.00000
1396296304.23724 H5Dclose (83886080) 0 0.00001
1396296304.23726 H5Dclose (83886081) 0 0.00001
1396296304.23727 H5Sclose (67108866) 0 0.00000
1396296304.23728 H5Fclose (16777216) 0 0.00043
Figure 7.2: A sample I/O trace generated by the Recorder for a simple parallel
application called pH5Example
based on the selections they make.
In HDF5 terminology, hyperslabs are portions of datasets, either a logically contiguous
collection of points in a dataspace, or a regular pattern of points or blocks in a dataspace.
In a parallel HDF5 program, once each process defines both the memory and file hyper-
slabs they execute a partial read/write [42]. In HDF5, the hyperslabs are selected using
H5Sselect hyperslab function. The four parameters that can be passed to this func-
tion are start, stride, count, and block: The start array is used by each process
to specify the starting location for the hyperslab; The stride array specifies the distance
between two consecutive selected elements or blocks. The count array for specifying the
number of the elements/blocks to select; Finally, the block array specifies the size of the
block selected from the dataspace.
In order to be concrete, we illustrate the definition of I/O patterns with an example
application we have used in this paper. Figure 7.3 shows the four hyperslab selection of a
parallel four-process run of pH5Example.
81
H5Sselect_hyperslab (...,H5S_SELECT_SET,{0;0},{1;1},{6;24},NULL) 0
H5Sselect_hyperslab (...,H5S_SELECT_SET,{6;0},{1;1},{6;24},NULL) 0 
H5Sselect_hyperslab (...,H5S_SELECT_SET,{12;0},{1;1},{6;24},NULL) 0 
H5Sselect_hyperslab (...,H5S_SELECT_SET,{18;0},{1;1},{6;24},NULL) 0 
Rank 0:
Rank 1:
Rank 2:
Rank 3:
herr_t H5Sselect hyperslab(hid_t space_id, H5S_seloper_t op, const 
hsize_t *start, const hsize_t *stride, const hsize_t *count, const 
hsize_t *block)
Function Signature:
Figure 7.3: The four HDF5 hyperslab selection function calls across different
ranks of a parallel four-process run of pH5Example
As it can be seen, all the processes are calling the same function with the same arguments
except for start. The values of these start arrays are {0, 0}, {6, 0}, {12, 0}, and {18,
0}. The values of count arrays on all the ranks are {6, 24}. The call specifies that the
2D dataset is decomposed in the first dimension, with each process accessing a distinct
horizontal slice.
In order to abstract these patterns, we make use of array distribution notation that was
also used in High Performance Fortran (HPF) [43]. High Performance Fortran uses data
distribution directives to help the programmer to distribute data between processes. Among
these directives, DISTRIBUTE directive is used to specify the partitioning of the array
data on to an abstract processor array. The basic distributions are BLOCK, CYCLIC, and
DEGENERATE. A different distribution can be used for each dimension. Below is a short
description of each of these distributions:
1. Block Distribution: In a block distribution, each process gets a single contiguous
block of the array.
2. Cyclic Distribution: In a cyclic distribution, array elements are distributed in a
round-robin manner. This means that the first element is on the first process, the
second element on the second process and so on.
3. Degenerate Distribution: Degenerate distribution, represented by *, is basically
no distribution or serial distribution. It means that all the elements of this dimension
82
is assigned to one processor.
Using this terminology for the pH5Example application is straightforward. First, there is
one HDF5 dataspace in the whole application created by the use of H5Screate simple()
function. It is a 2D dataspace of size 24 × 24. Then there are two datasets created on
this dataspace named Data1 and Data2. Then each of the ranks are selecting their own
decomposition of the space and create two datasets of the size of the selected set as their
memory dataset. Finally there are two H5Dwrite() function calls to write to Data1 and
Data2. Using HPF terminology we can abstract pH5Example as the following:
• pH5Example:
<2D, (BLOCK, *), (6, 24)>
<2D, (BLOCK, *), (6, 24)>
The advantage of this representation is that it is succinct enough in order to be stored in
a key-value store as the I/O pattern repository. Currently, we are using text files to store
the patterns without requiring a global database. However, as the number of patterns grow,
in order to store the patterns associated with their I/O performance model, we can use a
key-value store database. The schema of this database should include the dimensions of the
patterns, their decompositions, their sizes, and the corresponding I/O performance model.
7.2 Setup and Evaluation Results
We have conducted all the experiments presented in this chapter on two platforms, Edison
and Hopper.
We chose different I/O benchmarks and kernels. The four I/O kernels we have looked at
are: Vector Particle-In-Cell (VPIC-IO), VORPAL-IO, and Global Cloud Resolving Model
(GCRM-IO) and FLASH-IO. Below is a brief description of these I/O benchmarks.
Figures 7.4(a)-7.4(c) show the I/O accesses of the three applications we are considering
in this work. These I/O accesses are the range of accesses based on the four parameters
of the hyperslab selection. It can be observed that VPIC-IO is a 1-dimensional application
and VORPAL-IO and GCRM-IO have 3-dimensional I/O accesses. We can also see how
each processes are writing the same amount of data by having the same count arrays. The
processes access different parts of the file in parallel by having different values for the start
array.
Each process is writing a contiguous amount of data with 8 MB of size one after the other
in the VPIC-IO benchmark. This is a very common and simple I/O pattern and we will see
83
P0 = [ {0}, {1}, {8 M}, {0} ]
P1 = [ {8 M}, {1}, {8 M}, {0} ]
P2 = [ {16 M}, {1}, {8 M}, {0} ]
...
[start, stride, count, block]
P0 P1 P2 ... Pn
0 8 M 16 M 24 M
(a) VPIC-IO
P0 = [ {0,0,0}, {1,1,1}, {1,26,327680}, {0,0,0} ]
P1 = [ {0,0,327680}, {1,1,1}, {1,26,327680}, {0,0,0} ]
P2 = [ {0,0,655360}, {1,1,1}, {1,26,327680}, {0,0,0} ]
...
.
.
[start, stride, count, block]
(b) GCRM-IO
P0 = [ {0,0,0}, {1,1,1}, {60,100,300}, {0,0,0} ]
P1 = [ {0,0,300}, {1,1,1}, {60,100,300}, {0,0,0} ]
P2 = [ {0,100,0}, {1,1,1}, {60,100,300}, {0,0,0} ]
...
.
.
[start, stride, count, block]
(c) VORPAL-IO
Figure 7.4: I/O patterns of (a) VPIC-IO (b) GCRM-IO (c) VORPAL-IO bench-
mark
how it is abstracted. A more complex I/O access is GCRM-IO’s. It is a 3-dimensional I/O
benchmark decomposed only along one dimension as Figure 7.4(b) shows. Since only one
dimension is decomposed in GCRM, we can see that the size of the whole dimension is used
in the count array for the other two dimensions and the value of the start is 0.
The last I/O benchmark with the most complex I/O pattern is VORPAL-IO. It writes a
3-dimensional grid with a 3-dimensional decomposition along each of the dimensions. The
size of the block that each process is writing is fixed and therefore the count array is the
same for each of the processes. However, each of the processes have different values along
the 3 dimensions of the start array.
Using the notation described in Section 7.1, we can represent our three applications as
below:
• VPIC-IO:
<1D,BLOCK,8388608>
84
<1D,BLOCK,8388608>
... (5 more times) ...
<1D,BLOCK,8388608>
• GCRM-IO:
<3D,(*,*,BLOCK), (1,1,327680)>
<3D,(*,*,BLOCK), (1,1,327680)>
... (7 more times) ...
<3D,(*,*,BLOCK), (1,1,327680)>
• VORPAL-IO:
<3D,(BLOCK,BLOCK,BLOCK),(60,100,300)>
<3D,(BLOCK,BLOCK,BLOCK),(60,100,300)>
... (17 more times) ...
<3D,(BLOCK,BLOCK,BLOCK),(60,100,300)>
We now show our results in four subsections. Note that for the results of this paper, we
use all the developed models in our previous paper [41]. Therefore, there was no tuning
for any application for this work and we have used the models developed for them in our
previous work.
7.2.1 An application with the same I/O pattern
In order to have IOR issue write patterns similar to VPIC-IO, we configured it to use its
HDF5 interface. Since VPIC-IO writes 8 datasets, we need to configure IOR accordingly.
This is done by using 8 MB segments (-s 8), writeFile (-w), 32 MB blockSize (-b 32m)
and transfer size of 32 MB (-t 32m).
Figure 7.5(a) shows the performance of the autotuned configuration which was proposed
for IOR, as it has the same pattern as VPIC-IO, on 512 and 4096 cores of Hopper, and
Edison in [41]. As mentioned before, there was no modeling effort done for this application
and yet we can see that we are able to get up to 4.21 GB/s and 15.01 GB/s on 512 and 4096
cores of Hopper. On Edison these numbers are 9.34 GB/s, 16.70 GB/s.
7.2.2 An application with similar I/O pattern
Resemble-VORPAL-IO is a synthetic benchmark generated by Record-and-Replay frame-
work [44]. It has very similar I/O pattern to VORPAL-IO benchmark but with different
85
 0
 2
 4
 6
 8
 10
 12
 14
 16
 18
512 cores - Hopper 4096 cores - Hopper 512 cores - Edison 4096 cores - Edison
I/O
 B
an
dw
id
th
 (G
B
/s
)
Default Configuration
Autotuned Configuration
(a) IOR
 0
 2
 4
 6
 8
 10
 12
 14
 16
 18
512 cores - Hopper 4096 cores - Hopper 512 cores - Edison 4096 cores - Edison
I/O
 B
an
dw
id
th
 (G
B
/s
)
Default Configuration
Autotuned Configuration
(b) Resemble-VORPAL-IO
 0
 2
 4
 6
 8
 10
 12
 14
 16
 18
512 cores - Hopper 4096 cores - Hopper 512 cores - Edison 4096 cores - Edison
I/O
 B
an
dw
id
th
 (G
B
/s
)
Default Configuration
Autotuned Configuration
(c) FLASH-IO
Figure 7.5: The I/O performance of the autotuned (a) IOR (b) Resemble-
VORPAL-IO (c) FLASH-IO application on Hopper and Edison compared the
default configuration.
block sizes of 64 × 128 × 256 instead of 60 × 100 × 300 of VORPAL-IO. The purpose of
these experiments is two-fold: (a) To show that applications with similar I/O patterns with
slight differences only in block sizes can use the same I/O configuration to obtain good I/O
performance. (b) Requiring a threshold for the similarity between I/O patterns can save
dramatic I/O tuning time.
Figure 7.5(b) shows the performance of the autotuned configuration which was proposed
for Resemble-VORPAL-IO on 512 and 4096 cores of Hopper and Edison in [41]. Similar
to the previous experiment, there was no modeling effort done for this application and yet
we can see that we are able to get up to 3.32 GB/s and 7.89 GB/s on 512 and 4096 cores
of Hopper respectively. On Edison the highest bandwidth achieved by this mechanism was
8.75 GB/s and 13.07 GB/s on the same number of cores.
86
7.2.3 A new application
The last experiment is designed to test an arbitrary application that has not been tuned
before. For this experiment, we chose to test a well-known I/O kernel called FLASH-IO
because it is popular in the HPC I/O community and also hard to tune. The same as
previous experiment, we ran FLASH-IO at two scales, 512 and 4096 cores of Hopper and
Edison. The way that we calculate the bandwidth for this application is a little bit different
than the other ones as it produces three files. The definition of bandwidth here is basically
just the sum of all the output sizes divided by the runtime of the whole I/O benchmark
which is a conservative way of defining the I/O bandwidth of an application.
FLASH-IO is different from the other applications we have looked at mainly because
it writes many datasets with different I/O patterns. In order to overcome this problem
the framework considers the largest datasets in size and looks up for those patterns in the
database. Based on the output of H5Analyze tool, FLASH-IO has 34 datasets, out of which
24 of them have the same size as the largest size of the file. On 4096 cores, this is about
40GB for each dataset. These datasets are 4D and their pattern of these dataset are also
the same: <BLOCK, DEGENERATE, DEGENERATE, DEGENERATE>. Although the exact
same pattern does not exist for this pattern, GCRM-IO has the most similar pattern to this
application and therefore the framework uses the proposed configurations for GCRM-IO.
Figure 7.5(c) shows the performance of the autotuned configurations which was proposed
for FLASH-IO based on GCRM-IO model, on 512 and 4096 cores of Hopper, and Edison by
our framework. Similar to the previous experiment, there was no modeling effort done for
this application and yet we can see that we are able to get up to 2.09 GB/s and 5.95 GB/s
on 512 and 4096 cores of Hopper respectively. On Edison the highest bandwidth achieved
by this mechanism was 3.34 GB/s and 8.23 GB/s on the same number of cores.
7.3 Conclusions
Poorly tuned Parallel I/O becomes a major performance bottleneck in HPC applications that
need to write or read data. This is not due to incapability of I/O subsystems, but mainly
due to the complexity of its tuning. In this chapter, we propose a pattern-driven autotuning
framework to solve this problem. The framework consists of components to extract I/O
patterns, tune configuration for the detected patterns, store them in a database of patterns
associated with their I/O model, and finally map an arbitrary I/O pattern to a previously
tuned model in order to improve its I/O performance. We show that using these patterns,
one can tune different sets of applications ranging from the ones which have tuned before
87
the ones which are similar to the ones before, and totally new ones.
88
CHAPTER 8
RELATED WORK
8.1 Autotuning
Autotuning in computer science is a prevalent term for improving performance of compu-
tational kernels. There has been extensive research in developing optimized linear algebra
libraries and matrix operation kernels using autotuning [45–51]. The search space in these
efforts involves optimization of CPU cache and DRAM parameters along with code changes.
All these autotuning techniques search various data structure and code transformations us-
ing performance models of processor architectures, computation kernels, and compilers. Our
study focuses on autotuning the I/O subsystem for writing and reading data to a parallel
file system in contrast to tuning computational kernels.
There are a few key challenges unique to the I/O autotuning problem. Each function
evaluation for the I/O case takes on the order of minutes, as opposed to milli-seconds for
computational kernels. Thus, an exhaustive search through the parameter space is infeasible
and a heuristic based search approach is needed. I/O runs also face dynamic variability and
system noise while linear algebra tuning assumes a clean and isolated single node system.
The interaction between various I/O parameters and how they impact performance are not
very well studied, making interpreting tuned results a complex task.
We use genetic algorithms as a parameter space searching strategy. Heuristics and meta-
heuristics have been studied extensively for combinatorial optimization problems as well
as code optimization [52] and parameter optimization [53] problems similar to the one we
addressed. Of the heuristic approaches, genetic algorithms seem to be particularly well
suited for real parameter optimization problems, and a variety of literature exists detailing
the efficacy of the approach [54–56]. A few recent studies have used genetic algorithms [57]
and a combination of approximation algorithm with search space reduction techniques [58].
Both of these are again targeted to auto-tune compiler options for linear algebra kernels. We
chose to implement a genetic algorithm to attempt to intelligently traverse the sample space
for each test case; we found our approach produced well-performing configurations after a
89
suitably small number of test runs.
Various optimization strategies have been proposed to tune parallel I/O performance for
a specific application or an I/O kernel. However, they are not designed for automatic
tuning of any given application and require manual selection of optimization strategies.
Our autotuning framework is designed towards tuning an arbitrary parallel I/O application.
Hence, we do not discuss the exhaustive list of research efforts. We focus on comparing our
research with automatic performance tuning efforts.
There are a few research efforts to auto-tune and optimize resource provisioning and system
design for storage system [59–61]. In contrast, our study focuses on tuning the parallel I/O
stack on top of a working storage system.
Autotuning of parallel I/O has not been studied at the same level as the tuning for
computation kernels. The Panda project [62,63] studied automatic performance optimization
for collective I/O operations where all the processes used by an application to synchronize
I/O operations such as reading and writing an array. The Panda project searched for disk
layout and disk buffer size parameters using a combination of a rule-based strategy and
randomized search-based algorithms. The rule-based strategy is used when the optimal
settings are understood and simulated annealing is used otherwise. The simulated annealing
problem is solved as a general minimization problem, where the I/O cost is minimized.
The Panda project also used genetic algorithms to search for tuning parameters [64]. The
optimization approach proposed in this project were applicable to the Panda I/O library,
which existed before MPI-IO and HDF5. The Panda I/O is not in use now and the Panda
optimization strategy was not designed for current parallel file systems.
Yu et al. [65] characterize, tune, and optimize parallel I/O performance on the Lustre file
system of Jaguar, a Cray XT supercomputer at Oak Ridge National Laboratory (ORNL).
The authors tuned data sieving buffer size, I/O aggregator buffer size, and the number of I/O
aggregator processes. This study did not propose an autotuning framework but manually
ran a selected set of codes several times with different parameters. Howison et al. [21] also
perform manual tuning of various benchmarks that select parameters for HDF5 (chunk size),
MPI-IO (collective buffer size and the number of aggregator nodes), and Lustre parameters
(stripe size and stripe count) on the Hopper supercomputer at NERSC. These two studies
prove that tuning parallel I/O parameters can achieve better performance. In our study we
develop an autotuning framework that can select tuning parameters.
You et al. [23] proposed an autotuning framework for the Lustre file system on the Cray
XT5 systems at ORNL. They search for file system stripe count, stripe size, I/O transfer size,
and the number of I/O processes. This study uses mathematical models based on queuing
models. The autotuning framework first develops a model in a training phase that is close to
90
the real system. The framework then searches for optimal parameters using search heuristics
such as simulated annealing, genetic algorithms, etc. Developing a mathematical model for
different systems based on queuing theory can be farther from the real system and may
produce inaccurate performance results. In contrast, our framework searches for parameters
on real system using search heuristics. A preliminary version of our autotuning framework
appears in earlier work [66], where we primarily study the performance of our system at a
small scale. In this work, we do a more thorough analysis of the system on diverse platforms,
applications, and concurrencies, and conduct an in-depth analysis of resulting configurations.
8.2 I/O Modeling
Tuning the I/O subsystem has unique challenges. Although the computation kernels run for
a few milliseconds, evaluation of I/O functions can take minutes. Due to the complexity and
interdependency among multiple layers of the I/O system, searching for tuned parameters
is a cumbersome process. Our previous work [28] used a heuristic-based search with a GA
in order to achieve substantial performance improvements. However, this heuristic search
process has a prohibitive runtime. In this work, we took the performance modeling approach
to filter the number of combinations to a small number and then to search within the smaller
space. We now review works that address autotuning of parallel I/O as well as works that
seek to reduce the parameter search space.
autotuning of parallel I/O has been studied by relatively few projects. The Panda project
[62,63] studied automatic optimization of collective I/O operations, where all the processes
of an application perform I/O operations synchronously. The Panda project used GAs to
search for tuning parameters [64], which we found introduces a large overhead. Moreover,
the optimizations developed in this project were applied in the Panda I/O library, which
existed before MPI-IO and HDF5. You et al. [23] proposed an autotuning framework for the
Lustre file system on Cray XT5 systems at ORNL. The authors use mathematical models
based on queuing theory to develop a prediction model that is close to the real system. The
framework then searches for optimal parameters of Lustre file system and those of other I/O
layers using search heuristics. In contrast to the queuing theory-based models, we develop
empirical models based on a training phase of real execution times from writing data to the
file system.
There have been several efforts in predicting parallel I/O performance [23–25, 67–70].
Shan et al. [24] use the IOR benchmark to match the I/O patterns of an application and
predict I/O performance. Meswani et al. [68] use a similar strategy by running the I/O
91
operations of an application on a reference system and calibrate the performance of the
reference system with a target system. Smirni et al. [25] use a queuing network model
to predict the performance of RAID-3 disks. Song et al. [69] propose an analytical model
to predict the cost of read operations for accessing data organized in different layouts on
the file system. Kumar et al. [70] use various machine-learning algorithms for improving
performance of I/O in a PIDX file format library; their prediction focuses on network and
I/O performance while keeping the stripe settings fixed. While many of these efforts seek to
predict I/O performance accurately, our work uses the models to identify fruitful parameter
values and then iterates in the executing and refitting stages by searching among this smaller
set of parameter values. Using this approach, we have shown that our technique is fast and
effective in achieving good I/O performance.
8.3 I/O Recording
Existing profiling and tracing tools have demonstrated the value of multi-level views of I/O
activity and paved the way for Recorder.
Darshan [71] is a powerful profiling library that characterizes application I/O via statis-
tics and cumulative information. The information recorded by Darshan includes counters
for MPI-IO and POSIX I/O operations, counters for MPI-IO datatypes and access patterns,
cumulative information about amount of bytes read or written or time spent in operations,
etc. Darshan’s lightweight design allows it to be deployed full time for workload character-
ization of large systems [40] [72], but the compact nature of stored information limits its
usefulness in understanding detailed I/O behavior of applications or libraries.
IOPin [73] is another profiling tool that instruments the MPI library and PVFS file system.
IOPin gathers information such as rank, mpi call id, pvfs call id, I/O type (read/write), and
latency, and stores it in a database. One distinguising feature of IOPin is its ability to gather
and correlate function calls at the MPI-IO and file system levels. As opposed to our approach
(dynamic library preloading), IOPin makes use of runtime binary instrumentation.
The RIOT I/O tracing toolkit [74] intercepts MPI-IO and POSIX I/O function calls and
records timestamp, data size, and file offset details. The toolkit also includes a post-processor
to create statistical and graphical reports showing application I/O activity. RIOT has been
used to discover performance inefficiencies, demonstrating the value of tracing toolkits in
I/O performance analysis and tuning. We share the same vision as the RIOT authors, and
with Recorder provide a framework that also captures the I/O activity of the high-level I/O
library.
92
Earlier versions of the HDF5 library (1.4.x) included tools developed by the Pablo Research
Group [75] to log and analyze entry and exit from HDF5, MPI-IO, and POSIX I/O functions.
While this capability proved useful, it required source code modification to enable tracing
and is no longer available.
//TRACE [35] and ScalaHTrace [37] include both trace capture and replay capabilities.
//TRACE emphasizes accurate replay and can introduce considerable overhead during trace
capture as inter-node data dependencies are discovered. ScalaHTrace captures both com-
munication and I/O activity, using a novel compression technique to keep trace files at near
constant size for different problem scales. Its replay engine uses a distributed approach to
deterministically replay traces without decompressing them. Since //TRACE and ScalaH-
Trace primarily focus on trace replay rather than on diagnosis and correction of performance
issues, their traces mainly contain timing information and offer little assistance in identifying
sources of performance degradation.
In [32], Konwinski et al. propose a taxonomy for cataloging features of I/O tracing
frameworks based on their survey of three existing packages (LANL-TRACE [76], Tracefs
[36], and //TRACE). We found the proposed taxonomy very useful in considering and
describing the features of Recorder.
8.4 I/O Replaying
Tracefs [36] is a low-overhead and flexible tracing file system that intercepts operations at
the VFS level. The traces are recorded and useful for security and debugging.
//Trace [35] provides a detailed framework on POSIX-Level I/O recording and replaying.
It puts more emphasis on the average replay accuracy of the parallel replayer, which can
mimic the behavior of the traced application. Inter-node data dependencies and computing
times are discovered in order to create more representative workloads for storage systems
evaluation.
Scala-HTrace [37] focuses on the recording and compression of MPI-I/O level traces. It
utilizes histograms based on a user-specified merge precision level, which replays statistical
histogram traces without decompressing the original trace file.
Our work is distinct in that it traces I/O at the level of the HDF5 application calls. Since
MPI-IO and file system activities result from the HDF5 calls, tracing at the highest possible
level provides a complete view of I/O activities. In addition, we focus on lossless compression
of the traces, so that no information is lost.
Skel [77] is probably the closest work to this work. They both have the same target,
93
but with very different approaches. Skel creates I/O skeletal applications by utilizing the
ADIOS [78] framework. ADIOS users configure the I/O of the applications by creating an
external XML file. Skel makes use of this XML configuration and an additional XML file
for the test parameters to create skeletal I/O application. In our work, we do not need any
configuration file and running the application is enough to get the traces and generate the
I/O kernel. Additionally, our approach replays all the HDF5 I/O calls of an application
leading to the exact same I/O behavior of the original applications.
8.5 I/O Patterns
I/O Signature is a notation proposed by Byna et al [79] consisting of five dimensions of I/O
operations: opetaion, spatial offset, request size, repetitive behavior, and temporal intervals.
These are then gathered by a framework for each application and stored persistentlly for
later look up in order to help prefetching.
Statistical models such as Markov models, etc have been proposed for a long time for being
able to produce and predict I/O operations and file system performance. [80,81]. These are
then more used in the context of prefetching, caching or scheduling.
Omnisc’IO [82] is a grammar-based I/O model in the hope of capturing and predicting I/O
operations of an application. At its heart, it uses an algorithm based on Sequitur algorithm
which given a sequence of symbols, builds a grammar for text compression. It supports both
spatial and temporal patterns in this regard. In order to be more general, the authors use
the program’s stack trace as the symbols of the grammar. One strength of their approach
is that it does real-time prediction as the grammar is being updated in the algorithm. This
is similar to what we called ”real-time tuning” in the paper.
He et al. [83] correctly argue that a lot of information gets lost in a typcail I/O stack as
the data flows between its layers. Although high-level I/O libraries contain rich information
about the data structures, eventually they get down into simple offset and length pairs in
the storage system. Their solution to this problem is to ”rediscover these structures in
unstructured I/O” using gray-box technique. Our approach however is not to lose these
data by intercepting them at the higher-levels. In terms of framework design there are some
similiraties such as the way the pattern detection engine works. However, since it is at
POSIX level, it has a local pattern structure and a global one. For the local one, a modified
algorithm based on LZ77 is presented and for the global patterns, these local patterns are
sorted in order to check for a pattern between them. These are not necessary in our work.
94
CHAPTER 9
CONCLUDING REMARKS
Parallel I/O is an integral part of modern HPC; However, it remains challenging to obtain
maximum performance from I/O subsystems. This is mainly due to inter-dependencies
among multiple layers of the parallel I/O stack. The values for the parameters at each layer
of this stack are critical to the I/O performance and they vary across applications, platforms,
and the concurrency of the application.
In this dissertation, we proposed an end-to-end solution to HPC I/O problem. Starting
from an autotuning framework using Genetic Algorithms (GA), optimizing it with I/O per-
formance modeling, and finally explaining how I/O patterns can be exploited to form an
intelligent runtime system for parallel I/O tuning. This intelligent runtime system along
with profiling tools such as Recorder and I/O kernel generators such as Replayer can hide
the parallel I/O complexity from scientists and HPC users.
9.1 Comparison of the approaches
Table 9.1 shows a comparison of the three approaches for I/O tuning discussed in this chapter.
With default configuration without any I/O tuning, each application will take more than
3 hours. With Genetic Algorithms, for each application and scale, a cost of more than 10
hours is paid for tuning. With current approach, the cost of training is paid once and then
for each application applying the model takes less than an hour with fast application run
time.
9.2 Contributions
This dissertation makes the following contributions:
• The design and implemention of an autotuning system that hides the complexity of
tuning the Parallel I/O stack. This framework is covered in Chapter 3.
95
Table 9.1: A comparison of GA, modeling and default configuration.
Method
Training
Phase
Applying
the Model
Per App. &
Scale Tuning
App. Runtime
(VPIC-8192
on Hopper)
GA N/A N/A > 10 hours 118 seconds
Model
Fitting
> 10 hours
(can reuse)
< 1 minute
(automatic)
< 1 hour 100 seconds
Default
Config.
none none none > 3 hours
• Demonstration of performance portability of the autotuning system across diverse HPC
platforms.
• Demonstration of the applicability of the system to multiple scientific application
benchmarks.
• Demonstration of I/O performance tuning at different scales (both concurrency and
dataset size).
• Development of an approach to construct automatically an I/O performance model.
This model and its development is explained in Chapter 4.
• Usage of the model thus constructed to reduce the search space for good I/O configu-
rations.
• Demonstration of the applicability of the autotuning framework exploiting the perfor-
mance model to scientific I/O kernels with different write patterns and various problem
sizes.
• An implementation of a multi-level I/O tracing framework, called Recorder, that can
capture I/O function calls at multiple levels of the I/O stack, including HDF5, MPI-IO,
and POSIX I/O. Recorder is discussed in Chapter 5.
• Demonstration of the effectiveness of Recorder as an aid to understanding the I/O
activity of applications and identifying a performance bottleneck in HDF5’s current
implementation of metadata read.
• The design and implemention of a framework for generating I/O kernels from a full
HPC application called Replayer. Replayer is presented in Chapter 6.
• Usage of the Replayer to generate correct I/O kernels for various HPC applications.
96
• A new representation for I/O patterns based on the traces of high-level I/O libraries.
• Design and development of a trace analysis tool for identifying I/O patterns of an
application automatically.
• Being able to achieve significant portion of the peak I/O performance for arbitrary
I/O patterns using this method.
9.3 Future Research Directions
This is a list of some ideas derived from the contributions in this dissertation:
• IBM GPFS [84] file system is another main file system used on HPC production sys-
tems. We did not have a chance to test our modeling efforts on it and therefore, one
way to extend this work is to check the modeling work on a system equipped with IBM
GPFS.
• Runtime noise and dynamic interference from other users is a fact of life in production
HPC facilities. While our autotuning framework has presented compelling results, we
are assuming that the user will encounter a runtime workload which is comparable to
the one encountered during the autotuning process. We believe that measuring noise
and interference during the tuning process and deriving models for projecting their
effect at runtime will be key in tackling this hard problem.
• Although we used statistical non-linear regression models, one can look into machine
learning based approaches (such as Gaussian Processes) to intelligently sample the
search space, and further reduce the runtime.
• Our current approach of determining a training set is based on a batch execution
model. Namely, we pre-compute a training set with a space-filling design in advance,
and evaluate the training set in a single batch job. We could have opted for an adaptive,
“sequential design of experiments” approach (see, e.g., [85]), where each configuration
is based on the results of the previous runs. This has the potential to further reduce
the size of the training set.
• Given the considerable variation in the performance of I/O subsystems on HPC plat-
forms, it is difficult to obtain reliable measurements of tracing overhead by comparing
the execution times of traced and untraced runs. One can investiage an alternate
97
method of computing overhead by measuring the cumulative time spent in Recorder’s
logging functions and comparing that to the overall runtime.
• In order to support effective cross-level trace analysis, we need to correlate lower-level
functions with the higher-level function where they originated. The correlation will
be essential for multi-threaded applications and those with asynchronous I/O, because
one cannot simply use the order of events in the trace file to infer which operation
caused another under those circumstances.
• A trace analysis and visualization tool that can help identify I/O bottlenecks or auto-
matically draw useful conclusions from large-scale runs. A good tool should be usable
by, and useful to, end users without extensive experience in parallel I/O.
• Improving the pattern matching capabilities of our framework, in order to be able to
detect and compress more general control structures. While this “reverse engineering”
of the control structure of the original program is not tractable, in general, we con-
jecture that the I/O of most scientific codes has simple control structures that can be
detected by our methods.
• Our work has focused on generating an I/O skeleton for a fixed problem size and
fixed number of processes. However, the same pattern matching techniques we use to
compress traces can be used to detect dependencies on the number of processors or
key input parameters. This will require multiple runs with different input sizes and
different process counts.
• Although we have only shown that this framework works for the HPC applications
using HDF5 library, the tools and the operations in this work are all applicable to
other high-level I/O libraries such as PnetCDF. This is also another future work that
is considered, since adding such a capability will be very useful.
• Last but not least, this framework can be used for different HPC applications in order
to build a repository of I/O kernels that represent different HPC applications. This
repository can be used for different purposes such as storage systems evaluation, system
procurement, I/O performance analysis, etc.
98
REFERENCES
[1] M. Folk, A. Cheng, and K. Yates, “HDF5: A file format and I/O library for high
performance computing applications,” in Proceedings of Supercomputing, vol. 99, 1999.
[2] P. Corbett, D. Feitelson, S. Fineberg, Y. Hsu, B. Nitzberg, J.-P. Prost, M. Snirt,
B. Traversat, and P. Wong, “Overview of the mpi-io parallel i/o interface,” in Input/Out-
put in Parallel and Distributed Computer Systems. Springer, 1996, pp. 127–146.
[3] P. Schwan, “Lustre: Building a file system for 1000-node clusters,” in Proceedings of
the 2003 Linux Symposium, vol. 2003, 2003.
[4] The HDF Group, “Hierarchical data format version 5,” 2000-2010. [Online]. Available:
http://www.hdfgroup.org/HDF5
[5] R. Thakur, W. Gropp, and E. Lusk, “Data sieving and collective I/O in ROMIO,” in
Proceedings of the 7th Symposium on the Frontiers of Massively Parallel Computation,
ser. FRONTIERS ’99. Washington, DC: IEEE, 1999, p. 182.
[6] J. M. del Rosario, R. Bordawekar, and A. Choudhary, “Improved parallel I/O via a
two-phase run-time access strategy,” SIGARCH Comput. Archit. News, vol. 21, no. 5,
pp. 31–38, 1993.
[7] D. Knaak and D. Oswald, “Optimizing MPI-IO for applications on Cray XT
systems,” Report S-0013-10, 2009. [Online]. Available: docs.cray.com/books/
S-0013-10//S-0013-10.pdf
[8] W.-K. Liao and A. Choudhary, “Dynamically adapting file domain partitioning meth-
ods for collective I/O based on underlying parallel file system locking protocols,” in
International Conference for High Performance Computing, Networking, Storage and
Analysis, ser. SC ’08, 2008, pp. 1–12.
[9] S. Byna, J. Chou, O. Ru¨bel, Prabhat, and et al., “Parallel I/O, analysis, and
visualization of a trillion particle simulation,” in Proceedings of the International
Conference on High Performance Computing, Networking, Storage and Analysis, ser.
SC ’12, 2012. [Online]. Available: http://dl.acm.org/citation.cfm?id=2388996.2389077
pp. 59:1–59:12.
99
[10] S. Lang, P. Carns, R. Latham, R. Ross, K. Harms, and W. Allcock, “I/O performance
challenges at leadership scale,” in Proceedings of the Conference on High Performance
Computing Networking, Storage and Analysis, ser. SC ’09. New York, NY, USA:
ACM, 2009. [Online]. Available: http://doi.acm.org/10.1145/1654059.1654100 pp.
40:1–40:12.
[11] K. Schulz, “Experiences from the Deployment of TACC’s Stampede System,” March
2013 2013. [Online]. Available: http://www.hpcadvisorycouncil.com/events/2013/
Switzerland-Workshop/Presentations/Day 1/7 TACC.pdf
[12] K. J. Bowers, B. J. Albright, L. Yin, B. Bergen, and T. J. T. Kwan, “Ultrahigh
performance three-dimensional electromagnetic relativistic kinetic plasma simulation,”
Physics of Plasmas, vol. 15, no. 5, p. 7, 2008.
[13] C. Nieter and J. R. Cary, “VORPAL: a versatile plasma simulation code,” Journal of
Computational Physics, vol. 196, pp. 448–472, 2004.
[14] D. Randall, M. Khairoutdinov, A. Arakawa, and W. Grabowski, “Breaking the Cloud
Parameterization Deadlock,” Bull. Amer. Meteor. Soc., vol. 84, no. 11, pp. 1547–1564,
Nov. 2003. [Online]. Available: http://dx.doi.org/10.1175/bams-84-11-1547
[15] E. W. Bethel, J. M. Shalf, C. Siegerist, K. Stockinger, A. Adelmann, A. Gsell,
B. Oswald, and T. Schietinger, “Progress on H5Part: A Portable High Performance
Parallel Data Interface for Electromagnetics Simulations,” in Proceedings of the 2007
IEEE Particle Accelerator Conference (PAC 07). 25-29 Jun 2007, Albuquerque, New
Mexico. 22nd IEEE Particle Accelerator Conference, p.3396, 2007. [Online]. Available:
http://adsabs.harvard.edu/cgi-bin/nph-bib query?bibcode=2007pac..conf.3396B
[16] E. W. Bethel, J. M. Shalf, C. Siegerist, K. Stockinger, A. Adelmann, A. Gsell, B. Os-
wald, and T. Schietinger, “Progress on H5Part: A portable high performance parallel
data interface for electromagnetics simulations,” in Proceedings of the 2007 IEEE Par-
ticle Accelerator Conference, ser. PAC 07, 2007.
[17] L. B. N. Laboratory, “H5Part: a portable high performance parallel data interface to
HDF5.” [Online]. Available: http://vis.lbl.gov/Research/H5Part/
[18] D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning,
1st ed. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1989.
[19] C. S. Perone, “Pyevolve: a Python open-source framework for genetic algorithms,”
SIGEVOlution, vol. 4, no. 1, pp. 12–20, 2009. [Online]. Available: http:
//dx.doi.org/10.1145/1656395.1656397
[20] M. Sweet, “Mini-XML, a small XML parsing library,” 2003-2011. [Online]. Available:
http://www.easysw.com/mike/mxml
[21] M. Howison, Q. Koziol, D. Knaak, J. Mainzer, and J. Shalf, “Tuning HDF5 for Lustre
File Systems,” in Proceedings of 2010 Workshop on Interfaces and Abstractions for
Scientific Data Storage (IASDS10), Heraklion, Crete, Greece, Sep. 2010, lBNL-4803E.
100
[22] R. Thakur, W. Gropp, and E. Lusk, “Data Sieving and Collective I/O in ROMIO,”
in Proceedings of the The 7th Symposium on the Frontiers of Massively Parallel
Computation, ser. FRONTIERS ’99. Washington, DC, USA: IEEE Computer Society,
1999. [Online]. Available: http://dl.acm.org/citation.cfm?id=795668.796733 pp. 182–.
[23] H. You, Q. Liu, Z. Li, and S. Moore, “The design of an auto-tuning I/O framework on
Cray XT5 system,” in Cray Users Group Conference (CUG’11) (Best Paper Finalist),
Fairbanks, Alaska, may 2011, performance.
[24] H. Shan, J. Shalf, and K. Antypas, “Characterizing and predicting the I/O performance
of HPC applications using a parameterized synthetic benchmark,” in SC’ 08. Austin,
TX: ACM/IEEE, 2008.
[25] E. Smirni, C. L. Elford, D. A. Reed, and A. A. Chien, “Performance modeling of a
parallel I/O system: An application driven approach,” in PPSC. SIAM, 1997.
[26] P. Balaprakash, R. Gramacy, and S. M. Wild, “Active-learning-based surrogate models
for empirical performance tuning,” in Proceedings of IEEE International Conference on
Cluster Computing, ser. CLUSTER ’13, September 2013, pp. 1–8.
[27] P. Balaprakash, S. M. Wild, and P. D. Hovland, “An experimental study of global
and local search algorithms in empirical performance tuning,” in High Performance
Computing for Computational Science - VECPAR 2012, 10th International Conference,
Kobe, Japan, July 17-20, 2012, Revised Selected Papers, ser. Lecture Notes in Computer
Science. Springer, 2013, pp. 261–269.
[28] B. Behzad, L. Huong Vu Thanh, J. Huchette, S. Byna, Prabhat, R. Aydt, Q. Koziol,
and M. Snir, “Taming parallel I/O complexity with auto-tuning,” in Proceedings of
2013 International Conference for High Performance Computing, Networking, Storage
and Analysis, ser. SC ’13, 2013.
[29] R. McLay, D. James, S. Liu, J. Cazes, and W. Barth, “A user-friendly approach for
tuning parallel file operations,” in Proceedings of the International Conference for High
Performance Computing, Networking, Storage and Analysis (SC ’14). Piscataway, NJ,
USA: IEEE Press, 2014, pp. 229–236.
[30] J. Dennis and R. Loft, “Optimizing high-resolution climate variability experiments on
the Cray XT4 and XT5 systems at NICS and NERSC,” in Proceedings of the 51st Cray
User Group Conference (CUG), 2009.
[31] C. M. Herb Wartens, Jim Garlick, “LMT - The Lustre Monitoring Tool,”
https://github.com/chaos/lmt/wiki/, developed at Lawrence Livermore National Lab.
[32] A. Konwinski, J. Bent, J. Nunez, and M. Quist, “Towards an I/O Tracing Framework
Taxonomy,” in Proceedings of the 2nd International Workshop on Petascale data
storage: held in conjunction with Supercomputing ’07, ser. PDSW ’07. New York, NY,
USA: ACM, 2007. [Online]. Available: http://doi.acm.org/10.1145/1374596.1374610
pp. 56–62.
101
[33] “Stampede supercomputer,” http://www.tacc.utexas.edu/resources/hpc/stampede.
[34] M. Howison, Q. Koziol, D. Knaak, J. Mainzer, and J. Shalf, “Tuning HDF5 for Lustre
File Systems,” in Proceedings of 2010 Workshop on Interfaces and Abstractions for
Scientific Data Storage (IASDS10), Heraklion, Crete, Greece, Sep. 2010, lBNL-4803E.
[35] M. P. Mesnier, M. Wachs, R. R. Sambasivan, J. Lpez, J. Hendricks, G. R. Ganger, and
D. O’Hallaron, “//TRACE: Parallel Trace Replay with Approximate Causal Events,”
in FAST. USENIX, 2007, pp. 153–167.
[36] A. Aranya, C. P. Wright, and E. Zadok, “Tracefs: A File System to Trace Them All,”
in Proceedings of the 3rd USENIX Conference on File and Storage Technologies, ser.
FAST ’04. Berkeley, CA, USA: USENIX Association, 2004, pp. 129–145.
[37] X. Wu, K. Vijayakumar, F. Mueller, X. Ma, and P. C. Roth, “Probabilistic Communi-
cation and I/O Tracing with Deterministic Replay at Scale,” in ICPP. IEEE, 2011,
pp. 196–205.
[38] G. R. Ganger, “Generating representative synthetic workloads: An unsolved problem,”
in in Proceedings of the Computer Measurement Group (CMG) Conference, 1995, pp.
1263–1269.
[39] H. Luu, B. Behzad, R. Aydt, and M. Winslett, “A multi-level approach for understand-
ing i/o activity in hpc applications,” in Cluster Computing (CLUSTER), 2013 IEEE
International Conference on, 2013, pp. 1–5.
[40] P. Carns, K. Harms, W. Allcock, C. Bacon, S. Lang, R. Latham, and R. Ross, “Un-
derstanding and Improving Computational Science Storage Access through Continuous
Characterization,” in Mass Storage Systems and Technologies (MSST), 2011 IEEE 27th
Symposium on, 2011.
[41] B. Behzad, S. Byna, S. M. Wild, M. Prabhat, and M. Snir, “Improving Parallel I/O
Autotuning with Performance Modeling,” in Proceedings of the 23rd International Sym-
posium on High-performance Parallel and Distributed Computing, ser. HPDC ’14, 2014.
[42] “HDF5 Tutorial - Parallel Topics http://www.hdfgroup.org/HDF5/Tutor/parallel.
html,” Feb. 2011.
[43] H. Richardson, “High Performance Fortran: history, overview and current develop-
ments,” 1.4 TMC-261, Thinking Machines Corporation, Tech. Rep., 1996.
[44] B. Behzad, H.-V. Dang, F. Hariri, W. Zhang, and M. Snir, “Automatic generation
of i/o kernels for hpc applications,” in Proceedings of the 9th Parallel Data Storage
Workshop, ser. PDSW ’14. Piscataway, NJ, USA: IEEE Press, 2014. [Online].
Available: http://dx.doi.org/10.1109/PDSW.2014.6 pp. 31–36.
[45] R. C. Whaley, A. Petitet, and J. J. Dongarra, “Automated empirical optimization of
software and the ATLAS project,” Parallel Computing, vol. 27, no. 1–2, pp. 3–35, 2001.
102
[46] Frigo, Matteo, Johnson, and Steven, “FFTW: An adaptive software architecture for the
FFT,” in Proc. 1998 IEEE Intl. Conf. Acoustics Speech and Signal Processing, vol. 3.
IEEE, 1998, pp. 1381–1384.
[47] B. Jeff, A. Krste, C. Chee-Whye, and D. Jim, “Optimizing matrix multiply using
PHiPAC: a portable, high-performance, ANSI C coding methodology,” in Proceedings
of the 11th international conference on Supercomputing, ser. ICS ’97, 1997. [Online].
Available: http://doi.acm.org/10.1145/263580.263662 pp. 340–347.
[48] R. Vuduc, J. Demmel, and K. Yelick, “OSKI: A library of automatically tuned sparse
matrix kernels,” in Proceedings of SciDAC 2005, Journal of Physics: Conference Series,
2005.
[49] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel, “Optimization of
sparse matrix-vector multiplication on emerging multicore platforms,” in Proceedings
of the 2007 ACM/IEEE conference on Supercomputing, ser. SC ’07, 2007. [Online].
Available: http://doi.acm.org/10.1145/1362622.1362674 pp. 38:1–38:12.
[50] K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson,
J. Shalf, and K. Yelick, “Stencil Computation Optimization and Auto-tuning on
state-of-the-art Multicore Architectures,” in Proceedings of the 2008 ACM/IEEE
conference on Supercomputing, ser. SC ’08, 2008. [Online]. Available: http:
//dl.acm.org/citation.cfm?id=1413370.1413375 pp. 4:1–4:12.
[51] S. Williams, K. Datta, J. Carter, L. Oliker, J. Shalf, K. A. Yelick, and D. Bailey, “PERI:
Autotuning memory intensive kernels for multicore,” in Journal of Physics, SciDAC PI
Conference: Conference Series: 123012001, 2008.
[52] K. Seymour, H. You, and J. Dongarra, “A comparison of search heuristics for empirical
code optimization,” in Cluster Computing, 2008 IEEE International Conference on, 29
2008-oct. 1 2008, pp. 421 –429.
[53] H. Casanova, D. Zagorodnov, F. Berman, and A. Legrand, “Heuristics for Scheduling
Parameter Sweep Applications in Grid Environments,” in Proceedings of the 9th
Heterogeneous Computing Workshop, ser. HCW ’00. Washington, DC, USA: IEEE
Computer Society, 2000. [Online]. Available: http://dl.acm.org/citation.cfm?id=
795691.797922 pp. 349–.
[54] T. Ba¨ck and H.-P. Schwefel, “An overview of evolutionary algorithms for parameter
optimization,” Evol. Comput., vol. 1, no. 1, pp. 1–23, Mar. 1993. [Online]. Available:
http://dx.doi.org/10.1162/evco.1993.1.1.1
[55] K. Deb, A. Anand, and D. Joshi, “A computationally efficient evolutionary algorithm
for real-parameter optimization,” Evol. Comput., vol. 10, no. 4, pp. 371–395, Dec.
2002. [Online]. Available: http://dx.doi.org/10.1162/106365602760972767
[56] A. H. Wright, “Genetic Algorithms for Real Parameter Optimization,” in Foundations
of Genetic Algorithms. Morgan Kaufmann, 1991, pp. 205–218.
103
[57] A. Tiwari and J. K. Hollingsworth, “Online Adaptive Code Generation and Tuning,”
in Proceedings of the 2011 IEEE International Parallel & Distributed Processing
Symposium, ser. IPDPS ’11. Washington, DC, USA: IEEE Computer Society, 2011.
[Online]. Available: http://dx.doi.org/10.1109/IPDPS.2011.86 pp. 879–892.
[58] H. Jordan, P. Thoman, J. J. Durillo, S. Pellegrini, P. Gschwandtner, T. Fahringer,
and H. Moritsch, “A Multi-Objective Auto-tuning Framework for Parallel Codes,”
in Proceedings of the International Conference on High Performance Computing,
Networking, Storage and Analysis, ser. SC ’12. Los Alamitos, CA, USA: IEEE
Computer Society Press, 2012. [Online]. Available: http://dl.acm.org/citation.cfm?id=
2388996.2389010 pp. 10:1–10:12.
[59] G. A. Alvarez, E. Borowsky, S. Go, T. H. Romer, R. Becker-Szendy, R. Golding,
A. Merchant, M. Spasojevic, A. Veitch, and J. Wilkes, “Minerva: An
Automated Resource Provisioning Tool for Large-scale Storage Systems,” ACM
Trans. Comput. Syst., vol. 19, no. 4, pp. 483–518, Nov. 2001. [Online]. Available:
http://doi.acm.org/10.1145/502912.502915
[60] E. Anderson, M. Hobbs, K. Keeton, S. Spence, M. Uysal, and A. Veitch,
“Hippodrome: Running Circles Around Storage Administration,” in Proceedings
of the 1st USENIX Conference on File and Storage Technologies, ser. FAST
’02. Berkeley, CA, USA: USENIX Association, 2002. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1083323.1083341
[61] J. Strunk, E. Thereska, C. Faloutsos, and G. R. Ganger, “Using utility to provision
storage systems,” in Proceedings of the 6th USENIX Conference on File and
Storage Technologies, ser. FAST’08. USENIX Association, 2008. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1364813.1364834 pp. 21:1–21:16.
[62] Y. Chen, M. Winslett, S.-w. Kuo, Y. Cho, M. Subramaniam, and K. Seamons,
“Performance modeling for the panda array I/O library,” in Proceedings of the 1996
ACM/IEEE conference on Supercomputing (CDROM), ser. Supercomputing ’96, 1996.
[Online]. Available: http://dx.doi.org/10.1145/369028.369122
[63] Y. Chen, M. Winslett, Y. Cho, S. Kuo, and C. Y. Chen, “Automatic Parallel I/O Perfor-
mance Optimization in Panda,” in In Proceedings of the 10th Annual ACM Symposium
on Parallel Algorithms and Architectures, 1998, pp. 108–118.
[64] Y. Chen, M. Winslett, Y. Cho, and S. Kuo, “Automatic parallel I/O performance
optimization using genetic algorithms,” in High Performance Distributed Computing,
1998. Proceedings. The Seventh International Symposium on, jul 1998, pp. 155 –162.
[65] W. Yu, J. Vetter, and H. Oral, “Performance characterization and optimization of
parallel I/O on the Cray XT,” in Parallel and Distributed Processing, 2008. IPDPS
2008. IEEE International Symposium on, april 2008, pp. 1 –11.
104
[66] B. Behzad, J. Huchette, H. V. T. Luu, R. Aydt, S. Byna, Y. Yao, Q. Koziol,
and Prabhat, “A framework for auto-tuning HDF5 applications,” in Proceedings
of the 22nd international symposium on High-performance parallel and distributed
computing, ser. HPDC ’13. New York, NY, USA: ACM, 2013. [Online]. Available:
http://doi.acm.org/10.1145/2462902.2462931 pp. 127–128.
[67] J. Oly and D. A. Reed, “Markov model prediction of I/O requests for scientific appli-
cations,” in Proceedings of the 16th International Conference on Supercomputing, ser.
ICS ’02, 2002, pp. 147–155.
[68] M. Meswani, M. Laurenzano, L. Carrington, and A. Snavely, “Modeling and predicting
disk I/O time of HPC applications,” in High Performance Computing Modernization
Program Users Group Conference, ser. HPCMP-UGC, 2010, pp. 478–486.
[69] H. Song, Y. Yin, Y. Chen, and X.-H. Sun, “Cost-intelligent application-specific data
layout optimization for parallel file systems,” Cluster Computing, vol. 16, no. 2, pp.
285–298, June 2013.
[70] S. Kumar, A. Saha, V. Vishwanath, P. Carns, J. A. Schmidt, G. Scorzelli, H. Kolla,
R. Grout, R. Latham, R. Ross, M. E. Papkafa, J. Chen, and V. Pascucci, “Characteriza-
tion and modeling of PIDX parallel I/O for performance optimization,” in Proceedings
of the International Conference for High Performance Computing, Networking, Storage
and Analysis, ser. SC ’13, 2013, pp. 67:1–67:12.
[71] P. Carns, K. Harms, W. Allcock, C. Bacon, R. Latham, S. Lang, and R. Ross, “Un-
derstanding and Improving Computational Science Storage Access through Continuous
Characterization,” in In Proceedings of 27th IEEE Conference on Mass Storage Systems
and Technologies, 2011.
[72] P. Carns, Y. Yao, K. Harms, R. Latham, R. Ross, and K. Antypas, “Production I/O
Characterization on the Cray XE6,” in Proceedings of the Cray User Group meeting
2013 (CUG 2013), May 2013.
[73] S. J. Kim, S. W. Son, W.-k. Liao, M. Kandemir, R. Thakur, and A. Choudhary,
“IOPin: Runtime Profiling of Parallel I/O in HPC Systems,” in SC Companion.
IEEE Computer Society, 2012. [Online]. Available: http://dblp.uni-trier.de/db/conf/
sc/sc2012c.html#KimSLKTC12 pp. 18–23.
[74] S. A. Wright, S. D. Hammond, S. J. Pennycook, R. F. Bird, J. A. Herdman, I. Miller,
A. Vadgama, A. Bhalerao, and S. A. Jarvis, “Parallel File System Analysis Through
Application I/O Tracing.” Comput. J., vol. 56, no. 2, pp. 141–155, 2013.
[75] “Pablo Research Group,”
http://www.renci.org/focus-areas/project-archive/pablo#.
[76] “Lanl-trace,” http://institute.lanl.gov/data/software/#lanl-trace.
105
[77] J. Logan, S. Klasky, J. Lofstead, H. Abbasi, S. Ethier, R. Grout, S.-H. Ku,
Q. Liu, X. Ma, M. Parashar, N. Podhorszki, K. Schwan, and M. Wolf, “Skel:
Generative software for producing skeletal i/o applications,” in Proceedings of the 2011
IEEE Seventh International Conference on e-Science Workshops, ser. ESCIENCEW
’11. Washington, DC, USA: IEEE Computer Society, 2011. [Online]. Available:
http://dx.doi.org/10.1109/eScienceW.2011.26 pp. 191–198.
[78] ADIOS 1.5 user’s manual, “http://users.nccs.gov/ pnorbert/adios-usersmanual-
1.5.0.pdf.”
[79] S. Byna, Y. Chen, X.-H. Sun, R. Thakur, and W. Gropp, “Parallel I/O Prefetching
Using MPI File Caching and I/O Signatures,” in Proceedings of the 2008 ACM/IEEE
Conference on Supercomputing, ser. SC ’08. Piscataway, NJ, USA: IEEE Press,
2008. [Online]. Available: http://dl.acm.org/citation.cfm?id=1413370.1413415 pp.
44:1–44:12.
[80] E. Smirni and D. A. Reed, “Lessons from Characterizing Input/Output Bahavior of Par-
allel Scientific Applications,” International Journal on Performance Evaluation, vol. 33,
pp. 27–44, 1998.
[81] H. Simitci and D. A. Reed, “A Comparison of Logical and Physical Parallel I/O Pat-
terns,” International Journal of High Performance Computing Applications, vol. 12, pp.
364–380, 1998.
[82] M. Dorier, S. Ibrahim, G. Antoniu, and R. Ross, “Omnisc’IO: A Grammar-based
Approach to Spatial and Temporal I/O Patterns Prediction,” in Proceedings of the
International Conference for High Performance Computing, Networking, Storage and
Analysis, ser. SC ’14. Piscataway, NJ, USA: IEEE Press, 2014. [Online]. Available:
http://dx.doi.org/10.1109/SC.2014.56 pp. 623–634.
[83] J. He, J. Bent, A. Torres, G. Grider, G. Gibson, C. Maltzahn, and X.-H.
Sun, “I/O Acceleration with Pattern Detection,” in Proceedings of the 22Nd
International Symposium on High-performance Parallel and Distributed Computing,
ser. HPDC ’13. New York, NY, USA: ACM, 2013. [Online]. Available:
http://doi.acm.org/10.1145/2462902.2462909 pp. 25–36.
[84] F. Schmuck and R. Haskin, “GPFS: A Shared-Disk File System for Large Computing
Clusters,” in Proceedings of the 1st USENIX Conference on File and Storage
Technologies, ser. FAST ’02. Berkeley, CA, USA: USENIX Association, 2002. [Online].
Available: http://dl.acm.org/citation.cfm?id=1083323.1083349
[85] R. B. Gramacy and H. K. H. Lee, “Adaptive design and analysis of supercomputer
experiments,” Technometrics, vol. 51, no. 2, pp. 130–145, 2009.
106
