W&M ScholarWorks
Dissertations, Theses, and Masters Projects

Theses, Dissertations, & Master Projects

2011

Effective Large Scale Computing Software for Parallel Mesh
Generation
andriy Kot
College of William & Mary - Arts & Sciences

Follow this and additional works at: https://scholarworks.wm.edu/etd
Part of the Computer Sciences Commons

Recommended Citation
Kot, andriy, "Effective Large Scale Computing Software for Parallel Mesh Generation" (2011).
Dissertations, Theses, and Masters Projects. Paper 1539623585.
https://dx.doi.org/doi:10.21220/s2-hsxy-8829

This Dissertation is brought to you for free and open access by the Theses, Dissertations, & Master Projects at W&M
ScholarWorks. It has been accepted for inclusion in Dissertations, Theses, and Masters Projects by an authorized
administrator of W&M ScholarWorks. For more information, please contact scholarworks@wm.edu.

Effective Large Scale Computing Software
for Parallel Mesh Generation

Andriy Kot
Ternopil, Ukraine
Master of Science, The College of William and Mary, 2004
Master of Science, Ternopil Academy of National Economy, 2003

A Dissertation presented to the Graduate Faculty
of the College of William and Mary in Candidacy for the Degree of
Doctor of Philosophy

Department of Computer Science

The College of William and Mary
May 2011

APPROVAL PAGE
This Dissertation is submitted in partial fulfillment of
the requirements for the degree of

Approved by the Committee, May 2011

Associate Professor Weizhen Mao, Computer Science
The College of William and Mary

Associate Professtr Qun Li, Computer Science
The College of William and Mary

Assistant Professor

"peng Shen, Computer Science

The College of William and Mary

The College of William and Mary

ABSTRACT PAGE

Scientists commonly turn to supercomputers or Clusters of Workstations with hundreds (even thousands) of nodes to generate meshes for large-scale simulations. Parallel
mesh generation software is then used to decompose the original mesh generation problem into smaller sub-problems that can be solved (meshed) in parallel. The size of the
final mesh is limited by the amount of aggregate memory of the parallel machine. Also,
requesting many compute nodes on a shared computing resource may result in a long
waiting, far surpassing the time it takes to solve the problem.
These two problems (i.e., insufficient memory when computing on a small number of
nodes, and long waiting times when using many nodes from a shared computing resource)
can be addressed by using out-of-core algorithms. These are algorithms that keep most
of the dataset out-of-core (i.e., outside of memory, on disk) and load only a portion
in-core (i.e., into memory) at a time.
We explored two approaches to out-of-core computing. First, we presented a traditional
approach, which is to modify the existing in-core algorithms to enable out-of-core computing. While we achieved good performance with this approach the task is complex
and labor intensive. An alternative approach, we presented a runtime system designed
to support out-of-core applications. It requires little modification of the existing in-core
application code and still produces acceptable results. Evaluation of the runtime system
showed little performance degradation while simplifying and shortening the development
cycle of out-of-core applications. The overhead from using the runtime system for small
problem sizes is between 12% and 41% while the overlap of computation, communication
and disk I/0 is above 50% and as high as 61% for large problems.
The main contribution of our work is the ability to utilize computing resources more
effectively. The user has a choice of either solving larger problems, that otherwise would
not be possible, or solving problems of the same size but using fewer computing nodes,
thus reducing the waiting time on shared clusters and supercomputers. We demonstrated
that the latter could potentially lead to substantially shorter wall-clock time.

Table of Contents
Acknowledgments

iii

List of Table

iv

List of Figures

vi

1

Introduction
1.1 Related work . . . . . . . . . .
1.1.1 Out-of-core computing .
1.1.2 Run-time systems . . .

2
6
6
8

2

Out-of-core Parallel Delaunay Refinement
2.1 Parallel Delaunay Refinement Method . . . . . . . .
2.1.1 Shared memory implementation of the PDR.
2.2 Out-of-core PDR . . . . . . . . . . . . . . . .
2.2.1 Out-of-Core Shared memory PDR ..
2.2.2 Out-of-core Distributed Memory PDR
2.2.3 Out-of-core Hybrid Memory PDR ..

11
11
15
17
18
22
27

3

Out-of-core Parallel Constraint Delaunay Meshing
3.1 Parallel Constrained Delaunay Meshing
3.2 Programming model . . . . . . . . . .
3.3 Out-of-core subsystem for the PCDM
3.4 Implementation . . . . . . . . . .

33
33

Multi-layered Run-Time System
4.1 Requirements . . . .
4.2 Background . . . . .
4.3 Programming Model
4.4 Organization . . . .
4.5 Implementation . . .
4.5.1 Software layers
4.5.2 Mobile Objects and Threads
4.5.3 Message Passing . . . . . . .
4.5.4 Object Migration . . . . . . .
4.6 Out-of-core Non-Uniform Parallel Delaunay Refinement

42
42
43
44
46
48
48
50
51
52
54

4

36
37

38

40601
40602
40603
5

6

Implementation
Optimization
Findings 0 0 0 0

51
53
54

0

Performance Evaluation
5ol Experimental Setup 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
502 Out-of-core Parallel Delaunay Refinement 0 0 0 0 0 0
503 Out-of-Core Parallel Constrained Delaunay Meshing
5.4 Multi-layered Run-Time System 0 0 0 0 0 0 0 0 0 0 0

56
56

Conclusion and Future work

76

Bibliography

80

Vita

85

ii

57
62
66

ACKNOWLEDGMENTS
I am grateful to my adviser Nikos Chrisochoides for everything he has done.
I thank Andrey Chernikov for his collaboration on numerous projects.
I very much appreciate feedback and support from my committee: Henry Krakauer,
Weizhen Mao, Qun Li and Xipeng Shen.
Last but not least, I thank the people who suggested and helped me to get into
graduate school, Allen Tucker and his wife Meg.
This work was performed using computational facilities at the College of William
and Mary which were enabled by grants from Sun Microsystems, the National Science
Foundation, and Virginia's Commonwealth Technology Research Fund.
This work was supported in part by NSF grants: CCF-0916526, CCF-0833081, and
CSI-719929 and by the John Simon Guggenheim Foundation and Richard T. Cheng
Endowment.

iii

List of Tables
5.1

5.2

5.3

5.4

5.5
5.6

5.7

5.8

Parallel Delaunay refinement for a mesh of a unit square using the IBM
cluster. The OSPDR, the ODPDR and the OHPDR use 4 processors; the
PDR uses 4, 9, 16 and 25 processors. . . . . . . . . . . . . . . . . . . . .
Parallel Delaunay refinement for a mesh of a unit square using the IBM
cluster. The OSPDR, the ODPDR and the OHPDR use 4 processors; the
PDR uses 4, 9, 16 and 25 processors. . . . . . . . . . . . . . . . . . . . .
Parallel Delaunay refinement for the unit square. The ODPDR and the
OHPDR1 use 16 processors (4 nodes, 4 CPU per node); the OHPDR2
uses 16 processors (2 nodes, 8 CPUs per node) of the IBM cluster; the
PDR uses up to 121 processors of the SciClone cluster. In parentheses on
the PDR column are the corresponding values from running the in-core
PDR on up to 32 processors of the IBM cluster. Wait-in-queue time is
included when computing normalized speed for the in-core algorithm.
Parallel Delaunay refinement for a mesh of the pipe model. The ODPDR
and the OHPDR use 16 processors (4 nodes, 4 CPUs per node); the
PDR uses varying number of processors (16-121). Wait-in-queue time is
included when computing normalized speed for the in-core algorithm.
Average sustained speed of local disks for nodes 1 through 4. . . . . . .
Normalized speed of the PCDM and the OPCDM for problems that
fit completely in-core. OPCDM(d) and OPCDM(b) denote respectively
OPCDM with disk and database OoC subsystems. Pipe geometry. . . .
Normalized speed of the PCDM with virtual memory and the OPCDM for
problems that have memory footprint twice as large as the available physical memory. OPCDM( d) and OPCDM(b) denote respectively OPCDM
with disk and database OoC subsystems. Pipe geometry. . . . . . . . . .
Normalized speed of the PCDM(estimated) and the OPCDM for large
problem sizes. The normalized speed for the PCDM is estimated using
statistical data for wait-in-queue time and average per processor performance demonstrated on smaller in-core problems. The normalized speed
for the OPCDM is computed from the actual total execution time using
16 PE (2 nodes) with total physical memory of 16GB on varying problem
sizes (there is no wait-in-queue time for the OPCDM). OPCDM(d) and
OPCDM(b) denote respectively OPCDM with disk and database OoC
subsystems. Pipe geometry.
. . . . . . . . . . . . . . . . . . . . . . . .

IV

. 59

. 59

. 60

61
63

. 63

. 64

. 65

5.9

5.10

5.11

5.12

5.13

5.14
5.15
5.16
5.17
5.18
5.19
5.20

OPCDM disk utilization and I/0 overlap using disk OoC subsystem. Utilization is shown as a fraction of achievable speed. Overlap is shown as a
fraction of total time. Pipe geometry. . . . . . . . . . . . . . . . . . . .
Normalized speed of the PCDM(estimated) and the OPCDM for large
problem sizes. The normalized speed for the PCDM is estimated using
statistical data for wait-in-queue time and average per processor performance demonstrated on smaller in-core problems. The normalized speed
for the OPCDM is computed from the actual total execution time using
16 PE (2 nodes) with total physical memory of 16GB on varying problem
sizes (there is no wait-in-queue time for the OPCDM). OPCDM(d) and
OPCDM(b) denote respectively OPCDM with disk and database OoC
subsystems. Brain geometry. . . . . . . . . . . . . . . . . . . . . . . . .
OPCDM disk utilization and I/0 overlap using disk OoC subsystem. Utilization is shown as a fraction of achievable speed. Overlap is shown as a
. . . . . . . . . . . . . . . . . .
fraction of total time. Brain geometry.
Normalized speed of the PCDM(estimated) and the OPCDM for large
problem sizes. The normalized speed for the PCDM is estimated using
statistical data for wait-in-queue time and average per processor performance demonstrated on smaller in-core problems. The normalized speed
for the OPCDM is computed from the actual total execution time using
16 PE (2 nodes) with total physical memory of 16GB on varying problem
sizes (there is no wait-in-queue time for the OPCDM). OPCDM(d) and
OPCDM(b) denote respectively OPCDM with disk and database OoC
subsystems. Letter "A" geometry. . . . . . . . . . . . . . . . . . . . . .
OPCDM disk utilization and I/0 overlap using disk OoC subsystem. Utilization is shown as a fraction of achievable speed. Overlap is shown as a
fraction of total time. Letter "A" geometry. . . . . . . . . .
Single PE performance of UPDR and OUPDR methods. . .
Single PE performance of NUPDR and ONUPDR methods.
Single PE performance of PCDM and OPCDM methods.
Overlap of computation, communication and out-of-core disk I/0 in the
OUPDR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Overlap of computation, synchronization and out-of-core disk I/0 in the
ONUPDR.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Overlap of computation, communication and out-of-core disk IO in the
OPCDM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The comparison of performance of the computing layer implementations.

v

. 65

. 66

. 66

. 67

67
68
72
72
73
73
74
. 75

List of Figures
1.1

2.1
2.2

2.3

2.4

2.5

2.6

The wait-in-queue time statistics for parallel jobs collected from the last
four and a half years from a 300+ processor cluster at the College of
William and Mary. . . . .
Subdivision of a mesh M.
An example of the PDR algorithm in one dimension. The mesh is comprised of three submeshes M 1, M2 and M3 (there are three processors),
8M~,J denote border segments. Stages (0)-(5) correspond to algorithm
steps 0-5. Arrows between different steps indicate movements of submeshes between domains (e.g., network send-receive). Right dashed (thin
lines) areas show parts that are being modified during refinement, left
dashed (thick lines) areas show refined parts. . . . . . . . . . . . . . . .
An example of out-of-core PDR algorithm in one dimension. Mesh is
comprised of four submeshes M1, M2, M3, M4 (there are two processors, RAM is limited so only one submesh can be loaded per processor),
8M~,J denote border segments. Solid arrows between different steps indicate movements of submeshes between subdomains, dashed arrows indicate that a submesh will be stored on disk until it is required. Right
dashed (thin lines) areas show parts that are being modified during refinement, left dashed (thick lines) areas show refined parts. Large gray-shaded
areas show data that currently reside on disk. . . . . . . . . . . . . . . .
An example of domain partitioning for the ODPDR (left) and the OHPDR (right) methods. Pis the number of processors in 1 processor/core
per node scenario, ppn is the number of processors per node, K is the
number of nodes. N is derived empirically and depends on amount of
memory and disk space (N 2 is the total number of subdomains). . . . .
Out-of-core schemes of top-level shifts for ODPDR: along axis (left) and
diagonal (right). Input geometry is the outline of North American continent. Setup: 4 processors, 9 subdomains, distributed memory and disk
storage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Out-of-core schemes of top-level shifts for OHPDR: along axis (left) and
diagonal (right). Input geometry is the outline of North American continent. Setup: 2 nodes, 2 processors with shared memory per node, 9
subdomains, disk storage. . . . . . . . . . . . . . . . . . . . . . . . . . .

vi

3
13

. 15

. 17

21

. 24

. 26

3.1
3.2
3.3
3.4

Decomposition of a human brain into 1024 subdomains mapped onto 8
processors. . . . . . . . . . . . . . . . . .
Splitting an edge. . . . . . . . . . . . . . . . .
A high level description of the OPCDM . . .
A high level description of function Schedule

31
32
36
37

4.1
4.2

Memory organization and global addressing of the MRTS
Software organization of the MRTS. . . . . . . . . . . . .

42
49

5.1

Geometries used for evaluation: pipe cross-section, brain cross-section
and letter A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Execution times for UPDR and OUPDR for in-core problem sizes. . .
Execution times for NUPDR and ONUPDR for in-core problem sizes .
Execution times for PCDM and OPCDM for in-core problem sizes
Execution times for OUPDR for out-of-core problem sizes .
Execution times for ONUPDR for out-of-core problem sizes
Execution times for OPCDM for out-of-core problem sizes .

62
69
69
70
70
71
71

5.2
5.3
5.4
5.5
5.6
5. 7

vii

Effective Large Scale Computing Software
for Parallel Mesh Generation

Chapter 1

Introduction
Scientists commonly turn to supercomputers or Clusters of Workstations (CoW) with
hundreds (even thousands) of nodes to generate meshes for large-scale simulations. A
parallel mesh generation software is then used to decompose the original mesh generation
problem into smaller sub-problems that can be solved (meshed) in parallel. The limiting
factor is memory - it is not possible to generate the mesh if memory is not sufficient.
While mesh generation time shortens with increasing the number of Processing Elements (PEs), however there are other factors that could affect the total wall clock (or
completion) time significantly. For instance, it takes only several minutes to generate
largest possible mesh (for available memory) using one of our parallel mesh generation
applications on Sci Clone Cluster at the College of William and Mary. At the same time,
according to collected statistics from the about four and a half years (see Fig. 1.1) the
waiting time is considerably longer than the actual execution time for cases requiring
more than 16 PEs.
The only way to decrease the waiting time that is in the power of the user is to ask

2

1000000

-

:-:--

-

r-

100000

-

r-

1--

,......10000

1--

(.)
Q)

r--

r--

"' 1000
';;'

s

~

-

-

100

1--

10
1
1-15
Dmax Oa'\erage

16-31

32-47
48-63
64-95 96-127
#of requested processors

128+

Figure 1.1: The wait-in-queue time statistics for parallel jobs collected from the last four and
a half years from a 300+ processor cluster at the College of William and Mary.

for fewer nodes. This, in turn, results in less aggregate memory. This thesis is aimed
at solving this problem by enabling computing larger problems by using less memory
than would be required otherwise. In turn, this can potentially lead to shorter waiting
times and even shorter overall times, also known as wall-clock times. Because we use
computing resources effectively (i.e., fewer nodes and shorter wall-clock time) we call
our approach effective computing.
We will focus on parallel mesh generation since we have access to experts and readily
available state of the art software in that area, but our research should be applicable
to much broader range of scientific applications. Thus, our goal it to make possible
generating large meshes on machines with limited memory, including both shared memory workstations with multiple processors and/or processing cores and small affordable
CoWs. Our solution is to store most of the mesh out-of-core (OoC) with only small
portion that we work on in memory.

3

First, we present several out-of-core parallel mesh generation algorithms that are
based on our research in parallel in-core mesh generation. These algorithms required
substantial amount of time and effort to enable out-of-core computing, as well as to
optimize them to an acceptable level of performance. Since algorithms need to be restructured to for out-of-core computing the bulk of changes are specific to each algorithm
and cannot be easily reused.
Next, we designed an out-of-core layer to be used with a run-time system and customized an existing application for this run-time system to support out-of-core computing. While it still required substantial amount of customization to port and optimize
the application we could reuse the out-of-core layer.
Finally, we designed and implemented the Multi-layered Runtime System (MRTS),
which permits out-of-core computing with any application designed for or ported to this
runtime system with virtually no changes to the application code. Note, to achieve
optimal performance some modification to the application code may still be required.
Nevertheless, the changes will be small compared to adding out-of-core support from
scratch.
We have evaluated the out-of-core layer with Out-of-core Parallel Constrained Delaunay Mesh Generation (OPCMD) which is an out-of-core version of the Parallel Constrained Delaunay Mesh Generation (PCDM) [12]. Because OPCDM uses the same
programming model, the porting process was simple and few changes had to be made
to the application code. The Out-of-core PCDM (OPCDM) is not limited in problem
size and its performance is comparable to the original PCDM for small to medium mesh
sizes. The OPCDM can be used as an effective alternative to PCDM, for very large

4

meshes that require high number of processors for their aggregate physical memory. In
such cases the difference in waiting for larger number of processors can often be quite
substantial, and the wait time can be much higher than the additional overhead cost
one pays for running PCDM in an out-of-core mode. Also, we evaluated MRTS with
Out-of-core Non-Uniform Parallel Delaunay Refinement (ONUPDR) which is an outof-core version of the Non-Uniform Parallel Delaunay Refinement (NUPDR). NUPDR
uses a different programming model than MRTS and therefore the porting was more
challenging. However, once the application was ported and able to compute in-core,
adding out-of-core support was very straightforward.
In summary, we presented an approach for effective computing of large irregular
scientific problems such as unstructured mesh generation. We showed that out-of-core
computing allows solving larger than otherwise possible problems as well as getting the
results faster on shared computing resources. We designed, implemented and evaluated the MRTS, which permits out-of-core computing with many application by simply
porting an existing or developing a new applications for the MRTS. While porting and
development are greatly simplified performance is not sacrificed.
The main contributions of this thesis are presented in:
• Andriy Kot, Andrey Chernikov, and Nikos Chrisochoides. Effective out-of-core
parallel Delaunay mesh refinement using off-the-shelf software. ACM Journal of
Experimental Algorithmics. In print.
• Andriy Kot, Andrey Chernikov, and Nikos Chrisochoides. The evaluation of an
effective out-of-core run-time system in the context of parallel mesh generation.
In IEEE International Parallel and Distributed Processing Symposium, 2011. To

5

appear.
• Andriy Kot, Andrey Chernikov, and Nikos Chrisochoides. Parallel out-of-core Delaunay refinement. In IEEE Intelligent Data Acquisition and Advanced Computing
Systems: Technology and Applications, Sophia, Bulgaria, September 2005.

1.1
1.1.1

Related work
Out-of-core computing

There are two basic approaches for out-of-core computing: implicit, which usually involves virtual memory (VM) supported by internal mechanisms of an operating system (OS); and explicit, which often implies algorithm-specific optimizations.
While VM is easy to employ, it has limitations. The OS-supported VM is optimized
for system throughput and usually cannot exploit access patterns of irregular and adaptive applications. On four processors, our tests indicate that an increase in the problem
size from 23.8 million elements to 58.8 million elements (doubling the amount of memory
by using disk) resulted in an increase of the execution time from about 7 minutes to over
3 hours (192 minutes). Our out-of-core methods generate meshes of the same size (58.8
millions) in less than 30 minutes on the same four processor workstation. Additionally,
the amount of VM may be limited by either computer architecture (32-bit processors
can only address 4GB) or by administration of the computing resources (it is common
to set VM at no more than twice the amount of RAM 1 ).
1

Based on authors' personal experience, having access to computational clusters of varying sizes.

6

In contrast, the explicit approach is usually employed to develop algorithm-specific
out-of-core methods. This approach has been very effective in linear algebra parallel
computations [20, 42]. Out-of-core linear algebra libraries use various mapping layouts
(depending on the underlying I/0 and algorithm specifics) to store out-of-core matrices
and to employ vendor-supplied libraries for asynchronous disk I/0. They rely on high
performance in-core subroutines of BLAS [23], LAPACK [22] and ScaLAPACK [19] and
a simple non-recursive (in most cases) pipeline to hide latencies associated with disk
accesses.
Extensive research was performed on designing optimal algorithms for a parallel
multi-level memory model [1, 39, 45-47] as well as designing methods to map existing
in-core algorithms, based on batch-synchronous parallel models, into efficient out-of-core
algorithms [21].
Salmon et al. described [40] an out-of-core N-body parallel method which is irregular
and does not involve creation or deletion of new bodies during the execution, unlike the
parallel mesh refinement computation. The authors extend the virtual memory scheme
to store out-of-core pages on the disk. They use an algorithm-specific space-filling curve
to arrange data within memory pages. A problem-independent feature [40] is the page
replacement algorithm which is based on the last recently used (LRU) replacement
policy. The same policy is used as a basic virtual memory policy for many platforms (e.g.,
Linux). However, the authors extend it by introducing priorities, different aging speeds
for different data types, and explicit page locking.
Etree [43] is an out-of-core algorithm-specific approach for sequential mesh generation. The novelty of Etree is in its use of a spatial database to store and operate on

7

large octree meshes. Each octant is assigned a unique key using the linear quadtree
technique which is stored as a B-tree. There are three steps associated with generating
a mesh with Etree: (1) create an unbalanced octree on disk, (2) balance the etree by
decomposing further the octants that violate the 2-to-1 constraint (each octant may not
have more than two neighbors on each side), and (3) store the element-node relations
and node coordinates in two separate databases. Subsequently, all the mesh operations
are performed by querying the databases using Etree calls. This method targets octree
meshes and is exceptionally fast, especially after additional improvements using a twolevel bucket sort algorithm [44]. However, it targets octree-based meshes and is not yet
parallel.

1.1.2

Run-time systems

Charm++[30] support global address space by providing directory service, message deliveries and migration of chares. A chare is a collection of data, similar to an object from
object-oriented programming. Each chare can have a number of entry methods (again,
similar to class methods). To invoke a method on a chare a message is sent from another
chare. A program consists of a collection of chares and progresses by the exchange of
messages between chares (with subsequent methods executions), by the creation of new
chares and by the destruction of existing ones. It should be noted that the asynchronous
entry method invocation is the only method of communication for a Charm++ application. The system is load-balanced by "off-loading" chares with the highest loads to the
least loaded processors.
Chaos++ [9] is a runtime library extending functionality of Chaos by supporting dis-

8

tributed arrays and distributed pointer-addressable data structures. Two base classes
are provided: mobile object and globally addressable object. Contrary to globally addressable objects, the content of mobile objects is accessible by remote processors. Every object is owned by a single processor, and shadow copies are maintained by all
other processors accessing remotely. Similar to Chaos++ and independently developed
ABC++[4] supports mobile objects and allows for migration of objects between nodes,
and communication with a home node is required to find the object once it has migrated.
There are also languages that are designed to achieve the same goal. Emerald [29]
is a specialized language that relies on its own compiler and preprocessor to allow for
transparent accesses to remote objects. Amber [10] is a dialect of C++ where each object
is assigned globally unique address space which permits seamless migration between
nodes.
The Portable Runtime Environment for Mobile Applications (PREMA) [7] framework has been created to support development of adaptive and irregular applications
like parallel mesh refinement. It has been demonstrated that PREMA has a number of
advantages over similar systems [7] while simplifying application development. PREMA
consists of a communication layer and the Implicit Load Balancing Library (ILB) [7].
There were several implementations of the communication layer; the latest and current
version is Clam [25]. Clam is a runtime system designed to be used for development of
irregular and adaptive applications. It provides one-sided communication and Remote
Service Request (RSR) functionality as well as global address space and management
of so called "mobile objects". A Clam mobile object is a user-defined data structure
that is referenced by a mobile pointer anywhere in the system regardless of its location.

9

A mobile object can migrate to another processor without the necessity of updating
mobile pointers that point to that object 2 . Additionally, RSR handlers called mobile
object handlers can be invoked on a node where the mobile object is currently located (a

local pointer to the object is passed to a handler upon execution). The application does
not need to know where the object is located to post a RSR; Clam routes it to the
processor where the object is located and can postpone it if the object is in the process
of migration.
The ILB uses the mobile object concept provided by Clam to implement schedulable
objects. These are the smallest units of work managed by the ILB and can be moved
between processors to counter the imbalance.
In contrast to existing runtime systems the MRTS presented in this thesis provides
support for out-of-core computing and interfaces to both fine- and coarse-grain parallelism. Enabling out-of-core computing with the MRTS is relatively straightforward
and for many applications only minor changes are required. Similarly to the systems
presented above the MRTS provides one-sided communication and RSR functionality,
global address space and management of user defined mobile objects.

2

The application is responsible for the actual movement of the data-structure, Clam only provides
procedures to uninstall the mobile object on one processor and then install it on another.

10

Chapter 2

Out-of-core Parallel Delaunay
Refinement

2.1

Parallel Delaunay Refinement Method

The Parallel Delaunay Refinement (PDR) algorithm is based on a theoretical framework
for constructing guaranteed quality Delaunay meshes in parallel [15, 16]. Sequential
guaranteed quality Delaunay Refinement algorithms insert points at the selection disks
around circumcenters of triangles [14] of poor quality or of unacceptable size. Two points
are called Delaunay-independent iff they can be inserted concurrently without destroying
the conformity and Delaunay properties of the mesh. For 2-dimensional geometries, the
authors presented in [15] a sufficient condition of Delaunay-independence which is based
on the distance between points: two points are Delaunay-independent if the distance
between them is greater than 4f, where f is an upper bound on triangle circumradius
in the initial mesh. In n-dimensions, to ensure that processors insert only Delaunay-

11

independent points at each step of the algorithm they impose an-dimensional hypercube
lattice 1 over the entire n-dimensional domain.
In this thesis we present an out-of-core version of the algorithm which we published
in [31]. For simplicity we begin by presenting the algorithm in one dimension 2 • In one
dimension the hypercube lattice is equivalent to a segment subdivided into a number
of smaller equal size subsegments (cells). We call the length of the segment (i.e., 1-D
lattice) the size of the lattice. Similarly, we call the length of a subsegment (i.e., cell)
the size of the cell. Consequently, the length of a segment that consists of several cells
is the size of the segment and is equivalent to the sum of the cell sizes.
Given a conforming Delaunay mesh M and the number of available processors P we
compute f such that the szze of the corresponding lattice can be computed as af x P
where a is a constant that depends on implementation and dimensionality of the problem (a= 16 for our 2D implementation). Next, M is distributed among P processors:
let Mt be the mesh that resides in memory of processor i such that M

= U;: 1 Mt,

and

the szze of a lattice segment that corresponds to Mt is equal to af.
We denote bordering segments of Mt as 8Mt,J where i is the index of the subdomain
containing Mt and j is the index of the respective neighbor, j E { i-1, i+ 1} (e.g., 8M3,4
would be the rightmost segment of M3). Szze of each bordering segment is

f3r, where f3

is a constant that depends on implementation and dimensionality of the problem ({3
for our 2D implementation) and

=4

f3 I a. Additionally, we denote segments of equal size

of the border 8Mt,J inside Mt as 8M~, 1 . Figure 2.1 shows the subdivision3 of M.
1

The points pattern of the latt1ce is equivalent to that of a n-dimensional hypercube vertices
In one dimension a "triangulation" of a segment is a discretization of the segment.
3
According to the figure M, = (8M,,,_l U 8M~,,-l U 8M~,,+l U 8M,,,+I) which is true for our 2D
implementation but is not required, in fact (8M,,,_l U 8M:,,_ 1 U BM:,,+I U 8M,,,+I) c;;; M,.
2

12

M

af

)

afx P
Figure 2.1: Subdivision of a mesh M.

Below is the outline of the algorithm. First, we define the necessary operations (for
simplicity, A and B are abstract variables):

A

+--

B: A is assigned a copy of a value in B, this includes transferring the copy to a
processor where A is located, if necessary

A U B: the result of this operation is a mesh that contains all elements of A and B as
a single simply connected mesh, A and Bare not modified

A \ B: the result of this operation is a mesh that contains all elements in A except those
in B, A and B are not modified

refine(A, B): defined only if AU B where mesh A is refined as follows: elements in A
that belong to An B are refined, additionally refinement may affect elements in

A that belong to A 6 B and are geometrically within '"'( ('"'( is an implementation
dependent constant, '"'( I (3; '"Y

= 2f for

our 2D implementation) from the bounding

box of B, resulting in refined mesh stored in A.

The algorithm will perform the following steps:

13

p

0. distribute M: M = UMt, MtnM 1 =

0,

i,j = 1,2, ... ,P, i #j

t=l

let I = {2, 3, ... , P - 1}

2. Vi, i E I: refine (Mt, (Mt \

refine (M1, (M1 \

(8M~,t+1 U 8Mt-I,t)))

(8M~, 2 U 8M1,2)))

refine (Mp, (8MP-l,P U oM'p,P-1))

4. Vi, i E I: refine (Mt, (Mt \ (oM~,t- 1
refine (M1,

U

8Mt+1,t)))

(8M~, 2 U 8M1,2))

refine (Mn, (Mn \ 8M'p,p_ 1))

See Figure 2.2 for an example of algorithm execution with the mesh partitioned
between three subdomains. In step 0, the mesh is subdivided into submeshes and distributed between processors. In step 1, border segments on the right side of each sub mesh
are transferred to neighbors on the right of their respective processors. In step 2, each

14

(0)

(1)

(3) J11111111

(4)

I
I __
I 8M2,3 8M3,2
8Mt,2 8M2,!
I

JI//J

8Mt,2 8M2,!

I
I
I

IIIII

J 1111

8M,2

(5) 1111111111/IJI/iJ
Figure 2.2: An example of the PDR algorithm in one dimension. The mesh is comprised of three
submeshes M 1 , M 2 and M 3 (there are three processors), 8M,, 1 denote border segments. Stages
(0)-(5) correspond to algorithm steps 0-5. Arrows between different steps indicate movements of
submeshes between domains (e.g., network send-receive). Right dashed (thin lines) areas show
parts that are being modified during refinement, left dashed (thick lines) areas show refined
parts.

processor refines its submesh, border segments 8M~,t+l and 8Mt-l,t are not refined but
changes may propagate into them. In step 3, border segments on the left side of each
submesh together with the border segments that were transferred in step 1 are transferred to neighbors on the left of their respective processors. In step 4, each processor
refines its submesh, border segments 8M~,t-l and 8Mt+l,t are not refined but changes
may propagate into them. In step 5, border segments now located on the right of each
submesh are transferred to their original locations, on the left side of their respective
submeshes. At this point the mesh is refined and the algorithm finishes.

15

2.1.1

Shared memory implementation of the PDR

The original implementation of the PDR was for distributed memory computing. However, since multi-core (including support for hardware threads) is becoming increasingly
popular we implemented a modified algorithm to take advantage of shared resources and
to avoid unnecessary communication:

• due to the location of buffer cells from different domains in the same memory space
it is no longer necessary to exchange them using message passing; instead those
cells are referenced by different processors
• synchronization is necessary to allow concurrent access to shared data-structures
• consequently, all supportive operations that accompany buffer exchange (i.e., packing/unpacking and merging of submeshes) are no longer needed

Our evaluation showed [32] that performance of the Shared memory PDR (SPDR)
is better than the original method when used on the same hardware platform. However,
the difference is very small and the problem size is limited by the total memory of an
SMP /SMT node. Nevertheless, this work was used to implement an advanced version
of the out-of-core algorithm giving more of a performance boost (see Section 2.2.3).

2.2

Out-of-core PDR

See Figure 2.3 for an example of the algorithm execution with the mesh partitioned into
four subdomains with only room for two in memory.

16

M1
(0)

M2

~I'M''I
l

(1)

I

M3

I

I
I
···········...!

1 oM·~.:I

jliM>;1

1~../i'M,;,

aM,,, aM,,
1
1
1

M4
{lMa,~

tM'·"t

1oM,,~

18M'·'!

I

(3) hiilli/1

l 7i'M,; 8MU
l' h l l h l l i
l
OM,,.iiM,,t
I I hnh111

(4) 111111111

(5) Jil/11111

J.!···· ...:P..~~'V....•.·•··•········

I

I .~~J
Ill

IIIII

I
(6)

Ji IIIII II

(7) ) llll/111

1
~:li
/.~

I

hnJ 1

__ I

DM,aM,,!

h111

I

hnl 1

h111111

II l l j
ft•·¥/"'"'

lllll/1111/1}

I
(8)

hllll/11111}

dM,,

'

I

(9) Jii//l///li/JJJiiJ/11 1
~

I
l

(10) hllllillllillll~/1111

Figure 2.3: An example of out-of-core PDR algorithm in one dimension. Mesh is comprised
of four submeshes M 1 , M2, M 3 , M4 (there are two processors, RAM is limited so only one
submesh can be loaded per processor), 8Mi,j denote border segments. Solid arrows between
different steps indicate movements of submeshes between subdomains, dashed arrows indicate
that a submesh will be stored on disk until it is required. Right dashed (thin lines) areas show
parts that are being modified during refinement, left dashed (thick lines) areas show refined
parts. Large gray-shaded areas show data that currently reside on disk.

2.2.1

Out-of-Core Shared memory PDR

The Out-of-Core Shared memory PDR (OSPDR) algorithm is designed to create large
meshes in parallel, using only one workstation or a single node of a cluster with the
hard disk complementing the memory. The following assumptions were made for the

17

design of the OSPDR algorithm: (1) parts of the mesh stored on disk can be accessed
by any processor that needs them but synchronization is necessary to handle collisions;
(2) only a small fraction of the mesh can be loaded into the system memory, and (3)
disk accesses have a very high latency. Therefore, our goal in OSPDR is to minimize
the number of accesses and overlap them with computation whenever possible.
The mesh is stored on disk as a collection of subdomains. The subdomains are generated from the block decomposition using an auxiliary lattice similar to the one utilized
in the PDR [15]. All processors can access all subdomains therefore no specific data
distribution is required. The subdomains are stored as a sequence of separate entities,
that is each subdomain is an atomic block and can be loaded/stored independently of
the others, yet it must be loaded/stored as a whole. Only one subdomain can be loaded
into processor memory at any time. Throughout the paper for simplicity of presentation
we assume that subdomains send and receive data (e.g., when we say subdomain i sends
data to subdomain j, and subdomain i is loaded into the memory of processor m while
subdomain j is loaded into the memory of processor n, if fact processor m sends data
to processor n).
There are four main steps (we call them phases) in the PDR algorithm, each consists
of a refinement step and a data exchange called shift. Since we only have enough memory
to hold a portion of the mesh in-core it is impossible to perform a phase simultaneously
for all subdomains as in the PDR. In OSPDR, we break each phase into several steps.
At each step we load a portion of the mesh, refine it, exchange data between in-core
subdomains and store the updated portion of the mesh. We call the data exchanges
between in-core subdomains a shift, in consistence with the PDR.

18

During a shift each subdomain4 receives data from one of its neighbors and sends
data to another. We define a dzrectzon of a shift as a relative geometric position of the
subdomain that receives data with regards to the position of the subdomain that sends
data. All shifts in a phase share the same direction which is the direction of the phase.
There are two distinct types of phases based on their direction: parallel (up, right,
down, left) and dzagonal (up-right, down-right, down-left, up-left). We only need to
explain one of each, the rest can be understood by analogy. In particular, we describe
the phase with right shift and the phase with down-right shift.

P DRrefmement

a portion of the mesh using external mesh library (Triangle).
triangles in the border subdomain into the mesh.
PDRrefmement

P D Rsh~fts

refines

integrates

For more detailed description of

and PDRsh~fts see the in-core algorithm [15].

A phase with parallel direction is rather straightforward, the order of refinement (geometrical direction in which blocks are loaded, refined and stored back to disk) coincides
with the direction of the shift:
OSPDR HORIZONTALSHIFT(M, 3., p, P, p, N)
Input: M IS a Delaunay mesh computed m previous phase(s)
X IS a planar straight lme graph which defines the domam of M
.& and p are desired upper bounds on tnangle area
and c1rcumradms-to-shortest edge ratio, respectively
P IS the total number of processors ( VP IS mteger)
pIS the index of the current processor, 1 ::; p::; P
N 2 IS the total number of subdomams (NIVP IS mteger)
Output: a (partially) refined Delaunay mesh Mv which conforms to X
and respects ( m certam regiOns) .& and p
0 Calculate row(p) and col(p) of the current processor
I I 1 ::; row(p), col(p) ::; VP
1 form= 1,
,N
2
for n = 1,
,N
Load block p of subdomam (m- 1) x N + n as local mesh Mv
3
4
ifni= 0 and col(p) = 1
5
Reference cells {c,,l I 1::; ~::; 4} of local mesh Mv
6
endif
7
Mp <-- PDRrefmement(Mp, 3., p, P, p)
4

W1th the exceptiOn of the boundary subdomams

19

8
9
10
11
12
13
14
15

Mv <---- P DR,h,fts(Mv, &, p, P, p)
if col(p) = VP and n f= N
Assign cells {c,,4 I 1 ~ i ~ 4} to processor in (row(p), 1)
endif
Store local mesh Mv as block p of subdomain (m- 1) x N + n
endfor
endfor
return Mv

A phase with diagonal direction is more complex, because the corner cell shifts
both horizontally and vertically and both groups of side cells shift into their respective
directions:
OSPDR.DIAGONALSHIFT(M, &, p, P, p, N)
Input: same as in OSPDR.HorizontalShift
Output: a (partially) refined Delaunay mesh Mv which conforms to X
0 Calculate row(p) and col(p) of the current processor
I I 1 ~ row(i), col(i) ~ .JP
1 form= 1, ... , N
2
for n = 1, ... , N
3
Load block p of subdomain (m- 1) x N + n as local mesh Mv
4
if n f= 0 and col (p) = 1
5
Reference cells {c,,l I 1 ~ i ~ 3} of local mesh Mv
6
endif
7
Mp <---- PDRrefmement(Mp, &, p, P, p)
8
Mv <---- PDR,h,fts(Mv, &, p, P, p)
9
if col(p) = VP and n f= N
10
Assign cells {c,,4 II~ i ~ 3} to processor in (row(p), 1)
11
endif
12
if row(p) = VP and m f= N
13
Assign cells {c4,, II~ i ~ 3} to processor in (l,col(p))
14
endif
15
if p = P and n f= N and m f= N
16
Assign cell C4,4 to processor in (1, 1)
17
endif
18
ifrow(p) = 1 and m < N
19
Reference cells {b1,, I 1 ~ i ~ 3} of local buffer B
20
Overwrite cells {c1,, I 1 ~ i ~ 3} of block p
in subdomain m x N + n with the content of B
21
endif
22
if n < N and m < N
23
Reference cell b1,1 of local buffer B
24
Overwrite cell c1,1 of block p
in subdomain m x N + n + 1 with the content of B
25
endif
26
Store local mesh Mv as block p of subdomain (m- 1) x N + n
27
endfor
28 endfor
29 return Mv

20

2.2.2

Out-of-core Distributed Memory PDR

Similarly to the OSPDR the Out-of-core Distributed memory PDR (ODPDR) algorithm
is designed to create very large meshes in parallel. Unlike the OSPDR the ODPDR
is designed to use multiple nodes of a CoW and exploit the aggregate and concurrent
access to disk space. The following assumptions were made for the design of the ODPDR
algorithm: (1) parts of the mesh stored on disk can only be accessed by the processor
that the disk is directly attached to; (2) only a small fraction of the mesh can be loaded
into the system memory, and (3) network and disk accesses have a very high latency.
Therefore our goal in ODPDR is the same as in the OSPDR: to minimize the number
of accesses and overlap them with computation whenever possible.

N

4 3
Figure 2.4: An example of domain partitioning for the ODPDR (left) and the OHPDR (right)
methods. P is the number of processors in 1 processor/core per node scenario, ppn is the
number of processors per node, K is the number of nodes. N is derived empirically and depends
on amount of memory and disk space (N 2 is the total number of subdomains).

Again, the mesh is stored on disk as a collection of subdomains. The subdomains
are generated from the block decomposition (using an auxiliary lattice) we used for the
PDR method. The ODPDR relies on the PDR (in-core) parallel Delaunay meshing and

21

refinement code, but uses a different assignment of cells to processors than the PDR.
Optimal data distribution reduces the amount of communication to a necessary minimum and consequently lowers associated latencies. We propose an interleaving block
partitioning (see Figure 2.4, left). That is the domain is partitioned into N 2 subdomains,
where N is a number related to the size of the mesh and the amount of available RAM.
Each subdomain is further partitioned into P blocks, where Pis the total number of processors. Since P is a constant for every configuration, N is chosen such that the memory
requirements of any single block is small enough to fully fit into the RAM of a single
node. The total number of blocks in the domain is P x N 2 ; each processor stores (on the
local disk) one block from each subdomain, having a total of N 2 blocks. This scattered
decomposition helps to implicitly improve workload imbalances. Similarly to the way
we described OSPDR we will only explain one horizontal and one diagonal shift.
The horizontal/vertical type of top-level shift is rather straightforward: the order of
refinement coincides with the direction of the shift (see Figure 2.5):

ODPDR.HORIZONTALSHIFT(M, .::5., p, P, p, N)
Input: M is a Delaunay mesh computed in previous phase(s)
X is a planar straight line graph which defines the domain of M
.::5. and p are desired upper bounds on triangle area
and circumradius-to-shortest edge ratio, respectively
P is the total number of processors (VP is integer)
p is the index of the current processor, 1 ~ p ~ P
N 2 is the total number of subdomains (NIVP is integer)
Output: a (partially) refined Delaunay mesh MP which conforms to X
and respects (in certain regions) .::5. and p
0 Calculate row(p) and col(p) of the current processor
II 1 ~ row(i),col(i) ~ VP, 1 ~ i ~ P
1 for m = 1, ... , N
2
for n = 1, ... , N
3
Load block p of subdomain (m- 1) x N + n as local mesh MP
4
if n =I 0 and col (p) = 1
5
Receive cells {c,, 1 II~ i ~ 4} of local mesh Mp
6
endif
7
Mp <--- PDRrefmement(Mp, .::5., p, P, p)

22

8
9
IO
11
I2
I3
I4
I5

Mp <---- PDRshtfts(Mp, 6., p, P, p)
if col (p) = VP and n # N
Send cells {c,,4 II :S i :S 4} to processor in (row(p), I)
endif
Store local mesh Mp as block p of subdomain (m- I) x N
endfor
endfor
return Mp

+n

The diagonal shift is more complex, because the corner cell shifts both horizontally
and vertically and both groups of side cells shift into their respective directions (see
Figure 2.5):
ODPDR.DIAGONALSHIFT(M, 6., p, P, p, N)
Input: same as in ODPDR.HorizontalShift
Output: a (partially) refined Delaunay mesh Mp which conforms to X
0 Calculate row(p) and col(p) of the current processor
I I I :S row(i), col(i) :S VP, I :S i :S P
I for m = I, ... , N
2
for n = I, ... , N
Load block p of subdomain (m- I) x N + n as local mesh Mp
3
4
if n # 0 and col (p) = I
5
Receive cells {c,,l II :S i :S 3} of local mesh Mp
6
endif
7
Mp <---- PDRrefmement(Mp, 6., p, P, p)
8
Mp <---- PDRshtfts(Mp, 6., p, P, p)
9
if col (p) = VP and n # N
IO
Send cells {c,,4 II :S i :S 3} to processor in (row(p), I)
II
endif
I2
ifrow(p) = VP and m # N
I3
Send cells {c4,, II :S i :S 3} to processor in (I,col(p))
I4
endif
I5
if p = P and n # N and m # N
I6
Send cell C4,4 to processor in (I, I)
I7
endif
I8
ifrow(p) = 1 and m < N
I9
Receive cells {b1,, I I :S i :S 3} of local buffer B
20
Overwrite cells {c1,, II :S i :S 3} of block pin
in subdomain m x N + n with the content of B
2I
endif
22
if n < N and m < N
23
Receive cell b1,1 of local buffer B
24
Overwrite cell c1,1 of block p
in subdomain m x N + n + I with the content of B
25
endif
26
Store local mesh MP as block p of subdomain (m- I) x N + n
27
endfor
28 endfor
29 return Mp

23

subdomain

s:ffl
~

D

0 cell stored on disk

~processed

• cell loaded to ram

ll] active buffer

cell, stored on disk

181 buffer stored in ram EE buffer stored on disk

processor assignment

-~"'data movements (shifts)

Figure 2.5: Out-of-core schemes of top-level shifts for ODPDR: along axis (left) and diagonal (right). Input geometry is the outline of North American continent. Setup: 4 processors, 9
subdomains, distributed memory and disk storage.

24

2.2.3

Out-of-core Hybrid Memory PDR

To take full advantage of the current hardware trend of having multiple processors/ cores
per node, we designed and implemented the Out-of-core Hybrid memory PDR (OHPDR). Indeed, our experimental study (see Section 5.2) showed that the OHPDR method
is faster than the ODPDR on nodes with more than one processor / core. We made the
same design assumptions as in the case of the ODPDR. Additionally, processors of the
same node have equal access time to its local disk.
Again, the mesh is stored on disks as a collection of subdomains generated from
the block decomposition (using the auxiliary lattice). Part of the code responsible for
meshing is taken from the OSPDR, but the assignment of cells to processors is different.
We use an interleaving partition similar to the one used in the ODPDR (see Figure 2.4,
right). The mesh is divided into N 2 subdomains, where N is a number related to the
size of the mesh and the amount of available RAM. Each subdomain is then subdivided
into ppn x K blocks, where K is the number of SMP nodes and ppn is the number of
processors per node. The value of N is chosen in the same way we chose the number of
subdomains for the ODPDR method.
The OHPDR also (as the ODPDR) uses the same two levels of data movements.
However, a shift can be either shared (between processors of an SMP) or distributed,
over the network (between nodes). Similarly, there are two distinct types of top-level
shifts: horizontal/vertical and diagonal. We will only focus on the horizontal shift to
the right and the diagonal shift to the right and down (the rest is done by analogy).
A top-level horizontal shift is performed in the following steps (see Figure 2.6):

25

~processed cell, stored on disk
D cell stored on disk
• cell loaded to ram
r1 active buffer
EB buffer stored on disk
data movements (shifts)
~buffer stored in ram
__. ~ -.,.
-<> ! '»
no e processor assignment between nodes
between procs.

subdomain

g [J

Figure 2.6: Out-of-core schemes of top-level shifts for OHPDR: along axis (left) and diagonal (right). Input geometry is the outline of North American continent. Setup: 2 nodes, 2
processors with shared memory per node, 9 subdomains, disk storage.

26

OHPDR.HORIZONTALSHIFT(M, .&, p, K, ppn, p, N)
Input: ppn is the number of processors per node (the same number of
processors on all nodes)
K is the number of nodes (we assume K * ppn is integer and,
for simplicity of the presentation, K = ppn)
pis the index of the current processor, 1 :::; p:::; ppn x K
M, X,.&, p and N are the same as in ODPDR.HorizontalShift
Output: a (partially) refined Delaunay mesh Mp which conforms to X
and respects (in certain regions) .& and p
0 Calculate node(p) and proc(p) of the current processor
I I 1:::; node(i):::; K, 1:::; proc(z):::; ppn, 1:::; i:::; ppn x K
1 for m = 1, ... , N
2
for n = 1, ... , N
3
Load block p of subdomain (m- 1) x N + n as local mesh Mp
4
if n f 0 and proc(p) = 1
5
Read cells {c,,l 11:::; i:::; 4} of local mesh MP from shared-memory buffer
6
endif
7
Mp +-- SPDRrefmement(Mp, .&, p, ppn, K, p)
8
Mp +-- SPDR.h,fts(Mp, .&, p, ppn, K p)
9
if proc(p) = ppn and n f N
10
Write cells {c,,4 I 1 :::; i :::; 4} into shared-memory buffer
11
endif
12
Store local mesh Mp as block p of sub domain ( m - 1) x N + n
13
endfor
14 endfor
15 return Mp

v

The top-level diagonal shift to the right and down is performed m the following
steps (see Figure 2.6):

OHPDR.DIAGONALSHIFT(M, .&, p, K, ppn, p, N)
Input: same as in OHPDR.HorizontalShift
Output: a (partially) refined Delaunay mesh Mp which conforms to X
0 Calculate node(p) and proc(p) of the current processor
I I 1 :::; node( i) :::; K, 1 :::; proc( i) :::; ppn, 1 :::; i :::; ppn x K
1 for m = 1, ... , N
2
for n = 1, ... , N
3
Load block p of subdomain (m- 1) x N + n as local mesh Mp
4
if n f 0 and proc(p) = 1
5
Read cells { c,, 1 I 1 :::; i:::; 3} of local mesh MP from shared-memory buffer
6
endif
7
Mp +-- SPDRrefmement(Mp, .&, p, ppn, K, p)
8
Mp +-- SPDRsh•fts(Mp, .&, p, ppn, K, p)
9
if proc(p) = ppn and n f N
10
Write cells {c,,4 I 1:::; i:::; 3} into shared-memory buffer
11
endif
12
if node(p) = K and m f N
13
Send cells {c4,, 11:::; i:::; 3} to node node(p)
14
endif
15
if proc(p) = ppn and node(p) = K and n f Nand m f N
16
Send cell C4,4 to node 1

27

17
18
19
20

21
22
23
24
25
26
27
28

29

endif
if node(p) = 1 and m < N
Receive cells {b1 , I 1 :S z :S 3} of local buffer B
Overwnte cells {c!,t 11 :S z :S 3} of block p m
m subdomam m x N + n with the content of B
endif
if n < N and m < N
Receive cell b1,1 of local buffer B
Overwnte cell c1 1 of block p m subdomam m x N + n
endif
Store local mesh Mp as block p of subdomam (m- 1) x N
endfor
endfor
return Mp

28

+ 1 with the content of B
+n

Chapter 3

Out-of-core Parallel Constraint
Delaunay Meshing

3.1

Parallel Constrained Delaunay Meshing

In this thesis we present an out-of-core version of the Parallel Constrained Delaunay
Meshing (PCDM) [12] which we presented in [33].
The mesh generation procedure starts with constructing an initial mesh which conforms to the input vertices and segments, and then refines this mesh until the constraints
on triangle quality and size are met. The general idea of the Delaunay refinement is to insert points inside the circumscribed circles 1 of triangles that violate the required bounds,
until there are no such triangles left. To update the triangulation, the Bowyer/Watson
algorithm [8, 50] is used, which is based on deleting the triangles that are no longer
Delaunay and inserting new triangles that satisfy the Delaunay property.
1

Traditionally circumcenters are used. However, as the authors have shown [11] there exist entire
regions for the selection of new points.

29

The set of triangles in the mesh whose circumcircles include the newly inserted point

Pt is called a cavzty [26], and is denoted as C (Pt)· Also, the symbol 8C (Pt) stands for
the set of edges which belong to only one triangle inC (Pt), i.e., external edges.
In the absence of external boundaries, the algorithm maintains a Delaunay mesh M,
and at every iteration performs the following steps:
1. Select a triangle from the queue of unsatisfactory triangles.
2. Compute the circumcenter Pt of this triangle.

4. Delete all triangles inC (Pt) from M.
5. Add triangles obtained by connecting Pt with every edge in 8C (Pt) to M.
The case when the new point happens to be close to a constrained edge is treated
separately. Following Shewchuk [41], diametral lenses are used to detect if a segment
is encroached upon. The dzametral lenses of a segment is the intersection of two disks,
whose centers lie on the opposite sides of the segment on each others boundaries, and
whose boundaries intersect in the endpoints of the segment. A segment is said to be
encroached upon by point Pt if Pt lies inside its diametrallenses. When a point selected

for insertion is found out to encroach upon a segment, another point is inserted in the
middle of the segment instead.
To refine the mesh in parallel, a coarse grained domain decomposition obtained
with the meDDec [36]1ibrary is used. The goal is to distribute the subdomains among
processors, so that the sums of the weights of the subdomains on each processors are

30

approximately equal, and the total length of the subdomain boundaries which are shared
between processors is minimized.

Figure 3.1 2 shows an example of a human brain

domain decomposition. During runtime, the Load Balancing Library [5] maintains the
equidistribution and small edgecut conditions by moving the subdomains among the
processors in response to dynamically changing work load imbalance.

I

l

t
Figure 3.1: Decomposition of a human brain into 1024 subdomains mapped onto 8 processors.

The domain decomposition procedure described above creates N subdomains, each
of which is bounded by edges of the initial domain decomposition. The edges and their
endpoints that are shared between two subdomains are duplicated. The boundary edges
are treated as constrained segments, and whenever they are split due to encroachment
on one processor, an active message [25, 49] is sent to the mobile object holding the
adjacent subdomain, so that the duplicate of the boundary edge is also split, and the
mesh is globally consistent (see Figure 3.2).
2

Courtesy of Andrey Chernikov.

31

Figure 3.2: Splitting an edge.

3.2

Programming model

First step to enable out-of-core computing for PCDM is to port it to Portable Runtime
Environment for Mobile Applications (PREMA) [7] framework. PREMA has been created to support development of adaptive and irregular applications like parallel mesh
refinement. It has been demonstrated that PREMA has a number of advantages over
similar systems while simplifying application development.
The PREMA programming model is centered around a mobile object concept. A mobile object is an application user defined structure and it is not restricted to continuous
memory. A mobile object can be referenced by any processor via a unique global mobile
pointer. PREMA is designed for data centric computation where most of communication
happens between mobile objects rather than between processors. The communication is
handled by message operation. A message is an asynchronous remote request call that
is routed to the location of the target mobile object. Since a mobile object can move
between processors the request is sent to the last known location and then routed to
the correct destination. Upon arrival a user defined handler function is called passing
target mobile object and arguments from the message.
PREMA encourages overdecomposition that is the problem is broken into N sub-

32

problems and N

»

P, where P is the number of processors. Originally, high degree of

overdecomposition was required to allow greater flexibility to the Implicit Load Balancing (ILB) library. However, it is also important in out-of-core computing - the more
subproblems can fit in-core the greater is the flexibility of paging.
Thus the computation of the usual PREMA application consists of operations on
mobile objects. When communication is necessary message operation is used regardless
of the location of the target mobile object. Common iterative accesses in this model
should be replaced by sending messages to corresponding mobile objects.

3.3

Out-of-core subsystem for the PCDM

The smallest chunk of data that can be stored out-of-core is a mobile object.

We

developed a light-weight out-of-core layer to facilitate loading and storing objects as
well as determining the status (in-core or out-of-core) and to assist the algorithm in
making decisions on what needs to be loaded/stored at any given time.
This layer provides functionality of a high level cache to local disk. For every node,
it maintains a directory of local objects currently in memory or stored on disk as well as
history of accesses. When on disk the objects can be stored as a collection of files (one
object per file) or as a single file. Storing all objects in one file eliminates the overhead
of the file system 3 but requires extra data structures and code to index the objects.
In the single file case there is an option to pad to accommodate growing objects. We
also support raw access to a disk partition but this solution is not very practical due to
3

Unless we can directly write to disk device (rarely an option on public resources) the overhead
savings are insignificant.

33

access limitation on public resources.
Upon request to load an object it is loaded into a buffer, callbacks to unpack the
object are executed and its pointer returned. Often, the object will be in buffer, then
no loading is necessary. Regardless whether the object was in buffer or not, the request
is blocked until the object is ready.
When there is a request to store an object, the state of the buffer is evaluated. Then,
based on the amount of available space in the buffer, access statistics, and priorities,
the object might be unloaded to disk or kept in the buffer. The priority of the object is
lowered in the latter case. The requests to store an object are always non-blocking.
Additionally, on every object store request, a state of cache is evaluated and, based
on the access statistics, some objects may be scheduled to move to disk or prefetched.
For these decisions, we use prioritized least recently used page replacement policy with
priority information passed from higher layers.

3.4

Implementation

Implementation is done in two steps: first, we port the PCDM to PREMA without
adding any new functionality; second, we make changes and optimize the application to
enable and improve its out-of-core performance.
We start by registering the subdomains as mobile objects, then we replace all communication between processors with communication between subdomains. We also remove
some of the code that keeps track of the remote subdomain and decides which subdomain should receive a split request when it arrives. With the exception of few minor
changes, this constitutes the porting process.
34

We modify the main loop to access local mobile object through out-of-core subsystem
rather than directly. This ensures that objects are in fact in-core when the application
tries to access them. In case an object is out-of-core, the application call will be blocked
until the object is loaded. At this stage the application already can run out-of-core
but the performance is not very good. The main reason for poor performance is the
order in which each processor refines the subdomains it owns. Since the original PCDM
does not differentiate between in-core and out-of-core subdomains, it often tries to refine
out-of-core subdomains before it refines all subdomains that are currently in-core. This
adds extra overhead to migrate unrefined subdomains to disk and back.
We introduced the following changes (see Figures 3.3 and 3.4). First, before processing a subdomain in the main loop, we check whether the next subdomain in queue is
in-core and: mark it as sticky if it is in-core or post a non-blocking load request for that
subdomain if it is not. Second, after all bad triangles were processed for a subdomain,
we check whether the next subdomain in queue is in-core. If it is not, we move it to the
end of the queue and examine the next. If we cannot find an in-core subdomain we load
the next subdomain in queue with a blocking call.
It should be noted that the RTS will mark subdomains with multiple incoming

messages as sticky and may attempt to prefetch them. Additionally, when processing
incoming messages (application is polling) the RTS first executes messages addressed to
in-core subdomains regardless of the order in which messages were received (order of the
messages sent to the same subdomain is preserved).

35

OPCDM( {(Xz, Mz) I i = 1, ... , N}, li, p, P,p)
Input: Xz are PSLGs that define the subdomains nt
Mz are initial coarse meshes of nt
li is the upper bound on triangle area
pis the upper bound on triangle circumradius-to-shortest edge ratio
P is the total number of processes
p is the index of the current process
Output: Modified Delaunay meshes {Mz} which respect the bounds li and
1 Compute the mapping "' : {1, ... , N} ---t {1, ... , P}
of subdomains to processes
2 Distribute subdomains to processes
3 Let {ntl' ... 'ntNp } be the set of local subdomains
*4 Let Q be the set of unrefined subdomains
*5 Q f - {ntJ 1 j = 1, ... , Np}
*6 while Q =I= 0
*7
OtJ t - SCHEDULE( Q)
8
DELAUNAYREFINEMENT(XtJ, MtJ, fi, p, "')
*9 endwhile
10 TERMINATE()

p

M, fi, p, "')
Q f - {t EM I (p(t) 2: p) v (~(t) 2: li)}
while Q =/= 0
Lett E Q
BADTRIANGLEELIMINATION(X, M, t, "')

DELAUNAYREFINEMENT(X,

11
12
13
14
15
16

Update Q
endwhile

BADTRIANGLEELIMINATION(X,

M, t, "')

17 Pt t - CIRCUMCENTER(t)
18 if Pz encroaches upon a segment s E X
19
Pt t - MIDPOINT(s)
20
21

22
23

REMOTESPLITREQUEST("'(NEIGHBOR(s)),

Pz)

endif

C(pz)={tEM IPzEt}
M t - M \ C (pz) U { 6 (PzPmPn)

I e (PmPn)

E

8C (pz)}

Figure 3.3: A high level description of the OPCDM. This figure is based on the original PCDM
algorithm figure [12] with augmented lines marked with *· Function SCHEDULE is explained in
Figure 3.4.

36

SCHEDULE( Q)
Input: Q is a set of local subdomains (in-core and out-of-core)
Output: A subdomain 0~ 1 that is guaranteed to be in-core; also start preloading
1 0~ 1 +-- FINDINCORE( Q)
2 if n~J = null
3
n~J f - pop (Q)
4
Blocking load 0~ 1
5 endif
6 0~ 1 +1 +-- FINDINCORE(Q)
7 if o~J+l = null
8
o~J+l f - pop (Q)
9
Non-blocking load 0~ 1 +1
10 else
11
Mark n~J+l as St2cky
12 endif
13 push (Q, n~J+l)
14 return n~J

Figure 3.4: A high level description of function SCHEDULE, part of the OPCDM algorithm.
FINDINCORE returns a subdomain that is currently in-core (if any).

37

Chapter 4

Multi-layered Run-Time System
To simplify and streamline the process of enabling existing codes to compute out-of-core
as well as developing new out-of-core applications we designed and implemented the
Multi-layered Run-Time System (MRTS) [35], a practical out-of-core runtime system
that supports the execution of large scale parallel applications on a fraction of the nodes
that otherwise would be normally required.

4.1

Requirements

The mesh generation methods we described earlier have the following common characteristics:
1. spatial locality -

each PE works with a subset of mesh elements that cover a

certain geometrically defined area, and most of the computation is performed on
data that does not have outside dependencies;
2. although the communication patterns vary among the methods, the common property is that the amount of the data that the PEs need to exchange is substantially

38

smaller than the sizes of the subdomains, i.e., mesh sizes of subdomains;
3. local synchronization -

changes in a su bdomain usually affect only neighbors of

that subdomain and global synchronization is not required;
4. irregular access pattern -

it is not possible to predict the exact mesh elements

and memory locations that are accessed;

5. SPMD data model- a single program is used to process portions of the dataset
in parallel;

6. interoperability- to simplify the porting process we should not obstruct the MPI
or any other form of communication used by the rest of the application (i.e., FE
solver).

4. 2

Background

We adopted the mobile object which is defined in [7] as a location-independent container
implemented by the run-time system to store application data. The decision to define
mobile objects is left to an application programmer, but we recommend using it for representing larger semi-isolated fragments of a dataset (e.g., subdomains). A mobile object
can be freely moved by the run-time system between nodes and is globally addressable.
A message is an amalgamation of data transfer and a remote procedure call [48]. It
is one-sided, which means the receiving node does not have to post an explicit receive
and is not interrupted when a message arrives.
A message handler is a function defined by an application and registered with a
mobile object. A message is delivered to a mobile object by invocation of a correspond-

39

ing message handler on a node where the mobile object is located. Message handlers,
messages and mobile objects allow encapsulation of data represented by mobile objects.
A mobzle poznter is a global identifier and is used to reference a mobile object.
Because a mobile object is not restricted to any specific node a message is addressed to
the mobile pointer and the run-time system routes the message appropriately. Order of
messages is preserved only between two endpoints.
In the course of out-of-core computing mobile objects can be unloaded to and reloaded from the disk. Mobile objects support senalzzatwn1 by implementing serialization
interfaces provided by the run-time system. A more detailed description of out-of-core
objects behavior and requirements is provided in the following sections.

4.3

Programming Model

The programming model is centered around the mobile object concept. The run-time
system is designed for data-centric computation where most communication happens
between mobile objects rather than between processors.

Parallelism is achieved by

executing message handlers simultaneously on multiple nodes and multiple tasks within
each message handler. The MRTS tries to achieve maximum utilization by executing
as many tasks as are available while not oversubscribing the PEs which can lead to
unnecessary context switches and performance degradation.
The usual application for the run-time system has its dataset broken into a collection
of mobile objects. We encourage overdecomposztwn, that is breaking the problem into
1
Serialization is the process of transforming the memory representation of an object to a data format
suitable for storage or transmission.

40

N subproblems and N

»

P, where Pis the number of PEs. Overdecomposition allows

greater flexibility for dynamic load balancing[6] and is even more important for outof-core computing where the number of objects simultaneously allowed in memory is
limited by available physical memory.
At the beginning, an application performs initial preprocessing (if necessary), creates
mobile objects, defines serialization interfaces, registers message handlers, distributes
the mobile objects between nodes (optional), initiates the parallel phase by posting the
initial messages (e.g., main/driver function) and then passes control to the run-time
system.
The execution progresses by executing messages handlers, posting messages and
dynamically creating new mobile objects. A message is posted to perform an operation
on the data of a particular mobile object. Messages can be addressed to local (including
self), out-of-core and remote mobile objects. In fact, we strongly recommend using
messages rather than function calls or other means of communication outside the context
of the mobile object. Otherwise, the application is responsible for load balancing and
for checking and ensuring availability of the data it tries to access.
A message addressed to a local mobile object is inserted into its message queue. If
the object is local but out-of-core, the message is queued and the object is scheduled to
be loaded in-core. If the object is remote, the message is routed to the corresponding
node and processed there. The processing of a message from a remote node is the same
as for a local message.
The bulk of parallel computations are performed inside message handlers. When
no message handlers are executing and no messages are being delivered, the run-time

41

system detects a termination condition. At this point the control is passed back to the
application. Usually, at this point the application performs post-processing (if necessary)
and terminates, although it is possible to start another phase of computing with the runtime system.
MOBILE OBJECTS

• MOBILE POINTERS

IN-CORE

APPLICATION SPACE

DtSTRIBUTED DIRECTORY

LOCAL DIRECTORY

REMOTE
OBJECTS

OUT-OF-CORE

Figure 4.1: Memory organization and global addressing of the MRTS

4.4

Organization

The run-time system is organized into layers according to the principle of separation
of concerns (see Fig. 4.1). Parallelism is exploited via multi-threading on a node level
42

and via message passing between nodes. The memory space available to an application
consists of local, disk and remote memory. Hence, we call our run-time system the
Multi-layered Run-Time System). The MRTS is organized into the following layers: the
storage layer, the out-of-core layer, the control layer and the computing layer.
The storage layer is used for managing mobile objects stored out-of-core. The underlying storage facility is hidden from the application and can utilize regular files, block
devices and databases 2 . Blocking and non-blocking operations for loading and storing a
mobile object are provided. This functionality is primarily used by the MRTS internally
and is not exposed to an application.
The out-of-core layer is responsible for keeping track of mobile objects and controlling
swapping (i.e., determining when and which objects should be un-/loaded from and
to memory). The out-of-core layer also maintains a cache to prefetch mobile objects
depending on swapping scheme and input from application.
The control layer is responsible for delivering messages either locally or remotely and
for controlling migration of objects between nodes. Object location is determined by
querying the mobile object distributed directory. Depending on the location of the object
the message can be routed to a remote node or queued for local execution. The control
layer decides the order in which message queues of local mobile objects are processed.
The input from the control layer influences the swapping decisions of the out-of-core
layer. In addition, the control layer provides memory management primitives to an
application [2].
2

The evaluation of different storage subsystems is out of scope of this paper and will be submitted
elsewhere. Out-of-core objects are stored in a single large file and meta-data is kept in memory at all
times for all experiments presented in this paper.

43

The computing layer is used to provide a uniform interface to various multi-threading
technologies employed in the MRTS. We encourage the use of tasks - fragments of
code that can run in parallel and are expected to complete without blocking. Each
message handler function viewed as a task once it is scheduled to be executed and can
spawn new tasks during the execution. Unlike messages tasks can only access data of
the corresponding mobile object. However, tasks are lightweight and can be used to
exploit fine-grain parallelism without much overhead. The computing layer manages
the execution of message handlers and tasks and is responsible for memory allocation,
synchronization and load balancing of the tasks between PEs (i.e., cores, nodes, racks).

4.5
4.5.1

Implementation
Software layers

The storage layer implements several swapping schemes which are based on popular
cache algorithms. In addition to the least recently used (LRU) scheme, we implemented
the least frequently used (LFU), the most recently used (MRU), the most used (MU)
and the least used (LU) schemes. While the LRU scheme enjoys highest performance
most of the time, for some applications (e.g., PCDM) the LFU can be up to 7% faster.
A set of swapping thresholds is used to influence swapping in normal cases as well
as to force swapping in extreme cases. The hard swapping threshold is defined as a
multiple of the size of the largest mobile object currently stored on disk. The actual
value can be set at the initialization of the MRTS; the default is 2. This threshold is
checked whenever the application wants to allocate additional memory. If the amount of

44

memory after allocation is less than the threshold, unused objects are forcefully unloaded
to free memory. The soft swapping threshold is defined as a fraction of the total available
memory and is used to influence caching of the out-of-core mobile objects. When the
amount of free memory drops below the soft threshold the storage layer is "advised"
to start swapping. The soft threshold can be set at the initialization of the MRTS; the
default is 0.5.
Additionally, the out-of-core layer provides an API to assign swapping priorities to
mobile objects3 and to directly lock/unlock mobile objects. The locking is straightforward: a locked object cannot be unloaded from memory before it is unlocked. The
priorities are used to provide hints to the run-time system regarding the importance of
keeping an object "in-core" while still allowing it to make final decisions.
The control layer uses preemptive communication internally. When such a message
arrives it interrupts one of the currently running threads and gives control to the message
handler. The control is returned back to the interrupted thread after message handler
competed its execution. Executing potentially long running mobile messages can lead to
high overheads. Therefore, application messages are queued upon arrival and executed
when appropriate. When a message is removed from the queue it is "delivered" by
executing its respective message handler. When the message handler terminates, the
control layer makes a decision whether to continue to process the message queue of the
current object, to switch to another object or to serve systems aspects like information
dissemination and/or decision making for load-balancing or swapping. The control layer
keeps track of all messages, including the messages of out-of-core mobile objects, and
3
The swapping priority assigned to a mobile object is stored inside the corresponding mobile pointer
data-structure.

45

assigns swapping priorities depending on the number of messages and the order in which
they were delivered. Depending on the amount of work (i.e., number of messages) incore, the control layer can "advise" the out-of-core layer to initiate swapping.

4.5.2

Mobile Objects and Threads

The mobile object directory that stores mobile pointers is a distributed directory with
lazy updates [24]; for a mobile object that resides on a remote node, its last known
location is stored. When a message is sent to that location it is not guaranteed that the
destination mobile object will be there. If not, the message is forwarded to the last known
location of the object on that node. When the message finally arrives at the object's
current location, an update service message is sent back to all nodes through which the
message was routed. In [24] experimental evaluation using different location management
policies shows that lazy updates provide good compromise between accuracy and update
overhead.
The computing layer provides a lightweight mostly-wrapper interface to multithreading libraries. We encourage and support multi-threading within a message handler. Each message handler is a task and can be further broken into child tasks, and
some of those tasks can be executed in parallel. We utilize two different but similar
industrial-strength multi-threading programming technologies (only one can be active).

(1) Intel Threading Building Blocks (TBB) [27] is a C++ template library designed to
simplify and streamline parallel programming for C++ developers. It provides highlevel abstraction, is based on generic programming, and is designed to hide low level
details of managing threads and supports nested parallelism. (2) Grand Central Dis-

46

patch (GCD) [3] is an Apple technology used to optimize application support for systems
with multiple and/or multi-core processors. GCD implements task parallelism based on
the thread pool pattern. In both cases we use provided functionality to achieve task
level parallelism within a message handler, a task can be implemented as a block in the
case of GCD or as a method of the task class or a lambda function [28] in the case of
TBB.

A user defined mobile object must implement initialization, un-/registration and de/serialization methods. Initialization is performed when the object is first created; the
object is unregistered when it has to be moved to another node and is registered when
it is installed on a new node; the object is de-/serialized when it is transferred from/to
disk.
Whenever a mobile object is created a mobile pointer is generated. Each mobile
pointer contains either a reference to its object if that object is local and in-core, or its
location otherwise. Additionally, a mobile pointer of a local mobile object is associated
with a queue of messages that were delivered to the mobile object. When an object is
loaded in-core the message queue is processed. The size of a message queue influences
scheduling and swapping.

4.5.3

Message Passing

A message is composed of a destination mobile pointer, a message handler and optional
arguments. A message handler is implemented as a function. When it is called, it is
provided with a reference to the corresponding mobile object (not the mobile pointer)
and optional arguments. Messages that are delivered to their destination nodes are

47

stored together with the respective mobile objects. This means that if an object is outof-core, its messages are also stored out-of-core. The number of messages in a message
queue is stored in the respective mobile pointer.
To send a message to a mobile object the following should be supplied: a mobile
pointer that identifies the destination mobile object, a message handler, and optional
arguments. In case of a local mobile object the message is queued in the respective queue.
Alternatively, the message is delivered through a one-sided communication mechanism
to a last known node where the object might be located. A remote procedure call is
performed to both deliver the message as well as to notify remote node of the delivery.
We are using the Aggregate Remote Memory Copy Interface (ARMCI)[37]library for
such low-level inter-node communications. The ARMCI is a portable one-sided communication library that can be used in MPI applications and offers an extensive set
of functionality in the area of RMA communication: (1) data transfer operations, (2)
atomic operations, (3) memory management and synchronization operations, and (4)
locks. Additionally, the ARMCI library is part of the Global Arrays [38] which is popular
in scientific computing and widely supported on existing and upcoming supercomputers,
which, in turn, ensures the MRTS portability.

4.5.4

Object Migration

When an object is to be migrated to another node or stored out-of-core it must be
appropriately serialized, i.e., packed. Then again, when an object is installed on a
node or is loaded in-core it has to be de-serialized, i.e., unpacked. Due to the potentially
complex internal structure of a mobile object, the serialization operation must be defined

48

by the application. Not all mobile objects designated as out-of-core are actually unloaded
to disk, some are cached in memory. To allow a high degree of flexibility for the outof-core computing we provide several instruments of control. An application can choose
not to influence the system altogether; in such a case the decision to load/store mobile
objects is made based on their access pattern (i.e., message pattern). Alternatively, an
application can assign priorities which requires high priority objects to be cached more
often. Finally, an application can force loading an object as well as locking an object
which means the object is loaded or stays in memory regardless of its access pattern and
priority respectively. Note, an application should be very careful with locking too many
objects since it can result in running out of memory.

a.
a.

multi-threaded technology (TBB or GCD)

ro

E
Q)

en
>.
en

multi-layer
memory manager

mobile object
multi-layer
object directory
layer
MPI

Q)

E

'+='

c

::::J

'-

en
>.
en

disk object
manager

database
object manager

ARMCI

r--

DBMS
network

disk

Figure 4.2: Software organization of the MRTS.

Figure 4.2 shows the software organization of the MRTS.

49

4.6

Out-of-core Non-Uniform Parallel Delaunay Refinement

In this section, we describe in more detail the out-of-core NUPDR method and its
implementation with MRTS. Its in-core versions appeared in [16] for 2D and in [13] for
3D. We presented the out-of-core version in [35].
The NUPDR uses a master-worker model. The master starts by constructing a quadtree which initially contains a single leaf enclosing the entire geometry and an initial
triangulation. Next, a queue of leaves containing poor quality triangles is generated (we
will refer to it as a refinement queue). At this point the master enters a loop which
will only terminate when the refinement queue is empty and no workers are computing.
Termination of the loop indicates that the mesh is refined and the algorithm terminates.
Inside the loop, if the refinement queue is not empty and there is an available worker,
a leaf is removed from the queue. Additionally, a buffer zone BUF of the leaf, which is
defined as the collection of all neighboring leaves, is also removed from the queue. A
leaf is then passed to an available worker for refinement.
If the queue is empty or no workers are available, the master waits for a worker

to finish refining. When this happens, the leaves that compose the buffer BUF of the
refined leaf are checked for poor quality triangles. All leaves that have bad triangles are
reinserted into the refinement queue.
Poor quality triangles are stored as several structures based on a ratio between
the side length of the enclosing leaf and their circumradius. A worker refines a leaf by
processing poor quality triangle structures in a loop starting with the lowest ratio (largest

50

triangles). In that loop a queue of poor triangles with a specific ratio is processed until
it is empty.
For each poor triangle, a point is computed using a deterministic function and is
inserted into the mesh. Then the mesh is updated which could lead to a propagation
of changes into buffer leaves BUF and the creation of poor triangles for the current leaf
and for the buffer leaves. As a result, the poor quality triangles are inserted into the
corresponding data structures.
When both loops complete, the leaf is recursively split while a relation for constructing the quad-tree holds [13]. The locally refined mesh and quad-tree leaf are returned
to the master.

4.6.1

Implementation

The MRTS programming model does not support master-worker pattern directly and as
such, some restructuring of the algorithm is required. First, for each leaf of the quad-tree
we create a mobile object which holds a portion of the mesh that is enclosed by this
leaf. The refinement queue is also a mobile object. Additionally, the refinement queue
mobile object holds and updates the quad-tree structure internally.
At the start a single thread creates the first top leaf mobile object and generates
the initial mesh. In the process of mesh generation, the top leaf could be split and in
such cases new mobile objects are constructed. Each leaf stores its list of poor quality
triangles independently of the rest.
Next, a list of leaves that contain poor triangles is generated. A message designated
update is sent to the refinement queue mobile object and the control is passed to the

51

MRTS. When the control is returned to the application the mesh is fully refined.
The update message takes the following arguments: a list of changes to the quadtree, which is a list of mobile pointers to the newly created leaves and their relation to
the existing leaves; a list of mobile pointers of the leafs that have bad triangles.
When an update message is received by the refinement queue mobile object, its
handler performs the following. The quad-tree and the refinement queue are updated
with the new leaves. If the refinement queue is empty (a list of leaves with bad triangles
could be empty) the message handler exits. If not, a leaf is removed from the queue, its
buffer BUF is computed, and the respective leaves are also removed from the queue. A
message designated as construct buffer is sent to the leaf and its BUF buffer. The
only arguments of the message are the mobile pointer of the leaf and the number of
leaves in the buffer.
The message handler of construct buffer will do the following depending on the
receiver. If the message is received by the leaf object, a counter is created with the
number of leaves in the buffer. If the message is received by one of the leaves in the
buffer, it sends the message add to buffer to the leaf being refined and frees the
memory it used for storing its portion of the mesh.
The add to buffer message is used to deliver a portion of the mesh to another leaf.
When an add to buffer message is received by a leaf, the counter of the buffer leaves is
decremented and the argument mesh is integrated into the mesh of the receiving mobile
object. When the counter reaches zero, a message designated as refine is sent to the
leaf object (i.e., itself). The refine message takes no arguments.
The message handler of a refine message performs the same step as a worker in the

52

NUPDR algorithm. The only difference is the following. Instead of updating a global
list of leaves with poor triangles, a local structure is created and updated through the
refinement. After the refinement completes, an update message is sent to the refinement
queue object. The local list of leaves with poor triangles as well as any changes made
to quad-tree are passed as arguments to the update message. Then, new mobile objects
are created as needed (for every new leaf) and the corresponding portions of the mesh
are distributed among them. Finally, the portions of the mesh that correspond to the
leaves other than the current leaf are returned to their owners via recreate messages.
In the end, when no message handlers are executing and no messages are traveling, we
reach the termination condition. At this point the control is returned to the application
and the algorithm completes.

4.6.2

Optimization

While the algorithm described above works correctly, it is not as efficient as it could
be. Following are the number of changes we introduced to considerably improve the
performance.
The refinement queue object is relatively small and receives and sends many messages. Therefore, we locked it in memory meaning it will never be unloaded out-of-core.
Since we operate in a shared memory environment, we try to minimize the use of
add to buffer messages. We check whether the receiving leaf object is in-core, and in
such a case call the message handler directly. When the handler is called directly the
sender's mesh fragment is made available to the receiver and does not have to be copied.
Consequently, the memory occupied by the mesh fragment is not freed and a recreate

53

message is unnecessary.
The leaves that are part of the buffer are locked in memory after they send the
add to buffer messages or call the respective handlers directly. They do not occupy
a significant amount of memory at this point and do not require a recreate message
anymore. Instead, a recreate message handler is called directly and afterwords the
objects are unlocked (i.e., can be unloaded from memory). Similarly, we call the refine
message handler directly, thus eliminating the possibility it will be forced out of memory
before the message is delivered.
We change the order of the leaves in the refinement queue based on how many leaves
are in their buffers. This way we try to have as many leaves as possible present together
in-core and available for refining. We also check which leaves are in-core and try to
refine the leaves with the most buffer leaves loaded.
Additional improvements come from managing the priorities of the out-of-core subsystem. When we remove a leaf from the refinement queue we check if it is currently
loaded. If it is, we assign it a very high priority to minimize the possibility it will be
unloaded before a construct buffer message arrives. Also, we assign different priorities to the leaves of the buffers depending on the order they were removed from the
refinement queue.

4.6.3

Findings

The NUPDR algorithm requires access to several leaves of the quad-tree to refine a
single leaf. To accommodate this we either have to collect all leaves in one mobile
object dynamically on demand or store a single leaf in each object but then ensure

54

that when the message is delivered all related objects are local and in-core.

Since

the MRTS discourages direct control over mobile objects we used the first approach.
With optimization the ONUPDR using this approach performs similarly to the NUPDR.
However, this discovery lead us to believe that the ability to collect several mobile objects
during the execution of a mobile message can simplify the development and provide
additional space for optimization.
We introduced a multicast mobile message to the MRTS. A multicast mobile message
is similar to a mobile message except it can be sent to multiple mobile objects and ensure
that specific mobile objects are loaded into memory when the message is delivered. Note
that this is still experimental and requires further research and evaluation.
Instead of a destination mobile pointer, a vector of mobile pointers is supplied.
Additionally, a counter specifies which objects will receive the message (first n objects
will receive the message, where n is the counter). In the example of the ONUPDR, we
would provide a vector containing mobile pointers of a leaf and its buffer as the first
argument and 1 as the second argument, meaning the message should be delivered only
to the leaf mobile object.
Internally, the MRTS must first collect all mobile objects from the vector on the same
node and in-core, and only after that the mobile message is delivered. The message is
then delivered to one or more mobile objects in the vector (depending on the second
argument), order is not important, can be simultaneously.

55

Chapter 5

Performance Evaluation

5.1

Experimental Setup

For the evaluation we used the CRTC cluster which is part of the Center for Real-time
Computing 1 at the College of William and Mary. The cluster consists of four, fourway SMP IBM OpenPower720 compute nodes, with IBM Power5 processors clocked at
1.62 GHz and 8GB of physical memory on every node. The IBM Power5 is a dual-core
processor, and each one of its cores is organized as a simultaneous multi-threading execution engine, running two concurrent threads of control from the same or different address
spaces. The processor has a large L2 cache (1.9 MB organized in three banks) which
is shared between the cores via a crossbar switch, and a very large (36 MB) dedicated
L3 cache, which is also shared between the processor's cores and threads. The nodes
are interconnected with Gigabit Ethernet and the cluster is accessible from the outside
world via Gigabit lines as well. The main 16-processor, 32-core, 64-thread compute infrastructure, is stored in one rack along with one dual-processor OpenPower720 storage
1

http:/ /crtc.wm.edu/

56

server and one dual-processor OpenPower720 management and software development
node. We also employed some nodes of the SciClone cluster at the College of William
and Mary 2 (64 single-cpu Sun Fire V120 servers at 650 MHz with 1 GB memory and 32
dual-cpu Sun Fire 280R servers at 900 MHz with 2GB memory).

5.2

Out-of-core Parallel Delaunay Refinement

All algorithms are independent of the geometry of the domain, however, for our performance evaluation we used a square geometry to eliminate other parameters like workload imbalance. This and other issues of the in-core algorithm are addressed in nonuniform Parallel Delaunay Refinement algorithm [15] and are out of scope of this thesis.
However, it should be noted that over-decomposition introduced by out-of-core algorithms somewhat improves the work-load imbalance. We tested it with a mesh of a
cross section of a pipe model that is part of a rocket fuel system (see Figure 5.1, left).
This test geometry shows that the impact of load imbalances is much less severe to the
out-of-core PDR algorithms.
To date, there are no agreed standards to evaluate the performance of out-of-core
algorithms and existing metrics suitable for in-core parallel algorithms are not sufficient
for this task. Usually, we expect an out-of-core algorithm to have the following qualities:

• for small problems that can fit in-core, the execution time should be as close as
possible to that of an in-core counterpart algorithm
• for large problems that do not fit in-core it is acceptable to have lower performance
2

http:/ /compsci.wm.edu/SciClone

57

yet it should be comparable to the performance of an in-core algorithm for the same
number of processors
• for the same hardware setup it should be possible to solve much larger problems
with an out-of-core algorithm than with its in-core counterpart, however the execution time will be longer
• ideally, per processor performance should not degrade as problem size increases
while the average number of processors and physical memory stays constant

Therefore, to evaluate the performance of the out-of-core methods, we compare incore methods and out-of-core methods using the notion of normalzzed speed that we
introduced earlier [34]. This measure computes the number of elements generated by a
single processor over a unit time period, and it is given by V

= Tr:_P,

where N is the

number of elements generated, P is the number of processors in the configuration and
T is the total execution time.

In order to compare the performance of the in-core and the out-of-core PDR methods
which run on differing number of nodes, we use normalzzed speed. This measure computes
the number of elements generated by a single processor over a unit time period, and it
is given by V

= Tr:_P,

where N is the number of elements generated, P is the number

of processors in the configuration and T is the total execution time.
Tables 5.2 and 5.1 shows the performance of all three out-of-core methods on a
single 4-way SMP node from the workstation. The PDR performance is also included
for comparison. However, the PDR has to use 9, 16 and 25 processors, respectively
from the second problem and on since they would not fit in the aggregate memory of

58

fewer processors. As expected, OSPDR and OHPDR show the best performance (not
including the PDR). The ODPDR does not take advantage of shared memory and thus is
slower. These data show that the OHPDR method can be as low as 19% slower (and no
method is more than about two times slower) than its counterpart in-core PDR method
for the mesh sizes that fit completely in the core of the CoWs.
Mesh size,
#elements
x10 6
23.8
58.8
109.3
175.4

PDR

121
105
116
114

(4)
(9)
(16)
(25)

I OSPDR I ODPDR I OHPDR
execution time,
sec
249
276
438
486
631
639
1236
1136

264
444
578
1257

Table 5.1: Parallel Delaunay refinement for a mesh of a unit square using the IBM cluster.
The OSPDR, the ODPDR and the OHPDR use 4 processors; the PDR uses 4, 9, 16 and 25
processors.

Mesh size,
#elements
x10 6
23.8
58.8
109.3
175.4

PDR

I OSPDR I ODPDR I OHPDR

normalized speed
( x 103 triangles per sec per proc)
21.52
22.53
49.10 (4)
23.87
33.12
62.01 (9)
33.56
30.24
47.26
58.67 (16)
43.28
42.76
38.61
35.47
34.89
61.67 (25)

Table 5.2: Parallel Delaunay refinement for a mesh of a unit square using the IBM cluster.
The OSPDR, the ODPDR and the OHPDR use 4 processors; the PDR uses 4, 9, 16 and 25
processors.

Table 5.3 shows the performance of distributed memory out-of-core PDR methods
along with the in-core PDR using up to 121 processors. The unit square is used as a test
case. The OHPDR is tested on two slightly different configurations: (1) using 16 nodes
with a single processor per node, listed as OHPDR1 and (2) using 8 nodes with two
processors per node, listed as OHPDR2. The OSPDR being designed solely for shared
memory cannot run on these configurations.
59

Mesh size,
#elements
x10 6
109.3
175.4
255.0
352.6
470.7
587.8
738.9
873.5
1284.1
1967.2

PDR

23.24
23.78
24.01
24.23
25.1
24.6
24.63
24.55
23.11
24.23

I ODPDR ! OHPDR1 I OHPDR2

normalized speed
( x 103 triangles per sec per proc)
(98.1)
53.33
53.01
52.24
47.61
(96.74)
42.12
48.54
(98.32)
39.8
40.9
(97.84)
52.1
46.24
(100.2)
49.8
50.23
47.27
50.43
51.2
49.67
48.72
50.6
50.01
49.82

45.23
43.24
44.51
38.45
43.18
47.12
46.88
45.81
44.14
46.12

Table 5.3: Parallel Delaunay refinement for the unit square. The ODPDR and the OHPDR1
use 16 processors (4 nodes, 4 CPU per node); the OHPDR2 uses 16 processors (2 nodes, 8 CPUs
per node) of the IBM cluster; the PDR uses up to 121 processors of the SciClone cluster. In
parentheses on the PDR column are the corresponding values from running the in-core PDR
on up to 32 processors of the IBM cluster. Wait-in-queue time is included when computing
normalized speed for the in-core algorithm.

The performance of both OoC methods is similar on the same configuration which is
expected since the OHPDR does not take advantage of shared-memory. On SMP nodes
the OHPDR (listed as OHPDR2) performs slightly worse. This is the opposite of the
results we have seen on another system [34]. It is likely due to smaller cache (per core)
and/or different implementations of MPI and OpenMP.
The normalized speed of the parallel OoC methods is approximately constant for all
large problem sizes we ran. This suggests that the parallel OoC methods scale very well
with respect to the problem size.
The total execution time for just under 2 billion elements is a little over one hour
and a half (one hour and 37 minutes) using parallel OoC methods and 16 processors.
However, the wait-in-queue delays for parallel jobs with more than 100 processors (they
are required to generate the same size mesh using the in-core PDR) in our cluster is on
average about five hours. However, on the same cluster the waiting time for 16 processors
60

is less than half an hour. This makes the OHPDR2 response time 3.3 times shorter than
the response time of the in-core PDR, for mesh sizes close to a billion elements.
Moreover, many scientific computing groups can afford to own a dedicated 8 to 16
processor cluster which means zero waiting time. Thus, the parallel OoC methods are
much more effective and even faster if one uses the total "wall-clock" time.
Mesh size,
#elements
x10 6
58.3
91.1
131.2
178.6
233.3
295.3
364.6
441.1

PDR
I ODPDR I OHPDR
normalized speed
( x 103 triangles per sec per proc)
16.12 (16)
36.38
36.96
15.18 (25)
35.21
35.85
14.29 (36)
36.12
37.02
14.35 (49)
35.78
36.65
13.3 (64)
36.35
36.88
14.08 (81)
35.10
36.03
15.72 (100)
35.61
36.83
17.2 (121)
35.89
37.12

Table 5.4: Parallel Delaunay refinement for a mesh of the pipe model. The ODPDR and
the OHPDR use 16 processors (4 nodes, 4 CPUs per node); the PDR uses varying number of
processors (16-121). Wait-in-queue time is included when computing normalized speed for the
in-core algorithm.

Table 5.4 shows the performance of distributed and shared memory OoC methods
along with the PDR on large configurations for an irregular geometry, the pipe model.
The uniform block data decomposition we used for the pipe model results in an uneven
distribution of work to processors. This load imbalance on average reduces the speed for
both the in-core method (by 61%) and the OoC method (by 27%). In the case of OoC
methods, at every point of time processors refine only a portion of over-decomposed [6]
mesh, with all processor working in close proximity of each other. As a result, the workload is implicitly balanced because by far all processors have to perform approximately
the same amount of computation.

61

5.3

Out-of-Core Parallel Constrained Delaunay Meshing

The mesh generation time in practice is linear with respect to the number of the resulting
triangles. The number of elements is roughly inversely proportional to the required
triangle area bound, and can be controlled by selecting the area bound correspondingly.
The estimation is not an exact prediction of the size of the final mesh but it works well
in our experiments. We used several geometries (see Fig. 5.1) for our evaluation with
the same triangle shape constraint (20° minimal angle).

pipe

brain

letter "A"

Figure 5.1: Geometries used for evaluation: pipe cross-section, brain cross-section and letter A.

Additionally, we compare the effectiveness of the two different object managers: disk
and database. Below we will use OPCDM( d) and OPCDM(b) to refer to the experiments
performed with the disk object manager and the database object manager respectively.
Table 5.5 shows the average sustained speeds of local disks. These speeds are upper
bounds for any disk operations. We will use these to determine the utilization of the
disks in our evaluation. We measure the average disk read and write speeds for our
applications and present them as disk utilization below as fractions of sustained speeds.
Table 5.6 shows the normalized speed for problem sizes that fit completely in-core
on varying number of processors. The problem sizes are experimentally chosen to be

62

Table 5.5: Average sustained speed of local disks for nodes 1 through 4.

Disk
operation
read
write

Sustained speed (MB/sec) for node:
1
4
2
3
25.01 25.53 24.61
25.34
18.58 17.33 17.95
18.87

as large as possible and still fit in memory when computed with the PCDM. We can
see that the performance of the OPCDM is very close to that of the PCDM yet it is
slightly slower due to the overheads and it has slightly larger memory footprint (some
of the OPCDM data may be stored out-of-core). There is no difference between using
the disk object manager or the database object manager. Since the problem that fits
completely in memory would not trigger the use of virtual memory we do not have a
separate column for the case when virtual memory is used.
Table 5.6: Normalized speed of the PCDM and the OPCDM for problems that fit completely
in-core. OPCDM(d) and OPCDM(b) denote respectively OPCDM with disk and database OoC
subsystems. Pipe geometry.

Mesh size,
x 106 triangles
19.79
39.55
79.11
158.25
316.50

number
of PEs
(nodes)
2(1)
4(1)
8(1)
16(2)
32(4)

Normalized speed,
x 103 triangles per second
PCDM OPCDM(d) OPCDM(b)
60.01
58.91
68.61
61.35
70.10
62.06
64.91
72.11
64.07
63.86
64.01
70.18
61.91
62.25
69.95

Table 5. 7 compares the out-of-core performance of the PCDM with virtual memory
to that of the OPCDM for problem sizes that have the memory footprint twice as large
as the available physical memory. We use half the PEs for the same problem sizes when
compared to Table 5.6.
Tables 5.8,5.10 and 5.12 demonstrate the effectiveness of the out-of-core approach

63

Table 5. 7: Normalized speed of the PCDM with virtual memory and the OPCDM for problems that have memory footprint twice as large as the available physical memory. OPCDM( d)
and OPCDM(b) denote respectively OPCDM with disk and database OoC subsystems. Pipe
geometry.
Mesh size,
x 106 triangles
158.25
316.50
633.07

number
of PEs
(nodes)
8(1)
16(2)
32(4)

Normalized speed,
x 103 triangles per second
PCDM OPCDM(d) OPCDM(b)
55.59
29.23
54.24
57.31
55.63
29.00
55.93
28.91
54.75

for computing very large problems in real-life environment. Because such problems will
not fit into physical memory of a small cluster like ours one has to run the application
on a larger cluster. Therefore, one must factor in the time that is spent in queue waiting
for the job to schedule. We used the wall-clock time instead of the total execution time
to compute the normalized speed shown in the last table. The wall-clock time is the
sum of the wait-in-queue time and the total execution time. We used the average waitin-queue time for a given number of processors from the statistical data gathered on the
SciClone 3 cluster during several years (see Fig. 1.1).
Tables 5.9, 5.11, and 5.13 show utilization of the disks and overlap of computation
and disk I/0. Disk utilization is roughly a quarter of the peak possible which is acceptable considering commercial tools achieve about one third of the peak for sparse data
access. Overlap is also quite high (up to 68%) when taking into account complexity of
our application.
We see that the normalized speed for the OPCDM does not change much as we
increase the problem size or the geometry. Disk utilization and overlap also do not
change much. At the same time, the wait-in-queue time dominates the wall-clock time
3

http://www.compsci.wm.edu/SciClone/

64

Table 5.8: Normalized speed of the PCDM(estimated) and the OPCDM for large problem sizes.
The normalized speed for the PCDM is estimated using statistical data for wait-in-queue time and
average per processor performance demonstrated on smaller in-core problems. The normalized
speed for the OPCDM is computed from the actual total execution time using 16 PE (2 nodes)
with total physical memory of 16GB on varying problem sizes (there is no wait-in-queue time for
the OPCDM). OPCDM(d) and OPCDM(b) denote respectively OPCDM with disk and database
OoC subsystems. Pipe geometry.

Mesh size,
x 106 triangles
949.47
1265.96
1582.44
2531.91
3164.89
3956.11

number
of PEs
(est.)
48
64
80
128
160
200

Normalized speed,
x 103 triangles per second
PCDM(est.) OPCDM(d) OPCDM(b)
7.65
54.56
54.09
3.38
53.79
52.90
3.24
57.74
57.79
1.57
52.80
53.84
2.15
47.30
48.75
1.76
53.23
51.85

Table 5.9: OPCDM disk utilization and I/0 overlap using disk OoC subsystem. Utilization
is shown as a fraction of achievable speed. Overlap is shown as a fraction of total time. Pipe
geometry.

Mesh size, x 10° triangles
949.47
1265.96
1582.44
2531.91
3164.89
3956.11

read (%)
27.74
27.93
24.51
22.40
24.01
25.59

write (%)
22.19
22.35
19.61
17.92
19.21
20.47

overlap(%)
51.67
63.00
53.48
61.50
59.86
58.38

for the PCDM and this results in a much lower normalized speed. It is clear that the
in-core generation of very large meshes on large clusters with hundreds of processors is
less effective in terms of time than the out-of-core generation of the same meshes on
small clusters with limited number of processors and physical memory. Additionally, we
see that the OPCDM(b) performs slightly better as problem size increases. It supports
our assumption that databases can be used to store out-of-core data.

65

Table 5.10: Normalized speed of the PCDM(estimated) and the OPCDM for large problem
sizes. The normalized speed for the PCDM is estimated using statistical data for wait-in-queue
time and average per processor performance demonstrated on smaller in-core problems. The normalized speed for the OPCDM is computed from the actual total execution time using 16 PE (2
nodes) with total physical memory of 16 GB on varying problem sizes (there is no wait-in-queue
time for the OPCDM). OPCDM(d) and OPCDM(b) denote respectively OPCDM with disk and
database OoC subsystems. Brain geometry.

Mesh size,
x 106 triangles
981.95
1282.79
1599.01
2468.28
3316.15
3962.73

number
of PEs
(est.)
48
64
80
128
160
200

Normalized speed,
x 103 triangles per second
PCDM(est.) OPCDM(d) OPCDM(b)
7.42
55.40
55.64
3.50
51.34
50.84
3.37
57.14
57.07
1.63
55.10
52.06
2.13
47.37
45.73
1.81
53.02
52.53

Table 5.11: OPCDM disk utilization and 1/0 overlap using disk OoC subsystem. Utilization
is shown as a fraction of achievable speed. Overlap is shown as a fraction of total time. Brain
geometry.

Mesh size, x 10° triangles
981.95
1282.79
1599.01
2468.28
3316.15
3962.73

5.4

read(%)
27.22
30.53
23.04
22.46
23.15
28.08

write(%)
21.62
22.25
20.09
16.65
18.80
20.37

overlap(%)
55.27
59.52
52.26
61.03
62.47
57.04

Multi-layered Run-Time System

We start by evaluating the performance of the control layer of the MRTS. We tested
small problems sizes on CRTC for all three methods and very large problems were tested
on SciClone for in-core methods.
Figure 5.2 shows the execution times of the UPDR (16 and 25 PEs) and the
OUPDR (16 PEs).

The largest problem size on the chart, 175 million elements is

too large for UPDR running on 16 processors. We can see that the performance of the

66

Table 5.12: Normalized speed of the PCDM(estimated) and the OPCDM for large problem
sizes. The normalized speed for the PCDM is estimated using statistical data for wait-in-queue
time and average per processor performance demonstrated on smaller in-core problems. The normalized speed for the OPCDM is computed from the actual total execution time using 16 PE (2
nodes) with total physical memory of 16 GB on varying problem sizes (there is no wait-in-queue
time for the OPCDM). OPCDM(d) and OPCDM(b) denote respectively OPCDM with disk and
database OoC subsystems. Letter "A" geometry.
Mesh size,
x 106 triangles
925.78
1244.82
1585.31
2437.68
3147.60
3974.47

number
of PEs
(est.)
48
64
80
128
160
200

Normalized speed,
x 103 triangles per second
PCDM(est.) OPCDM(d) OPCDM(b)
7.31
55.05
56.41
3.49
52.26
50.77
3.37
56.90
60.23
51.30
1.49
54.66
2.10
49.59
48.70
1.69
51.66
51.14

Table 5.13: OPCDM disk utilization and I/0 overlap using disk OoC subsystem. Utilization
is shown as a fraction of achievable speed. Overlap is shown as a fraction of total time. Letter
"A" geometry.
Mesh size, x 10° triangles
925.78
1244.82
1585.31
2437.68
3147.6
3974.47

read(%)
24.62
23.11
27.34
23.88
22.31
27.19

write(%)
19.7
18.49
21.87
19.1
17.84
21.75

overlap(%)
56.68
67.97
65.56
53.72
67.34
63.11

UPDR and that of the OUPDR is very similar (the OUPDR is up to 12% slower) for
in-core problem sizes which means that the overhead introduced by the MRTS is small.
Figure 5.3 shows the execution times of the NUPDR and the ONUPDR for 2, 4, and
8 PEs4 . For 4 and 8 PEs, the overhead can be as high as 18% which is acceptable.
For 2 PEs the ONUPDR is up to 41% slower. This is explained by the fact that the
NUPDR uses a custom memory allocator that shows much lower overhead than the
4
The NUPDR and current implementation of the ONUPDR are shared memory applications and as
such are restricted to a single node

67

MRTS memory manager in the 2 PEs case. Figure 5.4 shows the execution times of the
PCDM (16 and 25 PEs) and the OPCDM[33] for 8 and 16 processors. As is the case
with the UPDR and OUPDR, the performance of the OPCDM is very similar to that
of the PCDM (up to 13% overhead).
Figures 5.5, 5.6 and 5.7 demonstrate the performance of the out-of-core and storage
layers of the MRTS. They show the execution times of the OUPDR (8 and 16 PEs),
ONUPDR (2, 4 and 8 PEs) and OPCDM (8 and 16 PEs) for very large problems. These
charts demonstrate that the size of very large problems do not degrade the performance
of the methods (time increases almost linearly) on MRTS.
Table 5.14: Single PE performance of UPDR and OUPDR methods.

Size
x10 6
24
59
109
175
255
353
471
588
739
874
1284
1967

PEs
4
9
16
25
36
49
64
81
100
121
n/a
n/a

Time (sec)
UPDR OUPDR
294
46
102
295
295
176
297
368
293
576
802
295
1133
300
1386
296
1745
300
2111
294
3122
n/a
4599
n/a

Speed ( x 103 /sec)
UPDR OUPDR
20
33
22
36
23
39
24
30
28
24
24
27
25
26
24
27
25
26
25
26
26
0
27
0

Tables 5.14, 5.15 and 5.16 reflect the performance of the out-of-core layer as well as
the performance of the control layer. Note, the execution time of the original application
is from older SciClone cluster since they need the aggregate memory of over a hundred
processors. The MRTS applications run on the newer faster CRTC cluster and have
faster per PE speed in most cases. Rather than compare the actual speeds in those
68

400
350
300

0 250
Ql
Ul

-; 200

E

i=

150
100
50
0
0

20

40

60

80

100

120

140

160

180

Mesh size, million elements

1-<>- UPDR 16 -o- UPDR 25

--{:r-

OUPDR 16

"""'*""" I

Figure 5.2: Execution times for UPDR and OUPDR for in-core problem sizes

250

200

al

.e

150

Ql

~

100

50

5

10

15

20

25

Mesh size, million elements

1-<>- NUPDR 2 -o- NUPDR 4 --l:r- NUPDR 8--+-- ONUPDR 2 _.,.._ ONUPDR 4 -o- ON UPDR 81
Figure 5.3: Execution times for NUPDR and ONUPDR for in-core problem sizes

69

30

800
700
600

0 500
Q)
(/)

-; 400

E
i= 300
200
100
0
50

0

100

150

250

200

Mesh size, million elements
1-<>-PCDM 8 -D-PCDM 16 .....fr-OPCDM 8 -+-OPCDM 161

Figure 5.4: Execution times for PCDM and OPCDM for in-core problem sizes

10000
9000
8000
7000

0Q)

6000

.!!!... 5000
Q)
E
i= 4000
3000
2000
1000
0
0

200

400

600

800

1000

1200

1400

1600

1800

Mesh size, million elements
I-<>-OUPDR8 -o-OUPDR161

Figure 5.5: Execution times for OUPDR for out-of-core problem sizes

70

2000

3500
3000
2500
¥2000
Q)

E 1500

i=

1000
500
0
10

110

60

160

260

210

310

Mesh size, million elements
1-<>-0NUPDR 2 -o-ONUPDR 4 --ir-ONUPDR

a!

Figure 5.6: Execution times for ONUPDR for out-of-core problem sizes

6000
5000
4000

0Q)
1/)

'a;'

3000

E

i=

2000
1000
o~~bL-r----~------.------r----~,-----.------.------.-----.-----~

0

200

400

600

1000

800

1200

1400

1600

1800

Mesh size, million elements
1-<>- OPCDM 8 -o- OPCDM 161

Figure 5. 7: Execution times for OPCDM for out-of-core problem sizes

71

2000

Table 5.15: Single PE performance of NUPDR and ONUPDR methods.

Size,
x10 6
8
9
12
16
29
46
74
118
188
301

Time (sec)
NUPDR ONUPDR
17
20
21
27
24
33
46
35
157
n/a
322
n/a
589
n/a
1016
n/a
1638
n/a
2702
n/a

Speed ( x 103 /sec)
NUPDR ONUPDR
100
119
114
89
124
90
115
86
46
n/a
36
n/a
31
n/a
29
n/a
29
n/a
28
n/a

Table 5.16: Single PE performance of PCDM and OPCDM methods.

Size
x10 6
30
59
122
238
366
480
706
963
1074
1235
1480
1662
1864

PEs
4
8
16
32
48
64
96
128

n/a
n/a
n/a
n/a
n/a

Time (sec)
PCDM OPCDM
308
73
296
101
319
163
310
425
327
707
304
918
324
1408
299
1772
1986
n/a
2256
n/a
2614
n/a
2900
n/a
3285
n/a

72

Speed (x103 /sec)
PCDM OPCDM
24
26
25
37
24
47
24
35
23
32
25
33
23
31
25
34
34
n/a
34
n/a
35
n/a
36
n/a
35
n/a

tables we want to see the trend as we increase the problem size. We can see that the
original applications as well as the MRTS implementations seem to maintain more or
less constant speed. This means that as we increase the problem size the MRTS is able
to sustain the performance level. Additionally, for the original applications this means
they scale rather well [13, 15, 17, 18].
Table 5.17: Overlap of computation, communication and out-of-core disk I/0 in the OUPDR.

Size
x10 6
24
59
109
175
255
353
471
588
739
874
1284
1967

Time
(sec)
46
102
176
368
576
802
1133
1386
1745
2111
3122
4599

Camp
(%)
88
85
86
65
61
58
57
55
54
51
52
53

Comm
(%)
18
16
21
15
12
11
13
13
14
18
18
16

Disk
(%)
0
0
0
36
51
61
64
70
73
73
76
82

Overlap(%)
min max avg
1
7
6
0
2
1
2
8
7
16
4
19
8
29
24
6
35
30
11
38
33
5
46
38
5
48
41
6
54
42
5
46
57
20
63
50

Table 5.18: Overlap of computation, synchronization and out-of-core disk I/0 in the ONUPDR.

Size
x10 6
8
9
12
16
29
46
74
118
188
301

Time
(sec)
20
27
33
46
157
322
589
1016
1638
2702

Comp
avg (%)
98
99
98
98
51
40
36
35
32
33

Sync
avg (%)
2
1
2
2
1
1
1
1
1
0

Disk
avg (%)
0
0
0
0
81
103
112
116
123
124

Overlap(%)
min max avg
0
0
0
0
0
0
0
0
0
0
0
0
38
33
5
52
43
7
48
7
56
52
17
58
18
64
56
17
64
58

Tables 5.17, 5.18 and 5.19 are presented to demonstrate the out-of-core performance
73

Table 5.19: Overlap of computation communication and out-of-core disk IO in the OPCDM.

Size
x10 6
30
59
122
238
366
480
706
963
1074
1235
1480
1662
1864

Time
(sec)
73
101
163
425
707
918
1408
1772
1986
2256
2614
2900
3285

'
Comp
avg (%)
49
64
94
66
62
60
61
57
58
59
58
59
60

Comm
avg (%)
53
36
12
7
5
4
3
3
3
3
3
4
4

Disk
avg (%)
0
0
0
50
64
72
76
87
88
91
95
98
97

Overlap(%)
min max avg
2
2
0
0
0
0
2
7
5
4
27
23
8
36
30
43
6
36
40
10
50
47
6
56
49
8
63
9
65
53
14
67
57
10
73
60
62
7
74

of the MRTS applications. These tables show computation, communication (or synchronization for ONUPDR) and disk I/0 as a percentage of total execution time. The last
three columns show overlap of computation, communication/synchronization and disk

I/0 which we compute as Overlap =

Comp+Comm+Dtsk-Total
Total

x 100% where Comp is
'

the computation time, Comm is the communication/synchronization time, Disk is the
disk I/0 time and Total is the total execution time. MRTS is designed to promote
overlapping of communication and I/0 and our data show we have been very successful
at it. The overlap is over 50% for large problems and can be as high as 62%. This
means the MRTS is capable of tolerating high latencies rather well and accommodate
data-intensive application.
The MRTS can use and supports either GCD or TBB multi-threading libraries to
utilize shared-memory computing. Since GCD availability on non-Apple systems is very
limited yet we had to use an older system running an experimental version of FreeBSD:

74

Dell PowerEdge 6600 with 4 Intel Xeon MP 1.47 GHz processor and 16GB of memory.
Table 5.20: The comparison of performance of the computing layer implementations.

Size,
x10 6
7.97
9.49
11.98
16.04

Threading Building Blocks
T1(sec) T4(sec) Spdup
49.20
24.94
1.97
60.98
31.88
1.91
70.38
32.93
2.14
114.59
56.66
2.02

Grand Central Dispatch
T1(sec) T4(sec) Spdup
46.29
27.54
1.68
61.89
34.05
1.82
37.84
1.88
71.17
115.31
60.11
1.92

Table 5.20 shows sequential time (T1), parallel time with 4 PEs (T4) and relative
speedup (Spdup) for the ONUPDR with TBB and GCD implementations of the computing layer. Size is the number of elements in the resulting mesh, a pipe cross-section
geometry was used for all experiments. The speedup is comparable to the speedup of
the NUPDR, We can see that GCD implementation is slightly slower yet we can see
similar trends for both implementations.

75

Chapter 6

Conclusion and Future work
This thesis aims to provide an approach for effectzve computing of large irregular scientific problems such as unstructured mesh generation. The main contributions are design,
implementation and evaluation of the Multi-layered Run-Time System, a practical parallel out-of-core runtime system, which enables out-of-core computing for both new and
existing applications with small cost in performance and labor.
We followed an evolutionary approach to out-of-core computing. First, we designed
and implemented several custom out-of-core codes based on existing state of the art
in-core algorithm, PDR. We achieved good performance with low overhead (as low as
19% for our best method) and were able to demonstrate the effectiveness of out-of-core
approach. That is, using our custom out-of-core codes we were able to solve larger than
otherwise possible problems, or solve problems of the same size (compared to the in-core
method) using significantly fewer PEs. Compared to the in-core method, this allows to
potentially achieve shorter wall-clock time (time between user submits his application
and gets the results back) on shared computing resources. In fact, in the case of Sci Clone

76

cluster the wall-clock time of our best out-of-core method can be as low as one third of
the wall-clock time of the in-core method.
Despite good performance, developing custom out-of-core codes from scratch is time
consuming and labor intensive task. Additionally, restructuring of in-core algorithms
is often required, which means the same out-of-core solution cannot be easily applied
to a different algorithm. To counter this we focused on a runtime system rather than
a single application. Our next step was to design and implement out-of-core support
for an application based on PREMA framework. If successful, this can be reused for
any application built on top of PREMA. We designed and implemented an out-of-core
version of the PCDM method. Focusing on the framework rather than specific application simplified the porting process and still produced acceptable results in terms of
performance. The Out-of-core PCDM adds little overhead compared to the PCDM and
shows high overlap (up to 68%) of computation, communication and disk I/0.
Finally, we used our experience to design and implement the MRTS, which enables out-of-core computing automatically for any application that was built on top or
ported to this runtime system. The MRTS extends PREMA by adding "free" out-of-core
support and interfaces for fine-grain parallelism. We ported existing in-code methods,
UPDR, PCDM and NUPDR, to demonstrate that the porting process is not overly
complex. In fact, the porting of PCDM which uses PREMA programming model was
straightforward.
We used traditional CoWs to perform an evaluation of our implementation using
three parallel unstructured mesh generation methods with a wide spectrum of memory access patterns and communication/synchronization requirements to stress test the

77

MRTS. In particular, the NUPDR was used to test multi-threaded performance, the
UPDR was used to test structured communication with some synchronization, and the
PCDM was used to test fully asynchronous communication. Furthermore, each application tested the out-of-core subsystem. The performance of the MRTS-based codes
was slightly worse than that of custom out-of-core codes (overlap 50% on average and
up to 61% for large problem sizes). However, this is small price to pay for shorter and
simplified development or porting.
The MRTS is implemented on top of established software libraries and standards like
TBB/GCD for multi-threading and ARMCI/MPI for both one- and two-sided message
passing. This permits incremental application development for multi-layered parallel
architectures. Moreover, it allows for an evolutionary approach to the migration of
complex applications (e.g., parallel mesh generation) from traditional parallel platforms
to emerging massively parallel platforms.
In the future, I plan to continue working on MRTS and make it available on emerging
platforms, like Blue Waters supercomputer. While there is no possibility (nor necessity)
for traditional out-of-core computing on Blue Waters its highly hierarchical architecture
makes principles used in the design and development of the MRTS and out-of-core
applications relevant. Since many applications can benefit from large memory of Blue
Waters but cannot take advantage of the high level of concurrency, one possible scenario
is to partition resources into compute nodes and memory nodes and use MRTS for
managing data flow between them.
GPU computing is gaining popularity among scientific applications and is conceptually similar to traditional out-of-core computing with system memory replacing the

78

disk. I see potential usefulness of MRTS in this area in the near future and plan to
add implementation of the computing layer using OpenCL for compatibility with both
high-count multi-core processors and GPUs.
To summarize, we presented an approach for effective computing of large irregular
scientific problems such as unstructured mesh generation. We showed that out-of-core
computing allows solving larger than otherwise possible problems as well as getting the
results faster on shared computing resources. We designed, implemented and evaluated the MRTS, which permits out-of-core computing with many application by simply
porting an existing or developing a new applications for the MRTS. While porting and
development are greatly simplified performance is not sacrificed.

79

Bibliography
[1] ALOK AGGARWAL AND 8. VITTER, JEFFREY. The input/output complexity of
sorting and related problems. Commun. ACM, 31(9):1116-1127, 1988.
[2] CHRISTOS D. ANTONOPOULOS, FILIP BLAGOJEVIC, ANDREY N. CHERNIKOV,
NIKOS P. CHRISOCHOIDES, AND DIMITRIS S. NIKOLOPOULOS. A multigrain Delaunay mesh generation method for multicore SMT-based architectures. Journal
on Parallel and D2stnbuted Computmg, 69:589-600, 2009.
[3] APPLE INC. Grand Central Dispatch (GCD) Reference, 2009.

[4] ESHRAT ARJOMANDI, WILLIAM G. O'FARRELL, IVAN KALAS, GITA KOBLENTS,
FRANK CH. EIGLER, AND GUANG R. GAO. ABC++: Concurrency by inheritance
inc++. IBM Systems Journal, 34(1):120-137, 1995.

[5] KEVIN BARKER, ANDREY CHERNIKOV, NIKOS CHRISOCHOIDES, AND KESHAV
PINGALI. A load balancing framework for adaptive and asynchronous applications.
IEEE Transactwns on Parallel and D2stnbuted Systems, 15(2):183-192, February
2004.

[6] KEVIN BARKER, ANDREY CHERNIKOV, NIKOS CHRISOCHOIDES, AND KESHAV
PINGALI. A load balancing framework for adaptive and asynchronous applications.
IEEE Transactwns on Parallel and D2stnbuted Systems, 15(2):183-192, February
2004.

[7] KEVIN BARKER AND NIKOS CHRISOCHOIDES. Practical performance model for optimizing dynamic load balancing of adaptive applications. In Pmceedmgs of the 19th
IEEE Internatwnal Parallel and D2stnbuted Pmcessmg Sympos2Um (IPDPS'OS} Papers- Volume 01, IPDPS '05, pages 28.2-, Washington, DC, USA, 2005. IEEE
Computer Society.
[8] ADRIAN BOWYER. Computing Dirichlet tesselations. Computer Journal, 24:162166, 1981.

[9] CHIALIN CHANG, JOEL SALTZ, AND ALAN SUSSMAN. Chaos++: A runtime library
for supporting distributed dynamic data structures. Technical report, Gregory V.
Wilson, Editor, Parallel Programming Using C, 1995.
[10] JEFFREYS. CHASE, FRANZ G. AMADOR, EDWARD D. LAZOWSKA, HENRY M.
LEVY, AND RICHARD J. LITTLEFIELD. The Amber system: Parallel programming
80

on a network of multiprocessors. In Proceedzngs of the 12th ACM Symposwm on
Operatzng Systems Prznczples, pages 147-158, Litchfield Park AZUSA, 1989.
[11] ANDREY CHERNIKOV AND NIKOS CHRISOCHOIDES. Three-dimensional semigeneralized point placement method for delaunay mesh refinement. In 16th Internatwnal Meshzng Roundtable, pages 25-44, Seattle, WA, October 2007.
[12] ANDREY CHERNIKOV AND NIKOS CHRISOCHOIDES. Algorithm 872: Parallel 2d
constrained delaunay mesh generation. ACM Transactwns on Mathematzcal Software, 34:6-25, January 2008.
[13] ANDREY CHERNIKOV AND NIKOS CHRISOCHOIDES. Three-dimensional delaunay
refinement for multi-core processors. In 22nd ACM Internatwnal Conference on
Supercomputzng, pages 214-224, Island of Kos, Greece, June 2008.
[14] ANDREY CHERNIKOV AND NIKOS CHRISOCHOIDES. Generalized two-dimensional
delaunay mesh refinement. SIAM Journal on Sczentzfic Computzng, 31:3387-3403,
2009.
[15] ANDREY N. CHERNIKOV AND NIKOS P. CHRISOCHOIDES. Practical and efficient
point insertion scheduling method for parallel guaranteed quality Delaunay refinement. In Proceedzngs of the 18th Internatwnal Conference on Supercomputzng, pages
48-57. ACM Press, 2004.
[16] ANDREY N. CHERNIKOV AND NIKOS P. CHRISOCHOIDES. Parallel2D graded guaranteed quality Delaunay mesh refinement. In Proceedzngs of the 14th Internatwnal
Meshzng Roundtable, pages 505-517. Springer, September 2005.
[17] ANDREY N. CHERNIKOV AND NIKOS P. CHRISOCHOIDES. Parallel guaranteed
quality Delaunay uniform mesh refinement. SIAM Journal on Sczentzfic Computzng,
28:1907-1926, 2006.
[18] ANDREY N. CHERNIKOV AND NIKOS P. CHRISOCHOIDES. Algorithm 872: Parallel
2D constrained Delaunay mesh generation. ACM Transactwns on Mathematzcal
Software, 34(1):1-20, January 2008.
[19] J AEYOUNG CHOI AND J. J. DoNGARRA. Scalable linear algebra software libraries
for distributed memory concurrent computers. In Proceedzngs of the 5th IEEE
Workshop on Future Trends of Dzstrzbuted Computzng Systems, FTDCS '95, pages
170-, Washington, DC, USA, 1995. IEEE Computer Society.
[20] EDUARDO F. D'AZEVEDO AND JACK DONGARRA. The design and implementation
of the parallel out-of-core scalapack lu, qr, and cholesky factorization routines.
Concurrency- Practzce and Experzence, 12(15):1481-1493, 2000.
[21] FRANK DEHNE, WOLFGANG DITTRICH, AND DAVID HUTCHINSON. Efficient external memory algorithms by simulating coarse-grained parallel algorithms. In In
Proceedzngs of the 9th ACM Symposwm on Parallel Algorzthms and Archztectures,
pages 106-115, 1997.
81

[22] JAMES DEMMEL, JACK DONGARRA, JEREMY Du CROZ, ANNE GREENBAUM,
SVEN HAMMARLING, AND DANNY SORENSEN. Prospectus for the development
of a linear algebra library for high-performance computers. Technical Report
ANL/MCS-TM-97, 9700 South Cass Avenue, Argonne, IL 60439-4801, USA, 1987.
[23] JACK J. DONGARRA, JEREMY Du CROZ, SVEN HAMMARLING, AND RICHARD J.
HANSON. An extended set of FORTRAN Basic Linear Algebra Subprograms. ACM
Transactzons on Mathematzcal Software, 14(1):1-17, 1988.
[24] A. FEDOROV AND N. CHRISOCHOIDES. Location management in object-based distributed computing. In Proceedzngs of the 2004 IEEE Internatzonal Conference on
Cluster Computzng, pages 299-308, Washington, DC, USA, 2004. IEEE Computer
Society.
[25] ANDRIY FEDOROV AND NIKOS CHRISOCHOIDES. Communication support for dynamic load balancing of irregular adaptive applications. In 2004 Internatzonal conference on parallel processzng workshops (ICPPW'04), pages 555-562, 2004.
[26] PAUL-LOUIS GEORGE AND HOUMAN BOROUCHAKI. Delaunay Trzangulatzon and
Meshzng. Applzcatzon to Fznzte Elements. HERMES, 1998.
[27] INTEL CORPORATION. Intel(R) Threading Building Blocks Reference Manual, 2008.
[28] JAAKKO JARVI AND JOHN FREEMAN. Lambda functions for c++Ox. In Proceedzngs
of the 2008 A CM symposzum on Applzed computzng, SAC '08, pages 178-183, New
York, NY, USA, 2008. ACM.
[29] ERIC JUL, HENRY LEVY, NORMAN HUTCHINSON, AND ANDREW BLACK. Finegrained mobility in the emerald system. ACM Trans. Comput. Syst., 6(1):109-133,
1988.
[30] LAXMIKANT V. KALE AND SANJEEV KRISHNAN. Charm++: a portable concurrent
object oriented system based on c++. SIGPLAN Not., 28(10):91-108, 1993.
[31] ANDRIY KOT, ANDREY CHERNIKOV, AND NIKOS CHRISOCHOIDES. Effective outof-core parallel delaunay mesh refinement using off-the-shelf software. ACM Journal
on Experzmental Algorzthmzcs. In press.
[32] ANDRIY KOT, ANDREY CHERNIKOV, AND NIKOS CHRISOCHOIDES. Out-of-core
parallel delaunay mesh generation. In !MAGS World Congress Sczentzfic Computatzon, Applzed Mathematzcs and Szmulatzon, number 17, July 2005.
[33] ANDRIY KOT, ANDREY CHERNIKOV, AND NIKOS CHRISOCHOIDES. Parallel outof-core delaunay refinement. In IEEE Intellzgent Data Acquzsztzon and Advanced
Computzng Systems: Technology and Applzcatzons, pages 183-195, Sophia, Bulgaria,
September 2005.
[34] ANDRIY KOT, ANDREY CHERNIKOV, AND NIKOS CHRISOCHOIDES. Effective outof-core parallel delaunay mesh refinement using off-the-shelf software. In Proceedzngs of the 20th znternatzonal conference on Parallel and dzstrzbuted processzng,
IPDPS'06, pages 125-125, Washington, DC, USA, 2006. IEEE Computer Society.
82

[35] ANDRIY KOT, ANDREY CHERNIKOV, AND NIKOS CHRISOCHOIDES. The evaluation
of an effective out-of-core run-time system in the context of parallel mesh generation.
In IEEE Internatzonal Parallel and Dzstrzbuted Processmg Symposwm, 2011. To
appear.
[36] LEONIDAS LINARDAKIS AND NIKOS CHRISOCHOIDES. Graded delaunay decoupling
method for parallel guaranteed quality planar mesh generation. SIAM Journal on
Sczentzfic Computmg, 30:1875-1891, March 2008.
[37] J. NIEPLOCHA, V. TIPPARAJU, M. KRISHNAN, AND D. K. PANDA. High performance remote memory access communication: The armci approach. Internatzonal Journal of Hzgh Performance Computmg Apphcatzons, 20(2):233-253, Summer 2006.
[38] JAREK NIEPLOCHA, BRUCE PALMER, MANOJKUMAR KRISHNAN, HAROLD
TREASE, AND EDOARDO APR. Advances, applications and performance of the
global arrays shared memory programming toolkit. Intern. J. Hzgh Perf. Camp.
Applzcatwns, 20, 2005.
[39] MARK H. NODINE AND JEFFREY SCOTT VITTER. Greed sort: optimal deterministic sorting on parallel disks. J. ACM, 42( 4):919-933, 1995.
[40] JOHN SALMON AND MICHAEL WARREN. Parallel out-of-core methods for N-body
simulation. In Proceedmgs of the Ezghth SIAM Conference on Parallel Processmg
for Sczentzfic Computmg, 1997.
[41] JONATHAN RICHARD SHEWCHUK. Delaunay Refinement Mesh Generatzon. PhD
thesis, Carnegie Mellon University, 1997.
[42] S. TOLEDO AND F. GUSTAVSON. The design and implementation of solar, a
portable library for scalable out-of-core linear algebra computations. In 4th Annual Workshop on I/0 m Parallel and Dzstnbuted Systems, pages 28-40, 1996.
[43] TIANKAI TU AND DAVID R. O'HALLARON. A computational database system
for generatinn unstructured hexahedral meshes with billions of elements. In SC
'04: Proceedmgs of the 2004 ACM/IEEE conference on Supercomputmg, page 25,
Washington, DC, USA, 2004. IEEE Computer Society.
[44] TIANKAI Tu AND DAVID R. O'HALLARON. Extracting hexahedral mesh structures
from balanced linear octrees. In Proceedmgs of the 13th INternatzonal Meshmg
Roundtable, pages 191-200, 2005.
[45] JEFFREY SCOTT VITTER AND MARK H. NODINE. Large-scale sorting in uniform
memory hierarchies. Journal of Parallel and Dzstnbuted Computmg, 17:107-114,
1993.
[46] JEFFREY SCOTT VITTER AND ELIZABETH A. M. SHRIVER. Algorithms for parallel
memory ii: Hierarchical multilevel memories. ALGORITHMICA, 12:148-169, 1993.

83

[47] JEFFREY SCOTT VITTER, ELIZABETH A. M. SHRIVER, AND ELIZABETH A.
M. SHRIVER Z. Algorithms for parallel memory i: Two-level memories. Algonthmzca, 12:110-147, 1994.
[48] T. VON EICKEN, D. CULLER, S. GOLDSTEIN, AND K. SCHAUSER. Active messages:
A mechanism for integrated communication and computation. In Proceedzngs of the
19th Int. Symp. on Camp. Arch., pages 256-266. ACM Press, May 1992.
[49] THORSTEN VON EICKEN, DAVID E. CULLER, SETH COPEN GOLDSTEIN, AND
KLAUS ERIK SCHAUSER. Active messages: a mechanism for integrated communication and computation. volume 20, pages 256-266, New York, NY, USA, April
1992. ACM.
[50] DAVID F. WATSON. Computing then-dimensional Delaunay tesselation with application to Voronoi polytopes. Computer Journal, 24:167-172, 1981.

84

VITA

Andriy Kot

Andriy Kot was born in Ternopil, Ukraine in 1981. From 1997 through 2003 he studied
in the Ternopil Academy of National Economy in Ternopil, Ukraine where he received
Master's (2003) degrees in Computer Engineering with "red diploma" distinction. His
Master's thesis was on the information security systems, incorporating both software
and hardware components.
In Fall of 2001 Andriy Kot started studying in the Department of Computer Science
at the College of William and Mary where he received his Master's degree in Computer Science. His Master's project was on out-of-core computing and parallel run-time
systems.
Since Spring of 2004 he is pursuing a Ph.D. degree in the Department of Computer
Science at the College of William and Mary, where he is currently a Ph.D. candidate
and a research assistant.

85

