A Virtual Tile Approach to Rastel-based Calculations of Large Digital Elevation Models in a Shared-Memory System by Yildirim, Ahmet Artu et al.
NOTICE 
 
 
This is the author’s version of a work that was accepted for publication in 
Computers & Geosciences. Changes resulting from the publishing process, 
such as peer review, editing, corrections, structural formatting, and other 
quality control mechanisms may not be reflected in this document. Changes 
may have been made to this work since it was submitted for publication. This 
document is licensed under Creative Commons Attribution-Non-Commercial-
NoDerivatives 4.0 International (CC-BY-NC-ND license). For more 
information about the license, please visit: 
https://creativecommons.org/licenses/by-nc-nd/4.0/legalcode  
 
The definitive published version is:       
 
Yıldırım A.A., Watson D., Tarboton, D.G., Wallace, R.M., A virtual tile 
approach to raster-based calculations of large digital elevation models in a 
shared-memory system, Computers & Geosciences, Volume 82, September 
2015, Pages 78-88, ISSN 0098-3004, 
http://dx.doi.org/10.1016/j.cageo.2015.05.014. 
 
 
Corresponding Author: 
Ahmet Artu Yıldırım 
E-mail: ahmetartu@aggiemail.usu.edu 
 
A Virtual Tile Approach to Raster-based Calculations of Large Digital Elevation
Models in a Shared-Memory System
Ahmet Artu Yıldırıma, ,˚ Dan Watsona, David Tarbotonb, Robert M. Wallacec
aDepartment of Computer Science, Utah State University, Logan, Utah, USA
bUtah Water Research Laboratory, Utah State University, Logan, Utah, USA
cUS Army Engineer Research and Development Center, Information Technology Lab, Vicksburg, Mississippi, USA
Abstract
Grid digital elevation models (DEMs) are commonly used in hydrology to derive information related to topographically
driven flow. Advances in technology for creating DEMs has increased their resolution and data size with the result that
algorithms for processing them are frequently memory limited. This paper presents a new approach to the management
of memory in the parallel solution of hydrologic terrain processing using a user-level virtual memory system for shared-
memory multithreaded systems. The method includes tailored virtual memory management of raster-based calculations
for data sets that are larger than available memory and a novel order-of-calculations approach to parallel hydrologic
terrain analysis applications. The method is illustrated for the pit filling algorithm used first in most hydrologic terrain
analysis workflows.
Keywords: Multithreaded parallel digital elevation model analysis, pit filling algorithm, user-level virtual memory
management
1. Introduction
Grid digital elevation models (DEMs) are commonly
used in hydrology to derive information related to topo-
graphically driven flow [1, 2, 3]. When dealing with large
digital elevation model (DEM), datasets, computational5
efficiency and memory capacity become important consid-
erations. Prior work in TauDEM [4] has advanced paral-
lel methods for terrain processing using a message pass-
ing (MPI) approach that allows memory to be distributed
across multiple processors in medium-sized cluster com-10
puters [5, 6, 7].
In desktop computers, virtual memory systems are the
standard approach to working with data that is too big to
fit in memory. Operating systems typically implement vir-
tual memory using page files that hold on disk contents of15
memory. However repeated swapping (thrashing) occurs
when these get large because the system has limited gen-
eral capability to anticipate the pages needed next. Most
operating systems implement the virtual machine in kernel
that makes it difficult, sometimes impossible, to change its20
functionality and page replacement policy. This necessities
the implementation of a user-level tailored virtual memory
system to handle a programs’ locality better for fine-grain
control [8].
˚Corresponding author
Email addresses: ahmetartu@aggiemail.usu.edu
(Ahmet Artu Yıldırım ), dan.watson@usu.edu (Dan Watson),
dtarb@usu.edu (David Tarboton),
robert.m.wallace@usace.army.mil (Robert M. Wallace)
There is a need to process large DEMs on desktop com-25
puters that are often limited in total memory. This is the
primary problem addressed in this work. It is also desir-
able to have parallel terrain analysis algorithms that use
multi-threading to take advantage of common multi-core
processors [9] for greater efficiency. This is a secondary30
consideration in this work.
This paper presents a new approach to the manage-
ment of memory and the parallel solution of the raster-
based computations for shared-memory multithreaded sys-
tems, such as desktop computers. The contributions of35
this method in the context of parallelism are a tailored
user-level tile based virtual memory manager for raster-
based calculations for data sets that are larger than avail-
able memory and a novel order-of-calculations approach
to parallel hydrology analysis applications.40
We implemented a modified version of the Planchon
and Darboux pit filling algorithm [10] as implemented in
[4, 6] as an application of our tiled virtual memory man-
ager and evaluated its effectiveness for pit removal in DEMs
of varying size, with a varying number of operating system45
threads and memory capacity. The results demonstrate
several benefits over the use of standard virtual memory
approaches.
Furthermore, this study examines a load-balancing tech-
nique to minimize the idle times which might occur in50
case of uneven load between compute threads due to data
variability. We used the GDAL library [11] to enable a
wide range of raster file formats within the implemented
memory manager. We implemented our memory man-
ager using the Microsoft Windows 7 operating system as55
it is widely used for desktop Geographic Information Sys-
tem terrain analysis and enabling the processing of large
DEMs on Windows systems was a goal of this work. The
source code of the virtual memory project and the pit-
filling algorithm can be found at https://bitbucket.60
org/ahmetartu/hydrovtmm.
The paper is organized as follows: Section 2 provides
background and literature review. Section 3 gives specifics
of the modified Planchon and Darboux Algorithm used
here. Section 4 describes the design of the multi-threaded65
tiled virtual memory manager for raster-based calculations
that is contributed here. Section 5 gives performance re-
sults. Finally, we discuss conclusions based on the ob-
tained results in Section 6.
2. Background70
This review addresses three subjects that are needed
to set the context for this work. First we review existing
DEM pit filling approaches focusing on the Planchon and
Darboux Algorithm [10] modified and used in this work.
We then examine other efforts that use parallel methods75
for hydrologic terrain analysis and general methods for vir-
tual memory management to enable the processing of large
data sets that do not fit in physically available computer
memory.
Pits, defined as grid cells or sets of grid cells completely80
surrounded by higher grid cells often occur during DEM
production and are generally considered to be artifacts of
the data collection process [12]. Drainage conditioning to
remove pits is an important preprocessing step in the hy-
drologic analysis of DEMs, and is representative of a broad85
class of raster-based algorithms (e.g. [13, 1]) designed to
determine topographically driven flow. Once drainage con-
ditioning has been performed, a DEM that has no pits is
referred to as being hydrologically conditioned. The most
common approach to drainage conditioning is pit filling,90
whereby pit grid cells are raised to a level where they are
at a minimum equal to the lowest surrounding grid cell
and can drain. A well-known effective pit filling algorithm
is described by Planchon and Darboux [10]. Pit removal
using the Planchon and Darboux (PD) algorithm is one95
of the multiple hydrologic terrain analysis functions in the
TauDEM package. Time-complexity of the direct imple-
mentation of the PD algorithm is reported to grow with the
number of cells N in DEM as OpN1.5q [10]. Planchon et
al. also provided an improved implementation that embed-100
ded a recursive dry upward cell function that was reported
to achieve a time-complexity of OpN1.2q [10]. Alternative
pit filling algorithms, some claiming better efficiency have
been presented by others [14, 15, 16].
The Planchon and Darboux (PD) approach fills pits by105
covering the whole surface with a layer of “water” up to
a level greater than or equal to the highest point in the
DEM, then removes the excess water in an iterative man-
ner. Doing so, the algorithm naturally leaves the water in
the depressions at the height of their outlet. Let Z P R2110
be the set of input elevation points (i.e., the input DEM
with size m, where each member is an elevation point xi,
1 ď i ď m) and let W P R2 be the output DEM consisting
of “filled” elevation points yi. The goal of the PD algo-
rithm is to increase each elevation point xi with a minimal115
difference of elevation. The PD algorithm initializes all
grid cells to a large value (greater than the highest point
in the domain). It then uses iterative scans across the
domain to lower elevation values to the lowest elevation
greater than or equal to the original terrain hydrologically120
conditioned so that they drain to one of their neighbors.
Wallis et al. [4, 6] implemented a parallel version of
the PD algorithm. This adopted a distributed memory
domain partitioning approach to parallelism and divided
the domain into horizontal stripes, one stripe per paral-125
lel process. The PD algorithm was applied separately to
each stripe in parallel, with a step to exchange informa-
tion across stripe boundaries at the end of each iteration
so as to ensure convergence to the same global solution
as obtained by a serial implementation. The original PD130
algorithm and the Wallis et al. parallel implementation
visit each grid cell on each iteration. Scans of the grid
cycle through all eight possible combinations of row and
column scan orders. PD also offered an improved imple-
mentation that used a recursive dry upward tree search135
each time a cell was set to the original elevation to en-
hance efficiency. However each iterative pass across the
DEM still examined each grid cell.
Subsequent to the Wallis et al. [6] work, the TauDEM
team [4] identified the visiting of each grid cell on each it-140
eration as an inefficiency and developed a stack based ap-
proach whereby unresolved grid cells are placed on a stack
on the first scan, then removed from the first stack on each
subsequent scan and placed on a second stack. Stacks are
then switched. This limits the scanning to two directions145
rather than eight, but was found to result in a speedup
of a factor of 2 for small datasets and 4.3 for a modestly
large 1.5 GB dataset in comparison to the eight combi-
nation full grid scanning. The benefits of the stack thus
seem to outweigh the inefficiency of fewer scan directions.150
The TauDEM team did not evaluate the recursive dry up-
ward approach of the improved PD algorithm. Recursive
methods use the system stack to expand system memory,
posing a challenge for memory management in large data
computations. They also pose a challenge for a domain155
partitioned parallel approach as cross partition calls are
less predictable. They are also hard to implement on a
stack as they would require additional code to track the
stack position of each grid cell and to handle changing the
order in which grid cells on the stack are processed. The160
two direction stack based modified PD algorithm was in-
corporated into the publicly released version of TauDEM
[4] that was the starting point for this work. The focus
of this paper is on virtual memory management for large
DEM data, and the modified PD algorithm as used by165
TauDEM [4, 6] is used as an example to illustrate a gen-
2
eral approach.
Beyond pit filling several parallel computing technolo-
gies have been employed in the implementation of the hy-
drologic algorithms to achieve higher performance by effec-170
tively utilizing available processors in the system includ-
ing MPI for distributed memory architectures [17, 5, 6],
OpenMP [18] or low-level standard threading library for
multi-threaded shared memory architectures, and NVIDIA
CUDA for general-purpose computing on graphics process-175
ing units (GPUs) [19, 20]. Xu et al. [18] present a grid-
associated algorithm to improve the performance of a D8
algorithm [21] via OpenMP technology which is a thread-
level standard of parallel programming. CUDA algorithms
for drainage network determination are presented by Or-180
tega et al. [20], achieving up to 8x speed-up improvement
with respect to corresponding CPU implementation. Do et
al. introduced a parallel algorithm to compute the global
flow accumulation in a DEM using MPI [17] where hi-
erarchical catchment basins are computed by means of a185
parallel spanning tree algorithm to model the flow of wa-
ter. Each processor computes a local flow direction graph
using the D8 flow routing model.
There have been a number of general approaches to
better manage virtual memory. The Tempest [22] inter-190
face was proposed for customizing the memory policies
of a given application on parallel computers. Translation
lookaside buffer (TLB) is a limited memory cache that is
utilized for efficient memory address translation. However,
TLB misses are also common in memory-hungry applica-195
tions such as databases and in-memory caches. To elim-
inate these, Basu et. al [23] proposed mapping part of a
process’ linear virtual address space with a direct segment.
Huang et. al [24] introduced a power-aware virtual mem-
ory system to reduce the energy consumed by the memory200
for data-centric applications.
Software programs may implement their own level of
memory management to overcome the inefficiencies of system-
level virtual memory. For example, ArcGIS reportedly
uses quad-tree tiling to organize on disk file storage in205
a way that facilitated better virtual memory management
[25]. However full details of their implementation are not
available.
3. Modified Planchon and Darboux Algorithm
We employed a modified version of the PD algorithm,210
as used in TauDEM [4], in this study. The PD algorithm
has two phases, initialization (Algorithm 1) and filling
(Algorithm 2). Initialization starts by allocating memory
space for elevation data, W . Then, if the cell is located
at the edge, the algorithm assigns the value of cell in Z215
to each cell of W with the same cell location, otherwise
a maximum number is assigned. Here edge is defined to
include cells at the edge of the DEM or internal cells adja-
cent to cells with no data values, or cells marked as inter-
nally draining and not to be filled in a separate input file.220
Thus, it is assumed that water may drain from the edges
of the input DEM or into internally draining or no data
cells (because data may not fit neatly into the rectangular
DEM domain). We employ stacks to hold indexes of the
cells not yet solved. The indexes of the cells having excess225
“water” in W are pushed onto the first stack, (S1), which
is used later in the filling phase. An upper bound on the
number of elements the stack needs to hold is the number
of internal (non-edge) grid cells in the DEM.
Algorithm 1 Initialisation Phase of Stack-based Plan-
chon and Darboux Algorithm
Require: Input DEM Z
Ensure: Elevation data W , stack S1
1: procedure initializePlanchon(Z)
2: for Cell number i Ð 1 to |Z| do
3: if ci P Z is on the edge then
4: W piq Ð Zpiq
5: else
6: W piq Ð `8
7: S1 Ð push(S1, i);
8: end if
9: end for
10: return W , S1
11: end procedure
In the filling phase (Algorithm 2), the cell values of W230
are decreased in an iterative manner until no cell value
is changed. The procedure pulls an unresolved grid cell
from stack, S1. The operation of minN8pW piqq denotes
finding the minimum value among 8 neighbours of the cell
W piq. In Line 6, the algorithm checks whether the value235
of Zpiq is greater than or equal to the minimum elevation
of its 8 neighbours plus a small parameter ϵ to enforce
strictly positive slopes. This parameter may be 0, and
is defaulted to 0 where minimally altered hydrologically
conditioned surfaces are desired. If “true”, this implies240
that the cell drains, so the value of Zpiq is assigned to the
cell W piq. Otherwise the value Nmin` ϵ is assigned to the
cell, lowering its value to the elevation at which it drains
(line 10). Cells that are set to their original elevation value
(line 7) do not need to be revisited so are not placed on245
the stack S2. Cells lowered to a neighbor value (plus ϵ)
may need to be revisited in a later iteration so are pushed
to the stack S2. The stacks decrease in size progressively
as cells are processed. After each iteration, in Line 16, the
indexes pointing to the stacks are swapped.250
The original PD algorithm did not employ a stack as
described above. Rather at each iteration it scanned the
entire DEM examining for each cell whether W piq ą Zpiq,
and if true examined neighbor elevations and applied the
excess water removal logic.255
4. Virtual tile approach to the parallel raster-based
algorithm
We implemented a parallel modified PD algorithm for a
multi-threaded shared memory system with limited mem-
3
Algorithm 2 Filling Phase of Planchon and Darboux Al-
gorithm
Require: Input DEM Z, elevation dataW , stacks S1 and
S2, ϵ-descending path parameter, i denotes to the vis-
ited cell index, minN8pW piqq denotes to minimum el-
evation of the neighbouring cells of cell i
Ensure: Elevation data W
1: procedure fillPlanchon(Z, W , S1, S2, ϵ)
2: do
3: for all cell number i in S1 do
4: if W piq ą Zpiq then
5: Nmin Ð minN8pW piqq
6: if Zpiq ě Nmin ` ϵ then
7: W piq Ð Zpiq
8: else
9: if W piq ą Nmin ` ϵ then
10: W piq Ð Nmin ` ϵ
11: end if
12: S2 Ð pushpS2, iq
13: end if
14: end if
15: end for
16: call swapStacks(S1, S2)
17: while any cell c P W is changed
18: return W
19: end procedure
ory to process large DEMs. The input DEM is divided260
into a number of generally square qˆ q tiles where q is tile
size that can be considered a generalized domain decom-
position approach when compared to the TauDEM stripe
approach. Tiles may be rectangular along the edges to ac-
commodate domain sizes that are not multiples of q. Each265
tile is processed individually by one thread. Tiles are dis-
tributed among the threads with respect to number of cells
in the stacks using a load balancing mechanism. In order
to process big DEMs, which might be larger than available
physical memory, we adopt a tile-based user-level virtual270
memory management system. The memory system swaps
out tiles and their related data to hard-disk to free the
memory space automatically when the predefined mem-
ory limit is reached. Tiles are chosen for swapping out by
the memory manager according to a least recently used275
rule.
In our algorithm, we use a single input/output (I/O)
thread and multiple compute threads. Compute threads
are responsible for the execution of the PD algorithm while
the I/O thread is used to avoid performance bottlenecks280
due to overlapping disk I/O and compute operations. The
I/O thread services all compute threads. The main thread
is regarded as a compute thread and is responsible for
spawning other compute threads and the I/O thread. Mem-
ory address space is shared among all threads, alleviating285
the overhead associated with interprocess communication
in shared memory systems.
Figure 1: Partition of the DEM into tiles with one-cell border. Grid
cells contain elevation values with nondraining cells (pits) in the tile
detail depicted in gray.
5 5 4 5 5 5 5 6 5 6 6
6 6 5 6 6 5 6 7 4 4    6
7 7 6 7 7 4 7 8 4 3 6
9 9 8 9 9 4 9 7 3 9 10
11 11 10 11 11 11 11 9 11 11 10
12 12 8 12 12 12 12 10 12 9 9
13 12 7 12 13 13 9 11 9 13 14
14 7 6 11 14 14 14 9 14 14 14
15 7 7 8 9 15 15 13 15 15 16
15 8 8 8 7 16 16 14 16 16 15
15 11 9 11 11 17 17 6 17 17 16
Tile Width
Tile Height
We adopt data parallelism among compute threads where
each thread works on one tile at a time and executes the
same algorithm (Algorithm 3). Each tile is a rectangular290
subset of the DEM that consists of primary data and a
one-cell border that overlaps into the primary data from
adjacent tiles as illustrated in Figure 1, that is used to fa-
cilitate transfer of information between tiles. The primary
data of all tiles completely covers the domain without over-295
lap. A tile page, Tk, has the corresponding subset of eleva-
tion data, Wk, input DEM data, Zk and two stacks. Tile
pages are indexed by tile page number k that define the
page uniquely in the address space. The initializePlanchon
and fillPlanchon functions change values only within the300
primary data, but refer to values from adjacent grid cells
using the edge data held locally as part of each tile page.
After the initialization pass (Lines 3-5) and each filling
pass (Lines 9-13) the resulting elevation values are hydro-
logically conditioned within the local context of each tile,305
but may need to be modified as each tile is juxtaposed
with its surrounding tiles. Hence, local elevation data on
the edges must be updated from surrounding tiles. Neigh-
boring threads exchange the border of the tiles. Thus,
we implement a barrier to synchronize all threads at this310
point to make sure all the threads are at the same position.
Barriers are shown in Algorithm 3. Although barriers are
desired to be avoided to fully exploit the parallelism in al-
gorithm design, in order to circumvent race conditions it
is required. Then the exchangeBorders function is called315
for each tile by each owner thread. This accesses the data
from each bordering tile and updates its edge data with
the corresponding primary data from the adjacent tile.
Remaining cells to be processed at the next iteration
can be determined by the number of cells in the stack.320
This determines the require processing time for each tile.
After locally filling the pits of the tiles, imbalance might
occur in which some threads finish the local filling before
other threads. To avoid this inefficiency, we implement
load balancing functionality that distributes the tiles to325
the threads evenly based on the number of cells in the
stacks. Before distributing the tiles, all threads must wait
4
Algorithm 3 Multithreaded PD algorithm
Require: Input DEM Z, tile size TS, memory limit ML,
number of compute threads NC, each with start in-
dex, SIi and end index, EIi, referencing the tiles it
processes.
Ensure: Output DEM W (final elevation data)
1: procedure parallelplanchon(Z, TS, ML, NC)
2: for all Compute Threads i in parallel do
3: for kÐ SIi, EIi do
4: call initializePlanchon(Tk)
5: end for
6: call exchangeBorders() with barrier()
7: call performLoadBalancing() with barrier()
8: do
9: for k Ð SIi, EIi do
10: if Tk is not finished then
11: call fillPlanchon(Tk)
12: end if
13: end for
14: call exchangeBorders() with barrier()
15: call performLoadBalancing() with barrier()
16: while any cell elevation in W is changed
17: end for
18: call barrier()
19: call writeOutputDEM()
20: end procedure
for the main thread that executes the loadBalancing func-
tion. loadBalancing examines the size of the unresolved
stack in each tile and adjusts the tiles assigned to each330
thread by setting SIi and EIi prior to the next iteration.
More discussion about the load balancing can be found in
Load balancing section.
A virtual memory management strategy (Figure 2) was
developed to support implementation of this algorithm in335
systems where the physical memory is limited. A memory
manager is responsible for performing swapping operations
as memory reaches its maximum capacity. In system-level
virtual memory, each program on the operating system has
its own address space that is consisting of memory chunks340
(pages) [26]. In our approach, each thread is provided a
set of tiles on the DEM file and its associated data whose
memory is managed by the virtual memory through a well-
defined software interface.
All memory management functionality is performed by345
the I/O thread, using the GDAL library [11]) so that pro-
cessing in the compute threads may proceed in parallel
with I/O swapping of tiles to disk. One of the reasons for
a single I/O thread is that GDAL did not support parallel
I/O. However even with a parallel I/O library on the PC350
platforms we are targeting, the parallel I/O capability is
limited and a single I/O thread avoids cross thread I/O
bottlenecks.
The main thread initializes the memory manager and
creates the tile page table. Tile page table is in shared355
Figure 2: Shared virtual memory management strategy
Main ThreadThread 0 Compute Thread 0 Main Thread
Thread 1
Thread n-1
Thread n
(1) Initialize memory 
 manager
(2) Create I/O Thread 
and Compute Threads
(3) Apply Algorithm 1 (4) Wait until all  Compute
Threads are done
(5) Finalize memory 
manager
(1) Request tile (2) Return tile
Compute Thread n-1
Compute Thread 1
I/O Thread
memory accessible to the compute threads and I/O thread.
The tile page table contains tile data structures mainly in-
cluding tile state, memory addresses of original and con-
ditioned DEM data and stack data. We implemented a
tailored stack data structure where the data are kept in360
memory in a contiguous fashion in order to serialize the
writing and reading of stack memory content efficiently.
The stack data structure contains a dynamically increas-
ing heap array (using the realloc utility), and size and
capacity variables.365
Figure 3: Tile distribution and sharing among the compute threads
in pit-remove algorithm. Tiles are numbered with the owner compute
thread number
Memory
Compute 
Thread 1
Memory
Compute 
Thread 2
Memory
Compute 
Thread N. . .
1
1
1
1
1
1
2
2
2
2
2
2
N-1
N-1
N
N
N
N
Tile-based Virtual Memory
3
3
Figure 3 illustrates how tiles are assigned to compute
threads. Note that the exchangeBorders function results
in tiles assigned to one thread being accessed by adjacent
threads. This is a non-locking read access so that con-
tention that may occur when a thread attempts to lock370
a memory block that is owned by another thread [27]
can be avoided. Data sharing in shared memory systems
is one source of performance degradation and should be
avoided where possible, thus the exchangeBorders func-
tion is called only at infrequent intervals.375
5
Each tile page has state, µk where µk “ tnotloaded,
present, swappedout, finishedu: notloaded, indicating that
the page has not yet been requested by any compute thread;
present, the page is resident in the memory; swappedout,
the page is swapped out to the disk; or finished, all eleva-380
tion points in the tile have been hydrologically conditioned
and no more pit-filling is required. The condition finished
is set on a tile when its stack size becomes 0. The algo-
rithm iterates until all the tiles have achieved the state
of finished, or until all stacks on all processes remain un-385
changed. The page table is illustrated in Figure 4(a) with
tile states illustrated in Figure 4(b).
A distinguishing property in our model, as compared
to most GIS software, e.g., ArgGIS, is that a tile page does
not consist only of raster data, but also contains algorithm-390
specific data such as the input, output and associated data
that comprise the memory space of the program.
A message queue is used for interprocess communica-
tion between compute threads and the I/O thread. The
I/O thread constantly retrieves the messages in the mes-395
sage loop in a First In, First Out (FIFO) manner. The
message passing between the compute threads and I/O
thread is performed using PostThreadMessage and PeekMes-
sage of Win32 functions. If a compute thread requests a
tile retrieval, it sends a message to the I/O thread and400
waits until the page is returned. If the page is present
in memory, the operation is completed without waiting.
Each tile has a time stamp variable to keep track of the
last request time. The time stamp is updated by the I/O
thread when requested by any compute thread. Using this405
time stamp, the memory manager swaps out the oldest
tiles according to the Least Recently Used (LRU) replace-
ment algorithm.
The granularity of the virtual memory manager is an
important parameter that affects the system’s performance.410
A small page/tile size can result in larger page tables,
which might cause the system to spend its CPU time
mostly in I/O operations. On the other hand, too large a
tile size can hinder the benefit of using memory manage-
ment and memory usage can exceed the memory limit by415
a factor of tile page size.
Thrashing is a well-known phenomenon encountered
when a process idles excessively waiting for the referenced
page to be loaded by the operating system. In this algo-
rithm, the effect of thrashing is alleviated by pre-fetching420
tiles using a pre-specified pattern based on knowledge of
the sequence in which the algorithm accesses tiles. The
prefetching technique has been applied for sequential pro-
grams in which predicting patterns of program execution
and data access in the near future improves the system ef-425
ficiency [28]. In the implemented virtual memory system,
when a tile is requested, the memory manager caches the
next α tiles (from the flist array) in this pattern asyn-
chronously. The goal of this prefetching technique is to in-
crease the hit rate of tiles found in present memory when430
requested. We define the hit rate as the number of tile
references whose memory present in RAM divided by the
total number of tile references.
4.1. API to access to the virtual memory manager
The user-level virtual memory is designed to be used as435
a general-purpose virtual memory for raster-based compu-
tations to execute on a single machine with limited mem-
ory through a well-defined interface. The raster-based al-
gorithm accesses the VM through a well-defined interface
so as to enable the potential for application with other440
raster-based algorithms. The API is explained below.
• initializeMemoryManagerpSq: The function accepts
array S that contains the setting values of the vir-
tual memory manager including maximum allowable
memory, and the DEM file accessed to determine445
extent and tile sizes. Using the parameter S, the
virtual memory is initialized and memory allocated
for resources such as the messaging system and the
data structures used by the main thread.
• getT ileptid, f listrsq: A compute thread requests the450
tile with uniquely-defined tile id tid from VM. A tile
can be owned by only one compute thread for writ-
ing at a time to avoid race condition error. During
the tile-retrieval process with the getTile function,
the tile’s reference count is incremented by 1. The455
virtual memory manager prevents access to a shared
resource from other threads using the EnterCritical-
Section and LeaveCriticalSection Win32 functions.
As an advantage with respect to most operating sys-
tem level virtual memory systems, the tiles to be460
prefetched are determined by the algorithm from the
flist array containing tile ids. Prefetching occurs
in parallel. The next tile on the list is prefetched
by the virtual memory manager. After loading each
tile, the virtual memory manager checks the mem-465
ory limit and if this has been exceeded swaps out the
least-recently accessed tile not in use. This strategy
is a heuristic attempt to ensure that the next tiles
needed by the compute threads will be in memory
when they are requested.470
• unlockT ileptidq: After the compute thread is fin-
ished with the tile, the tile is explicitly released by
the owner thread. The virtual memory manager
decrements the tile’s reference count where the tile
is requested with getT ile function. When the tile’s475
reference count reaches 0, the tile is returned to the
memory pool. Then, virtual memory manager de-
cides to keep the tile in memory or swaps out to the
disk based on the page replacement algorithm to free
memory space.480
• saveT ileptidq: If the tile is finished completely by
the compute thread, the result is saved to the disk
via saveT ile function and then the tile is marked as
finished state which is the final state of a tile.
6
Input DEM
(notloaded)
Tile 0
Tile 1
Tile 2
Tile 3
Tile 4
Tile 5
Tile Page Table
System  RAM
(present )Swap Files
(swappedout )
Output DEM
  (finished)
Tile n-1
(a)
notloaded present finished
swappedout
(b)
Figure 4: (a)Tile page table maps each page to a region in a DEM dataset, data block in a physical memory or data stored in a swap file on
the hard disk (b) Tile state diagram
• performLoadBalancingptilesrs, ctidq: Given a set485
of tile ids tilesrs and compute thread id ctid, per-
formLoadBalancing returns the working set of tiles
for the compute thread ctid. The virtual memory
manager distributes the tiles evenly among the com-
pute threads. Load balancing is a program option490
that can be disabled if desired.
• finalizeMemoryManagerpq: The memory resources
used by the virtual memory manager are freed.
4.2. Load balancing
The variable nature of the topography that the algo-495
rithm operates on gives rise to variability in the number
of cells on the unresolved stack for each tile at each it-
eration. Load balancing has been applied for concurrent
MPI execution streams that exhibit irregular structure and
dynamic load patterns [29] and to ensure that the load500
on each core is proportional to its computing power in
multi-core architectures [30]. We developed a load balanc-
ing mechanism that is specifically tailored for the parallel
modified PD algorithm. The modified PD algorithm ap-
plied to a tile iteratively processes unresolved cells whose505
indices are stored on a stack. The size of the stack on a
tile indicates the work load per tile. The load balancing
rule distributes the tiles based on tile stack size (|S|). Fig-
ure 5 illustrates the load balancing mechanism. Let T be
the set of all tiles and CT be the set of compute threads.510
Then
ř
|S| “
ř|T |
i“1 |S|i is the sum of stack sizes, and
|S| “
ř
|S|{|CT | is used as the target number of elements
in the stacks per compute thread. The load balancing al-
gorithm assigns consecutive tiles to each compute thread
to take advantage of CPU caches through the principle of515
memory locality. The number of assigned cells might be
greater than or equal to |S| as the tile granularity affects
the distribution of the load.
Figure 5: Illustration of load balancing among the compute threads;
In this example,
ř
|S| “ 1000, |S| “ 250 for 4 compute threads and
10 tiles
|S|1=90 |S|2=100 |S|3=70 |S|4=140 |S|5=170 |S|6=60 |S|7=50 |S|8=40 |S|9=100 |S|10=180
Thread 1
Working
tile set
T1
T2
T3
Thread 2
Working
tile set
T4
T5
Thread 3
Working
tile set
Thread 4
Working
tile set
T10T6
T9
T8
T7
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10
5. Experimental Results
Numerical experiments were performed to evaluate the520
virtual memory manager using five DEM datasets obtained
from the National Elevation Dataset web site (http://
ned.usgs.gov) (Table 1). These ranged from a relatively
small test dataset to a dataset that exceeded the physical
memory capacity of the test computer. Tile sizes, number525
of compute threads and memory capacity were all varied in
the runs. The experiments were conducted on a machine
equipped with an Intel Core I7 processor with 3.40 GHz
and configured with either 4 GB or 16 GB of RAM. We
used the 64-bit version of Microsoft Windows 7. Although530
not popular in HPC systems, Windows is widely used for
desktop Geographic Information System terrain analysis
7
500
1000
2000
5000
10000
10
00
30
00
50
00
70
00
90
00
11
00
0
13
00
0
15
00
0
Tile Size
To
ta
l T
im
e 
(s
ec
)
Number of Compute Threads
1 CT
2 CTs
3 CTs
4 CTs
(a)
250
500
1000
2000
5000
10000
10
00
30
00
50
00
70
00
90
00
11
00
0
13
00
0
15
00
0
Tile Size
IO
 T
im
e 
(s
ec
)
(b)
200
300
400
10
00
30
00
50
00
70
00
90
00
11
00
0
13
00
0
15
00
0
Tile Size
Al
go
rit
hm
 T
im
e 
(s
ec
)
(c)
Figure 6: Time results in seconds for DEM3 as a function of tile size where y-axis is in logarithmic scale. (a) Total execution time (b) Waiting
time (c) Algorithm time. (for varying number of compute threads up to 4 and prefetch count α “ 1)
and enabling the processing of large DEMs on these sys-
tems was a goal of this work.
Tile size is one of the most important parameters im-535
pacting the performance of the parallel PD algorithm. To
examine the impact of tile size, the parallel PD algorithm
was executed with shared virtual memory management en-
abled and a varying number of compute threads up to 4
using DEM3 with a range of tile sizes from 1000 to 15000540
on a machine with 16 GB RAM. The maximum memory
capacity for this experiment was set to half memory size
of DEM3 dataset for these runs. Maximum memory is the
size of the allocated memory for the entire page structure.
In this experiment it was purposely set less than the total545
memory required by the algorithm, around 4 times the file
size, or 16 GB to induce swapping and exercise the user
level virtual memory management.
The total execution time is a combination of Algorithm
time and I/O Waiting time (i.e., the elapsed time between550
the time a tile is requested and the completion of that re-
quest by the I/O thread). In Figure 6(a), the total exe-
cution time is seen to decrease sharply when more than 2
compute threads are utilized for a tile size of 1000. Par-
allelization is seen here to be effective in overlapping I/O555
operations with computation by the I/O thread. We ob-
serve that total execution time continues to decrease as
the tile size increases up to 9000, in which case the min-
imum time is achieved with 468 seconds when 2 compute
threads are used. After a tile size of 9000, the total execu-560
tion time begins to increase due to a thrashing effect be-
cause the memory manager performs an excessive number
of disk swapping operations. The waiting time accounts
for the majority of this total execution time, indicating
an increased I/O overhead cost associated with small tile565
sizes and correspondingly more distinct disk accesses (Fig-
ure 6(b)). The algorithm time is also dependent on tile
size, reflecting the additional computation involved with
edge exchanges for smaller tile sizes, although this effect
is significantly smaller than the I/O effect quantified by570
the wait time (Figure 6(c)). These experiments show that
the tile size can have a dramatic effect on performance of
the program but this effect diminishes as the tile size in-
creases, and then becomes more prominent due to better
data locality. This experiment suggests limiting tile sizes575
to approximately 9000 for this system for the best perfor-
mance using 3 compute threads. Multithreading also helps
decrease the algorithm time. However, we observe that as
the memory limit decreases, the performance benefit of
multithreading diminishes because the algorithm becomes580
more I/O intensive. Because I/O overhead increases as
tile sizes decrease, experiments for tile sizes less than 1000
were not performed. Conversely, note that when the tile
size is 15000, the algorithm execution times with 3 and
4 compute threads approach the algorithm time with 2585
compute threads because of the unfair load distribution
among the threads. We also performed the experiment for
tile size 24000 to investigate the effect of the biggest tile
size on the virtual memory manager but the experiments
failed due to insufficient memory.590
Table 1: Properties of DEM datasets used in the experiments
Label Location Domain
Size
File Size
DEM1 GSL Basin (100 m cells) 4045x7402 117 MB
DEM2 GSL Basin (27.3 m cells) 14849x27174 1.61 GB
DEM3 Boise River Basin (10 m
cells)
24856x41000 3.98 GB
DEM4 Wasatch Front (10 m cells) 34649x44611 6.05 GB
DEM5 Chesapeake Bay (10 m
cells)
45056x49152 8.65 GB
Memory use during the run of the parallel PD algo-
rithm to evaluate the effect of tile size with the DEM3
dataset was profiled to find the average and maximum
memory used (Table 2) where the memory limit was set to
2000 MB. In practice, because memory is allocated in tile595
size increments for each thread, the maximum memory us-
age is larger than the target memory capacity parameter.
The amount of this excess varies with tile size. When we
set the tile size to 24000, the algorithm is failed to allo-
cate memory space because the test machine doesn’t have600
sufficient memory space and, therefore, we completely lost
the benefit of the virtual memory system. The experi-
8
Table 2: Memory profile result of DEM3 in which expected maximum
memory limit is 2000 MB
Tile Size
(q)
Avg. Memory
Usage
Max. Achieved
Memory Usage
1000 1828 MB 2046 MB
3000 1851 MB 2396 MB
5000 1780 MB 3477 MB
7000 2089 MB 4955 MB
9000 2249 MB 6392 MB
11000 2438 MB 9253 MB
13000 3308 MB 12242 MB
15000 3456 MB 13683 MB
24000 ˆ ˆ
Figure 7: Speedup Ratio of the parallel PD algorithm for DEM1,
DEM2, DEM3 and DEM4
1.0
1.5
2.0
2.5
3.0
3.5
1 2 3 4 5 6 7 8
Number of Compute Threads
Sp
ee
d−
up
 R
at
io Datasets
DEM1
DEM2
DEM3
DEM4
ments show that the amount by which actual memory use
exceeds target memory capacity increases with tile size.
Maximum memory use exceeding the target parameter is605
a disadvantage of this virtual memory management ap-
proach as it limits the control on memory use by the pro-
gram, resulting in potential for operating system virtual
memory to kick in or for the program to crash due to in-
sufficient memory.610
In a second experiment we evaluated the parallel per-
formance of the algorithm on DEM datasets that fit within
RAM and where swapping was not needed. With the hard-
ware reconfigured to 16 GB RAM we tested the speed on
the first four DEM datasets. Although we achieved best615
time result with tile size 9000, tile size was set to 1000.
Speed up ratios for this test (Figure 7) show a beneficial
speed-up relative to the sequential modified PD algorithm
with one I/O thread. A speed-up ratio of around 3 was
obtained with 8 compute threads. Speed-up ratios are gen-620
erally slightly more for the larger datasets, though this pat-
tern is not consistent across the full range of the number
of threads tested, the variability presumably being due to
differences in the data. The machine used in the study has
four cores, but the experiments were expanded to 8 com-625
pute threads to observe the effect of the hyper-threading
feature of the processor, which improves processor perfor-
mance by enabling the execution of pipeline-scaled inter-
leaving of multiple threads on a single core, resulting in a
modest performance benefit.630
Figure 8: Hit rate results in percentage for virtual memory man-
ager without prefetching (blue bars) and with prefetching (red bars)
with respect to memory capacity and number of compute thread for
DEM1
1 2 3 4 5 6 7 8
0
50
100
Full Memory
Number of Compute Threads
Hi
t R
at
e
1 2 3 4 5 6 7 8
0
50
100
1/2 Memory
Number of Compute Threads
Hi
t R
at
e
1 2 3 4 5 6 7 8
0
50
100
1/4 Memory
Number of Compute Threads
Hi
t R
at
e
1 2 3 4 5 6 7 8
0
50
100
Number of Compute Threads
Hi
t R
at
e
1/8 Memory
 
 
Prefetching Disabled Prefetching Enabled
Next, the prefetching capability of the shared virtual
memory manager module, described at the end of Section
3, was evaluated (Figure 8). The experiments were con-
ducted with respect to the fraction of tiles that can be
present in the available system memory. For example, 1/4635
memory means there can be at most 1/4 of total number
of tiles present in memory at the same time. The experi-
ments show that as the size of the memory limit allowed to
be used by the application decreases, the hit rate (i.e., the
fraction of tiles already in memory when requested) also640
decreases due to an increasing number of swapping out op-
erations. This may lead to higher loads on the I/O thread
and disk. By contrast, because the implemented system
uses one I/O thread, prefetching more than one tile ahead
may also result in other threads experiencing higher I/O645
times. This I/O bottleneck is exacerbated in the presence
of multiple compute threads and low memory limits.
Table 3 shows average (HRD) and standard deviation
(σHRD) values of the difference of hit rates (HRD) be-
tween the VM with prefetching (P ) and without prefetch-650
ing (WP ) in percentage with respect to memory capacity
(M) and a number of compute threads (CT ). The values
9
0250000
500000
750000
CT1 CT2 CT3 CT4 CT5 CT6 CT7 CT8
Compute Threads
Nu
m
be
r o
f P
oin
ts
(a)
0
250000
500000
750000
CT1 CT2 CT3 CT4 CT5 CT6 CT7 CT8
Compute Threads
Nu
m
be
r o
f P
oin
ts
(b)
Figure 9: Compute thread loads with respect to a number of processed points (a) Without load balancing (b) With load balancing
Table 3: Average (HRD) and standard deviation (σ
HRD
) of the
difference of hit ratio between the VM with prefetching and with-
out prefetching in percentage with respect to a number of compute
threads for varying memory limits
Statistics Full
Mem.
1/2
Mem.
1/4
Mem.
1/8
Mem.
Average 0.784 5.834 22.664 31.279
Std. Dev. 0.015 0.450 9.355 11.301
are computed as follows:
HRDpCT,Mq “ pHRCT,M,P ´HRCT,M,WP q (1)
HRDpMq “
CTÿ
i“1
HRDi,M
CT
(2)
σHRDpMq “
gffffe
CTÿ
i“1
pHRDpi,Mq ´HRDpMqq2
CT ´ 1
(3)
As the virtual memory limit decreases, the average of
difference of hit rate values increase that shows the benefit655
of using prefetching for low memory limit. However, the
variation of hit ratios also increase. This is because in-
creasing compute threads affects the hit ratio more under
lower memory limits when compared to higher memory
limits.660
We evaluated the load balancing mechanism in which
the tiles are distributed among the compute threads. With-
out a load balancing mechanism the number of points (grid
cells) evaluated by each compute thread is quite unevenly
distributed (Figure 9 a), resulting in the threads with fewer665
points to evaluate waiting for the threads with more points
to evaluate before each synchronization that occurs with
Table 4: DEM datasets for which Pitremove was successfully run us-
ing TauDEM 5.1, the parallel PD algorithm with Operating System’s
built-in virtual memory system (WVM) and tiled virtual memory
(TVM) approach
DEM1
(0.1
GB)
DEM2
(2.3
GB)
DEM3
(4 GB)
DEM4
(6 GB)
DEM5
(8.7
GB)
TauDEM
5.1
! ! ˆ ˆ ˆ
PD
Algorithm
with WVM
! ! ! ˆ ˆ
PD
Algorithm
with TVM
! ! ! ! !
each call of the exchangeBorders function. When load bal-
ancing is not enabled, the compute thread 6 (CT6) pro-
cesses nearly 6ˆ105 more cells than the compute thread 7670
(CT7). However, the load balancing approach resulted in
a much more even distribution of the number of points pro-
cessed by each thread (Figure 9 b) where the difference of
processed cells between the compute threads dropped sig-
nificantly. Fair data distribution can be enhanced by the675
use of smaller tile size that affects the granularity of the
virtual memory.
The objectives of the study were to develop a capability
to run the PD algorithm for large datasets without relying
on the operating system’s built-in virtual memory system.680
We performed an experiment using three approaches for
the five DEMs listed in Table 1. We used a machine config-
ured with 4 GB RAM. The last two DEM datasets require
memory significantly more than this. DEM4 is 6.05 GB
and requires nearly 21 GB of memory for the algorithm,685
while DEM5 is 8.65 GB and requires 34 GB of memory
for the algorithm. We set the target memory capacity for
our tiled virtual memory (TVM) approach to 2 GB and
used a tile size of 1000 because of its better memory usage
efficiency (see Table 2). We also performed a run with the690
tiled virtual memory disabled so that the Windows 7 op-
10
erating system’s built-in virtual memory manager (WVM)
was invoked. The default Windows 7 paging file size was
used. The third comparison ran the same problem using
TauDEM 5.1 [4] which uses an MPI message-passing ap-695
proach to implement the PD parallel algorithm. It has
no virtual memory management so also relies on the Win-
dows 7 virtual memory manager for swapping where the
total memory demanded by all processes is greater than
the physical memory capacity. As illustrated in Table 4,700
we were able to successfully run the PD algorithm with
the developed system for all the datasets, the largest of
which demanded memory more than five times the physical
memory available. By contrast, the TauDEM 5.1 package
failed to process DEM3, DEM4 and DEM5, and the PD705
algorithm with WVM failed to process DEM4 and DEM5.
6. Conclusion and Discussion
The increased availability of high resolution DEMs is
driving the need for software to process and analyze these
on all systems including desktop PC systems with lim-710
ited memory. In this paper we introduced a virtual tile
memory manager that can be used with hydrologic ter-
rain analysis programs to provide user level memory man-
agement and enable the processing of large DEMs on a
PC, where operating system virtual memory management715
and multi-process domain partitioning fail. This was il-
lustrated using a parallel pit removal algorithm for hydro-
logically conditioning DEM datasets. Elimination of pits
is an essential step in hydrologic terrain analysis, required
before applying other hydrological algorithms to calculate720
flow paths, slopes, and delineation of watersheds and sub-
watersheds. We used a parallel implementation of a modi-
fied Planchon and Darboux algorithm [10] for pit removal
in shared-memory multithreaded systems, designed to run
on desktop computers having limited system memory. sys-725
tem. The system is optimized to retrieve tiles with a high
hit ratio. The main conclusions of the system developed
are:
1. We are able to run this hydrologic terrain analysis
algorithm on limited memory systems for datasets so730
large that previously available algorithms fail. This
overcomes the main obstacle of the memory capac-
ity of the machine, by utilizing a user-level shared
virtual memory system.
2. While evaluated with pitremove, there is nothing in735
the user-level virtual memory system that should
preclude it from use with other algorithms which
may require modification of the page replacement al-
gorithm based on algorithm-specific data access pat-
terns to maximize system throughput.740
3. Efficiencies were achieved using algorithm refinements
such as load balancing and the pre-fetching approach
used in the page replacement algorithm of the user-
level shared virtual memory system that increased
hit rate and reduced wait time.745
We conclude from the experimental results that the im-
plemented parallel application is beneficial to geoscientists
and researchers for computing raster-based computations
in very large DEM datasets on a single machine with lim-
ited memory. We believe that the implemented system750
also can be used for other flow algebra algorithms used
in TauDEM [7] with slight modification. For other algo-
rithms the pre-fetching approach will need to be modified
according to the algorithm-specific data access patterns.
Furthermore, it may be beneficial to explore adaptive page755
replacement algorithms that change the replacement algo-
rithm at run time based on the tile-access pattern of the
system.
Acknowledgments
This research was supported by the US Army Engi-760
neer Research and Development Center System-Wide Wa-
ter Resources Program under contract number W912HZ-
11-P-0338. This support is gratefully acknowledged.
References
[1] I. D. Moore, R. B. Grayson, A. R. Ladson, Digital terrain mod-765
elling: A review of hydrological, geomorphological, and bio-
logical applications, Hydrological Processes 5 (1) (1991) 3–30.
doi:10.1002/hyp.3360050103 .
[2] T. Hengl, H. I. Reuter, Geomorphometry: concepts, software,
applications, Vol. 33, Elsevier, 2009.770
[3] J. P. Wilson, J. C. Gallant (Eds.), Terrain Analysis: Principles
and Applications, Wiley, New York, 2000.
[4] D. G. Tarboton, Terrain Analysis Using Digital Elevation Mod-
els (TauDEM), Utah Water Research Laboratory, Utah State
University, http://hydrology.usu.edu/taudem/ (2014).775
[5] T. K. Tesfa, D. G. Tarboton, D. W. Watson, K. A. Schreud-
ers, M. E. Baker, R. M. Wallace, Extraction of hydrological
proximity measures from dems using parallel processing, Envi-
ronmental Modelling and Software 26 (12) (2011) 1696 – 1709.
doi:10.1016/j.envsoft.2011.07.018 .780
[6] C. Wallis, R. Wallace, D. G. Tarboton, D. W. Wat-
son, K. A. T. Schreuders, T. K. Tesfa, Hydrologic
terrain processing using parallel computing, 18th World
IMACS / MODSIM Congress, 2009, pp. 2540 – 2545,
http://www.mssanz.org.au/modsim09/F13/wallis.pdf.785
[7] D. G. Tarboton, K. A. T. Schreuders, D. W. Watson, M. E.
Baker, Generalized terrain-based flow analysis of digital el-
evation models, in: Proceedings of the 18th World IMACS
Congress and MODSIM09 International Congress on Modelling
and Simulation, Cairns, Australia, 2009, pp. 2000–2006.790
[8] D. Engler, S. Gupta, M. Kaashoek, Avm: application-level
virtual memory, in: Hot Topics in Operating Systems, 1995.
(HotOS-V), Proceedings., Fifth Workshop on, 1995, pp. 72–77.
doi:10.1109/HOTOS.1995.513458 .
[9] A. Marowka, Back to thin-core massively parallel processors,795
Computer 44 (12) (2011) 49 –54. doi:10.1109/MC.2011.133 .
[10] O. Planchon, F. Darboux, A fast, simple and ver-
satile algorithm to fill the depressions of digital ele-
vation models, CATENA 46 (23) (2002) 159 – 176.
doi:10.1016/S0341-8162(01)00164-3.800
[11] GDAL Development Team, GDAL - Geospatial Data Ab-
straction Library, Open Source Geospatial Foundation,
http://www.gdal.org (2014).
[12] S. K. Jenson, J. O. Dominque, Extracting topographic struc-
ture from digital elevation data for geographic information sys-805
tem analysis, Photogrammetric Engineering and Remote Sens-
ing 54 (11) (1988) 1593–1600.
11
[13] D. G. Tarboton, R. L. Bras, I. Rodriguez-Iturbe, On
the extraction of channel networks from digital eleva-
tion data, Hydrological Processes 5 (1) (1991) 81–100.810
doi:10.1002/hyp.3360050107.
[14] L. Wang, H. Liu, An efficient method for identifying and
filling surface depressions in digital elevation models for hy-
drologic analysis and modelling, International Journal of
Geographical Information Science 20 (2) (2006) 193–213.815
doi:10.1080/13658810500433453.
[15] R. Barnes, C. Lehman, D. Mulla, Priority-flood: An opti-
mal depression-filling and watershed-labeling algorithm for dig-
ital elevation models, Comput. Geosci. 62 (2014) 117–127.
doi:10.1016/j.cageo.2013.04.024 .820
[16] X. J.-W. LIU Yong-He, ZHANG Wan-Chang, Another fast and
simple dem depression-filling algorithm based on priority queue
structure, Atmospheric and Oceanic Science Letters 2 (4) (2009)
214.
[17] H.-T. Do, S. Limet, E. Melin, Parallel computing flow ac-825
cumulation in large digital elevation models, Procedia Com-
puter Science 4 (0) (2011) 2277 – 2286, proceedings of the In-
ternational Conference on Computational Science, ICCS 2011.
doi:10.1016/j.procs.2011.04.248 .
[18] R. Xu, X. Huang, L. Luo, S. Li, A new grid-associated al-830
gorithm in the distributed hydrological model simulations,
SCIENCE CHINA Technological Sciences 53 (2010) 235–241,
10.1007/s11431-009-0426-4.
[19] H. Wang, Y. Zhou, X. Fu, J. Gao, G. Wang, Maximum speedup
ratio curve (msc) in parallel computing of the binary-tree-based835
drainage network, Computers and; Geosciences 38 (1) (2012)
127 – 135. doi:10.1016/j.cageo.2011.05.015.
[20] L. Ortega, A. Rueda, Parallel drainage network computation
on cuda, Computers and Geosciences 36 (2) (2010) 171 – 178.
doi:10.1016/j.cageo.2009.07.005 .840
[21] J. F. O’Callaghan, D. M. Mark, The extraction of drainage
networks from digital elevation data, Computer Vision,
Graphics, and Image Processing 28 (3) (1984) 323 – 344.
doi:http://dx.doi.org/10.1016/S0734-189X(84)80011-0 .
[22] S. K. Reinhardt, J. R. Larus, D. A. Wood,845
Tempest and typhoon: User-level shared memory,
SIGARCH Comput. Archit. News 22 (2) (1994) 325–336.
doi:10.1145/192007.192062.
URL http://doi.acm.org/10.1145/192007.192062
[23] A. Basu, J. Gandhi, J. Chang, M. D. Hill, M. M.850
Swift, Efficient virtual memory for big memory servers,
SIGARCH Comput. Archit. News 41 (3) (2013) 237–248.
doi:10.1145/2508148.2485943.
URL http://doi.acm.org/10.1145/2508148.2485943
[24] H. Huang, P. Pillai, K. G. Shin,855
Design and implementation of power-aware virtual memory,
in: Proceedings of the Annual Conference on USENIX Annual
Technical Conference, ATEC ’03, USENIX Association, Berke-
ley, CA, USA, 2003, pp. 5–5.
URL http://dl.acm.org/citation.cfm?id=1247340.1247345860
[25] J. Pardy, K. Hartling, Efficient Data Manage-
ment and Analysis with Geoprocessing, ESRI,
http://proceedings.esri.com/library/userconf/devsummit13/papers/devsummit-
086.pdf (2014).
[26] A. S. Tanenbaum, A. S. Woodhull, A. S. Tanenbaum, A. S.865
Tanenbaum, Operating systems: design and implementation,
Vol. 2, Prentice-Hall Englewood Cliffs, NJ, 1987.
[27] K. Li, Ivy: A shared virtual memory system for parallel com-
puting., ICPP (2) 88 (1988) 94.
[28] A. J. Smith, Sequential program prefetching in memory hierar-870
chies, Computer 11 (12) (1978) 7–21.
[29] M. Bhandarkar, L. Kal, E. de Sturler, J. Hoeflinger,
Adaptive load balancing for mpi programs, in: V. Alexandrov,
J. Dongarra, B. Juliano, R. Renner, C. Tan (Eds.), Compu-
tational Science - ICCS 2001, Vol. 2074 of Lecture Notes in875
Computer Science, Springer Berlin Heidelberg, 2001, pp. 108–
117. doi:10.1007/3-540-45718-6_13.
URL http://dx.doi.org/10.1007/3-540-45718-6_13
[30] T. Li, D. Baumberger, D. A. Koufaty, S. Hahn, Efficient oper-
ating system scheduling for performance-asymmetric multi-core880
architectures, in: Proceedings of the 2007 ACM/IEEE Confer-
ence on Supercomputing, SC ’07, ACM, New York, NY, USA,
2007, pp. 53:1–53:11. doi:10.1145/1362622.1362694.
12
