Abstract. Flash memory-based solid-state disks are fast becoming the dominant form of end-user storage devices, partly even replacing the traditional hard-disks. Existing two-level memory hierarchy models fail to realize the full potential of flash-based storage devices. We propose two new computation models, the general flash model and the unit-cost model, for memory hierarchies involving these devices. Our models are simple enough for meaningful algorithm design and analysis. In particular, we show that a broad range of existing external-memory algorithms and data structures based on the merging paradigm can be adapted efficiently into the unit-cost model. Our experiments show that the theoretical analysis of algorithms on our models corresponds to the empirical behavior of algorithms when using solid-state disks as external memory.
Introduction
In many practical applications, one needs to compute on data that exceeds the capacity of the main memory of the available computing-device. This happens in a variety of settings, ranging from small devices, such as PDAs, to highperformance servers and large clusters. In such cases, the cost of data transfers between disk and the main memory often proves to be a critical bottleneck in practice, since a single disk transfer may be as time-costly as millions of CPU operations. To capture the effect that memory transfers have on the running time of algorithms, several computational models have been proposed over the past decades. One of the most successful of these models is the I/O-model. I/O-model. The I/O-model, as defined in [1] , is a two-level memory model. It consists of a CPU, a fast internal memory of size M and a slow external-memory of infinite size. The CPU can access only data stored in the internal memory, and data transfers between the two memories are performed in chunks of B consecutive data items. The I/O-complexity of an algorithm is given by the number of memory transfers, or I/Os, performed. Many problems have been studied in this model and efficient algorithms have been proposed. For comprehensive overviews we refer the interested reader to [2, 3] .
Flash memories. In the recent years, a new trend has emerged in the storage device technology -that of solid-state disks based on flash memory. Flash memories are non-volatile, reprogrammable memories. Flash memory devices are lighter, more shock resistant and consume less power. Moreover, since random accesses are faster on solid-state disks compared to traditional mechanical hard-disks, flash memory is fast becoming the dominant form of end-user storage in mobile computing. Many recent notebook and netbook models have already replaced traditional mechanical hard-disks by flash memory disks. Market research company In-Stat predicted in July 2006 that 50% of all mobile computers would use flash (instead of hard-disks) by 2013.
Flash memory devices typically consist of an array of memory cells that are grouped into pages of consecutive cells, where a fixed amount of consecutive pages form a block. Reading a bit is performed by reading the whole page containing the given bit. When writing, we distinguish between changing bits from 1 to 0 and from 0 to 1. To change a bit from 0 to 1, the device first "erases" the entire block containing the given bit, i. e. all the bits in the block are set to 1. However, changing a bit from 1 to 0 is done by writing only the page containing it, and each page can be programmed only a small number of times before it must be erased again. Reading and writing pages is relatively fast, whereas erasing a block is significantly slower. Each block can sustain only a limited number of erasures. To prevent blocks from wearing prematurely, flash devices usually have a built-in micro-controller that dynamically maps the logical block addresses to physical addresses to even out the erase operations sustained by the blocks.
Related work. Recently, there has been an increased interest in using flash memories to improve the performance of computer systems. This includes the experimental use of flash memories in database systems [4] [5] [6] , using flash memories as caches in hard-disks (e. g. Seagate's Momentus 5400 PSD hybrid drives), Windows Vista's ReadyBoost, i. e. using USB flash memories as a cache, or integrating flash memories into motherboards or I/O-buses, e. g. Intel's Turbo Memory technology [7] .
Most previous algorithmic work on flash memories deals with wear leveling, i. e. block-mapping and flash-targeted file systems (see [8] for a comprehensive survey). There exists very little work on algorithms designed to exploit the characteristics of flash memories. Wu et al. [9, 10] proposed flash-aware implementations of B-trees and R-trees without file system support by explicitly handling block-mapping. More recently, efficient dictionaries on flash disks have been engineered [11] . Other works include the use of flash memories for model checking [12] or route planning on mobile devices [13, 14] .
Our contributions. Owing to the lack of good computation models to help exploiting the particular characteristics of flash devices, there is no firm theoretical foundation for comparing algorithms. In this paper, we propose two computational models for flash devices that exploit their constructive characteristicsthe general flash model and the unit-cost flash model. These models can be used as a basis for a theoretical comparison between different algorithms on flash memory devices. While the general flash model is very generic and is especially suitable for studying lower bounds, the unit-cost flash model is appealing for the design and analysis of algorithms. In particular, we show that a large number of external-memory algorithms can be easily adapted to give efficient algorithms in the unit-cost flash model. Interestingly, we observe that external-memory algorithms based on the merging paradigm are easy to adapt in the unit-cost flash model, while this is not true for algorithms based on the distribution paradigm. We conduct experiments on several algorithms exhibiting various I/O-access patterns, i.e. random and sequential reads, as well as random and sequential writes. Our experiments confirm that the analysis of algorithms on our models (particularly, the unit-cost flash model) predicts the observed running-times much better than the I/O model. Our experiments also show that the adaptations of these algorithms improve their running-times on solid-state disks.
Models for flash memory
In this section we propose and discuss models for flash memories. We first discuss the practical behavior of flash memories. We then propose two models of computation, a general flash model and a unit-cost flash model. They are both based on the I/O-model, but use a different block size for reading than for writing.
Flash memory behavior. Due to constructive characteristics, in practice flash memories have a significantly different behavior compared to hard disks [15] [16] [17] . In Figure 1 we give empirical results showing the dependence of throughput on the block size when performing random reads and writes, as well as sequential reads and writes. We used two different disks: a 64 GB Hama SSD drive and a Seagate Barracuda 7200 rpm 500 GB hard-drive. The main difference concerns the relative performance of random reads and random writes. For hard-disks random reads and random writes provide similar throughput, whereas for the SSD drive random reads provide significantly more throughput than random writes, especially for small block sizes. Furthermore, the throughput of random accesses converges to the throughput of the corresponding sequential accesses at different block sizes, implying different block sizes for reading and writing. Also, the throughput provided by sequential reads is nearly the same as the throughput provided by sequential writes for most flash devices [15] .
The key characteristic of the flash devices that we model is the different block sizes for reading and writing. For the general flash model we also consider different throughput for reading and writing. To keep our computation models simple enough for algorithm design, we abstract away the other flash-memory characteristics, such as effects of misalignment, limited endurance etc. flash memory of infinite size. The input and output data reside on the external flash memory, and computation can only be done on data residing in the internal memory. Read and write I/Os from and to the flash memory occur in blocks of consecutive data of sizes B r and B w respectively. The complexity of algorithms is x + c · y, where x and y are the number of read and write I/Os respectively, and c is a penalty factor for writing. Similarly to the I/O-model, the parameters M , B r , B w , and c are known to the algorithms. Typically, we assume B r ≤ B w < M ≪ N and c ≥ 1. We note that the I/O-model is a particular case of this general model, when B r = B w = B and c = 1.
Unit-cost flash model. The fact that in the general flash model c may take arbitrary values implies arbitrary relative costs between read and write I/Os. This complicates the reuse of existing external-memory algorithms and algorithmic techniques. In [15] it was shown that for most flash devices the throughput provided by reads and writes is nearly the same, assuming proper block sizes, i.e. B r and B w are set so that the maximum throughput is achieved on random I/Os. This means that, in spite of different read and write block sizes, the access time per element is nearly the same. The unit-cost flash model is the general flash model augmented with the assumption of an equal access time per element for reading and writing. This simplifies the model considerably, since it becomes significantly easier to adapt external-memory results. For the sake of clarity, the cost of an algorithm performing x read I/Os and y write I/Os is given by xB r + yB w , where B r and B w denote the read and write block sizes respectively. Essentially, the cost of an algorithm in this model is given by the total amount of items transferred between the flash-disk and the internal memory.
For both models, we note that "items transfered" refers to all the B r (B w ) elements moved during a read (write) I/O and not just the useful elements transfered. Also, our models can be adapted to obtain hardware-oblivious models.
Relating unit-cost models to external-memory models. We turn to exploring the relation between the unit-cost models and the external-memory models. Consider some algorithm A in the unit-cost flash model, which transfers f (N, M, B r , B w ) items. Denote by f r (N, M, B r , B w ) the total cost for read I/Os and let f w (N, M, B r , B w ) be the total cost for write I/Os. The algorithm is executed as an external-memory algorithm with a block size B = B r as follows. Read operations are done in blocks of size B r and therefore the reads incur f r (N, M, B r , B w )/B r I/Os, whereas writes are done in blocks of size B w which implies that each write incurs B w /B r I/Os. We obtain that all the writes take
The simulation in Lemma 1 provides an efficient mechanism for obtaining lower bounds in the unit-cost flash model, as stated in Lemma 2. 
Algorithms for the unit-cost flash model
Typical external-memory algorithms manipulate buffers using various operations, such as merging and distributing. Given that in the unit-cost flash model the block sizes for reads and writes are different, algorithms can merge O(M/B r )-ways and distribute O(M/B w )-ways. Since M/B r > M/B w , merging is preferred to distributing because more buffers can be manipulated simultaneously. A surprisingly large body of merging-based external-memory algorithms (and data structures) can be easily adapted to get efficient and sometimes even optimal algorithms (and data structures) in the unit-cost flash model, sometimes by simply setting the block size B to B r . In this section we show a few typical examples of how simple changes lead to efficient algorithms in the unit-cost flash model.
Sorting
Sorting N records in the I/O-model requires Ω(N/B log M/B N/B) I/Os [1] . Using Lemma 2, we obtain that sorting N elements needs Ω(N log M/Br N/B r ) items to be transfered in the unit-cost flash model.
To sort in the unit-cost flash model, we use multi-way mergesort, which is optimal in the I/O-model, and we show that it achieves optimality also in the unit-cost flash model. The algorithm splits the input into Θ(M/B) subsequences, recursively sorts them, and in the end merges the (sorted) subsequences. The I/O-complexity is Θ(N/B log M/B N/B) I/Os. For the unit-cost flash model, different costs are achieved depending on the number of subsequences the input is split into. Splitting the input in Θ(M/B w ) subsequences yields an algorithm that transfers O(N log M/Bw N/B w ) items, whereas splitting Θ(M/B r )-ways yields the optimal Θ(N log M/Br N/B r ) cost. Lemma 3. Sorting N elements can be done by transferring Θ(N log M/Br N/B r ) items in the unit-cost flash model.
Data structures
In this section we give brief descriptions of efficient implementations for search trees and priority queues in the unit-cost flash model.
Search trees.
For searching, we show how to adapt the B-trees used in the I/Omodel to obtain an efficient implementation in the unit-cost flash model. We employ a two-level structure. The primary data structure is a B-tree with a fan-out of Θ(B w ); each node of the primary structure is stored also as a Btree, but with nodes having a fan-out of Θ(B r ). Searches and updates transfer O(B r log Br N ) items.
Priority queues. Several optimal external-memory priority queues have been proposed [18] [19] [20] [21] . Each of them takes amortized O(1/B log M/B N/B) I/Os per operation. However, only the cache-oblivious priority queue in [20] translates directly into an optimal priority queue in unit-cost flash model, taking amortized O(log M/Br N/B r ) items transfered per operation. This is because it only merges buffers, whereas the other priority queues also employ distribution and achieve only amortized O(log M/Bw N/B w ) transfered items. We note that priority queues are the core of time forward processing, a technique widely employed to achieve efficient external memory graph algorithms.
BFS
For BFS on undirected graphs G(V, E) in the unit-cost flash model, we focus on the randomized external-memory algorithm by Mehlhorn and Meyer [22] . For ease of exposition, we restrict ourselves to sparse graphs, i.e. |E| = O(|V |). The algorithm starts with a preprocessing phase, in which the input graph is rearranged on disk. This is done by building |V |/µ disjoint clusters of small diameter (O(µ · log |V |) with high probability (whp.)) that are laid contiguously on disk. In the BFS phase, the algorithm exploits the fact that in an undirected graph, the edges from a node in BFS level t lead to nodes in BFS levels t − 1, t or t + 1 only. Thus, in order to compute the nodes in BFS level t + 1, the algorithm collects all neighbors of nodes in level t, removes duplicates and removes the nodes visited in levels t−1 and t. For collecting the neighbors of nodes efficiently, the algorithm spends one random read I/O (and possibly, some further sequential read accesses depending on the cluster size) for loading a whole cluster as soon as a first node of it is visited and then keeps the cluster data in some efficiently accessible data structure (hot pool) until all nodes in the cluster are visited. 
Experimental results
The main goal of our experimental study is to verify the suitability of the proposed unit-cost flash model for predicting the running-time of algorithms using SSD as an external-memory. We want to check how well the behavior of the algorithms on SSDs correspond to their theoretical analysis on the unit-cost flash model. In particular, we look at the improvements from the adaptation process as predicted theoretically on the unit-cost flash model and ascertain if these gains are actually observed in practice. We consider three algorithms which present various I/O-patterns and have very different complexities in the I/O model. First, we consider sorting, which takes sort(N ) = O(N/B log M/B N/B) I/Os and performs mainly sequential I/Os. We then move to BFS, which requires O(|V | · log |V |/B + sort(|V |)) I/Os whp. for sparse graphs and causes both sequential and random reads, but no random writes. Finally, the classical DFS implementation performs O(|V |) I/Os on sparse graphs and does a large number of random reads and writes. We observe the performance of these algorithms when using a SSD as external-memory.
Experimental setup. For algorithms and data structures designed in the I/Omodel we use implementations already existent in the STXXL library [23] wherever possible. We show results where the size of blocks in which data is transferred between the internal memory and the flash device is set to both the read and write block sizes of the device. According to our flash models, algorithms read blocks of size B r and write blocks of size B w . To comply with this requirement, we implement a translation layer similar to Easy Computing Company's MFT (Managed Flash Technology) [24] . The translation layer prevents random writes of blocks of size B r by buffering B r -sized blocks into blocks of size B w that provide optimal throughput when written to the disk. When using the translation layer, an algorithm reads and writes pages of size B r . Oblivious to the algorithm, the translation layer logically groups B w /B r pages into a block of size B w , which is written to the flash disk. To do so, O(1) B w -sized buffers are reserved in the memory, so that when one such buffer gets full it is immediately written to the flash disk. To keep track of the data used, this layer maintains a mapping of the logical addresses of the pages viewed by the algorithm to their actual address on the flash disk. Since this mapping occupies little space and is used only to manage temporary data, the translation layer is stored in main memory throughout the execution of the algorithm. Additionally, the translation layer is responsible for keeping track of the free pages and blocks.
Due to its simplicity and generality, we view the translation layer as a generic easy-to-implement adaptation of I/O algorithms to algorithms in the unit-cost flash model. However, we note that there exist cases where the translation layer can not be employed, e.g. extremely large inputs when the translation layer may no longer fit into the main memory.
Our experiments were conducted on a standard Linux machine, with an Intel Core 2 Quad 2.4 GHz CPU, 8 GB RAM out of which algorithms are restricted to use only 512 MB, and a 64 GB HAMA flash disk. The smallest block sizes where the disk reaches optimal performance for random reads and random writes are 128 KB and 16 MB respectively, see e. g. Figure 1 , and consequently we set B r and B w to these values. The code was compiled using GCC version 4.3.
Sorting. For sorting we consider the STXXL implementation, which is based on (cache-aware) multi-way mergesort. The results in Table 1 show that when the block size is set to B w , the running time is larger than when the block size equals B r , and the volume of data read and written by the algorithm is larger as well. This behavior is easily explained theoretically by the larger number of recursion levels in the former case, noticeable by the relative ratio between the read/write volumes and the input volume. Also, when using the translation layer we obtain very similar results to when setting the block size to B r . This behavior is also in line with the theoretical analysis in unit-cost flash model, since the algorithm essentially writes data sequentially, and in this case writing blocks of size B r yields the same throughput as when writing blocks of size B w (when using the translation layer). Such a behavior would be inexplicable in the I/O-model, which assumes reads and writes in equally sized blocks for reading and writing. We note that, due to the limited size of the flash disk, we could not sort larger sequences. Table 1 . The read volume (RDV), write volume (WRV), and the running time (RT) for sorting N random integers (taking the specified volume) when using the translation layer (TL), setting the block size to Br and to Bw respectively. RDV and WRV are measured in GB, and RT is measured in seconds.
BFS.
We perform experiments on square grid graphs as they have proven to be a difficult graph class [25] for the external-memory BFS algorithm. As shown in Table 2 , using the translation layer yields only a small benefit compared to the Table 2 . Read/write volumes (in GB) and running times (in seconds) for externalmemory BFS with randomized preprocessing on square grid graphs, separated into preprocessing phase (pp) and BFS phase, using block sizes Br, Bw and the translation layer (TL).
read block size. This is explained by the fact that the algorithm performs no random writes, while random and sequential reads are not affected by the layer. For preprocessing, using a smaller block size, and consequently a smaller µ, results in smaller running time, since the computed clusters tend to contain fewer nodes and have a smaller diameter. Comparing the preprocessing times for B r and B w on the square grid graph in Table 2 confirms this, as preprocessing using B w takes up to three times as long as when B r is used.
For the BFS phase, choosing a larger block size reduces the number of random I/Os needed to load clusters, but at the same time potentially increases the size of the hot pool because clusters with bigger diameter tend to stay longer in the pool. This affects the performance adversely if the hot pool no longer fits in internal memory as can be seen in Table 2 for |V | ≥ 2 24 . At that point the algorithm using B w is outperformed by the one using B r .
DFS. For DFS, we use a straightforward non-recursive implementation of the text-book RAM algorithm. The algorithm explores the graph by visiting for each node the first not yet visited neighbor, and to do so we use two data structures: a vector to mark the nodes visited and a stack to store the nodes for which not all the neighbors have been visited. The key particularity of this algorithm is that it performs extensive random reads to access many adjacency lists, as well as extensive random writes to mark the nodes. For a graph G = (V, E) the unit-cost of the algorithm is given by O(|E| · B r + |V | · B w ), since there are |E| read accesses to the adjacency lists and |V | write accesses to mark the vertices visited. The costs for accessing the stack are much smaller since both reads and writes can be buffered. We note that when transferring data in chunks of size B r the cost of the algorithm remains O(|E| · B r + |V | · B w ), but when the block size is set to B w the cost increases to O(|E| · B w + |V | · B w ).
We conduct experiments which show the running time of DFS when transferring chunks of B r and B w consecutive data between the memory and the flash disk, as well as on using the translation layer. Due to extensive running times, we restrict to square grid graphs. We noted that for all input sizes using the translation layer yields better running times than when doing I/Os in blocks of size B r , which is due to writing many blocks of size B r at random locations. When the graph fits into the main memory the algorithm is extremely fast. For |V | ≤ 2 20 , the running times were below two seconds. However, when the graph no longer fits into the main memory, the running times and the I/O-traffic increase significantly.
For |V | = 2 22 , the running times were of 4 180, 4 318, and 610 000 seconds for the translation layer, B r , and B w block sizes respectively. The huge running time for the B w block size is explained by the huge volume of read data, of about 46 TB, compared to 336 GB read when using B r -sized blocks and 311 GB when using the translation layer. The volume ratio between B w and B r approximately matches Bw Br = 128. However, the volume of data written was significantly low (less that 300 MB in each experiment). This is due to vector marking the visited nodes completely residing in memory.
Therefore we used another approach and stored the visited information with each node, effectively scattering the bits over a larger range of external memory. Internal memory was further restricted to cache at most half of an external memory data structure. Comparable experiments with block size B w are not possible in these settings because the internal memory cannot store a required minimal amount of blocks. For |V | = 2 21 the DFS using the translation layer took 6 064 seconds reading 250 GB and writing 146 GB of data. Using block size B r instead, the running time increased to 11 352 seconds and read volume of 421 GB, while write volume was 145 GB. The translation layer could serve a fraction of the read requests directly from its write buffers explaining the increase in read volume. While the written volume and write throughput rate were nearly unchanged (145 GB, 77-80 MB/s), the read throughput dropped from 69 MB/s to 46 MB/s. The subobptimal block size used for writing obviously triggers reorganization operations in the flash device that block subsequent operations (reads in our case). This accounts for the major part of the additional running time showing a clear benefit for the translation layer bundling these small random write requests.
Conclusions and future research
We proposed two models that capture the particularities of the flash memory storage devices, the general flash model and the unit-cost flash model. We show that existing external-memory algorithms and data structures, based on the merging paradigm, can be easily translated into efficient algorithms in the unitcost flash model. Relevant examples include sorting, search trees, priority queues, and undirected BFS. We conduct experiments that the unit-cost flash model predicts correctly the running times of several algorithms that present various I/O-patterns.
For the general flash model, an interesting future direction concerns obtaining lower bounds for fundamental problems, such as sorting or graph traversals, even for extreme cases when we set the penalty factor c to a very large value that allows the algorithm to write only the output. Future investigations in this model include engineering fast algorithms for basic problems, such as sorting.
For the unit-cost flash model, possible topics for future research include identifying problems for which the best external memory upper bounds cannot be matched in the unit-cost flash model.
Promising directions also include introducing relevant computational models that capture other characteristics of the flash devices and yet allow meaningful algorithm design.
