Introduction
Efficient hardware acceleration and adoption requires sim− plicity and universality. The best examples to support this claims are two most widely used methods in modern inter− active computer graphics, namely the concept of hardware accelerated rasterization and depth buffering. There were many attempts at overthrowing their predominant position, but none of them succeeded. Simplicity gave these methods large room for hardware and software based optimization for both achieving greater performance and lower produc− tion cost. At the same time universality allowed for out− standing adoption by programming community.
In light of the above we found that the way to reach a success in this highly pragmatic ecosystem is not through fighting existing status−quo. We rather comply with existing solutions rather than try to overthrow them completely. But at the same time we find that it has been over thirty years since Wolfgang Straßer (in early 1974) and later Edwin Earl Catmull introduced the concept of Z−buffering as a way of "flattening" 3D objects for representation on a 2D surface of a display device. The graphics algorithms soon moved beyond storing single data element for each pixel -the work behind the less constrained Carpenter's A−buffer started just a decade later.
Yet to this day methods that attempt to store irregularly distributed (between image's pixels) number of data ele− ments struggle with efficiency problems, when imple− mented on a modern graphics hardware. This comes par− tially from the rigid nature of the architecture, which cannot really escape the legacy of its single instruction multiple data (SIMD) processing model [1] . True we have moved far beyond attaching a simple processing unit and thus giving it an access to only single memory bank. That is gone from SIMD to multiple threads 1 multiple data (MTMD). But the requirements of memory access regularity to avoid conflicts and predictability to avoid latency are hard to deny.
Said all that our goal was to design a data structure and accompanying construction algorithms that would offer the benefits of efficient memory access on highly parallel pro− cessing architectures. At the same time we aimed at making it as simple as possible, to leave plenty of room for optimi− zation. Last but not least, we find that the concept of mem− ory storage optimization through early access pattern pre− diction behind our solution can be applied to any compute heavy problem and, thus, our work can be deemed universal and extensible.
To be more precise and name a few of possible applica− tions of our method we can list from the computer generated images field: l reducing aliasing in rasterizer based framework; l solving the global illumination problem; l composing contributions from multiple, possibly inter− secting transparent and translucent surfaces; l visualizing light scattering and non−uniform volumetric objects. As well as from the visual data acquisition field -the problem of efficient data storage and transmission in multi− ple camera and multiple processing nodes environments for high definition imagery, which can benefit from compres− sion aspects of dense data packing, irrelevant data omission and ordering of elements arriving in not always regular fashion from recording devices.
The goal of this paper was to present a complete in terms of problem solution application and hardware accelerated implementation on acclaimed graphics architectures. We present and refer to the bare minimum knowledge that is required to use the concept efficiently, thus simplifying the burden placed on practitioners wanting to make use of our findings and for fellow researches wanting to further explore and extend the concept.
To move us closer to aforementioned goal, we decided to organize the rest of this paper as follows.
We start by presenting previous work by other authors that we found both to be directly related to the problem of storing multiple data elements per−picture element and to be ground breaking and highly influential at the time of its publication.
Next we define most basic and yet important set of terms that we use throughout our paper.
We follow by expanding these terms with some funda− mental concepts that are proverbial corner stones (and key differences from the existing solutions) of the idea behind the data structure presented in this paper.
The main part of the paper consists of a description of the data structure construction algorithms that are tailored to fit the capabilities of three different graphics hardware architectures.
We proceed with stating conditions, hardware and soft− ware configuration, data set description and finally present experimental results.
They are discussed in the next section and their rele− vance to future research directions concludes the paper.
Previous work
In this section we present previous works by other authors, that revolve around the problem of fragment data storage. They display varying degree of universality and sophistica− tion, when viewed from the number of fragments and possi− ble applications stand point. Most simple allow a user to store single or otherwise very limited number of data ele− ments per−pixel. Others can potentially store unlimited number of fragments but are limited by the data composi− tion algorithms. But they all share one common characteris− tics, that is they can be perceived as proverbial milestones in terms of image data storage solutions. We have deliberately chosen most original works and avoided derivatives, to focus on high level aspect of evolution of solutions in the field of multi−dimensional data storage.
Therefore, we should start by presenting the concept of Z−buffer [2] . It works by extending the idea of a frame buffer with storage for additional (to traditionally occurring intensity or colour) screen space depth value for each pixel. For each arriving fragment, its depth (Z) value is compared with the one that has already been recorded in the depth buffer. If the new fragment is in front of the previously recorded one, it replaces it. Otherwise it is discarded. Such a simple solution trivially solves the problem of hidden sur− face removal. It does not impose any restrictions on the pos− sibility of inter−surface intersections and order of rasteriza− tion. At the same time it is very efficient from the computa− tional cost perspective, leaving much room for optimization and hardware acceleration. Not so when it comes to the stor− age -it needs additional storage for depth value per−pixel, that can be on par with only other displayable pixel attribute -colour value. Binary nature of visibility problem solution can also lead to significant aliasing issues.
A−buffer (or anti−aliased, area−averaged, accumulation buffer) as presented in Ref. 3 works by storing per−pixel fragment lists or simple pixels (with colour and depth va− lues 2 ) based on the local depth complexity of the visualized scene. Fragments are stored in a sorted order with respect to their depth values. Per−pixel sorted nature of the storage container allowed the authors to solve the problem of trans− parent surfaces composition. To handle the aliasing issues, authors stored additional coverage masks per−fragment.
The most interesting recent development in bringing A−buffer concept closer to interactive computer graphics is the concept of using stencil buffer to route fragments into multi−sample frame buffer presented by the authors of Ref. 4 . They use the fact that the stencil test is performed per−sample in most cases of modern hardware anti−aliasing implementations. They initialize the stencil buffer and set stencil test to pass only for samples that have particular sten− cil value. The stencil operation is set to modify the stencil buffer value for each incoming fragment, so that each sam− ple can be filled at most once. The number of fragments stored per−pixel is limited by the number of samples and stencil buffer precision. Using the image samples to store multiple, independent fragments also prevents authors from using it in the traditional multi−sampled drawing. Neverthe− less, we have found this method very interesting and inspir− ing and, therefore, use a similar approach for routing frag− ments into temporary multi−layered frame buffer in Shader Model 4.0 compliant version of our algorithm (see Sect.
5.3).
Here we would like to make one last exception from the high level nature of this section and refer to the work pre− sented in Ref. 5 and 6 . The authors of those works focused on using the flexibility of random access memory opera− tions offered by recent hardware to construct virtually un− bound, in terms of the number of fragments stored per−pixel, A−buffer. To avoid memory access synchronization issues and account for a non−uniform nature of per−pixel fragment distribution, they use reversed linked lists of fragments per− −pixel (see Fig. 2 ). They even presented its application to solving order independent transparency and global illumi− nation problems at interactive speeds [7] . The purely dyna− mic nature of fragment allocation has one shortcoming on current graphics hardware, that does not allow the user to allocate global memory resources in a shader code. That is: there is no prior knowledge of the number of fragments that are to be stored before actually rasterizing the geometry.
The goal of the work presented in Ref. 8 was to apply the Z−buffer algorithm to visualization of semi−transparent sur− faces. The authors use auxiliary or so called virtual pixel maps to decompose the global depth sorting problem into a series of simpler problems each of which finds the trans− parent surface that is both the closest to the so called moving depth and the furthest from the viewpoint. The moving depth surface is updated in each pass starting at the non− −transparent surface that is the closest to the view point, with the blended−in colour and depth values of the next transpar− ent surface, which become the new set of attributes stored in the virtual (moving depth) pixel map. The authors present application of virtual pixel maps to solve aliasing issues and merge it with visualization of transparent surfaces to form a coherent framework. As the methods using virtual pixel maps are inherently multi−pass, they can be heavy on geo− metry transformation and rasterization.
An example of procedural approach to fragment data storage can be found in Ref. 9 . The authors solve the alia− sing problem in a Z−buffer−based infrastructure by storing a series of per−pixel parameters that are used to find the final pixel colour value. This allows them to greatly reduce the memory complexity of anti−aliased visualization. To find the colour of an 8−times anti−aliased pixel, they only store 3 fragments (hence the name Z 3 −buffer) consisting of colour, depth, stencil and two (along screen space axes) depth gra− dient values. As the most of procedural storage methods this one in closely tailored to match the needs of problem at hand and it is hardly universal. Nevertheless, it offers great quality to memory consumption ratio and, thus, we found it worth mentioning. Unfortunately, as authors point out themselves the concept behind it shatters in highly unusual cases of fragment data non−uniformity. F−buffer (or fragment−stream buffer) presents an alterna− tive way of a fragment storage from the traditional 2D frame buffer [10] . The fragment data is stored in a first−in−first−out (FIFO) queue consistent with arrival (rasterization) order. Such an unstructured representation imposes the need to store pixel coordinates per−fragment, so that the data can be mapped to the final displayed 2D image elements. It can be perceived as a special case of a stream buffer [11, 12] , or a generalization of A−buffer that does not require any tem− porary 2D image representation. Although there is some inefficiency connected with storing pixel coordinates per− −fragment, the method can be quite useful as we have found during preparation of one variant of our Shader Model 5.0 compliant algorithm (see Sect. 5.5). The authors also under− line the advantages over regular structure of so called deep frame buffers -in particular universality of data reinterpre− tation and absence of space wasted for unused image ele− ments. They view F−buffers as hardware friendly due to the elimination of read−after−write hazards and, at the same time, find that problem of a buffer overflow (see the discus− sion above about dynamically allocated A−buffer) can be quite problematic.
A special case of multi−fragment storage for solving visualization of semi−transparent surfaces is also represen− ted by the R−buffer (or recirculating buffer 3 ) algorithm [13] . The rasterization is handled in a way similar to that used in the virtual pixel maps -there is an auxiliary set of attributes buffers involved, but all geometric primitives are rasterized in a single pass. The fragments contained in it are merged together based on their opacity and depth values stored along main attribute -colour. As the buffer is implemented as a FIFO queue, in cases when a buffer overflow occurs, it can be swapped into memory resources with greater capac− ity. The fragment composition is performed in multiple passes controlled by the depth attribute of transparent frag− ments and a set of control flags. As the R−buffer has no inherent 2D structure, it has to store frame buffer coor− dinates per−fragment.
K−buffer offers a hybrid approach to so called hard− ware−assisted visibility sorting (HAVS) problem. It first performs rough geometric primitives sorting using CPU that generates fragments in a so called nearly sorted order [14] . The fragments are then sorted by a limited depth sorting nets using graphics hardware. The letter k in the name of the algorithm stands both for the maximum distance (in terms of placement) of an element of a nearly sorted sequence from its rightful place in an exactly sorted sequence and, as a consequence, for the depth of the used sorting nets. As the composition algorithm is fixed and known before the stor− age is filled, the authors can use it to reduce the number of fragments stored per−pixel to prevent buffer overflow cases, that could otherwise arise (recall the problems with uncon− strained A−buffer and F−buffer algorithms). When a new fragment arrives, the one that is at the beginning of the sequence (as per front−to−back blending) is blended in, to provide space for the new−arrival. All this means that K−buf− fer is limited to storing at most k fragments per−pixel.
The idea behind K−buffer has been extended in Ref. 15 to allow for a more general form of data element merging. It assumes that an algorithm of choice works in a read−mod− ify−write (RMW) scheme. That is whenever a new fragment arrives, all the ones that are already stored are loaded, modi− fied according to its attributes, written back into the storage and the new fragment (having an unspecified influence on the ones stored) is then discarded. This generalization allows us to perceive K−buffer algorithm a as stream proces− sor, but with a very limited maximum stream length.
Most of the algorithms above are strongly devoted to solving particular fragment data composition problem. Data storage algorithms that are generally usable (that is time efficient) in case of an unspecified highly parallel architec− ture are rare or highly limited in terms of maximum element count.
For instance, A−buffer was first designed as an extension of the Z−buffer algorithm to solve aliasing problem in a manner fitting closely the Reyes drawing infrastructure, in which the fragment−pixel coherency is less important due to the fact that most polygons cover very few pixels.
Virtual pixel maps offset the solution of data storage to multi−pass rendering using flat (in terms of number of frag− ments per−pixel) data structures. This, in turn, severely ham− pers their performance. Z 3 −buffer was designed as closely−to−the−metal as it was possible. Its highly specialized nature causes it to fail in unusual cases and, thus, is rarely used in commodity graphics hardware. It served us as an example, that being too specific (although efficient) is not always the best way to solve problems. R−buffer is a highly complex, involving much flow con− trol in place of computation, solution of a single problemthat is visualization of semi−transparent surfaces.
On the other hand F−buffer and the generalized form of the K−buffer impose little to none restrictions on the final fragment processing algorithm. Although the latter, besides requiring the fragments to be sorted, limits the number of fragments in a way that can often be unacceptable.
Only a handful of the presented algorithms allows for theoretically unbounded per−pixel fragment storage -na− mely A−buffer, F−buffer and R−buffer, but often at the cost of performance or processing flexibility.
Our goal in light of the presented previous works by other authors was to design a data structure and accompanying construction algorithms that would alleviate such bounds, at the same time being efficient and general enough to allow for virtually any final fragment processing algorithm.
Basic terms
In this section we list some basic terms that are frequently used throughout the rest of the paper. For clarity and sim− plicity, we restrict the description below to terms that are either used in a non−traditional context or that can have mul− tiple meanings based on multiple contexts that this paper refers to (e.g., computer graphics, general stream processing model or parallel processing). Polygon -geometric primitive (e.g., triangle) that can cover each frame buffer element (or pixel) at most once. Pointer buffer -data structure holding an offset into global fragment data array that corresponds to the first fragment of a 2D frame buffer pixel.
l Count buffer -data structure holding fragment count for each corresponding 2D frame buffer pixel.
Basic concepts
This section provides a brief description of basic concepts behind the idea of data structure and algorithms used for its construction presented in the remainder of our paper. The idea is far from being complicated, but we feel that there is some fundamental explanation required for the reader to grasp the reason behind its higher performance, when com− pared with methods proposed by other authors.
Sparse data structure
A sparse data structure contains some elements that can be considered as irrelevant from the processing algorithm's point of view. These irrelevant elements should lay between relevant structure elements.
Taking the above into account the 2D frame buffer (shown on the left of Fig. 1 ) can generally be considered sparse. Since not all of its pixels come from polygon rasterization process (some can be filled with a clear buffer colour).
For a multi−layered frame buffer the situation is analo− gous. Its elements (multi−fragment pixels) can be sparse or dense. Please note though that it is not sufficient for a multi− −layered frame buffer to contain only dense pixels to be con− sidered dense (see the right part of Fig. 1 ). There should also be no irrelevant fragments at the end of each pixel.
Unordered data structure
In an unordered data structure there is no obvious order between simple elements that belong to the same complex element.
One example of an unordered data structure can be per−pixel linked lists (see Fig. 2 ) as presented in Ref. 7 . Note that it is unordered from the linearly addressed memory space point of view and, thus, requires additional data (per−fragment pointers) to describe proper pixel data ordering.
To be more precise, in an unordered frame buffer there is at least one fragment for a given multi−fragment (with a fragment count of at least 2) pixel, for which fragments preceding it and following it come from a different pixel.
D-buffer data structure
The deque buffer (D−buffer) is an ordered, from the pixel− −fragment relation point of view, fragment (array of arrays) data structure. D−buffer is dense from the fragment data point of view. That is per−pixel arrays of fragment data have no irrelevant elements between them and they are dense themselves.
We do not require any particular per−pixel fragment ordering though. We leave it up to the final fragment com− position algorithm of choice. We also do not impose any particular ordering of fragment data arrays building up the global (involving all pixels) array.
To map a dense D−buffer into a 2D displayed frame buffer we can use, just as in per−pixel linked lists, a sparse 2D pointer buffer with the addition of a sparse 2D fragment count buffer. Alternatively we can use a dense 1D map buffer, which for each non−empty pixel holds its address in a 2D frame buffer, pointer into global fragment data array for its first fragment and its fragment count.
All of the above assumes linear memory address space.
Constructing the D-buffer
We have prepared the D−buffer construction algorithm for each of the three most relevant hardware architectures. For clarity all the schemes are named after corresponding Direct3D Shader Model but the assumed hardware capabili− ties, as well as proposed implementation are defined in terms of OpenGL extensions [17] . This allows our D−buffer calculations algorithms to be generic in terms of hardware vendor and, at the same time, be specific in terms of hard− ware accelerated features required. The description of each algorithm consists of a graphical overview and a detailed description of each step. Graphical overview is presented in a data−flow diagram like manner. Each step is placed according to its operational ordering from top to bottom. Steps that can be performed in parallel are placed at the same level or their blocks have the same predecessor in the diagram. The step description can contain following subsections: 
Shader Model 3.0 architecture
The third generation (SM3) programmable architecture is most widely represented in: 
In parallel a. Fragment count pass
Problem: find the fragment count for each pixel. Solution: use additive 16−bit floating point blending (see Fig. 5 ) to find the per−pixel fragment count, occlusion query to find the total fragment count and saturated incrementing stencil function to find non−empty pixel mask [26] . Input: clip space vertex coordinates. Output: per−pixel and total fragment count, non−empty pixel stencil buffer mask.
b. Minimum per-pixel depth pass
Problem: find the depth range for each pixel to more effi− ciently encode it as a fixed point value.
Solution: use depth only rendering to exploit double fill rate optimization [28] , setting depth test to choose fragments with smallest depth value. c. Maximum per-pixel depth pass (see step 2b)
In parallel a. Fragment count reduction pass
Problem: find the number of layers for the temporary multi−layered frame buffer as the maximum fragment count, the address of the first and last non−empty pixel in each row and the first and last non−empty row. The bounds calculated in this step will be used during data gathering pass.
Opto−Electron. Rev., 21, no. 1, 2013 © 2013 SEP, Warsaw Solution: perform mixed GPU (row−wise pass, column−wise pass) and CPU (4−tuple reduction pass, 4−tuple to scalar pass) buffer reduction. Input: per−pixel fragment count. Output: maximum fragment count, per−row non−empty pixel bounds and non−empty row bounds.
Details:
Reduce the count buffer performing rectangular folding along horizontal axis finding column buffers holding maximum fragment count, first and last non-empty pixel in each row of the count buffer.
Reduce the above column buffers using 1D folding until minimum GPU reduction size is reached.
Read back the remaining data into an array of __m128.
Reduce the 4-tuples using SIMD CPU processing [29] .
Reduce the remaining data using scalar CPU processing.
b. Fragment data buffer allocation
Solution: use the total fragment count and maximum viewport dimensions to allocate a 2D buffer capable of stor− ing all D−buffer's fragment data. Input: total fragment count. Output: fragment attribute buffer.
c. Map buffer calculation pass
Problem: find the data structure mapping D−buffer into a 2D frame buffer on an architecture with no programmable scat− ter functionality. Solution: we tried two radically different solutions:
l first involved using VS to process non−empty pixels, rasterization based scattering and result write−out mask− ing using stencil buffer, finishing with additive blending for pointer buffer calculation (just like in Fig. 7 ), which in result gave us dense map buffer (see The second method proved to be up to 100 times faster than the first one, although in theory it required more mem− ory access operations. We link its efficiency to the fact that it showed perfectly predictable result write−out order and locality -a quality that the first method was obviously lacking. Input: per−pixel fragment count. Output: sparse map (pointer) buffer.
Details:
In FS shift the contents of the count buffer by one sample to the right (with a 1D wrapping applied to a 2D texture) filling with 0 and put it into a pointer buffer. In FS calculate row-wise running sum of elements in a pointer buffer (use scissor test to limit drawing region). Shift last column of the pointer buffer by one position downward filling with 0. In FS calculate running sum of elements of a column buffer (use scissor test to limit drawing region). In FS add the above running sum into each column of the pointer buffer.
d. Scale buffer calculation pass
Problem: see step 5. Solution: process all non−empty pixel's (identified by sten− cil mask calculated during fragment counting pass) calculat− ing scale values as 
Multi-layered frame buffer allocation
Problem: see step 5. Solution: use multiple rectangular, fixed−point textures (the size of current displayed frame buffer) bound as colour buffers of a MRT FBO. Input: maximum fragment count. Output: MRT attribute FBO with alpha component.
Multi-layered frame buffer initialization
Problem: we want to draw to multiple subsequent output locations from FS. We can output data into multiple colour buffers using the MRT functionality, but we want to draw only into colour buffer, which index corresponds to the fragment index in an incoming sequence. (where i is the render target index and N is the number of MRTs). To fight the upper clamping limit we can use per−pixel scaling based on the attribute value and fragment count. We can do that by extending the normalized attribute value to the [0; 1] while knowing its minimum and maximum values. We can also use the per−pixel fragment count to scale normali− zed attribute's value. The scaling factor is found as 
Multi-layered frame buffer gathering
Problem: gather the sparse MRT attribute buffer into a dense D−buffer. Solution: we process all entries of the D−buffer. As the map (pointer) buffer consists of monotonically increasing D− −buffer pointer values, we can use binary search to find pixel's data appropriate for a given D−buffer location. In fact we perform this gathering operation in two passes (see Fig. 9 ) for the sake of higher efficiency. First pass involves "blind" binary search finding correct sparse buffer's data and coordinates for D−buffer samples placed at regular strides. The second pass uses the coordinates found in the first pass to perform faster linear search. 
D-buffer processing and scattering
Problem: process the fragments belonging to given dis− played frame buffer pixel and output result at appropriate pixel address.
Solution: process all non−empty pixel's (identified by sten− cil mask calculated during fragment counting pass) using pointer and count buffers to identify its fragment data. Input: pointer, count, colour, depth, scale and minimum depth textures. Output: displayed frame buffer.
Details:
In FS fetch the pointer and count values for incoming fragment's coordinate. In FS fetch and decode (see step 5) all depth and fragment data associated with the D-buffer pointer value fetched above. In FS calculate and output final pixel colour.
Multi-pass Shader Model 3.0 algorithm
When there are more fragments that can be stored for each individual pixel (as limited by the number of layers of the temporary multi−layered frame buffer) we need to resort to a multi−pass drawing scheme. For the simplest solution we could perform steps 2 through 7 multiple times to fit all the polygons into a limited number of render targets of the MRT frame buffer. Please note, that the way in which the fragments are stored in the multi−layered frame buffer pres− ents a significant improvement over a traditional method globally assigning each polygon to a different layer. In our method the polygons are assigned to layers locally, that is they are densely packed in each pixel (compare the dense and sparse pixels in Fig. 1 ). Nevertheless, the number of passes would be constrained by the number of polygons divided by the number of render targets, in case when the maximal fragment count (as found in step 3a) exceeds the number of render targets. As this ratio tends to be high for scenes with high polygon count and low number of render targets at our disposal, we decided to design more elaborate geometry decomposition scheme. It involved minor restruc− turing of the algorithm's flow (see Fig. 10 ), but the idea behind it stays the same: The ones that are new were outlined in grey (see Fig. 10 ) and description of only these follows. As for the single pass version, the passes that require processing of scene's geom− etry source their data from the vertex transformation pass. But the actual processing is limited to the polygons des− cribed by the results of polygon index reduction pass. We omit this index data flow on the diagram for the sake of clarity and readability.
Overflow test pass
Problem: find if for any pixels the count has exceeded the maximum allowed number of fragments. Solution: we use a Boolean occlusion query [27] and stencil test on the stencil buffer coming from the fragments count− ing pass 4 , to test if for any pixel the reference stencil value is less than that stored in the stencil buffer at corresponding pixel's coordinate. Input: stencil buffer. Output: Boolean occlusion query result.
Bit field buffer calculation pass
Problem: find the polygon distribution among all display− able pixels that is, which polygons cover each pixel. Solution: we assign each polygon a unique index, which is then converted into a bit field mask with only one bit set (the one corresponding to the aforementioned index). We use MRT frame buffer of a fixed point component data com− bined with a logical function (bit−wise alterative) to obtain a polygon index bit field for each pixel. This limits us to processing only num_draw_buffers * num_componenets * num_bits_per_component polygons at once, which gives rise to the outer loop as seen in Fig. 10 . We use an additional per−vertex polygon index buffer and a texture map (see Fig.  11 ) to calculate bit fields for each component of each active render target (number of which comes from the number of polygons drawn in a given loop iteration). We also use Boolean occlusion queries to test if the polygon subset drawing produced any non−empty pixels. Input: clip space vertex coordinates (stored in a texture as a side effect of FS based vertex transformation), per−vertex polygon index buffer, vertex index bounds (as constrained by the previous outer loop iterations and aforementioned limit), polygon index to bit field conversion texture. Output: MRT bit field set of textures with bits set for cover− ing polygons.
Details:
In VS fetch clip space vertex coordinate from texture based on incoming vertex index, decrement incoming polygon index by the iteration's base index value and pass it to FS. In FS for each active render target fetch from polygon index texture map and output (to bit-wise colour operations) bit field value for all components.
Polygon index calculation pass
Problem: for each pixel find the index of the first, last, one after the last as well as count of the covering polygons. All for a given maximum polygon count and offset into a cur− rent bit field buffer. Solution: we use the bit field value stored for each compo− nent of each active render target to address a texture map encoding aforementioned parameters for each possible component value of a bit field. The texture map is organized as follows (see Fig. 12 ): We need to represent both maximum bit (that is covering polygons) count, as well as bit offset within the field due to an inherently variant nature of these values that are found globally (for all pixels) during previous inner loop itera− tions. We proceed with looking for an index of the first bit set given a global per−iteration offset (that is the one after the last bit consumed by the previous iteration). Then, while the number of bits processed is less than the maximum per−iteration count (coming either from the limited number of polygons, yet to be processed or the fragment count found during global fragment counting pass), we record the (so constrained) index of the last and next bit set in a bit field component. Input: MRT bit field set of textures found for current outer loop iteration, render target (texture in the aforementioned set), component (in the first texture to be processed), bit (in the first component of the first texture to be processed) and combined bit offsets, maximum bit count, texture encoding bit indices in terms of bit field component values. Output: polygon index buffer (per−pixel index of the first, last, one after the last and polygon count).
Details:
In FS fetch the fragment count and then bit field value based on the render target offset. In FS fetch the index values from the encoding texture based on the maximum bit count, component and bit offsets. In FS while the accumulated count is less than the maximum bit count and there are still bits to be processed in the bit field set of textures. { Based on a current bit field component (fetched if needed) value. { Increment bit count. Choose the minimum first bit index. Record the last and one after the last bit indices. } } In FS if there was no one after the last bit set in the last component processed, proceed with the above refraining from bit count, the first and the last bit index manipulation.
Polygon index reduction pass
Problem: find the global (in terms of pixels) maximum polygon count, first and last polygon index to be drawn in the current iteration, as well as the first one that should be considered in the subsequent iteration. Solution: we use the reduction process that is similar in terms of a structure to the count buffer reduction (see step 3a and Fig. 6 of the single pass algorithm), but performs slightly more complicated calculation per−pixel. It finds the largest set of polygons that can fit into all pixels' storage limits. The polygon count found in this pass is used to allo− cate appropriate temporary multi−layered frame buffer. The index of the one after the last polygon drawn in this iteration is used to update the loop's control variable, which limits the sweeping set for the next iteration. As a side effect we also find (based on per−pixel polygon count) the search bounds used in the fragment data gathering pass. Input: polygon index buffer. Output: maximum polygon count, first, last and next poly− gon index, per−row non−empty pixel bounds and non−empty row bounds. Details: see step 3a and Fig. 6 of the single pass algorithm.
Multi-layered frame buffer decoding
Problem: find the fragment colour and depth values that are independent of the per−iteration calculated scale and mini− mum depth buffers. Solution: as we need to store colour and depth values that are consistent across all dense fragment buffer calculation iterations, we should decouple the data decoding from final data processing. The algorithm is just as the one described for data encoding/decoding parts of steps 5, 6 and 8 of the 
Map buffer update pass
Problem: find the per−pixel offset into global fragment buf− fer for subsequent iteration. Solution: add contents of the current iteration's count buffer into a global per−pixel offset buffer (one that is added to the values fetched from a sparse map buffer during multi−lay− ered frame buffer gathering). This offset buffer is repre− sented with the same values as the sparse map and count buffer and is initialized to hold zeros for all pixels just before entering the outer loop.
Input: global offset buffer, current iteration's count texture. Output: updated global offset buffer.
Details:
In FS fetch and output (to additive blending) the fragment count from current iteration's count texture.
After each iteration of the inner loop, its control variable is updated with the polygon index of the first relevant (in terms of covered pixels) after the last processed in current iteration. Its value is then compared (as the loop's exit con− dition) with limit imposed by the last polygon that fits into the current outer loop's bit field buffer.
The outer loop continues batching polygons into the bit field buffer until there are no polygons in the scene's geom− etry representation left to be processed.
Shader Model 4.0 architecture
The fourth generation (SM4) programmable architecture is most widely represented in: l energy efficient mobile computing devices (such as net− books, UMPCs and tablets), and l CPU integrated graphics solutions (such as Intel Sandy Bridge architecture). It additionally (to those found in SM3 architecture) assu− mes the following hardware capabilities: l transform feedback (or TF) [30] ; l uniform buffer objects (or UBO) [31] ; l texture buffer objects (or TBO) [32] ; l programmable geometry stage (or GS) [33] ; l multi−layered rendering [33] , and l array texture objects [34] . The D−buffer calculation algorithm follows. Most of its steps take advantage of the code prepared for SM3 algo− rithm. The ones that are new and, therefore, interesting use transform feedback and multi−layered rendering capabilities of the SM4 hardware. There are some steps missing as well, due to the fact that we no longer need to concern ourselves with fragment data encoding and alpha based routing.
Vertex transformation
Problem: (see step 1 of the SM3 algorithm). Solution: use VS to transform, VBO to store and TF to cap− ture vertex coordinates. Input: (see step 1 of the SM3 algorithm). Output: (see step 1 of the SM3 algorithm).
Details:
In VS transform incoming vertex coordinates by a matrix stored in an UBO. Capture the homogenous clip space vertex coordinates into the output VBO using TF recording point primitives. Input: maximum fragment count. Output: multi−layered FBO with stencil component.
Multi-layered frame buffer initialization
Problem: we want to store multiple fragments' data for each pixel. Solution: use the multi−layered FBO to store pixel's frag− ments as efficiently as possible. This simply means that the fragments should occupy consecutive layers starting at layer 0 up to the number of fragments for a given pixel. As the geometric primitive rendering and fragment generation order is largely unknown (see for example Ref. 17) , we can− not assign the appropriate layer index to each fragment beforehand. Instead we use a "runtime" storage mechanism inspired by Ref. 4 . We start by initializing the stencil buffer of each layer with values going from 2 for the first layer up to the maximum number of fragments (or layers -which− ever is smaller) plus 1 for the last layer used. The shift in the lowest stencil value allows us to detect overflow cases just as described in Ref. 4 . In fact we could forfeit this trick by using an appropriate geometry rendering schedule based on the count buffer contents and, thus, avoid the overflow cases. But for clarity we use the same method as the one described by Myers. We draw to each layer of the FBO by writing out only to layers for which the stencil index is equal to 2. To block each layer from being written to multiple times, we decrement stencil index for each incoming fragment saturating at 0. Input: multi−layered attribute FBO. Output: multi−layered attribute FBO with initialized stencil write masks.
Multi-layered frame buffer rendering
Problem: see step 5. Solution: use GS to replicate geometry into all initialized layers of the multi−layered FBO. Input: clip space vertex coordinates, initialized multi−lay− ered attribute FBO. Output: multi−layered attribute FBO.
Details:
Set stencil test to pass only for pixels with the stencil index equal to 2 and decrement it with saturation to 0 for each incoming fragment. Draw initialized layers count instances of the geometry 5 . In GS emit primitive vertices into layer determined by instance index.
Multi-layered frame buffer gathering
Problem: (see step 7 of the SM3 algorithm). Solution: (see step 7 of the SM3 algorithm with the binary/ linear search distinction removed due to performance rea− sons).
D-buffer processing and scattering
Problem: (see step 8 of the SM3 algorithm). Solution: (see step 8 of the SM3 algorithm). Input: pointer, count, colour, depth textures. Output: (see step 8 of the SM3 algorithm).
Details:
In FS fetch the pointer and count values for an incoming fragment's coordinate. In FS fetch depth and fragment data (as needed by the processing algorithm of choice) associated with the D-buffer pointer value fetched above. In FS calculate and output final pixel colour.
Multi-pass Shader Model 4.0 algorithm
The multi−pass algorithm (see Fig. 14) is very much the same as for the Shader Model 3.0 version (see Fig. 10 ) with the following distinctions: l we use on−the−fly polygon bit field value calculation based on primitive index in the geometry shader; l we use integer textures in place of a fixed point for a bit field and a polygon index buffer representation; l we use a 2D texture array to represent the texture map encoding polygon indices in terms of bit field compo− nent values 6 . The use of 2D array textures, as well as a wider range of purely integer data representation allows us to process sub− stantially more polygons in each of the outer and inner loop iterations. In fact, we are only constrained by the memory limits and buffer initialization costs, unlike the very limited number of render targets imposed by Shader Model 3.0 hardware capabilities.
The only step distinct in terms of processing (data repre− sentation put aside) is therefore:
Bit field buffer calculation pass
Problem: (see step 2 of the multi−pass SM3 algorithm). Solution: use GS's primitive identifier predefined input variable as the basis of a bit field calculation. As we are now constrained only by the number of texture array's layers used, the limit on the number of polygons processed in a single outer loop iteration is equal to num_layers * num_ componenets * num_bits_per_component. Input: clip space vertex coordinates, vertex index bounds (as constrained by the previous outer loop iterations and aforementioned limit).
Details:
In GS calculate the bit field buffer layer index as primitive_id / (num_componenets * num_bits_per_component 
Shader Model 5.0 architecture
The fifth generation (SM5) programmable architecture is most widely represented in desktop and workstation solu− tions (such as NVIDIA Fermi and AMD Evergreen graphics architectures). It additionally (to those found in SM4 architecture) assu− mes following hardware capabilities relevant to the contents of this paper: l random read/write access to texture memory through image units (or IU) [35] ; l atomic counters (or AC) [36] ; l viewport arrays (or VA) [37] . Flexibility of unconstrained random access to image memory greatly reduces the programming burden needed to implement D−buffer rendering.
The D−buffer construction algorithm follows (see Fig.  15 ). It is in the most part similar to the SM4 algorithm, but there are some algorithm steps missing due to the fact that we no longer need to concern ourselves with temporary multi−layered frame buffer management (we draw directly into dense fragment buffer). accelerated, fixed function blending should be favourable even on current generation, highly programmable hardware.
To explore the possibility of a higher memory access paral− lelization, we have also tried to divide the drawing region into multiple sub−regions represented by distinct image objects. The drawing was constrained using viewport and scissor array (see Fig. 16 ), polygons had to be replicated in GS for cases, when they could span multiple sub−regions and, thus, should be rasterized to all of them. We also had to maintain per−region fragment (atomic) counters used for address offsetting during multi−region map buffer calcula− tion pass. This optimization offered no speed improvement, but is interesting from the high resolution drawing perspec− tive, when image drawn cannot fit the maximum texture or viewport dimensions limit. To reduce the number of full geometry drawing passes, we have also experimented with using this pass to output unordered fragment data 7 and then using rendering step to order it without the need for the sec− ond rasterization pass. In this case we additionally need access to an atomic counter just like in the per−pixel linked lists [7] case or a single element image object, if we decide to use paged fragment allocation scheme similarly as in Ref.
38 8 . Skipping the fragment counting step means reverting back to the original per−pixel linked lists algorithm. Input: (see step 2a of the SM3 algorithm) and discussion above.
Output: (see step 2a of the SM3 algorithm) and discussion above.
In parallel a. Fragment data buffer allocation
Solution: as we have random access to image memory, we do not need to concern ourselves with constrains imposed by outputting data into a 2D viewport. Therefore, we use the total fragment count to simply allocate properly sized buffer texture capable of storing all D−buffer's fragment data. Input: (see step 3b of the SM3 algorithm). Output: (see step 3b of the SM3 algorithm).
b. Map buffer calculation pass
Problem: find the data structure mapping D−buffer into a 2D frame buffer. Solution: we represent the map buffer just like in the case of SM3 and SM4 algorithms as a sparse 2D fragment map buffer (shown in Fig. 3 ). We calculate it by using single ele− ment image object (or multiple of these in multi−region drawing case, see Fig. 17 ), which is atomically incremented by each processed pixel's fragment count and its previous value is an output into a map buffer at corresponding pixel coordinate. As in the case of count buffer calculation, we have found that using traditional FS output infrastructure in place of image manipulation offered greater performance.
To reduce the number of memory access operations, we have also experimented with packing the fragment count and address data into a single component of a single texture object. In this case, we additionally use fixed function bit− −wise colour operations performed on FS outputs. As be− fore, if we have chosen to use a multi−region drawing scheme, we need to use GS to route data into appropriate buffer regions. But this time, its main purpose is not to repli− cate the incoming geometry, instead it calculates address offsets for each region as a partial sum of the fragment counts for regions preceding it. The fragment counts are passed through as TBO reinterpreted as UBO. The array of offsets is computed once for whole viewport and its appro− priate elements are output, while emitting vertices of poly− gons covering subsequent regions. As in our case using multiple regions did not offer any benefits we limit the remained of this step's description to a single region draw− ing. We have also experimented with skipping this step alto− gether offsetting its calculation to the rendering step with dynamic fragment data allocation (see step 4). It also did not offer any substantial performance benefits and incurred additional memory over−allocation and, thus, was not used in the final algorithm's implementation. Input: (see step 3c of the SM3 algorithm) and discussion above.
Output: (see step 3c of the SM3 algorithm).
Details:
In FS atomically increment the fragment counter value and output its previous value.
D-buffer rendering
Problem: given pointer buffer calculated in the previous step, output fragments resulting from scene's geometry rasterization at the appropriate locations of the dense frag− ment data buffer. Solution: for each incoming fragment we use value stored at its screen coordinate in the pointer buffer as an address into an output buffer (texture) image representing dense frag− ment data buffer. The appropriate value of the pointer buffer is retrieved as a side effect of atomic incrementation. This way we coalesce pointer retrieval and next (per−pixel) frag− Opto−Electron. Rev., 21, no. 1, 2013 J.K. Lipowski 117 7 With the addition of pixel coordinate for pointer buffer ad− dressing (see step 4). 8 With the difference that we do not drop fragments. When page allocation time−outs, we allocate possibly superfluous page, to get pixel−exact drawing results. ment offset calculation into one memory access operation.
If we have chosen to skip the map buffer calculation step, we additionally need to access a fragment counter (repre− sented as before with a single element image object) and one additional image object storing pointers into the list part of the dynamically allocated D−buffer, as well as count buffer (as a read only memory resource, see Fig. 18 ). In a so modi− fied version of the algorithm, for each incoming fragment we test if the properly sized array of fragments has been allocated for corresponding pixel and if it has not and the fragment is the first one for given pixel, we allocate said array. If the allocation is in progress, we output fragment's data in a manner similar to the one used in original per−pixel linked list algorithm, storing calculated pointer in an auxil− iary pointer buffer and additionally outputting per−fragment pointer to the previously written−out fragment. On the other hand, if we decide to output unordered data in the fragment counting step, we use VS sourcing input data from vertex attribute buffers or texture objects 9 along with per−fragment pixel coordinates (just like in the F−buffer's case [10] ) for pointer buffer addressing to output vertices representing sin− gle pixel points at appropriate locations. In the case of paged fragment allocation during fragment counting step, we use the FS to gather data from possibly sparse, paged fragment data buffer into dense D−buffer representation. As earlier, the basic and simplest form taking advantage of the results of all above steps prevailed as the most time efficient. Input: clip space vertex coordinates, sparse map (pointer) buffer and see discussion above. Output: D−buffer, sparse map (pointer) buffer with entries incremented with per−pixel fragment counts.
Details:
In FS atomically increment the pointer buffer value at current pixel's coordinate and output fragment's data into fragment data buffer at the address given by the pointer's previous value.
D-buffer processing and scattering
Problem: (see step 8 of the SM3 algorithm). Solution: (see step 8 of the SM3 algorithm), as a distinct hardware feature we use TBO as a fragment data storage and, thus, use 1D address space for fragment attribute retrieval. This requires us to supply the data retrieval algo− rithm with a pointer buffer, a count buffer and of course the fragment data buffer itself. All can be perceived as read only memory resources. The algorithm starts by fetching pointer and count buffer values for the current pixel's coordinate. Then, it proceeds with sequential data retrieval by address− ing fragment data buffer with a pre−decremented pointer value, until number of retrieved fragments reaches the cur− rent pixel's fragment count. If we decide to use dynamic allocation scheme as described earlier, we additionally need read only access to the list pointer buffer (see Fig. 19 ) and the per−fragment pointer buffer. We start by fetching and counting all the fragments stored in the list just like in Ref.
7. Then we fetch the remaining (count buffer value decre− mented by the list fragment count) fragments from the array.
Final fragment composition according to the algorithm of choice follows. Input: pointer incremented by the fragment count, count, colour and depth textures. Output: (see step 8 of the SM3 algorithm).
Details:
In FS fetch the pointer and count values for incoming fragment's coordinate. In FS for each pixel's fragment fetch depth and fragment data (as needed by the processing algorithm of choice) associated with the pre-decremented D-buffer pointer value fetched above.
In FS calculate and output final pixel colour.
Opto−Electron. Rev., 21, no. 1, 2013 © 2013 SEP, Warsaw Fig. 17 . Pointer buffer, counter(s) and optional fragment offsets used, when drawing with multiple sub−regions. 9 Which are both equivalent due to the use of TBO for frag− ment data storage. 
Results
We have evaluated all the algorithm versions on a common hardware/software configuration (see Table 1 ) to measure the algorithm's efficiency and avoid platform dependent skews. We also include results for our implementation of the Per−Pixel Linked Lists as a benefits' estimation, when compared with a state−of−the−art, proven to be practically applicable method [7] . We present frame drawing times in units of seconds (as measured with high performance CPU counters [39] 10 and GPU timer queries [40] 11 ) and memory consumption in units of megabytes (as measured with GPU memory info extension [41] ) all averaged over 1000, 100 or 10 frames 12 based on the processing time and rules of san− ity. The memory consumption was measured as a differ− ence of memory available upon software initialization and that available, when finishing frame drawing. The multi− −layered frame buffers were not freed (for performance rea− sons), although they could be after fragment gathering step has completed. Along the memory consumption values there are some additional configuration parameters listed: Fragment colour values are generated procedurally based on screen coordinates (see Fig. 23 ). This colour scheme was used more for the sake of debugging, then ascetics.
As our approach is not linked to any particular visualiza− tion solution, final fragment data composition was done with three different algorithms: l Z−buffer -choosing colour of fragment nearest to the viewpoint; does not require any local temporary storage for the process; loads colour only for fragments passing the depth test; well suited to benchmark storage algo− rithm's efficiency as the most work is shifted away from post−processing; The upper value in appropriate columns of the tables. 11 The lower value in appropriate columns of the tables. 12 With addition of 10% this number for initial warm−up, discarded frames. 13 Sweeping region bounding was not used, when it did not offer any performance benefits (all pixels covered by at least one fragment and all fragments fit in the multi−layered frame buffer) or in memory constrained cases, as it imposes the use of additional texture resources. Happy Buddha (see Fig. 25 ) model from The Stanford 3D Scanning Repository represented by 32328 geometry vertices and 67240 triangles [42] . Translucency used for final fragment data composition. Edge case for SM5 algo− rithm, in which all fragment counts fit single pass drawing requirements. Relatively, large number of polygons when compared to covered pixels is less favourable, while dra− wing with multiple passes.
Series of building segments (see Fig. 26 ) filling whole screen (typical use case for computer aided design or CAD applications) represented by 956643 geometry ver− tices and 522720 triangles [43] . Transparency used for final fragment data composition. Significant number of polygons, the maximum number of fragments and the large fragment count variance is especially ill−suited from the multi−pass drawing and memory access coherency perspective.
Opto−Electron. Rev., 21, no. 1, 2013 J.K. Lipowski 121 We also took more in−depth look at processing time reli− ance on screen resolution (pixel count), as well as a number of fragments (per−pixel) for the SM5 D−buffer algorithm and Per−Pixel Linked Lists.
We have plotted the single frame processing time as a function of number of on−screen pixels (see Fig. 20 ). We have chosen to increase the image's width from 16 up to 2560 with an increment of 16 pixels per−measurement point. Image's height ranged from 10 to 1600 with an increment of 10 pixels. All respecting typical screen resolutions aspect ratio of 16:10. In this case per−pixel fragment count was held constant at 15.
The acceleration ratio stayed roughly constant at the level of 4 times to the D−buffer's advantage.
Next we held the screen resolution constant at 1920 by 1200 and varied the number of fragments generated (uni− formly) per−pixel from 1 to 38 with an increment of 1 frag− ment (see Fig. 21 ).
The acceleration ratio was logarithmic in terms of per− −pixel fragment count (see Fig. 22 ).
All times were averaged over 1000 frames according to rules specified in the opening of this section.
Discussion
The first and the most pronounced issue that we feel re− quires a commentary and analysis is the memory consump− tion distribution in terms of a data set and an image resolu− tion. For the Per−Pixel Linked Lists we decided to allocate the fragment data buffer for the worst possible scenario that is the number of fragments equal to the number of pixels multiplied by the maximal number of fragments per−pixel coming from the geometrical data description. This, in turn, is a consequence of the fact that we wanted to achieve pixel−exact results, that is not to drop any fragments. Having no prior knowledge of fragment count requires us to take such a drastic measure. As a result the memory consumption for the Per−Pixel Linked Lists reaches the limit imposed by graphics memory at our disposal quickly and, thus, it is hard to compare our method with the original algorithm. Nevertheless, benefits are even more obvious.
As for the theoretical analysis of this problem, we reach conclusion that the D−buffer's (see Fig. 3 ) memory manage− ment (those not involving accessing actual fragment data) requirements are as follows: Putting aside the fact that in the case of Per−Pixel Linked Lists we should allocate memory for the worst possible sce− nario we expect overall memory complexity reduction, when 2 * pixel_count < pixel_count + fragment_count, that is when there is at least one fragment covering each pixel (see results for one polygon covering entire screen in Table 3 ).
Memory consumption is also a pressing issue, when it comes to using temporary multi−layered frame buffers. This was first mentioned by the authors of the F−buffer concept and is supported by our findings [10] . For the simplest case of a single full−screen polygon (see Table 3 pronounced as a close to 100% over−allocation. The addi− tional memory cost for SM3 and SM4 methods is related to the use of data structures required by bounded gathering and count buffer reduction. One conclusion we find especially worth mentioning is that the case in which we can fit all the layers into a single temporary frame buffer (thus, avoid multi−pass solution) is highly desirable from the processing time perspective. This is even more true for scenes with uni− form fragment count distribution (see Table 2 and Table 4) as it can take better advantage of multi−layered (somewhat fixed−function) memory access regularity and, thus, achieve up to 400% better performance than the random access implementation (see Table 3 and Table 5 ). When we reach cases where multi−pass drawing is unavoidable (see Table 7 and Table 9 ), we should always try to maintain structures used for bounded gathering, because they have even more positive influence, when used with sparse results of each inner loop iteration. Next, if the number of polygons is sig− nificant, we should use as little outer loop iterations as pos− sible, thus favouring bit field frame buffer size over tem− porary attribute frame buffer size. Using single−layered temporary frame buffer (see the edge case results of SM3 algorithm for the highest resolu− tion in Table 9 means in practice, that we revert to the solu− tion that is on par with virtual pixel maps [8] (or depth peel− ing), but with the added optimization that in each pass we only rasterize polygons that potentially have a chance of generating and writing out fragments.
Comparing the performance of the SM5 algorithm with Per−Pixel Linked Lists we come to conclusion that the cost of an additional count buffer pass is in the most cases negli− gible. The case in which it should be most pronounced (see Table 2 ) leads to a 2% performance drop for the highest res− olution (see Table 3 ). When the number of layers is low and the number of fragments is distributed evenly between pix− els (see Table 4 ) we get around 70% better performance (see Table 5 ). For a low average number of fragments per−pixel and a low number of non−empty pixels (see Table 6 ) we get 48-50% performance gain (see Table 7 ). A high average and a varying number of fragments per−pixel (see Table 8 ) yields 66-95% better performance (see Table 9 ).
We assume that the most common cases to which our method would be applicable (e.g., interactive simulations, CAD/CAM tools) are those most similar to the one repre− sented by the architectural scene (see Table 8 and Fig. 26 ), that is with a low number of empty pixels and rich and highly varying (in terms of depth complexity) content. We deem the close to a 100% performance improvement quite satisfying under those conditions 14 .
As for the results of synthetic benchmarks (see Figs. 20, 21 and 22) we connect the local irregularities of frame draw− ing time with respect to the number of pixels processed (see Fig. 20 ) to the hardware memory access scheduling issues. In a wider perspective they show a regular pattern of local irregularity, reoccurring at roughly the same pixel count intervals. As they are more pronounced for the Per−Pixel Linked Lists we connect them with the memory manage− ment -the original algorithm requires a larger amount of memory accesses per−pixel and performs them in a hard to predict manner.
In theory D−buffer's memory management related (those not involving accessing actual fragment data) requirements to write−out and compose fragments are as follows: This way we reduce on average the number of memory writes from fragment_count to fragment_count / pixel− _count and, at the same time, reduce the number of neces− sary memory reads during final fragment composition from fragment_count to pixel_count pointer elements. This allo− wed us to greatly reduce the influence of memory access latency on fragment data processing performance. It is best reflected by the results obtained for varying pixel counts (see Fig. 20 ), when on average we get 400% better perfor− mance when compared with Per−Pixel Linked Lists. This closely corresponds to twice the amount (in terms of the overall number of fragments) of writes during drawing into a fragment data buffer and twice the amount of reads during the final composition.
Virtually, equal CPU and GPU processing times mean that all code paths spent most of their time processing data on the GPU, which is advantageous in light of the proper dedicated resources utilization.
Conclusions
From the results of the previous section we conclude that we have presented a method that is simple in its concept, uni− versal in its nature, applicable to a wide range of hardware architectures and problems and yet more efficient than the existing solutions.
We are particularly proud of the method's universality both in terms of presented results, as well as applications to other compute intensive problems. The first come from the fact, that it can run on virtually any scene (see Fig. 26 ) even in a highly limited hardware environment (see SM3 algo− rithm and Table 9 ). True, its performance is far from being interactive under such circumstances, but it is nevertheless applicable even in such conditions.
The second comes from the fact that the concept used to order the data stored in the D−buffer can be widely used in any multi−level memory, or otherwise, memory access opti− mizing through reference prediction, architecture. Applying such a concept could work by performing an early, light− −weight simulation pass, involving as few off−chip resources as possible to give the hardware knowledge about the laten− cy burdened operations pattern.
The simplicity of the method leaves plenty of room for hardware and software based optimizations. For instance, we could use the count buffer as an input to thread sched− uler, telling it how much time it will be required to devote to processing given compound data element (e.g., pixel). This concept can be exploited even further, when we could as− sume count buffer's local coherency. This way we could take additional advantage of hierarchical optimizations assigning work load in batches (thread workgroups) or dis− carding whole sub−regions of processed data set, just like in the hierarchical depth buffer's case.
Future work
Given the experimental results for the SM4 algorithm we find it very intriguing to pursue a path of creating a hybrid approach using a set of multi−layered temporary buffers to take advantage of memory access regularity of predefined graphics API image structures. Each buffer could have dif− ferent depth and resolution based on the local frame buffer sub−region's needs. We could route drawing into such buff− ers using viewport arrays [37] and assigning them to regions based on a region size and a local depth complexity could limit the amount of wasted space and bandwidth used to initialize them.
This assumes local fragment count distribution coher− ency and, thus implies that we would like to investigate the possible benefits of hierarchical optimizations based on count buffer contents.
We are also eager to try applying the early memory ref− erence pattern prediction into problems not directly linked to the field of computer graphics.
Acknowledgements
To Professor Przemysław Rokita for encouraging me to keep working on the idea presented herein.
To Professor Karol Myszkowski for suggesting the idea of hierarchical optimizations based on count buffer con− tents.
And to, last but not least, Professor Władysław Skarbek for taking an interest in my work and giving me the opportu− nity to share it with the rest of scientific world. 
