Abstract. An efficient algorithm for parallel acquisition of visualization data for large sparse matrices is presented and evaluated both analytically and empirically. The algorithm was designed to be application-independent, i.e., it works with any matrix-processors mapping and with any sparse storage format/scheme. The empirical scalability study of the algorithm was carried on using multiple modern HPC systems. In our largest experiment, we utilized 262144 processors for 73 seconds to gather and store to a file the visualization data for a matrix with 1.17 · 10 13 nonzero elements. Using the proposed algorithm, one can thus visualize large sparse matrices with a minimal runtime overhead imposed on executed HPC codes.
1. Introduction. Within our previous work, we have addressed weaknesses of common solutions for the problem of visualization of large sparse matrices emerging in HPC applications [9] . There are several reasons that make such a problem difficult to solve. First, matrices exist in memory only for a short and unknown period of time determined by the scheduler of a given HPC system. Second, it is generally impossible to integrate the visualization process directly into the HPC code so that it will automatically produce a final matrix image of a desired quality. Third, very large matrices, due to their sizes, cannot be processed locally on personal computers. Moreover, their storage to a file on an HPC system and their transfer via network would take high amount of time.
We therefore proposed a solution where a matrix is first partitioned into blocks. Then, for each block, visualization information is calculated and stored into a file. Finally, this file is downloaded from the HPC system into a personal computer and processed there, possibly interactively, into a final matrix image. We introduced an algorithm that accomplishes a part of this procedure performed on the side of an HPC system, i.e., the acquisition of visualization data for a given sparse matrix. We also evaluated this algorithm experimentally using up to 1024 processors of a small-scale HPC system. This paper is an extended version of our previous results [9] . We present an updated version of the algorithm, which uses more efficient approach to the calculation of visualization data. Namely, due to the utilization of an ordered associative array, frequent lookup operations with logarithmic runtime complexity were substituted by a single iteration over the ordered records. Each required record is thus available with constant runtime complexity. We also provide more detailed analysis of the computational, space, and communication complexities of the algorithm. The results of large-scale experiments using several modern HPC systems are presented for up to 266144 utilized processors.
Modern HPC systems are typically hybrid shared-distributed memory machines. They consist of sharedmemory multicore computational nodes connected by network subsystems. Parallel programming models and corresponding runtime environments allow to use these machines as are, virtualize them as pure shared-memory, or virtualize them as pure distributed-memory. In the context of this paper, we consider the latter case.
Particularly, we assume the utilization of the MPI parallel programming library and its runtime environment [5, 11] , which is motivated by its widespread adoption in the HPC community.
We further call MPI processes involved in algorithm runs simply processors. However, we always assume that each MPI process is mapped at runtime to a single CPU core. Therefore, a processor may also refer to a CPU core on which the algorithm runs.
2. Terminology and Notation. Let A = (a i,j ) be an m A × n A real or complex matrix. Let B = (b i,j ) be an m B × n B real matrix, where m B ≤ m A and
We call B the visualization matrix. Let us partition A into m B × n B blocks (submatrices) of same or nearly same sizes as follows:
We will further write simply a i,j ∈ A k,l to indicate that the element a i,j belongs to the block A k,l . From Eq. (2.2) it follows that a i,j ∈ A k,l if
We call V the visualization function. Then, we define the elements of the visualization matrix B as follows:
where
(S k,l equals the number of elements of A k,l ). In case of sparse A, condition Eq. (2.4) allows to calculate b k,l by performing the summation only over nonzero elements of A k,l . We can thus rewrite Eq. (2.5) as
Suppose now that we have an algorithm that calculates B for a sparse matrix A on a given HPC system. Let P denote the number of processors involved in an algorithm run and let p 1 , . . . , p P denote these processors. Then, we can consider A as
where A (q) contains nonzero elements of A stored in the local memory of processor p q . (We use the (q) superscript frequently in this text to indicate that some entity belongs to processor p q .) We will further write simply a i,j ∈ A (q) to indicate that the element a i,j is contained in A (q) , therefore in the local memory of processor p q . Let |A (q) | denote the number of nonzero elements of A (q) . The calculation of b k,l now becomes trickier, since the nonzero elements of A k,l can be distributed among multiple processors. Let P k,l denote a set of processors each of which contains at least one nonzero element of A k,l . Thus,
Conversely, let A q denote a set of blocks from each of which processor p q contains in its memory at least one nonzero element. Thus
We can now rewrite Eq. (2.7) as
We further say that any processor from P k,l contributes to the calculation of b k,l . Let b k, * denote the k-th row of B. Let P k, * denote a set of processors that contribute to at least one element of this row, thus:
We can now define the problem of acquisition of visualization data for a sparse matrix A as follows: Problem 1. We are looking for an efficient parallel algorithm that calculates the visualization matrix B, for a given sparse matrix A and a visualization function V, and saves it to a file F. The additional requirements are: Req. 1: The algorithm should not depend on the type of matrix-processors mapping (the type of distribution of the nonzero elements of A among processors). Req. 2: The algorithm should not depend either on the computer representation of A or on the order in which its nonzero elements are accessible.
Sparse matrices are stored in computer memory in special data structures called sparse matrix storage formats/schemes (see, for instance [10, 1, 12] ). All these formats have one common feature-they allow to iterate over the nonzero elements. Due to Req. 2, we thus further regard A (q) , . . . , A (P ) as a sequence of their nonzero elements with unspecified order. This allows us to work with A independently of the storage format used within an actual HPC code, where the matrix appears. Moreover, this allows us to work with matrices that are even not stored in memory at all (their elements are computed on-thy-fly).
The information we want to visualize is determined by V. For instance, if we want to visualize the structure (pattern) of the nonzero elements of A, the visualization function might be defined simply as follows: Then, B would contain the density of the nonzero elements of blocks of A. Other visualization functions might be defined, e.g., as follows:
Thus, the usage of V 2 , V 3 , and V 4 results in the visualization of the average magnitude of the elements of each block, the average magnitude of their real parts, and the average magnitude of their imaginary parts, respectively. The visualization matrix B represents the intermediate visualization data that will be saved into F and from which the final matrix image will be constructed. We call F the visualization file. Recall that F is supposed to be downloaded from a parallel system to a user's personal computer. This, in effect, limits the size of F to some extent (presumably, to hundreds of megabytes or gigabytes nowadays), which consequently limits the size of B.
Suppose that F is a binary file and that we save the elements of B into F using a b-byte floating-point data type T (e.g., b = 4 for T being the IEEE 754 single-precision floating point data type [6] ). The memory requirements in bytes for saving B in F are then
Let S(F) denote the file size of F . Actually, it might be slightly higher than S(B), since there might be some space overhead given by the used file format as well as by any supplemental saved information. In practice, such overhead should not exceed several kilobytes, therefore we consider it as negligible for further analysis and set
S(F) = S(B).
(2.13)
Since our goal is to acquire as much visualization data as possible, we are looking for the answer to the following question: What is the maximum size of the visualization matrix B to be stored in F of size at most S(F) bytes? Combining Eqs (2.1), (2.12), and (2.13), we can find the solution as
together with Eq. (2.1).
Example.
Let A be a square matrix. Let us limit the size of the visualization file to 4 GB, which implies S(B) = 2 32 . Let us use the IEEE 754 single-precision floating-point data type for storing elements of B into F, thus b = 4. Then, m B = n B = 32768. We can thus partition A into 32768 × 32768 ≈ 10 9 blocks and save the visualization data in a form of a single floating-point number for each block into a file of approximate size 4 GB. From this file, we can then generate a matrix image up to the size of 1 gigapixel.
In the text below, the capitalized word "Algorithm" followed by a number always refers to a pseudocode presented by a corresponding floating text environment (as are figures and tables). Within pseudocode, we write references to corresponding equations in the end-of-line comments.
3. Methodology and Algorithm. To solve Problem 1, we need to develop an algorithm for the calculation of the visualization matrix B according to Eq. (2.9), and its storage into the visualization file F. First, note that we can translate Eq. (2.9) into pseudocode as follows: for l ← 1 to n B do 3:
for all processors in P k,l do in parallel 5:
for all local nonzero elements a i,j do 7:
end for 9:
perform parallel reduction of b
write b k,l into F
12:
end for 13: end for 14: end for for all processors p q in P k,l do in parallel b
Since this process needs to be performed for all possible combinations of k and l, a pseudocode that would solve Problem 1 might look like Algorithm 9.
The drawback of this approach is its computational complexity O |A (q) | · m B · n B for processor p q , where |A (q) | might be very high in HPC applications. However, by rearranging Algorithm 9 as described below, we can reduce the computational complexity to O |A
when |A (q) |, m B , n B are not all extremely small numbers. The idea of this rearrangement is to split the solution of Problem 1 into two phases:
1. Within Phase1, processor p q iterates over all its local nonzero elements a i,j ∈ A (q) . In each iteration, the contribution of a i,j to b
k,l is calculated, where k and l are given by Eq. (2.3). This process can be performed by all processors in parallel with no communication costs.
2. Within Phase2, the local contributions b
k,l are reduced in parallel to b k,l , which is then written to the visualization file F.
The price for such a solution is the additional need to temporarily store the local contributions b
k,l in memory of processor p q . Usage of a plain two-dimensional array would imply the algorithm space complexity O(m B · n B ) for each processor. Consequently, it would limit the size of B by the minimum of the available amounts of memory of all processors. An alternative is to store the contributions in an associative array (commonly also known as a map or a dictionary) of type (k, l) → R, where k and l are integers. The algorithm space complexity then changes to O |A q | for processor p q . In the worst case, processor p q contributes to all elements of B, which turns this space complexity into O(m B · n B ) as well (moreover, the hidden constants will be higher here, since the memory overhead of an associative array is higher than of a plain array). However, matrices are in practice mapped to processors typically according to some one-or two-dimensional partitioning scheme. In the best case, nonzero elements are distributed in matrices evenly, which implies |A q | ≈ m B · n B /P . The space complexity of an associative array-based algorithm then becomes O(m B · n B /P ). (We observed such Algorithm 10 Visualization data acquisition
4:
execute Phase1 ⊲ calculate local contributions b
execute Phase2 ⊲ reduce them to b k,l and store into F 7: end for a behaviour in experiments [9] .)
In HPC applications, computational problems are often solved as big as the available resources allow. Therefore, the minimum of available amounts of memory of processors can be a relatively small value. Since we do not want it to limit the size of B, we further consider, for the algorithm design, the storage of local contributions in associative arrays.
Moreover, we assume this associative array to be ordered lexicographically by the (k, l) key. This allows to iterate over the contributions b (q) k,l of processor p q in Phase2 only once, which is much more efficient than calling the lookup operation for each needed combination of k and l. We further denote an instance of the defined ordered associative array by ≎ and its record with the (k, l) key by ≎ [k, l].
3.1. Algorithm Pseudocode. We present here an efficient algorithm that solves Problem 1. Its outline is presented by a pseudocode as Algorithm 10. Let all variables introduced by Algorithm 10 have a global scope, i.e., they are available within the pseudocode of phases as well. Moreover, let all auxiliary variables/arrays defined in the "Data" section of each pseudocode be local to processors.
The pseudocode of Phase1 is shown as Algorithm 11. Within, each processor iterates over its nonzero elements and for each one, its contribution to the corresponding element of B is calculated. These contributions are stored in the ≎ data structure such that at the end of Phase1,
k,l on processor p q . In Phase2, local contributions to the elements of B are reduced in parallel to their final values. These are then written to the visualization file F. Performing the parallel reduction m B × n B times, each one for a single element of B, would be inefficient due to the overhead of communication operations. We therefore designed Phase2 such that the reductions are performed for whole rows of B at once, resulting in m B reductions, each one for n B elements at once.
To calculate b k, * , all the contributions b (q) k,l need to be reduced in parallel from processors p q ∈ P k, * . Due to Req. 1 of Problem 1, each processor can generally contribute to any element of B. Consequently, the sets P k,l and P k, * can consist of all processors p 1 , . . . , p P . To calculate b k, * , all the contributions b (q) k,l need to be reduced in parallel from processors p q ∈ P k, * . There are thus two options how to perform such a reduction:
1. from all processors p 1 , . . . , p P , while setting b (q) k,l = 0 for p q / ∈ P k, * ; 2. from P k, * only, while this set need to be constructed first. The second option provides no advantage over the first, since the construction of P k, * would require collective communication between all processors as well. (In terms of MPI, it would require to form a processors group and a corresponding communicator. There has been, in fact, developed even a noncollective communicator
Algorithm 11 Visualization data acquisition: Phase1
Data: k, l, r, r ′ , c, c ′ , S ⊲ auxiliary variables 1: for all local nonzero elements a i,j do 2:
if record (k, l) does not exist in ≎ then
5:
insert a record into ≎ with (k, l) key and value V(a i,j ) ⊲ (2.9)
6:
end if 9: end for 10: for all records (k, l) in ≎ do 11:
16: 
while j ≤ number of ≎ records and k ′ = k, where (k ′ , l ′ ) is the key of the jth ≎ record do 5:
end while ⊲ reduce to b k,l on p 1 and write to F:
perform parallel reduction of all values of the ≫⋍ array using + operator to processor p 1 ⊲ from all processors 9: if q = 1 then ⊲ if run on processor p 1 10:
end if 12: end for creation technique, however, it does not perform well for our purposes [4] ). We therefore further consider only the first option for algorithm design.
The pseudocode of Phase2 is presented as Algorithm 12. Note that the order of processing the local contribution matches the order of records in the ≎ data structure. This allowed us to design the pseudocode such that no ≎ lookup operation is needed, which considerably reduced the overall algorithm complexity. The auxiliary array ≫⋍ serves as a buffer for storing and reducing local contributions for a single row of B. Before reduction of the kth row contributions,
k,l on processor p q . After this reduction, ≫⋍[l] = b k,l on processors p 1 . The reduction is performed by all processors (see line 1 in Algorithm 10).
Complexities.
Let us analyze complexities of the presented algorithm. Generally, the complexities highly depend on the matrix-processors mapping. We therefore restrict our analysis to the following two extreme 28 D. Langr, I.Šimeček, P. Tvrdík and T. Dytrych cases: In the worst case, there is at least one processor that contributes to all m B × n B elements of B. On the other hand, in the best case, all processors contribute to at most ⌈m B × n B /P ⌉ elements of B.
We define the complexity of the algorithm as its complexity for the most loaded processor (this value determines both the algorithm running time as well as its per-processor memory requirements).
3.2.1. Computational Complexity. For processor p q , the computational complexity of Phase1 is given by two iterative processes defined at lines 1-9 and 10-17 of Algorithm 11. The computational complexity of the first process is given by iterating over |A (q) | nonzero elements, while in each iteration a record is either found or inserted into ≎ . Recall, that ≎ is an ordered associative array. These are typically implemented as binary search trees with logarithmic complexity of both search and insert operations (see, for instance [2, Chap. 12]). The computational complexity of the second process is given by iterating over all |A q | records of ≎ . The computational complexity of Phase1 on processor p q is thus
In the worst case, |A q ′ | = m B · n B for some processor p q ′ . Therefore, the computational complexity of Phase1 becomes
In the best case, |A q | ≤ ⌈m B · n B /P ⌉ for all processors. The computational complexity of Phase1 then equals
The computational complexity of Phase2 is, for processor p 1 and thus for the whole algorithm,
in all cases.
Space Complexity.
The space complexity of Phase1 is given by the number of records of the associative array ≎ , which equals |A q | on processor p q . The space complexity of Phase2 is given by the auxiliary array ≫⋍ of size n B . The overall space complexity of the algorithm is thus
in the worst case and best case, respectively.
Communication Complexity.
The communication complexity depends on the network topology of a given HPC system as well as on the implementation of communication operations by the utilized version of the MPI library. We can therefore only discuss the number of communication operations, instead of analyzing their asymptotic behaviour in terms of running time. In Phase1, there is no communication at all. In Phase2, m B parallel all-to-one reductions are performed by all processors, each over n B elements.
Experiments.
We have performed a weak-scalability study (constant problem size per processor), since it matches the real-world situations, where HPC problems are usually solved as large as available resources allow. On modern hybrid shared-distributed memory HPC systems, the available amount of memory, which limits the size of A, is thus proportional to the number of utilized processors P . In experiments, we always set the size of A such that its number of nonzero elements was approximately P · 4.6 · 10 7 (this corresponded to the per-processor memory requirements of A being around 1 GB using the coordinate storage scheme and the single-precision floating point data type for matrix elements).
As a source of large sparse matrices, we used the parallel scalable generator of benchmark sparse matrices [8] . This generator produces matrices by scalable enlargement of a small fixed seed matrix. As the seed matrix, we used the cage12 real square matrix obtained from the University of Florida Sparse Matrix Collection [3] . We set the input algorithm parameters so to generate a matrix image according to Section 2.1. The size of B was thus always set to m B = n B = 32768, which resulted in possibility of generation of a 1 gigapixel matrix image. As a visualization function, we used V 1 defined by Eq. (2.10).
We have implemented the presented algorithm with C++ and MPI to evaluate its performance and scalability. As an ordered associative array, we used the std::map container from the C++ Standard Library [7, Sect. 7.7] , which is typically implemented as a red-black tree [2, Chap. 13]. We exploited the HDF5 library [13] for creation of visualization files F.
We utilized several modern parallel systems that are listed in Table 4 .1, where we show their parameters together with the C++ compilers and the MPI libraries used for compilations of test programs and their runtime execution. For the Blue Waters system, we present parameters of the XE cabinets only (the XK cabinets are intended primarily for GPU-based computations).
In all experiments, we primarily measured algorithm running times. These are presented by Figure 4 .1. The results indicate that the majority of the running time is spent in Phase2, which is caused by the execution of communication and I/O operations. However, the algorithm is generally fast and the overall algorithm running time grows only slightly with the growing number of processors.
We also measured the estimated memory requirements of the ≎ data structure for all processors. Their maximal and average values are shown in Figure 4 .2. Up to some small constant memory overhead, these memory requirements clearly decrease inversely proportionally to the number of processors P . For higher P , the overall memory footprint of the algorithm is thus negligible in practice.
5.
Conclusions. This paper extends the work of Langr et al. [9] . It presents an updated version of the algorithm for the parallel acquisition of visualization data for large sparse matrices, along with its more detailed analytical and empirical evaluation. The main characteristic of the algorithm, from a user's point of view, is its application independence. It can be used with any mapping of matrices to processors and any sparse storage formats/schemes. Moreover, it can be used even when matrix nonzero elements are computed on-the-fly. It can be also easily implemented using the MPI parallel programming library, from which it needs to execute only the all-to-one parallel reduction operations.
Using this algorithm, one can visualize sparse matrices emerging in an HPC code with minimal runtime and memory overhead. Namely, within the largest presented experiment, 266144 processors gathered and stored visualization data for a sparse matrix with 1.17 · 10 13 nonzero elements with a measured runtime of 73 seconds. The majority of the algorithm running time takes the collective communication between all processors together with serial I/O operations. Resulting visualization data are appended to output files, which allows, among others, visualization files to be ASCII-based. Due to the application independence of the algorithm, parallel I/O cannot be efficiently employed to increase the I/O performance. This does not seem to be of much importance at the petascale performance level of today's prime HPC systems, as demonstrated by the presented experiments.
However, for emerging exascale and higher-level computing, even faster algorithms might appear to be necessary. In our future work, we want to focus on two promising options how to achieve them. The first one is to adapt the presented algorithm for the hybrid MPI+OpenMP parallel programming model, which naturally represents the shared-distributed memory architecture of modern HPC systems. The second option is to give up the application independence and to adapt the algorithm for certain types of matrix-processor mappings, which should allow to reduce the burden of collective communication and to efficiently utilize parallel I/O.
