Matrix multiplication A t A appears as intermediate operation during the solution of a wide set of problems. In this paper, we propose a new cache-oblivious algorithm for the A t A multiplication. Our algorithm, ATA, calls classical Strassen's algorithm as subroutine, decreasing the computational cost of the conventional A t A multiplication to 2 7 n log 2 7 . It works for generic rectangular matrices and exploits the peculiar symmetry of the resulting product matrix for sparing memory. We used the MPI paradigm to implement ATA in parallel, and we tested its performances on a small subset of nodes of the Galileo cluster. Experiments highlight good scalability and speed-up, also thanks to minimal number of exchanged messages in the designed communication system. Parallel overhead and inherently sequential time fraction are negligible in the tested configurations.
Introduction
Matrix multiplication is probably the main pillar of linear algebra. It is a fundamental operation in many problems of mathematics, physics, engineering, and computer science. The algorithmic aspects of matrix multiplication have been extensively explored, as well as its parallelization. Many approaches have been used over the years. In particular distributed computing and highperformance computing have been taken into considerations with the aim of optimizing performance and obtaining more and more efficient parallel algorithms and implementations.
Matrix A t A is a particular matrix multiplication involved in several applications. In Strang's book [23] , the product A t A is extensively used for its many useful properties. In fact, matrix A t A is symmetric and positive-definite. Product A t A appears as intermediate operation in many situations. For example, it appears in the projection matrix P = A(A t A) −1 A t , and in the equations A t Ax = A t b, known as the normal equations. Furthermore, it is used in the least square problem, in the Gram-Schmidt orthogonalization process, in the Singular Value Decomposition, and has many other applications (see, e.g., [23] ).
In this work, we propose a new parallel algorithm to compute the product A t A. Our algorithm is based on the Strassens strategy for the fast matrix multiplication. Strassens algorithm [24] is the most used fast algorithm for matrix multiplication. In fact, Strassen broke the O(n 3 ) operation count for executing the product among two matrices, reorganizing the recursive matrix multiplication algorithm, that is replacing a multiplication step with 18 cheaper matrix additions. This substitution implies asymptotically fewer multiplications and additions, and provides an algorithm executing O(n 2.81 ) operations. Winograd's variant improves Strassens complexity by a constant factor replacing one matrix multiplication with 15 matrix additions.
Since our algorithm exploits the characteristics of the resulting product matrix, it results to be faster than other parallel matrix multiplication algorithms, such as those based on classical Θ(n 3 ) multiplication, and those based on Strassen-like matrix multiplications. We study the performance of our algorithm by performing a set of tests on matrices of size 5 3 and 10 4 . The MPI paradigm has
Related work
Since Strassens algorithm proposal, many fast matrix multiplication algorithms were designed and improved the asymptotic complexity (see, e.g., [11, 22, 25] for recent algorithm). Unfortunately, often improvements come at the cost of very large hidden constants. Since for small matrices, Strassens algorithm has a significant overhead, several authors have designed hybrid algorithms, deploying Strassens multiplication in conjunction with conventional matrix multiplication, see, e.g., [5, 6, 13, 15, 4] . In [8] , Authors extend Strassens algorithm to deal with rectangular and arbitrary-size matrices. They consider the performance effects of Strassens directly applied to rectangular matrices or, after a cache-oblivious problem division, to (almost) square matrices, thus exploiting data locality. They also exploit the state-of-the-art adaptive software packages ATLAS and hand tuned packages such as GotoBLAS. Besides, they show that choosing a suitable combination of Strassens with ATLAS/GotoBLAS, their approach achieves up to 30%-22% speed-up versus ATLAS/GotoBLAS alone on modern high-performance single processors.
Numerous parallel algorithms for Strassens Matrix Multiplication have been proposed. In [19] , Luo and Drake explored Strassen-based parallel algorithms that use the communication patterns known for classical matrix multiplication. They considered using a classical 2D parallel algorithm and using Strassen locally. They also considered using Strassen at the highest level and performing a classical parallel algorithm for each sub-problem generated, where the size of the sub-problems depends on the number of Strassen steps taken. Further, the communication costs for the two approaches is also analyzed. In [12] , the above approach is improved using a more efficient parallel matrix multiplication algorithm and running on a more communication-efficient machine. They obtained better performance results compared to a purely classical algorithm for up to three levels of Strassen's recursion. In [17] , Strassen's algorithm is implemented on a shared-memory machine. The trade-off between available parallelism and total memory footprint is found by differentiating between partial and complete evaluation of the algorithm. The Authors show that by using partial steps before using complete steps, the memory footprint is reduced by a factor of (7/4) compared to using all complete steps. Other parallel approaches [9, 14, 21] have used more complex parallel schemes and communication patterns, but consider at most two steps of Strassen and obtain modest performance improvements over classical algorithms.
In [2] , a parallel algorithm based on Strassen's fast matrix multiplication, CommunicationAvoiding Parallel Strassen (CAPS), is described. Authors present the computational and communication cost analyses of the algorithm, and show that it matches the communication lower bounds described in [3] .
In this work, we consider a particular matrix multiplication, that is the multiplication between A t and A, where A may have any size and shape. We exploit the recursive Strassen's algorithm, that is recursively applied to conceivably rectangular matrices, exploiting the idea described in [8] .
To the best of our knowledge, this is the first parallel algorithm specifically thought for computing the product A t A.
The algorithm works for general m × n rectangular matrices. At each recursive step, the matrix A is divided in four sub-matrices as:
If we define:
, then, for the four sub-matrices composing A we have:
Also the product matrix C = A t · A can in turn be divided into four sub-matrices. The four components of C are obtained using the four components of the matrices A and A t for executing the product. Thus, C consists of the following four sub-matrices:
Both C 1,1 and C 2,2 components of matrix C consist of two addends that are, in their turn, the left hand product of a matrix by its transpose, that is a product of type M T M . Hence, four recursive calls are employed to compute the sub-products A Since for any matrix A the product A t · A is symmetric, at each recursive step only the lower triangular part of the product matrix is stored. This allows to spare almost half of the memory occupation with respect to a full matrix representation, that is n(n + 1)/2 entries versus the usual n 2 . For this reason, the term C 1,2 is not returned by the algorithm (being C 1,2 = C t 2,1 ). As for component C 2,1 , in order to compute A t 1,2 A 1,1 and A t 2,2 A 2,1 , we implemented the generalized Strassen's algorithm for non-square matrices presented in [8] , denoted as HASA. Finally, matrices A 1,2 and A 2,2 are transposed using the cache oblivious algorithm for matrix transposition shown in [18] .
In Algorithm 1 the pseudo-code of the ATA algorithm is provided. The base-case occurs as the number of rows or of columns of the current sub-matrix is less than or equal to 32. This size has been chosen taking into account several considerations, including experimental tests and observations highlighted in [10] on the cost difference between performing an arithmetic operation and loading/storing operations. In that case, the multiplication is performed using a non-recursive algorithm for matrix multiplication. The initialization of m i , n j , and A i,j , i, j = 1, 2, is performed as described above.
As anticipated, we implemented the pseudocode shown in [8] for matrix multiplications A t 1,2 A 1,1 and A t 2,2 A 2,1 , and we used the same notation; we refer to such procedure as to HASA (originally standing for Hybrid ATLAS/GotoBLASStrassen algorithm), keeping the same name used in [8] for a matter of reference and convenience, although we do not use ATLAS and GotoBLAS packages.
Computational cost
Strassen's algorithm is a cache oblivious algorithm to compute the product of two matrices and it was first described in [24] . It achieves to perform a 2 × 2 matrix multiplication using 7 multiplications instead of 8. By recursion, the general n × n matrix multiplication is performed using O(n log 2 7 ) multiplications. In Algorithm 1, there are four recursive calls to ATA on basically halved dimensions, and two calls to HASA. Thus we can derive the general recursive function:
if m ∨ n <= 32 then
3:
return mult(A, m, n);
4:
Define m 1 , n 1 , m 2 , n 2 ;
6:
Initialize A i,j , i, j = 1, 2;
7:
10:
11:
12:
13:
Using the Master Theorem (see, e.g., [7] ), it holds that the number of multiplications performed by ATA is upper-bounded by 
Parallel implementation
The recursive sequential Algorithm 1 has been implemented in parallel for a multiprocessor system using MPI (Message Passing Interface). MPI is a message-passing library specification that provides a powerful, efficient and portable way to write parallel algorithms, [20] . The parallelization idea consists in executing the recursive calls to ATA and to HASA (lines 7 to 12 in the sequential Algorithm 1) in a concurrent fashion. A parallel multiprocessor implementation has been developed for both ATA and HASA algorithms. We shall refer to these parallel routines as to ATA-P and HASA-P, where P denotes the number of available processors. Depending on the value of P , a certain number of parallel levels can be executed. A parallel level is a parallel execution of ATA-P and of HASA-P, where at least two processors share out the recursive calls. Each parallel execution of ATA-P requires at most six processes, one for each of the four calls to ATA-P and the two calls to HASA-P. Instead, each parallel execution of HASA-P requires at most seven processes, since HASA implements the Strassen's algorithm and executed seven recursive calls. When the maximum number of parallel levels is reached, processes execute sequentially ATA or HASA, depending on the task that has been assigned to them.
During the parallel phase all processes work independently from one another and do not need to interact nor to exchange data. Communication is resumed at the end of the sequential phase that is executed at the bottom of the recursive tree, when the output arguments of the recursive calls are collected and arranged to form the product matrix C = A t ·A on each parallel level. In the following subsections, we describe in detail how ATA-P and HASA-P have been implemented and the structures created in order to manage cooperation between processes. We shall also explain how the communication system was implemented.
ATA-P insights
In the previous section, we introduced the notion of parallel levels; the number of parallel levels is the maximum number of parallel executions of ATA-P and HASA-P where at least two processes are responsible for the recursive calls.
We shall classify parallel levels as either complete or incomplete. A parallel level is complete when six processes are assigned to each call to ATA-P and seven processes are assigned to each call to HASA-P. On the contrary, a parallel level is incomplete when the number of processes to which recursive calls are assigned in a recursive step of ATA-P or HASA-P is less than six and seven, respectively.
Every time ATA-P is called recursively in a complete parallel level, four processes will call ATA-P on the four sub-matrices of A defined in Equation (3), whilst the other two processes will execute HASA-P to compute the two rectangular products that are required to obtain C 2,1 (see equation (2)). In a complete parallel level, HASA-P execution implies that seven processes execute HASA-P, each on its own sub-data. A representation of how processes are distributed in ATA-P is depicted in Figure 1 , that shows the tree structure of the processes organization for two complete parallel levels. Figure 1 : Representation of how processes are distributed among two complete parallel levels. Squares represent processes, and the number next to them are their identity ranks. Solid and dotted lines between processes represent recursive calls to ATA-P and HASA-P, respectively. After the second complete parallel level, each process executes either ATA or HASA independently.
The structure described above, according to which ATA-P or HASA-P are called, can be used to derive the function npl( ), that we use to calculate the number of processes needed to accomplish complete parallel levels of ATA-P. In particular it holds that:
Given the number of available processors P , the maximum number of executable complete parallel levels max is:
If the number of available processes P is equal to npl( ), for some , then no more processes are available and therefore all active processes will continue their task sequentially. The alternative case occurs when npl( max ) < P < npl( max + 1). This corresponds to the scenario where no more complete parallel levels can be executed, yet some computational resources are still available. In this case the remaining P − npl( max ) processes are distributed in order to lighten the workloads of those processes involved in the complete parallel levels. When this event occurs, an incomplete parallel level is set between the max -th complete parallel level and the sequential calls.
When an incomplete parallel level can be performed, processes are distributed according to the following idea. To spread the P − npl( max ) processes as evenly as possible, the value k is calculated as follows:
Then, k of the still available processes are paired to each process involved in the last complete parallel level. If P − npl( max ) > k · npl( max ), there are other P − (k + 1) · npl( max ) processes to distribute. The distribution is realized following a hierarchical chart that exploits a classification of already active processes, based on the heaviness of the task they have to work on. First, we can observe that HASA-P is computationally more expensive than ATA-P, since it involves seven recursive calls instead of six. Hence HASA-P has the highest priority. The second factor used to establish process priority is represented by the size of the sub-problem assigned to a call. In fact, if m and n are not a power of 2, after a certain number of recursive calls, it holds that m 1 = m 2 + 1 and n 1 = n 2 + 1. Therefore, the size of the data of some processes may be higher.
Hence, processes may be ordered depending on the call (HASA-P or ATA-P) first, and on the size of the sub-problem as second parameter. In summary, the last P − (k + 1) · npl( max ) processes are paired, in a orderly way, to: (1) processes that are in charge of HASA-P calls, (2) processes that work on larger sub-problems. An example of how processes are distributed in a incomplete parallel level is depicted in Figure 2 , for P = 15. In this case, max = 1 and lefties= 9. The value of k resulting from equation (6) is 1, hence each of the npl( max ) = 6 processes can be paired to one of the lefties processes. As for the further remaining three processes, two are paired associated to processes performing HASA (namely, one is associated to the process having father with rank 4 and one is associated to the process having father with rank 5), whereas the third process must be associated to one of the process with higher sub-data size, that in this case is only P 0 . 
Implementation details
The parallel algorithm has been implemented using MPI for running on a multiprocessor system (see Section 6) . Once the MPI environment is initialized, every process gets its own identifying rank through the MPI function MPI rank. In addition to the input arguments given to Algorithm 1 in the sequential version, a parallel recursive call to ATA-P also reads:
• the index of the level it is working on;
• max the maximum number of complete parallel levels;
• father the rank of the father of the process that is serving the current call;
• lefties the number of processes beyond the ones needed to accomplish max complete parallel levels.
The argument father is the rank of the process that has called ATA-P, tracing the tree of the recursive calls (see Figure 1) . We assume that P ≥ 6. When the function is first launched, the identifier of the generated parallel level is 1, and the father is process 0. At each parallel level of ATA-P, the sub-problems are divided among processes depending on their task. Initially, for each parallel recursive call, an array of size 6 called ids is defined in the following way. In the level , the i-th element ids i of ids is defined as ids i = father + i · npl(x), for i = 0, 1, . . . , 4, while the 6-th element ids 5 is defined as ids 5 = father + 4 · npl(x) + 7
x , where x = max − . Afterwards, recursive calls to complete parallel levels are assigned to processes with rank in [ids i , ids i+1 ) if i < 5, and to process with rank [ids 5 , npl( max )) otherwise, for working on different sub-problems. Notice that process P ids0 is P f ather . Similarly, in each parallel level of HASA-P, the array ids of size 7 is defined in such a way that the i-th element ids i of ids is ids i = father + i · 7
x , where again x = max − . It is easy to understand why the numbering is constructed in this way by looking at the tree shown in Figure 1 .
The pseudocode of a simplified version of ATA-P is reported in Algorithm 2. This version of the algorithm is simplified in the sense that we assume that P is taken equal to npl( max ), for some max , and this implies that only complete parallel levels are executed. Since the input argument lefties is necessary only when an incomplete parallel level can be performed, it is omitted in Algorithm 2, (in fact, in this case lefties is equal to 0, being P = npl( max )). This simplification can be removed simply by adding the management of the set of lefties processes that we have when P = npl( max ) and that is described in Section 4.1.
Communication between processes
In the following sections 4.3.1 and 4.3.2, we shall describe how communication between processes was developed. For both functions ATA-P and HASA-P, we combined communicators and pointto-point communication. Communicators group together processes that may be organized in different topology within the communicator they belong to. Point-to-point communication allows two specific processes to share data.
Communication in ATA-P
In this section we describe how communication is carried out in ATA-P complete parallel levels. In each complete parallel level, the six processes generated by the call to ATA-P are responsible for the four recursive calls to ATA-P and for the two calls to HASA-P, and are now identified by rank ids 0 , . . . , ids 5 (where ids is the array defined as in the previous section). Recall that, for each call, ids 0 is the father of itself and of the remaining elements of ids. Tasks are divided in the following way:
Each of the three pairs of processes (P ids0 ,P ids1 ), (P ids2 , P ids3 ), and (P ids4 , P ids5 ) is responsible for computing the two addend of the three sub-matrices of C = A t · A (see equation (2)). Four communicators are created at this stage. Three communicators link together the addends of each of the three sub-matrices C 1,1 , C 2,2 and C 2,1 . To this end, an MPI reduction performing a matrix sum is executed within each of such communicators. As a result, processes P ids0 , P ids2 and P ids4 have C 1,1 , C 2,2 and C 2,1 respectively. The fourth communicator, that we call Current World, collects together all processors in ids. Current World is used to handle MPI barriers to synchronize Algorithm 2 ATA-P -simplified version: P = npl( max ) Input: A ∈ R m×n Output: Lower triangular part of C = A t · A 1: procedure AtA-P(A, m, n, , max, father) 2: if m ∨ n <= 32 then return mult(A, m, n); %Base case 3: else %Iterative case %Initialize sub-data 4:
Define m 1 , n 1 , m 2 , n 2 ; 5:
Initialize A i,j , i, j = 1, 2; 6:
if > 0 then % > 0 characterizes a parallel level 7:
x ← max − ; 8:
step ← npl(x); 9:
for (i = 0; i < 4; i++) do 10: if id ∈ [ids 0 , ids 1 ) then 21:
if id ∈ [ids 1 , ids 2 ) then 23: A) ATA-P is executed on matrix A. The communicator Current World collects together the six processes involved. Communicators Comm11, Comm22 and Comm21 group together the processes that compute the addends of C1,1, C2,2 and C2,1, respectively, where
is the process with rank idsi in COMM WORLD, k in Current World and l in the smaller communicator CommIJ. B) When the recursion is over, a MPI reduction computing a matrix sum is performed within each communicator CommIJ. The result is stored in the process P 0 . C) Processes of rank 2 and 4 in Current World send C2,2 and C2,1 respectively to P0, that is now in charge for patching together and returning C.
processes, guaranteeing safe communication. By convention, the root of communication is the process with the lowest rank in the current communicator. Blocking Send and Receive functions are employed to transfer C 2,2 and C 2,1 from P ids2 and P ids4 , respectively, to processor P ids0 , which is now in charge of defining and returning C. Communication is represented thoroughly in Figure  3 .
Communication in HASA-P
Here below we show how we implemented the communication system in a HASA-P complete parallel level. In each complete parallel level, the seven processes with rank ids 0 , . . . , ids 6 are responsible for the seven recursive calls to HASA-P. In particular if HASA is run on matrices A and B, to obtain their product D = A · B, the workload is distributed among processes as follows:
When processes P ids0 , . . . , P ids6 complete their work, the product matrix D = A · B is obtained calculating its four blocks as follows:
Similarly to how was done for ATA-P, four communicators (one for each block of D) are created at this stage. Processes computing a term that appears in more than one block of D (namely, all terms but M 6 and M 7 ) belong to two communicators. A MPI reduction allows to store D i,j , i, j = 1, 2 in the root of each communicator. Once all MPI reductions are completed, all data is sent to P ids0 using Send/Receive functions. P ids0 is responsible for recovering and returning D. Similarly to what described for ATA-P, synchronization among processes is achieved by creating a Current World communicator that includes all processes P idsi , i = 0, . . . , 6. Communicators are represented in Figure 4 .
Communication model and cost
Our communication model is similar to the one used in [2] . We consider latency and bandwidth costs, denoted as L(n, P ) and BW (n, P ), respectively. Latency cost is the communicated message count while bandwidth cost is expressed in terms of communicated word count. Message and word counts are computed along the critical path introduced in [26] . If α is the time spent for communicating a message and β is the time for communicating a word, then the total communication cost is given by: αL(n, P ) + βBW (n, P ).
The number of messages that are exchanged between processes depends on how many complete parallel levels of processes can be layered, max (see equation (5)). Notice that ∀P it holds max < log 7 P ; this is because of the number of nodes of the ideal tree that processes are distributed on (see Figure 1 for an example); more precisely, we may observe that for all P ≤ 11602 it holds that max = O(log 6.5 P ).
In a ATA-P communication step, three MPI reductions occurs simultaneously. Afterwards, two messages are sent to the root of the Current World communicator (see Figures 3 B) and C)). In a HASA-P communication step, four MPI reductions are performed at the same time, and three messages are sent to the Current World root. Since HASA-P is recalled on max − 1 levels, it holds that L(n, P ) = max {4 · ( max − 1); 3 · max }, that is L(n, P ) = max {4 · (O(log 7 P ) − 1); 3 · O(log 7 P )} in general. Hence, we can observe that the achieved latency is lower than the one reached in [2] for generic A · B multiplication, but communication suffers from high bandwidth cost (BW (n, P ) = n 2 2 ). Insights on this fact are given in Section 6.3.2.
Performance evaluation
In this section we describe the results of experimental tests carried out in order to assess ATA-P. We implemented ATA-P using MPI and tested it on a cluster of Intel processors. Performances are analyzed using different number of processors and several features for parallel algorithms were investigated.
Experimental setup
Performances have been tested on a small subset of nodes of the Galileo cluster, installed in CINECA (Bologna, Italy), [1] . It is an IBM NeXtScale, Linux Infiniband Cluster consisting of 360 nodes 2 x 18-cores Intel Xeon E5-2697 v4 (Broadwell) processors (2.30 GHz), and 15 nodes 2 x 8-cores Intel Haswell (2.40 Ghz ) processors endowed with 2 nVidia K80 GPUs.
Experimental results
We studied the performances of ATA-P in terms of execution time, speed-up, efficiency and KarpFlatt metric. For tests, we considered randomly generated square matrices of size n = 5000 and n = 10000. Tests have been carried out for the following values of activated processes P = 6, 12, 18, 38, 76, 114, 250. The cases P = 6, 38, 250 correspond to a parallel execution with complete parallel levels, whereas the remaining values of P generate incomplete parallel levels. Execution times for matrix size 5000 and 10000 are depicted in Figure 5 . For P = 1, the execution time is obtained running the sequential implementation of ATA. From Figure 5 , we can observe that for values of P that do not correspond to the activation of complete parallel levels there is a degradation in the trend of the execution time. On the contrary when the number of processes allows to generate complete parallel levels, the execution time improves. This aspect is also observable in Figure 6 , where we show the speed-up values. We can observe that for matrices of size n = 10000 the speed-up maximum value is 64.28%, and is obtained for P = 250, see Figure 6 .
The trend of the execution times shown in Figure 5 is strictly decreasing, highlighting the good scalability of the approach. The efficiency is computed as the ratio between the speed-up S and the number of processes P , and ranges from 0.66 (obtained for P = 6) to 0.26 (obtained for P = 250), as reported in Figure 7 . As expected, efficiency has a decreasing trend, except for P = 38, where it grows. A first observation about this fact is the following: for P = 38, exactly 2 complete parallel levels can be performed, while for P = 18 it holds that max = 1 and the remaining 12 processes are paired to the npl(1) = 6 processes in order to create an incomplete parallel level. Therefore, some portions of executed code are not shared in the two cases arising from having P = 18 and P = 38. Further comments on this phenomenon are in Section 6.3.2. Finally, in Figure 8 , the Karp-Flatt metric values, giving the experimentally determined serial fraction, are reported. The Karp-Flatt metric was first introduced in [16] and depicts the fraction of time spent by a parallel program to perform serial code, e. It is defined as follows:
where S is the speed-up and P is the number of processes. As we can see in Figure 8 , values of e are small and decreasing, meaning that there is no significant parallel overhead and that the portion of serial code that is executed is very low.
Discussion of performance results
The performance assessment parameters show that ATA is scalable and highlight high speed-up and negligible portions of executed serial code. Efficiency has a frequently observed decreasing trend that is due to intrinsic parallel overhead when the number of processes P grows. In this section we make two key observations that may be taken into account for improving the performances of ATA-P.
Process redistribution
As we described in Section 4, every time ATA-P and HASA-P are called recursively, in each complete parallel level one process is responsible for one recursive call, either to ATA-P or to HASA-P. Nevertheless, the computational cost of HASA-P is higher than the one of ATA-P. This results in a idle time for those processes performing ATA-P that worsen efficiency. As a matter of fact, the peak of efficiency for P = 38 (see Figure 7) is rather due to a low efficiency for P = 12 and P = 18, that are the two configurations that report the highest time difference between processes executing ATA-P and those performing HASA-P. A strategy to overcome this issue can be to distribute processes so that more processes are responsible for HASA-P calls, at the expenses of more workload for those performing ATA-P.
Data redistribution
In Section 5 we discussed the expression for latency and bandwidth of ATA-P. We noticed that the number of sent messages is small, but that the maximum size of sent messages is independent from P . Because of the low latency, this does not introduce a very appreciable delay when the number of processes is low, but it may introduce non negligible overhead for a very high number of processes. In the configurations that we investigated, the maximum percentage of time spent for communication ranges between ∼0.14% (P = 6, maximum time spent for communication is 0.08s) and ∼0.46% (P = 250, maximum time spent for communication is 0.16s) of the total parallel execution time. A possible solution to high bandwidth is to divide matrix size such that all processes work on the same amount of data and avoiding intermediate communication between processors.
Conclusions and future work
We have defined a cache-oblivious recursive algorithm for the A t A matrix multiplication, ATA, that is an operation that has applications in several problems in geometry, linear algebra, statistics, etc. The number of multiplications performed on matrices of size n is upper-bounded by 2 7 n log 2 7 , in the face of n 2 (n + 1)/2 products for conventional A t A multiplication algorithm; this is achieved because ATA includes recursive calls to generalized Strassen's algorithm for rectangular matrix multiplications. The algorithm was implemented in parallel using MPI and tested on a cluster. MPI communication facilities were used to perform smart communication in each parallel execution of ATA. Latency is low and the performances were assessed in terms of several performance evaluation indices, highlighting good scalability, low parallel overhead and negligible fraction of inherently sequential code. We detected possible improvements for the enhancement of parallel performances consisting in a different balance for task distribution among processes, and a more equally spread load of data among processors. We plan to find a trade-off between the two proposed improvements. Also, we believe that existing performance evaluation indices penalize the test results of systems where the execution time of employed processors do not overlap perfectly: first and foremost, conventional efficiency does not take into account the amount of time during which not all P processors are actively working; yet processes performing faster operations may use idle time for fulfilling additional tasks if the algorithm is integrated in a more complex system. We believe that a more accurate formulation for efficiency and speed-up may be useful to assess more truthfully systems like the one that we introduced.
